Skip to content

知识点卡片:最大似然估计 (MLE)

基本信息

属性内容
知识点最大似然估计 (MLE)
掌握程度★★★★★
学习优先级P0
预估时间6小时
面试频率★★★★★

核心原理

MLE的目标:找到使观测数据出现概率最大的参数。

似然函数:L(θ|D) = P(D|θ) = ∏ᵢ P(xᵢ|θ)
对数似然:log L(θ|D) = Σᵢ log P(xᵢ|θ)

MLE: θ* = argmax_θ log L(θ|D)
      = argmin_θ [-log L(θ|D)]  (负对数似然 = 损失函数!)

从分布推导损失函数

1. Bernoulli → Sigmoid → Binary Cross Entropy

python
"""
二分类:y ∈ {0, 1}
P(y|x) = p(x)^y * (1-p(x))^(1-y)
其中 p(x) = sigmoid(w^T x)

对数似然:
log L = Σ [y log p + (1-y) log(1-p)]

负对数似然 = Binary Cross Entropy:
BCE = -Σ [y log p + (1-y) log(1-p)]
"""

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

def binary_cross_entropy(y_true, y_pred):
    eps = 1e-15
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

# 验证
y_true = np.array([1, 0, 1, 1, 0])
logits = np.array([2.0, -1.0, 1.5, 0.5, -2.0])
probs = sigmoid(logits)
print(f"BCE: {binary_cross_entropy(y_true, probs):.4f}")

2. Categorical → Softmax → Cross Entropy

python
"""
多分类:y ∈ {1, ..., K}(one-hot编码)
P(y=k|x) = p_k,其中 p = softmax(z)

对数似然:
log L = Σ Σ y_ik log(p_ik) = Σ log(p_i,y_i)

负对数似然 = Cross Entropy:
CE = -Σ log(p_i,y_i)
"""

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / exp_x.sum(axis=-1, keepdims=True)

def cross_entropy(logits, targets):
    """targets: class indices"""
    probs = softmax(logits)
    n = len(targets)
    return -np.log(probs[np.arange(n), targets]).mean()

# 验证
logits = np.array([[2.0, 1.0, 0.1], [0.5, 2.0, 1.0]])
targets = np.array([0, 1])
print(f"CE: {cross_entropy(logits, targets):.4f}")

3. Gaussian → MSE

python
"""
回归:y = f(x) + ε,其中 ε ~ N(0, σ²)
P(y|x) = (1/√(2πσ²)) * exp(-(y-f(x))² / (2σ²))

对数似然:
log L = Σ [-½log(2πσ²) - (y-f(x))²/(2σ²)]
      = -1/(2σ²) * Σ (y-f(x))² + const

最大化对数似然 = 最小化 Σ (y-f(x))² = MSE!
"""

def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

# 验证从高斯分布推导
import scipy.stats as stats
x = np.array([1.0, 2.0, 3.0])
y = np.array([1.1, 2.2, 2.8])
w = 1.0  # 假设的参数

mu = w * x  # 均值(预测)
nll = -stats.norm.logpdf(y, loc=mu, scale=1.0).sum()  # 负对数似然
mse = ((y - mu) ** 2).sum()
print(f"负对数似然: {nll:.4f}")
print(f"MSE (未除n): {mse:.4f}")
# 两者成比例:NLL = MSE/(2σ²) + const

PyTorch中的MLE

python
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributions as dist

# 1. 交叉熵 = NLL(负对数似然)
logits = torch.randn(3, 5)
targets = torch.tensor([1, 2, 3])

ce_loss = F.cross_entropy(logits, targets)

# 等价于NLL
log_probs = F.log_softmax(logits, dim=-1)
nll_loss = F.nll_loss(log_probs, targets)
print(f"CE == NLL: {torch.allclose(ce_loss, nll_loss)}")

# 2. 用分布计算NLL
probs = F.softmax(logits, dim=-1)
cat_dist = dist.Categorical(probs=probs)
log_prob = cat_dist.log_prob(targets)
print(f"Distribution NLL: {-log_prob.mean():.4f}")

# 3. 回归的MSE = 高斯NLL
pred = torch.randn(10)
target = torch.randn(10)
mse = F.mse_loss(pred, target)

gaussian_nll = dist.Normal(pred, 1.0)
nll = -gaussian_nll.log_prob(target).mean()
print(f"MSE: {mse:.4f}, 高斯NLL: {nll:.4f}")

面试高频问题

Q1: 为什么CE是分类问题的自然损失?

  1. 概率解释:CE = 负对数似然,最小化CE = MLE,是参数估计的标准方法
  2. 梯度简洁:∂CE/∂logits = softmax(logits) - y(预测减标签)
  3. 信息论解释:CE衡量预测分布与真实分布的差异
  4. 收敛速度:CE梯度恒定,MSE梯度随sigmoid饱和而衰减

Q2: 从伯努利分布推导sigmoid的完整过程?

设 P(y=1|x) = p, P(y=0|x) = 1-p

建模 p = 1/(1+exp(-w^T x)) = sigmoid(w^T x)

为什么是sigmoid?
定义 odds = p/(1-p)
取对数:log(p/(1-p)) = w^T x  (线性模型拟合对数几率)
解出:p = 1/(1+exp(-w^T x)) = sigmoid(w^T x)

因此:sigmoid是二分类问题最自然的概率函数!

Q3: MLE有什么局限性?

  1. 过拟合:数据少时MLE容易过拟合(MAP+MLE缓解)
  2. 点估计:只给出参数的单一估计值,不提供不确定性
  3. 假设依赖:依赖正确的分布假设(如假设高斯噪声)
  4. 非凸问题:某些模型的MLE没有闭式解

练习题

python
# 1. 手推逻辑回归MLE梯度
# L = -Σ[y log σ(wx) + (1-y) log(1-σ(wx))]
# ∂L/∂w = X^T (σ(Xw) - y)

# 2. 验证CE梯度
import torch
logits = torch.randn(3, 5, requires_grad=True)
targets = torch.tensor([1, 2, 3])

loss = F.cross_entropy(logits, targets)
loss.backward()

# 解析梯度 = softmax(logits) - one_hot(targets)
analytical = F.softmax(logits, dim=-1)
analytical[range(3), targets] -= 1
print(torch.allclose(logits.grad, analytical, atol=1e-5))  # True

# 3. 生成过程模拟
# 从真实分布采样 → 用MLE恢复参数
w_true = np.array([1.0, -2.0])
X = np.random.randn(100, 2)
p_true = sigmoid(X @ w_true)
y = np.random.binomial(1, p_true)
# 用梯度下降MLE恢复w...

相关知识点