知识点卡片:最大似然估计 (MLE)
基本信息
| 属性 | 内容 |
|---|---|
| 知识点 | 最大似然估计 (MLE) |
| 掌握程度 | ★★★★★ |
| 学习优先级 | P0 |
| 预估时间 | 6小时 |
| 面试频率 | ★★★★★ |
核心原理
MLE的目标:找到使观测数据出现概率最大的参数。
似然函数:L(θ|D) = P(D|θ) = ∏ᵢ P(xᵢ|θ)
对数似然:log L(θ|D) = Σᵢ log P(xᵢ|θ)
MLE: θ* = argmax_θ log L(θ|D)
= argmin_θ [-log L(θ|D)] (负对数似然 = 损失函数!)从分布推导损失函数
1. Bernoulli → Sigmoid → Binary Cross Entropy
python
"""
二分类:y ∈ {0, 1}
P(y|x) = p(x)^y * (1-p(x))^(1-y)
其中 p(x) = sigmoid(w^T x)
对数似然:
log L = Σ [y log p + (1-y) log(1-p)]
负对数似然 = Binary Cross Entropy:
BCE = -Σ [y log p + (1-y) log(1-p)]
"""
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def binary_cross_entropy(y_true, y_pred):
eps = 1e-15
y_pred = np.clip(y_pred, eps, 1 - eps)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
# 验证
y_true = np.array([1, 0, 1, 1, 0])
logits = np.array([2.0, -1.0, 1.5, 0.5, -2.0])
probs = sigmoid(logits)
print(f"BCE: {binary_cross_entropy(y_true, probs):.4f}")2. Categorical → Softmax → Cross Entropy
python
"""
多分类:y ∈ {1, ..., K}(one-hot编码)
P(y=k|x) = p_k,其中 p = softmax(z)
对数似然:
log L = Σ Σ y_ik log(p_ik) = Σ log(p_i,y_i)
负对数似然 = Cross Entropy:
CE = -Σ log(p_i,y_i)
"""
def softmax(x):
exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
return exp_x / exp_x.sum(axis=-1, keepdims=True)
def cross_entropy(logits, targets):
"""targets: class indices"""
probs = softmax(logits)
n = len(targets)
return -np.log(probs[np.arange(n), targets]).mean()
# 验证
logits = np.array([[2.0, 1.0, 0.1], [0.5, 2.0, 1.0]])
targets = np.array([0, 1])
print(f"CE: {cross_entropy(logits, targets):.4f}")3. Gaussian → MSE
python
"""
回归:y = f(x) + ε,其中 ε ~ N(0, σ²)
P(y|x) = (1/√(2πσ²)) * exp(-(y-f(x))² / (2σ²))
对数似然:
log L = Σ [-½log(2πσ²) - (y-f(x))²/(2σ²)]
= -1/(2σ²) * Σ (y-f(x))² + const
最大化对数似然 = 最小化 Σ (y-f(x))² = MSE!
"""
def mse_loss(y_true, y_pred):
return np.mean((y_true - y_pred) ** 2)
# 验证从高斯分布推导
import scipy.stats as stats
x = np.array([1.0, 2.0, 3.0])
y = np.array([1.1, 2.2, 2.8])
w = 1.0 # 假设的参数
mu = w * x # 均值(预测)
nll = -stats.norm.logpdf(y, loc=mu, scale=1.0).sum() # 负对数似然
mse = ((y - mu) ** 2).sum()
print(f"负对数似然: {nll:.4f}")
print(f"MSE (未除n): {mse:.4f}")
# 两者成比例:NLL = MSE/(2σ²) + constPyTorch中的MLE
python
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributions as dist
# 1. 交叉熵 = NLL(负对数似然)
logits = torch.randn(3, 5)
targets = torch.tensor([1, 2, 3])
ce_loss = F.cross_entropy(logits, targets)
# 等价于NLL
log_probs = F.log_softmax(logits, dim=-1)
nll_loss = F.nll_loss(log_probs, targets)
print(f"CE == NLL: {torch.allclose(ce_loss, nll_loss)}")
# 2. 用分布计算NLL
probs = F.softmax(logits, dim=-1)
cat_dist = dist.Categorical(probs=probs)
log_prob = cat_dist.log_prob(targets)
print(f"Distribution NLL: {-log_prob.mean():.4f}")
# 3. 回归的MSE = 高斯NLL
pred = torch.randn(10)
target = torch.randn(10)
mse = F.mse_loss(pred, target)
gaussian_nll = dist.Normal(pred, 1.0)
nll = -gaussian_nll.log_prob(target).mean()
print(f"MSE: {mse:.4f}, 高斯NLL: {nll:.4f}")面试高频问题
Q1: 为什么CE是分类问题的自然损失?
答:
- 概率解释:CE = 负对数似然,最小化CE = MLE,是参数估计的标准方法
- 梯度简洁:∂CE/∂logits = softmax(logits) - y(预测减标签)
- 信息论解释:CE衡量预测分布与真实分布的差异
- 收敛速度:CE梯度恒定,MSE梯度随sigmoid饱和而衰减
Q2: 从伯努利分布推导sigmoid的完整过程?
答:
设 P(y=1|x) = p, P(y=0|x) = 1-p
建模 p = 1/(1+exp(-w^T x)) = sigmoid(w^T x)
为什么是sigmoid?
定义 odds = p/(1-p)
取对数:log(p/(1-p)) = w^T x (线性模型拟合对数几率)
解出:p = 1/(1+exp(-w^T x)) = sigmoid(w^T x)
因此:sigmoid是二分类问题最自然的概率函数!Q3: MLE有什么局限性?
答:
- 过拟合:数据少时MLE容易过拟合(MAP+MLE缓解)
- 点估计:只给出参数的单一估计值,不提供不确定性
- 假设依赖:依赖正确的分布假设(如假设高斯噪声)
- 非凸问题:某些模型的MLE没有闭式解
练习题
python
# 1. 手推逻辑回归MLE梯度
# L = -Σ[y log σ(wx) + (1-y) log(1-σ(wx))]
# ∂L/∂w = X^T (σ(Xw) - y)
# 2. 验证CE梯度
import torch
logits = torch.randn(3, 5, requires_grad=True)
targets = torch.tensor([1, 2, 3])
loss = F.cross_entropy(logits, targets)
loss.backward()
# 解析梯度 = softmax(logits) - one_hot(targets)
analytical = F.softmax(logits, dim=-1)
analytical[range(3), targets] -= 1
print(torch.allclose(logits.grad, analytical, atol=1e-5)) # True
# 3. 生成过程模拟
# 从真实分布采样 → 用MLE恢复参数
w_true = np.array([1.0, -2.0])
X = np.random.randn(100, 2)
p_true = sigmoid(X @ w_true)
y = np.random.binomial(1, p_true)
# 用梯度下降MLE恢复w...