Skip to content

知识点卡片:学习率调度

基本信息

属性内容
知识点学习率调度策略
掌握程度★★★★☆
学习优先级P1
预估时间4小时
面试频率★★★★☆

核心原理

学习率调度器在训练过程中动态调整学习率,通常策略是先大后小:

  • 前期大学习率:快速收敛到最优区域附近
  • 后期小学习率:精细调整,避免震荡

常用策略

1. Step Decay

python
# 每隔step_size个epoch,学习率乘以gamma
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
# LR: 0.01 → 0.001 → 0.0001 → ...

2. Cosine Annealing

python
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)
# 余弦曲线从max_lr降到min_lr

3. Cosine with Warmup

python
# Warmup: 前warmup_steps学习率线性增长
# Cosine: 之后余弦衰减
class CosineWarmupScheduler:
    def __init__(self, optimizer, warmup_steps, total_steps, min_lr=0.0):
        self.optimizer = optimizer
        self.warmup_steps = warmup_steps
        self.total_steps = total_steps
        self.min_lr = min_lr
        self.base_lrs = [group['lr'] for group in optimizer.param_groups]
        self.step_count = 0

    def step(self):
        self.step_count += 1
        if self.step_count <= self.warmup_steps:
            # 线性warmup
            scale = self.step_count / self.warmup_steps
        else:
            # Cosine衰减
            progress = (self.step_count - self.warmup_steps) / (self.total_steps - self.warmup_steps)
            scale = 0.5 * (1 + np.cos(np.pi * progress))
            scale = self.min_lr + (1 - self.min_lr) * scale

        for i, param_group in enumerate(self.optimizer.param_groups):
            param_group['lr'] = self.base_lrs[i] * scale

4. OneCycleLR

python
scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer, max_lr=0.1, epochs=epochs, steps_per_epoch=len(train_loader)
)
# 学习率先增后减,momentum先减后增

5. ReduceLROnPlateau

python
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', patience=10, factor=0.1
)
# 当loss不再下降时降低学习率
for epoch in range(epochs):
    train_loss = train()
    val_loss = validate()
    scheduler.step(val_loss)  # 传入监控指标

策略对比

调度器适用场景优点缺点
StepLR通用简单突变导致震荡
Cosine图像分类平滑/收敛好需要知道总epoch
Warmup+CosineLLM训练稳定+收敛好需要知道总步数
OneCycleLR快速训练收敛极快需要精细调参
ReduceLROnPlateau自适应训练不来调整可能过早降低LR

Warmup为什么重要

python
"""
Warmup在训练开始时逐步增大学习率:

原因:
1. 模型初始参数是随机的,梯度方向不稳定
2. 大学习率可能在开始时破坏初始化
3. 少量步数的预热可以让:
   - 梯度统计量(Adam的m和v)积累
   - BatchNorm的running stats积累
   - 参数找到稳定区域

特别是在Transformer中,没有warmup很难训练成功!
(Pre-LN架构缓解了这个问题,但warmup仍是标配)
"""

PyTorch完整示例

python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts

model = nn.Linear(10, 2)
optimizer = optim.AdamW(model.parameters(), lr=1e-3)

# 方案1: Cosine Annealing with Warm Restarts
scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2)
# 每T_0个epoch重启,周期逐渐变长

# 方案2: SequentialLR组合
scheduler_warmup = optim.lr_scheduler.LinearLR(
    optimizer, start_factor=0.01, total_iters=100
)
scheduler_cosine = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=900)
scheduler = optim.lr_scheduler.SequentialLR(
    optimizer,
    schedulers=[scheduler_warmup, scheduler_cosine],
    milestones=[100]
)

# 方案3: LambdaLR自定义
def lr_lambda(step):
    if step < 1000:
        return step / 1000  # warmup
    else:
        return 0.5 * (1 + math.cos(math.pi * (step - 1000) / 9000))

scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

# 训练循环
for epoch in range(epochs):
    train()
    scheduler.step()  # epoch-level

面试高频问题

Q1: 为什么需要学习率调度?

  • 固定学习率很难:太大则早期震荡/发散,太小则收敛太慢
  • 调度器提供了从"探索"到"利用"的平滑过渡
  • 在LLM训练中,warmup+cosine是事实标准
  • 好的调度可以防止模型卡在尖锐的局部最优

Q2: Cosine Annealing为什么有效?

  1. 开头大学习率快速收敛
  2. 结尾小学习率精细调优
  3. 中间平滑过渡,不突变
  4. 相比StepLR的突变,Cosine避免了震荡
  5. Warm Restart版本可以在局部最优间跳跃

相关知识点