Skip to content

知识点卡片:自动微分

基本信息

属性内容
知识点自动微分 (Automatic Differentiation)
掌握程度★★★★☆
学习优先级P0
预估时间6小时
面试频率★★★☆☆

核心原理

自动微分是深度学习框架的核心,它不是符号微分也不是数值微分,而是通过计算图+链式法则精确高效地计算梯度。

三种计算梯度方式:
1. 数值微分:f'(x) ≈ (f(x+h)-f(x))/h
   - 有截断误差,慢
2. 符号微分:手动推导公式
   - 精确但表达式膨胀
3. 自动微分:在计算图上使用链式法则
   - 精确且高效 ✓

Forward Mode vs Reverse Mode

Forward Mode:从输入到输出
∂y/∂x = ∂y/∂v_n · ... · ∂v_1/∂x
适合:输入维度 << 输出维度

Reverse Mode(DL使用):从输出到输入
∂y/∂x = ∂y/∂v · ∂v/∂x
适合:输出维度 << 输入维度(如标量loss对百万参数)

从零实现微型Autograd

python
import numpy as np

class Tensor:
    """支持自动微分的张量"""
    def __init__(self, data, children=(), op='', requires_grad=False):
        self.data = np.array(data, dtype=np.float32)
        self.grad = np.zeros_like(self.data) if requires_grad else None
        self._backward = lambda: None
        self.children = children
        self.op = op
        self.requires_grad = requires_grad

    def backward(self):
        """拓扑排序 + 反向传播"""
        # 拓扑排序(从后向前)
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for c in v.children:
                    build_topo(c)
                topo.append(v)
        build_topo(self)

        # 反向传播
        self.grad = np.ones_like(self.data)  # ∂L/∂L = 1
        for v in reversed(topo):
            v._backward()

    def __add__(self, other):
        other = other if isinstance(other, Tensor) else Tensor(other)
        out = Tensor(self.data + other.data, children=(self, other), op='+')

        def _backward():
            if self.requires_grad: self.grad += out.grad
            if other.requires_grad: other.grad += out.grad
        out._backward = _backward
        out.requires_grad = self.requires_grad or other.requires_grad
        return out

    def __matmul__(self, other):
        """矩阵乘法"""
        other = other if isinstance(other, Tensor) else Tensor(other)
        out = Tensor(self.data @ other.data, children=(self, other), op='@')

        def _backward():
            if self.requires_grad: self.grad += out.grad @ other.data.T
            if other.requires_grad: other.grad += self.data.T @ out.grad
        out._backward = _backward
        out.requires_grad = self.requires_grad or other.requires_grad
        return out

    def __mul__(self, other):
        other = other if isinstance(other, Tensor) else Tensor(other)
        out = Tensor(self.data * other.data, children=(self, other), op='*')

        def _backward():
            if self.requires_grad: self.grad += out.grad * other.data
            if other.requires_grad: other.grad += out.grad * self.data
        out._backward = _backward
        out.requires_grad = self.requires_grad or other.requires_grad
        return out

    def relu(self):
        out = Tensor(np.maximum(0, self.data), children=(self,), op='ReLU')

        def _backward():
            if self.requires_grad:
                self.grad += out.grad * (self.data > 0)
        out._backward = _backward
        out.requires_grad = self.requires_grad
        return out

    def sum(self):
        out = Tensor(self.data.sum(), children=(self,), op='sum')

        def _backward():
            if self.requires_grad:
                self.grad += np.ones_like(self.data) * out.grad
        out._backward = _backward
        out.requires_grad = self.requires_grad
        return out

# 测试:x² + 2x + 1 在 x=3 处的导数
x = Tensor(3.0, requires_grad=True)
y = x * x + Tensor(2.0) * x + Tensor(1.0)
y.backward()
print(f"y = x² + 2x + 1, x=3")
print(f"y = {y.data}")
print(f"dy/dx = {x.grad}")  # 2*3+2 = 8

PyTorch的Autograd机制

python
import torch

# requires_grad标记需要计算梯度的张量
x = torch.tensor([3.0], requires_grad=True)

# 计算图自动构建
y = x ** 2 + 2 * x + 1
# y.grad_fn = <AddBackward0>

y.backward()  # 触发反向传播
print(x.grad)  # tensor([8.])

# 禁用梯度计算(推理/评估时)
with torch.no_grad():
    y = x ** 2 + 2 * x + 1
    # 不会构建计算图,节省内存

# 分离计算图
z = y.detach()  # z与y值相同但不追踪梯度

面试高频问题

Q1: 自动微分 vs 数值微分 vs 符号微分?

方面数值微分符号微分自动微分
精度截断误差精确精确
速度慢(O(n))表达式膨胀
适用性任意函数闭式表达式可微函数
DL应用调试/验证不可行核心方案

Q2: 为什么DL用Reverse Mode?

: DL中输出通常是标量loss,输入是大量参数(百万到亿级)。

  • Forward Mode:需计算N次(N=参数量)
  • Reverse Mode:只需1次反向传播即可得到所有梯度
  • 这就是为什么反向传播只需一次!

Q3: PyTorch的detach()有什么用?

  1. 冻结梯度传播:如GAN训练中,生成器固定时用detach
  2. 节省内存:不需要梯度的张量从计算图断开
  3. 获取数值:需要tensor的值但不参与梯度计算

相关知识点