Gating Networks • Jeefy's Blog

1 min read 156 words Updated Apr 30, 2026 Created May 03, 2026

主流的 MoE 范式。

模型的输出是：

$$y = \sum G(x)_i E_i(x) $$

其中 $E_i(x)$ 是专家模型，$G(x)$ 是门控网络的结果，具体来说：

$$\begin{aligned} G(x) &= {\rm softmax}( {\rm top-k}(H(x), k)) \\ H(x) &= W_g x + {\rm standard norm}() \cdot {\rm softplus}(W_{noise} x) \end{aligned} $$

由于门控网络存在

马太效应（少数专家被过度选择并自我强化）
在分布式环境下，若某些专家过载（接收过多样本），其所在设备会出现内存不足或计算拥塞，而其他设备空闲，集群利用率崩溃。
所以损失函数是这样设计的：

$$J(\theta) = Loss + \omega_{importance} \cdot CV({\rm Importance}(X)) + \omega_{load} \cdot CV({\rm Load}(X)) $$

其中 $X$ 表示一个 batch 的数据，$CV: {\mathbb R}^n \to {\mathbb R}$ 表示变异系数（Coefficient of Variation，标准差除以均值）：

$$\begin{aligned} e &= \text{number of experts} \\ {\rm Importance} (X) &= \sum_{x \in X} G(x) \in {\mathbb R}^{e} \\ P(x, i) &= P\Big[H(x)_i \gt {\rm kth\_excluding}(H(x), k, i)\Big] \\ \implies P(x, i) &= \Phi \left(\frac {W_g x - {\rm kth\_excluding}(H(x), k, i)} {{\rm softplus}(W_{noise} x)} \right) \end{aligned} $$

其中 $\Phi$ 是 CDF（Cumulative Distribution Function，累积分布函数）描述的是随机变量小于等于某个值的概率。将离散的"是否在前 k 名"转化为一个光滑、可微的概率值。这里 $Z \sim {\mathcal N}(0, 1)$：

$$\Phi(z) = P(Z < z) = \frac 1 {\sqrt {2\pi} }\int_{-\infty}^z e^{t^2 / 2} {\rm d} t $$