Series of experiments conducted for SIP for GPT
ð Code Release : parchgrad/v23.11.07.1
[2023.11.05] The impact of large number of small magnitude gradients.
As the sign of gradients are changed by the normally distributed weights, the impact of multiple gradients are be magnified by the variance computation in the summation of multiple random variables. As a result, large number of small signals has high impact. The Figure below shows an example of the impact. We sample 10 high magnitude gradients $\mathcal{H}$ (blue) and 100 gradients $\mathcal{L}$ of 10 times smaller magnitude. Unlike our intuition that the dominance of $\mathcal{H}$ should be obtained, the resulted direction is highly affected by $\mathcal{L}$ (see Appendix for details).
Let $\mathcal{H}$ be a group of 10 random gradients whose first entry is sampled from $\mathcal{N}(0, 0.1)$ and the second entry is sampled from $\mathcal{N}(0.0, 1.0)$ and $\mathcal{L}$ be a group of 100 random gradients whose elements are both sampled from $\mathcal{N}(0, 0.1)$. The summed vector in each group and the combined group is shown in Figure below. Although \mathcal{H} group is dominant in Dim2 $(0,1)$, the summed vector of $\mathcal{L}$ group is dominant in Dim1 $(1,0)$ because of large number of vectors. Consequently, the final direction is $(1,1)$ direction (black). This experiment shows the impact of large number of irrelevant channels (red).
For the ResNet18, conservation of variance results in gradient signals on more bounding box regions. As the small number of channels are used in the convolutional modules, more gap is more obtained.
As VGG16 has only sequential gradient computation the conservation of gradient variance has no effect.
ResNet has more submodules with different kernel size. We conjecture that filtering convolutions in EfficientNet requires more discussions.
The above experiment is done with filtering sequential convolutions. However, in the paper, we filter only upper parts to remove unrelated concepts.