[1] Motivation of The Problem.

Series of experiments conducted for SIP for GPT

📌 Code Release : parchgrad/v23.11.07.1

Gradient Aggregation Phenomena

[2023.11.05] The impact of large number of small magnitude gradients.

As the sign of gradients are changed by the normally distributed weights, the impact of multiple gradients are be magnified by the variance computation in the summation of multiple random variables. As a result, large number of small signals has high impact. The Figure below shows an example of the impact. We sample 10 high magnitude gradients $\mathcal{H}$ (blue) and 100 gradients $\mathcal{L}$ of 10 times smaller magnitude. Unlike our intuition that the dominance of $\mathcal{H}$ should be obtained, the resulted direction is highly affected by $\mathcal{L}$ (see Appendix for details).

Although random gradients from $\mathcal{L}$ group has small magnitude, the large number of gradients results in high magnitude that change the direction from the blue arrow to the black arrow.

Let $\mathcal{H}$ be a group of 10 random gradients whose first entry is sampled from $\mathcal{N}(0, 0.1)$ and the second entry is sampled from $\mathcal{N}(0.0, 1.0)$ and $\mathcal{L}$ be a group of 100 random gradients whose elements are both sampled from $\mathcal{N}(0, 0.1)$. The summed vector in each group and the combined group is shown in Figure below. Although \mathcal{H} group is dominant in Dim2 $(0,1)$, the summed vector of $\mathcal{L}$ group is dominant in Dim1 $(1,0)$ because of large number of vectors. Consequently, the final direction is $(1,1)$ direction (black). This experiment shows the impact of large number of irrelevant channels (red).

Conserve Magnitude

ResNet18

For the ResNet18, conservation of variance results in gradient signals on more bounding box regions. As the small number of channels are used in the convolutional modules, more gap is more obtained.

VGG16

As VGG16 has only sequential gradient computation the conservation of gradient variance has no effect.

EfficientNetB0

ResNet has more submodules with different kernel size. We conjecture that filtering convolutions in EfficientNet requires more discussions.

Experimental Setting

The above experiment is done with filtering sequential convolutions. However, in the paper, we filter only upper parts to remove unrelated concepts.

Hook Modules # ResNet18 selected_convolutions = [] for layer in [self.model.layer3, self.model.layer4]: for basic_block in layer: for name in ['conv1', 'conv2', 'conv3']: if hasattr(basic_block, name): conv = getattr(basic_block, name) selected_convolutions.append(conv) # VGG16 selected_convolutions = [] num = len(self.model.features) start_num = int(num*0.5) print("[INFO] hook upper half convolutions") for n in range(start_num, num): layer=self.model.features[n] if layer.__class__.__name__ == "Conv2d": selected_convolutions.append(layer) # EfficientNet B0 selected_convolutions = [] for num in [6,7]: layer = self.model.features[num] conv = layer[0].block[1][0] selected_convolutions.append(conv) selected_convolutions.append(self.model.features[8][0])