TOWARD EFFICIENT LOW-PRECISION TRAINING: DATA FORMAT OPTIMIZATION AND HYSTERESIS QUANTIZATION
Seoul National University, Seoul, Korea
ICLR 2022
1 INTRODUCTION
研究背景与意义:
- As larger and more complex neural networks are adopted, the energy and time consumed for training have become a critical issue in hardware implementation.
- Using low-bit representations in training significantly reduces hardware overhead and memory footprint; hence, neural network training with limited precision has been extensively studied recently.
- Low-precision format: FP16 in GPU, bfloat in TPU, 8-bit formats
存在的问题:
- Optimal data format for low-precision training
目前8-bit的训练框架越来越多
8-bit情况下,数据格式也有很多种,应该使用哪一种?
- Performance degradation in from-scratch training
However, in low-precision training where a neural network is trained from scratch using low-precision values and computations, the trained model typically shows a noticeable accuracy drop.
提出的方法:
- We divide quantization in low-precision training into two types: network quantization and data flow quantization
network quantization: W
data flow quantization: A, E, G
- A method that can predict the training performance of various numeric formats for data flow quantization
- An optimal 8-bit format suitable for low-precision training of various models
- A new quantization scheme that utilizes the hysteresis effect to improve the performance of from-scratch training
2 DATA FLOW QUANTIZATION
2.1 NUMERIC FORMATS
本文考虑了多种数据格式的情况:8-bit 定点数,8-bit浮点数(1-xy),float-fix format
几种提出的数据格式,没有细看…
使用了几种约束条件,排除了一些情况之后:
498 formats in total (8, 7, 6-bit)
2.2 ACTIVATION AND ERROR QUANTIZATION
针对A,E,G直接选择了其他论文的量化方法。
Sean Fox, Seyedramin Rasoulinezhad, Julian Faraone, Philip Leong, et al. A block minifloat representation for training deep neural networks. In International Conference on Learning Representations, 2020.
2.3 INDICATORS OF TRAINING PERFORMANCE
cosine相似度
Effect of quantized error :考虑了error计算过程中的量化噪声,对G计算带来的偏差夹角
Effect of quantized activation:考虑了activation计算过程中的量化噪声,对G计算带来的偏差夹角
蓝色是full precision,黄绿红分别是8,7,6-bit
-
夹角是整个网络的。
-
We average angles from 100 mini-batches after quantizing a pre-trained model. 夹角的计算是根据 a pre-trained model计算出来的
- we could determine the optimal format for a specific neural network model, dataset, and task very efficiently . 对不同的数据集,任务,模型计算出各个数据格式带来的夹角。模型-数据集-任务 的组合会有预训练模型,根据预训练模型的结果去反推。
2.4 OPTIMAL FORMAT FOR DATA FLOW QUANTIZATION
we select six models with different architectures, layer types, and target tasks that are widely used in quantized training research for experiments .
Fig. 3 suggests that FP134 and FP143 are the best candidates across all models. For hardware implementation, FP134 is the most optimal format due to its low implementation cost.
3 NETWORK QUANTIZATION
3.1 FLUCTUATION OF WEIGHT PARAMETERS
本文使用了参数 主备份,每次计算前将参数拿出来,量化后进行计算。
本文认为,普通的量化会造成在最优解周围的震荡。参数训练本来应该是平滑改变的。
3.2 HYSTERESIS QUANTIZATION
使用了延迟更新,让量化后的参数更加稳定。
- stabilizing the training process and allowing the network to reach global optima more efficiently.
- if the weight change ∆W is small, then enough number of those changes should be accumulated to flip Qw.
- frequency is now proportional to the weight gradient.
- 如果是主备份+普通更新,可能会在最优值附近波动,从而影响收敛。
- 如果没有主备份,有些参数就无法完成微小的调整。
3.3 ULTRA-LOW-PRECISION FORMAT FOR NETWORK QUANTIZATION
we select 4-bit logarithmic representation as an ultra-low-precision format for weight parameters.
Mostafa Elhoushi, Zihao Chen, Farhan Shafiq, Ye Henry Tian, and Joey Yiwei Li. Deepshift: Towards multiplication-less neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2359–2368, 2021.
This format has the same dynamic range as INT8 which is widely used for weight quantization .
We apply channel-wise quantization to the convolutional layers to compensate for the insufficient expression range and layer-wise quantization to the other types of layers.
Fig. 5 clearly shows that using hysteresis significantly reduces weight change frequency and stabilizes the training process.
4 EXPERIMENTAL RESULTS
4.1 LOW-PRECISION TRAINING SCHEME
We need to quantize four variables: activation, error, weight, and weight gradient.
4.2 8-BIT LOW-PRECISION TRAINING
预训练后fine-tune?
4.3 ULTRA-LOW-PRECISION TRAINING WITH 4-BIT LOGARITHMIC WEIGHTS
5 CONCLUSION
整个流程:用一个能够估计量化后训练效果好坏的预测器,对(模型,数据集,数据格式)进行预测,这个预测不是实际训练,而是通过对比预训练模型得到。然后找到一个合适的数据格式,这个数据格式应该是针对(模型,数据集,数据格式)的,但大多数都是FP134。而对于from-scratch的训练,效果会比较差,所以提出了延迟更新。
- 一个训练效果指示器
- 分析得出FP134比较好
- 延迟更新参数可以让量化波动小