The bfloat16 data format can shorten the time it takes to train GPT-style deep learning models while preserving the accuracy of the model on downstream tasks.
In this article, we show how the bfloat16 data format works, how it fits into an automatic mixed precision training for large language models, and share some experimental results.
Automatic mixed precision
Automatic mixed precision is a mode that allows training deep learning models with a mix of single precision floating point float32
and half precision floating points such as float16
or bfloat16
.
The benefits of the mixed precision mode are primary lying in performance. It is an optimization technique that allows you to train your networks faster, but without loss in quality. This phenomenon is due to the fact that some layers of the neural networks can be executed without high precision level, such as convolutional or linear layers. They’ve proven to be much faster when executed with float16
or bfloat16
. However, other operations, such as reductions often require a higher precision level in order to maintain the same quality results.
This trade-off of what needs to be casted to half dtype and what should be maintained in a single precision is included in the recipe of “automatic mixed precision algorithm“. In a nutshell, this recipe measures the performance of the network in default precision, then walks through adding castings to run the same network with a mixed precision setting to optimize performance without hurting accuracy.
Mixed precision does not require you to specify bfloat16
as a half precision floating point, however, it has shown some benefits over applying float16
. Below we are going to discuss bfloat16
in more granular details.
Bfloat16 Floating Type
bfloat16
is a 16-bit floating point format for deep learning that’s comprised of one sign bit, eight exponent bits, and seven mantissa bits. This is different from the industry-standard IEEE 16-bit floating point, which was not designed with deep learning applications in mind. Figure below demonstrates the internals of three floating point formats: (a) float16
: IEEE half-precision, (b) float32
: IEEE single-precision, and ( c ) bfloat16
.
We can see that bfloat16
has a greater dynamic range (number of exponent bits) than float16
, which is identical to float32
.
Experiments: automatic mixed precision and bfloat16
We experimented with a large amount of deep learning networks, and happy to share results for GPT-3 XL network. Comparing between bfloat16
and float16
modes, we can see that using bfloat16
increases training throughput by 18% and is significantly less prone to weight growth.
As many readers may know, during the training process, the weights of a neural network are being “learnt” through an optimization process. If the norms of those weights increase in size over a long period, it might be an indicator that the model is becoming less stable (numerically). If some of the weights are huge, it may also mean that the model is paying undue attention to the weights of some features, which means the model is becoming overfitted.
In addition, models trained using bfloat16 show better eval scores. (Eval scores are the metrics such as accuracy that we compute on the evaluation set. This data is not usually seen by the model during the training process, so it is a good way to validate whether the trained model is likely to be functional. )
Benefits of Bfloat16 vs Float16/32
Our experiments demonstrated that choosing bfloat16
is beneficial over pure float32
or a mixed version with float16
. It improves efficiency of the training, uses less memory during training, saves space while maintaining the same accuracy level. This happens due to the fact that deep learning models in general are more sensitive to changes in exponent rather than mantissa. The memory saving come from the fact that it takes fewer bits to store weights with bfloat16
than float32
, thus it takes less space during training. If you store checkpoints in bfloat16
you can also save disk space.
Training behavior with bfloat16
setting is more robust and is less prone to underflows, overflows, or any other numerical instability during training compared to training with pure float32
dtype. This is happening because exponent size of bfloat16
floating point is the same as float32
. If you’re using float16, the smaller exponent (5 bits) cannot represent the same range of numbers, thus some number will overflow (go above its range of representation), or will underflow (go below representation range). If either of those happens, you will see error codes such as NaN (not a number) or Inf (infinity) in the loss instead of a real number, which means that the model has diverged and training will stop. Using bfloat16
will avoid this numerical instability
How to enable Bfloat16
In order to enable bfloat16
in the mixed precision mode, please allow the next changes in the config file:
model.use_bfloat16: True
optimizer.loss_scaling_factor: 1.0
model.mixed_precision: True
As you can see in addition to changes specific to mixed precision and bfloat16
parameter, we need to disable loss scaling. As we described above, bfloat16
has the same exponent size as float32
, thus it will have identical behavioral for underflows, overflows, or any other numeric instability during training. Originally, loss scaling factor was introduced for the mixed precision mode with float16
setting. It was necessary to scale the loss in order to avoid these side effects. bfloat16
does not require loss scaling, thus comes close to being a drop-in replacement for float32
when training and running deep neural networks.
Conclusion
In this article, we demonstrated how to enable bfloat16
dtype within mixed precision setting on the CS-2. This data format allows to improve deep learning models training time, while preserving the same accuracy level.
To try out some of our networks with this setting yourself, please refer to GPT-2, GPT-3, GPT-J and GPT-NeoX references.
Daria Soboleva | January 30, 2023
References
- Automatic Mixed Precision for Deep Learning
- GitHub – NVIDIA/apex: A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
- Automatic Mixed Precision — PyTorch Tutorials 1.12.1+cu102 documentation
- The bfloat16 numerical format | Cloud TPU | Google Cloud
Related Posts
May 17, 2024
Cerebras Breaks Exascale Record for Molecular Dynamics Simulations
Cerebras has set a new record for molecular dynamics simulation speed that goes…
May 1, 2024
Supercharge your HPC Research with the Cerebras SDK
Cerebras SDK 1.1.0, our second publicly available release, includes initial…