To Bfloat or not to Bfloat? That is the Question!

The bfloat16 data format can shorten the time it takes to train GPT-style deep learning models while preserving the accuracy of the model on downstream tasks.

In this article, we show how the bfloat16 data format works, how it fits into an automatic mixed precision training for large language models, and share some experimental results.

Automatic mixed precision

Automatic mixed precision is a mode that allows training deep learning models with a mix of single precision floating point float32 and half precision floating points such as float16 or bfloat16.

The benefits of the mixed precision mode are primary lying in performance. It is an optimization technique that allows you to train your networks faster, but without loss in quality. This phenomenon is due to the fact that some layers of the neural networks can be executed without high precision level, such as convolutional or linear layers. They’ve proven to be much faster when executed with float16 or bfloat16. However, other operations, such as reductions often require a higher precision level in order to maintain the same quality results.

This trade-off of what needs to be casted to half dtype and what should be maintained in a single precision is included in the recipe of “automatic mixed precision algorithm“. In a nutshell, this recipe measures the performance of the network in default precision, then walks through adding castings to run the same network with a mixed precision setting to optimize performance without hurting accuracy.

Mixed precision does not require you to specify bfloat16 as a half precision floating point, however, it has shown some benefits over applying float16. Below we are going to discuss bfloat16 in more granular details.

Bfloat16 Floating Type

bfloat16 is a 16-bit floating point format for deep learning that’s comprised of one sign bit, eight exponent bits, and seven mantissa bits. This is different from the industry-standard IEEE 16-bit floating point, which was not designed with deep learning applications in mind. Figure below demonstrates the internals of three floating point formats: (a) float16: IEEE half-precision, (b) float32: IEEE single-precision, and ( c ) bfloat16.

We can see that bfloat16 has a greater dynamic range (number of exponent bits) than float16, which is identical to float32.

Experiments: automatic mixed precision and bfloat16

We experimented with a large amount of deep learning networks, and happy to share results for GPT-3 XL network. Comparing between bfloat16 and float16 modes, we can see that using bfloat16 increases training throughput by 18% and is significantly less prone to weight growth.

As many readers may know, during the training process, the weights of a neural network are being “learnt” through an optimization process. If the norms of those weights increase in size over a long period, it might be an indicator that the model is becoming less stable (numerically). If some of the weights are huge, it may also mean that the model is paying undue attention to the weights of some features, which means the model is becoming overfitted.

In addition, models trained using bfloat16 show better eval scores. (Eval scores are the metrics such as accuracy that we compute on the evaluation set. This data is not usually seen by the model during the training process, so it is a good way to validate whether the trained model is likely to be functional. )

Benefits of Bfloat16 vs Float16/32

Our experiments demonstrated that choosing bfloat16 is beneficial over pure float32 or a mixed version with float16. It improves efficiency of the training, uses less memory during training, saves space while maintaining the same accuracy level. This happens due to the fact that deep learning models in general are more sensitive to changes in exponent rather than mantissa. The memory saving come from the fact that it takes fewer bits to store weights with bfloat16 than float32, thus it takes less space during training. If you store checkpoints in bfloat16 you can also save disk space.

Training behavior with bfloat16 setting is more robust and is less prone to underflows, overflows, or any other numerical instability during training compared to training with pure float32 dtype. This is happening because exponent size of bfloat16 floating point is the same as float32. If you’re using float16, the smaller exponent (5 bits) cannot represent the same range of numbers, thus some number will overflow (go above its range of representation), or will underflow (go below representation range). If either of those happens, you will see error codes such as NaN (not a number) or Inf (infinity) in the loss instead of a real number, which means that the model has diverged and training will stop. Using bfloat16 will avoid this numerical instability

How to enable Bfloat16

In order to enable bfloat16 in the mixed precision mode, please allow the next changes in the config file:

model.use_bfloat16: True

optimizer.loss_scaling_factor: 1.0

model.mixed_precision: True

As you can see in addition to changes specific to mixed precision and bfloat16 parameter, we need to disable loss scaling. As we described above, bfloat16 has the same exponent size as float32, thus it will have identical behavioral for underflows, overflows, or any other numeric instability during training. Originally, loss scaling factor was introduced for the mixed precision mode with float16 setting. It was necessary to scale the loss in order to avoid these side effects. bfloat16 does not require loss scaling, thus comes close to being a drop-in replacement for float32 when training and running deep neural networks.

Conclusion

In this article, we demonstrated how to enable bfloat16 dtype within mixed precision setting on the CS-2. This data format allows to improve deep learning models training time, while preserving the same accuracy level.

To try out some of our networks with this setting yourself, please refer to GPT-2, GPT-3, GPT-J and GPT-NeoX references.

Daria Soboleva | January 30, 2023