Abstract

Training large language models with long sequence lengths is prohibitive in practice and expensive due to long training times. In this article, we introduce you to a simple-to-use training method called Variable Sequence Length (VSL), which can reduce wall-clock times for training large language models with long sequence lengths capabilities without any changes in model architecture and training hyperparameters. Training a GPT model using VSL (2k sequence length followed by 8k sequence length) uses 29% fewer FLOPs over training with 8k sequence length all the way through while achieving the same model performance.

Introduction

The recent popularity of next-generation AI assistants powered by large language models (LLMs), like Chat-GPT and Claude, has increased the demand for long-context capabilities, especially for applications like long multi-turn conversations, summarizing documents, and code completion, which require the model to understand long-range dependencies. Figure 1 shows the rapid growth in sequence lengths supported by some prevalent foundation models.

Figure 1: Sequence (Context) Lengths supported by various state-of-the-art foundation models. There is an increasing demand for models supporting longer sequence lengths to enable many real-world applications, also seen from the exponential growth in models supporting exceedingly long sequence lengths (up to 100K). This figure is sourced from the Hazy Research blog on long context-length training and modified to reflect the latest trends in the industry (as of June 2023).

An inherent challenge in scaling to long sequence lengths is the quadratic scaling of memory and compute for self-attention, which limits training most large GPT models to sequence lengths of 2k. Recent techniques like Flash Attention [1] and Memory-Efficient Attention [2] have proposed drastically reducing memory overheads by splitting the attention computation using smaller sub-blocks. Other methods, such as ALiBi [3], propose incorporating contextual dependencies during model training to enable scaling of sequence lengths during inference. These techniques, however, do not fully address the quadratic compute as we scale to longer sequences. Sparse attention methods such as Longformer [4] and Performer [5] reduce computation and memory costs by approximating the full attention pattern using frameworks such as sliding windows or kernel representations. However, these methods require changes in the model architecture, often lead to a drop in model quality, and need optimized implementations to accelerate training on hardware.

Complementary to these approaches, we explore an efficient strategy to pre-train large language models (LLMs) with desired long context capabilities by reducing the overall cost of self-attention throughout training without using any approximations. We revisit a simple, staged training recipe from Shortformer [6] in the context of GPT [7, 8] models, which power most state-of-the-art language applications today. We call this method Variable Sequence Length (VSL) training, where we mostly pre-train using short sequences, and for a small fraction of the pre-training, we use the desired longer sequences. VSL training does not introduce any architectural changes, so we can compound the gains from using techniques such as Flash-Attention [1] and Performer [5] to improve training times further. Compared to naïve pre-training approaches that train with long sequence lengths all the way through, the recipe requires 29% fewer FLOPS for pre-training a compact GPT model without sacrificing model quality.

In the rest of the blog, we will discuss the fundamental idea behind VSL training and show the simplicity of enabling this method using the Cerebras Software Platform (CSoft). We follow this up with experimental results and qualitative insights which highlight the following key advantages of the method:

  • It requires no architectural change or additional hyperparameter tuning.
  • It works independently of position embeddings and is additive to the gains from using better-designed position embeddings.
  • It generalizes well across different training configurations, scaling with model size and the number of training tokens.

Variable Sequence Length Training

The pre-training of most LLMs is done with a fixed sequence length, for example, 2k tokens per sequence, throughout the entire training duration. One can follow this standard training method for models with long sequences, such as 8k and above. However, this will increase training times due to the compute overhead introduced by longer sequences.

Figure 2 depicts the Variable Sequence Length (VSL) method, which breaks standard LLM training into two stages. In the first stage, Stage-1, the model is trained with sequence lengths much shorter than the desired long sequence length, for example, 2k tokens per sequence. This is followed by an adaptation or fine-tuning phase, called Stage-2, where the model is trained for the desired long sequence length, for example, 8k tokens per sequence. The choice of the short sequence length and the fraction of pre-training steps for Stage-1 depends on the desired reduction in pre-training FLOPs and available compute resources. A model trained with a smaller sequence length and a significant fraction of steps in Stage-1 will result in a larger reduction in pre-training FLOPs. We do not modify other hyperparameters, such as optimizer states or learning rate schedules between the two stages.

Figure 2: VSL Training. The method breaks a standard GPT training into two stages, characterized by a short sequence length phase for a fraction of the total steps followed by a long sequence length phase for the remainder of the training.

Experimental Results and Insights

We apply the VSL recipe for pre-training GPT-3 [8] models (125M) on the Pile dataset [9] following the standard hyperparameter settings [8] from the paper but trained for 2.5B tokens following recommendations from Chinchilla [10]. We set the sequence length in Stage-1 as 512 and the sequence length in Stage-2 as 2048. We do not modify other hyperparameters during the entire pre-training process.

First, we train the model with three different position embeddings: no position embeddings (NoPE) [11], learned position embeddings [12], and rotary position embeddings (RoPE) [13]. Table 1 shows the average evaluation scores for tasks in the Open LLM Leaderboard for the models trained using VSL compared to a baseline model trained with a sequence length of 2048. Our experiments show that models trained to desired long sequence lengths with the VSL training recipe achieve similar evaluation scores as models trained entirely with long sequences. VSL training uses 13.8% fewer FLOPs than training with 2k sequence lengths all the way through. We observe similar trends when training large models using VSL. More details about this experiment can be found in Appendix B.

Table 1: Average evaluation scores (higher is better) on Open LLM Leaderboard for GPT-3 Small models using different position embeddings. The first row reports scores for VSL training, the second row for training with 512 sequence length only, and the last row for dense training with 2k sequence length only. We show that VSL training achieves similar scores as the 2k sequence length baseline for most position embeddings. VSL training uses 13.8% fewer FLOPs than the baseline model. A detailed breakdown of the individual task metrics can be found in Appendix A.

We further explore VSL training with ALiBi [3] position embeddings. Following contemporary trends where training compact models for longer durations is beneficial [14], we pre-train a 111M GPT-3 model with ALiBi [3] embeddings for 41B tokens. Table 2 shows the average evaluation scores for tasks in the Open LLM Leaderboard on sequence length 8k, highlighting the advantage of VSL training. Using only 25% of Stage-2 training, the model achieves on-par evaluation scores with the model trained with 8k sequence length only.

Table 2: Average evaluation scores (higher is better) on Open LLM Leaderboard with sequence length 8k for 111M GPT-3 model. The first row reports scores for VSL training, the second row for training with 2k sequence length only, and the last row for dense training with 8k sequence length only. We show that the model trained with VSL matches the 8k baseline while using 29% fewer pre-training FLOPs. We also note that the model trained with a 2k sequence length does not degrade too much as it relies on ALiBi embeddings for sequence length extrapolation during evaluation. A detailed breakdown of the individual task metrics can be found in Appendix C.

VSL also compounds on gains from improvements in modeling techniques like ALiBi while reducing FLOPs for pre-training. Figure 4 shows that VSL training benefits from the inference generalization of ALiBi. When evaluating with sequence length 32k, our VSL model offers similar extrapolation capabilities to the model trained entirely on 8k. In Figure 5, we show the breakdown in pre-training FLOPs for dense training (with 8k sequence length only) and VSL training for the 111M GPT model. By reducing the training time in the long sequence length phase (Stage-2), VSL training requires 29% fewer FLOPs over a model trained with an 8k sequence length all the way through.

Figure 4: Evaluation loss (lower is better) on Pile with MSL 32k for GPT-3 (111M). The model trained with VSL can extrapolate like the 8k baseline model. Note the extrapolation capabilities come from using ALiBi position embeddings in the model.

Figure 5: FLOPs spent in pre-training 111M GPT-3 model with ALiBi position embeddings on 41B tokens. The green bars indicate the FLOPs used for pre-training with 8k sequence length and the orange bar indicates the FLOPs used for pre-training with 2k sequence length. VSL training reduces FLOPs by 29% over standard dense training with 8k sequence length. Note that the reduction in FLOPs from VSL scale proportional to the increase in desired context length capabilities.

Qualitative Analysis and Insights

In this section, we analyze the self-attention patterns (as heatmaps) of GPT-3 Small (125M) models trained with VSL and provide insights into why Stage-2 training is effective. We use the model trained with no position embeddings (NoPE) from Table 1. We set the Stage-1 length to 90% of pre-training using short sequences of 512 tokens. We use the NoPE setting as it simplifies the analysis of attention patterns while being competitive with learned position embeddings. We select three heads with distinctive attention patterns that can handle short, medium, and long-range dependencies. For the heatmaps, the x and y-axis represent the token positions of the K and Q matrices, respectively. A darker color represents a higher attention score between two token positions.

First, we examine why Stage-1 training does worse than the baseline model. Figure 6 compares the model trained at the end of Stage-1, with sequence length 512 (top), to the baseline model trained with sequence length 2048 (bottom). The figure shows that Stage-1 training only learns to attend to the positions up to the short sequence length. For the first 700 tokens, we observe that both models have similar attention patterns, after which the short sequence model deviates from the baseline. The stage-1 model then assigns uniform scores to either group of tokens, as seen in the short-range head (top-left), or to all tokens, as seen by the uniform patches in the other two heads till position 1200. The stage-1 model cannot differentiate the relative importance of tokens past that threshold for all heads. This is expected since the model sees sequence lengths of at most 512 in Stage-1 training.

Figure 6: Heat maps showing attention scores between different token positions in decoder layer 7 for a selected sample. The top row shows results from the Stage-1 model trained using VSL (512 to 2k), and the bottom row shows results from the baseline model trained with 2k sequence length only. From left to right, we show the representative patterns in a few heads that handle short, medium, and long-range dependencies. The two gray dashed lines correspond to tokens 700 and 1,200.

In contrast, we observe significant changes in the attention patterns between the models in the two stages. Figure 7 compares the self-attention heatmaps of the Stage-2 model adapted with longer sequences of 2048 (top) to the baseline model (bottom). We see that the Stage-2 model adapts to handle longer sequences and shows similar behavior as the baseline model.

We highlight the key attention patterns observable in the Stage-2 and baseline models. First, heads with short and medium-range capabilities (first two columns) attend to only some context window immediately before the position being predicted, with thicker lines indicating a longer context window. This can explain why relative position embeddings like ALiBi [3] work well since the model only needs to make a prediction based on a local context window, much smaller than the desired long sequence length.

Block attention patterns characterize the other heads (last column). Upon inspecting the sample, we find that the edges of the triangles correspond to the “<|endoftext|>” token, a delimiter used for separating individual documents when packing them in longer sequences. The model can learn to attend to this delimiter and only relevant tokens within a document.

Figure 7: Heat maps showing attention scores between different token positions in decoder layer 7 for a selected sample. The top row shows results from the Stage-2 model trained using VSL (512 to 2k), and the bottom row shows results from the baseline model trained with 2k sequence length only. From left to right, we show the representative patterns in a few heads that handle short, medium, and long-range dependencies. The three gray dashed lines correspond to the positions of <|endoftext|> tokens.

Push Button Software for VSL Training

The Cerebras Software Platform (CSoft) makes it extremely simple to train models using the VSL method. For a given dataset, we recommend splitting it into two subsets, one for shorter sequences, and the other for longer sequence lengths and following the data preparation steps to ensure optimal training performance. Then it is as simple as modifying the data subset path between the different VSL stages. Figure 3 shows the YAML configuration files with the changes highlighted to switch from Stage-1 to Stage-2 for VSL training. The only differences are the path to the data directory and the batch size. To reiterate, no other changes to the Python code are needed, no extra hyperparameters are modified, and one can continue training efficiently on the CS-2 system.

Figure 8: YAML Configuration file changes are all that is required to switch between Stage-1 and Stage-2 of VSL training for GPT models. In this example, we show how to enable the VSL training process to switch between sequence lengths 2k to 8k in the two stages. We reduce the batch size proportional to the increase in sequence length from Stage-1 to Stage-2. This ensures that we see a consistent number of tokens per training iteration between the two stages and further reduces the need to tweak any extra hyperparameters like the learning rate.

Conclusion

In this work, we show that training with the simple two-stage VSL method can help accelerate time to convergence for models with long-context capabilities. VSL does not require any architecture changes or hyper-parameter tuning, while using 29% fewer pre-training FLOPs.  Parallel to our work, recent state-of-the-art models such as XGen and MPT-7b-8k have also successfully employed a similar approach to train models for up to 8k sequence length. To train models for even longer contexts like 32k or 65k, users can train with a multi-stage (2+) approach [15] and obtain training acceleration by training most of the time using shorter sequences, for example, 2k tokens.

With our initial results backed by qualitative insights, we see the tremendous promise of VSL to accelerate large GPT training. Enabled by the Cerebras CS-2’s ability to train large models without the complexities of distributed training and our integrated support for long contexts, users should be able to unlock the next frontier of state-of-the-art long-context models in a push-button manner. To learn more about how Cerebras CS-2 can empower your AI research or to learn more about this study, contact us.

References

  1. Dao, Tri, et al. “Flashattention: Fast and memory-efficient exact attention with io-awareness.” Advances in Neural Information Processing Systems 35 (2022): 16344-16359.
  2. Rabe, Markus N., and Charles Staats. “Self-attention Does Not Need O (n^2) Memory.” arXiv preprint arXiv:2112.05682 (2021).
  3. Press, Ofir, Noah A. Smith, and Mike Lewis. “Train short, test long: Attention with linear biases enables input length extrapolation.” arXiv preprint arXiv:2108.12409 (2021).
  4. Beltagy, Iz, Matthew E. Peters, and Arman Cohan. “Longformer: The long-document transformer.”arXiv preprint arXiv:2004.05150 (2020).
  5. Choromanski, Krzysztof, et al. “Rethinking attention with performers.”arXiv preprint arXiv:2009.14794 (2020).
  6. Press, Ofir, Noah A. Smith, and Mike Lewis. “Shortformer: Better language modeling using shorter inputs.” arXiv preprint arXiv:2012.15832 (2020).
  7. Radford, Alec, et al. “Language models are unsupervised multitask learners.” OpenAI blog 1.8 (2019): 9.
  8. Brown, Tom, et al. “Language models are few-shot learners.” Advances in neural information processing systems 33 (2020): 1877-1901.
  9. Gao, Leo, et al. “The pile: An 800gb dataset of diverse text for language modeling.” arXiv preprint arXiv:2101.00027 (2020).
  10. Hoffmann, Jordan, et al. “Training compute-optimal large language models.” arXiv preprint arXiv:2203.15556 (2022).
  11. Haviv, Adi, et al. “Transformer language models without positional encodings still learn positional information.” arXiv preprint arXiv:2203.16634 (2022).
  12. Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).
  13. Su, Jianlin, et al. “Roformer: Enhanced transformer with rotary position embedding.” arXiv preprint arXiv:2104.09864 (2021).
  14. Touvron, Hugo, et al. “Llama: Open and efficient foundation language models.” arXiv preprint arXiv:2302.13971 (2023).
  15. Li, Conglong, Minjia Zhang, and Yuxiong He. “The stability-efficiency dilemma: Investigating sequence length warmup for training GPT models.” Advances in Neural Information Processing Systems. 2022.

Appendix

A. Detailed Evaluation Metrics for GPT Small with Different Position Embeddings

Table 3 reports the task level breakdown of the evaluation scores for the models from Table 1. Following standard evaluation guidelines, we use the Eleuther evaluation harness for the ARC-Challenge, HellaSwag and TruthfulQA tasks. For MMLU, we use the original implementation from the authors and report the average accuracy.

Table 3: Average evaluation scores (higher is better) on Open LLM Leaderboard for GPT-3 Small models using different position embeddings. For each position embedding, we show results for model trained with 90% Stage-1 + 10% Stage-2, model trained with only a short sequence length (512), and model trained with 2k sequence length (standard training baseline). For ARC-Challenge (25-shot) and HellaSwag (10-shot), we report the normalized accuracy (acc_norm). For TruthfulQA (0-shot), we report the mc2 metric, and for MMLU (5-shot), following the original implementation from authors, we report the average accuracy across the four categories (Humanities, Social Sciences, STEM, and Others).

B. Detailed Evaluation Metrics for Training Large Models with VSL

We apply the VSL recipe for pre-training a GPT-3 XL model (1.3B parameters) on the Pile dataset [7] following the standard hyperparameter settings [6] from the paper but trained for 26B tokens following recommendations from Chinchilla [8]. We set the sequence length in Stage-1 as 512 and the sequence length in Stage-2 as 2048. We do not modify other hyperparameters during the entire pre-training process. Table 4 reports the task level breakdown of evaluation scores for Open LLM Leaderboard.

Table 4: Average evaluation scores (higher is better) on Open LLM Leaderboard for GPT-3 XL model using learned position embeddings. We show results for model trained with 75% Stage-1 + 25% Stage-2, model trained with only a short sequence length (512), and model trained with 2k sequence length (standard training baseline).

C. Detailed Evaluation Metrics for Training GPT Models to 8K Sequences

Table 5 reports the task level breakdown of the evaluation scores for the models from Table 2. Following standard evaluation guidelines, we use the Eleuther evaluation harness for the ARC-Challenge, HellaSwag and TruthfulQA tasks. For MMLU, we use the original implementation from the authors and report the average accuracy.

Table 5: Average evaluation scores (higher is better) on Open LLM Leaderboard for GPT-3 (111M) model using learned position embeddings. We show results for model trained with 75% Stage-1 + 25% Stage-2, model trained with only a short sequence length (2k), and model trained with 8k sequence length (standard training baseline).

Contributors

Siyun Li led the research efforts, evaluated the technique in different training settings on CS-2, conducted qualitative analysis, and contributed to the writing of this blog. Abhay Gupta advised the project, summarized insights, and contributed to the writing of this blog. Kevin Leong and Mark Browning supported training infrastructure and provided crucial debugging support. Daria Soboleva, Nolan Dey, Hemant Khachane and Ribhu Pathria conducted the AliBi experiments and associated ablations. Sean Lie coordinated the setup of the training infrastructure and was involved in experimental analysis. Shreyas Saxena advised the project, presented the initial proof of concept, and provided a detailed review of this writing.