Cerebras and Neural Magic have achieved a major milestone in the field of large language models (LLMs). By combining state-of-the-art pruning techniques, sparse pretraining, and purpose-built hardware, we have unlocked unprecedented levels of sparsity in LLMs, enabling up to 70% parameter reduction without compromising accuracy. This breakthrough paves the way for more efficient training and deployment of LLMs, making them accessible to a broader range of organizations and industries.

Key Achievements:

    • 70% Sparsity: Our novel approach achieves a record 70% sparsity in LLMs. In contrast, GPUs rarely achieve even 50% sparsity in production workloads.
    • Full Accuracy: This is the first time a foundation model such as Llama has been sparsified to 50-70% with full recovery of accuracy for challenging downstream tasks
    • Sparse Training Acceleration: Cerebras CS-3 system provided up to 8x training acceleration, leveraging its native support for unstructured sparsity.
    • Accelerated Inference: Neural Magic’s DeepSparse engine delivers up to 3x faster inference compared to dense models.

Sparse LLMs: From Research to Reality

The quest for sparsity in deep learning models has been an ongoing endeavor, with the goal of reducing computational and memory requirements. Pruning techniques, which remove less important weights, have proven effective in shrinking the size of computer vision models. However, applying these methods to large language models (LLMs) have so far not yielded great results. LLMs operate on high-dimensional data and require a vast number of parameters to capture the complexity and nuances of language. Removing weights through pruning can disrupt the delicate balance and relationships between these parameters, leading to a significant loss in accuracy, particularly in downstream tasks such as chat and coding. This degradation in performance and the complexity of training is the reason why no major LLMs today employ sparsity.

Another key hindrance for sparsity research has been the limited support for sparsity on GPU hardware. GPUs such as the H100 offer only a very limited sparsity option – namely allowing 2 out of 4 adjacent weights to be sparse. This 2:4 structured sparsity constraint is a significant limitation for LLMs, which are highly varied by nature and rarely follow such a predictable pattern. As a result, GPU sparsity is rarely used for LLMs, as it fails to capture the intricate and diverse sparsity patterns present in these models.

In contrast, Cerebras Wafer Scale Engine (WSEs) has been designed from the ground up to support arbitrary sparsity patterns at any sparsity level. This unique hardware architecture allows researchers to adapt to the natural structure and learned weights of the model, enabling them to leverage sparsity more effectively. With WSEs, it is possible to achieve high speedups when the weights are sparse while maintaining high accuracy when the weights are dense.

Sparse Fine-Tuning: A Revolutionary Approach

In collaboration with Neural Magic, we have developed a novel approach called sparse fine-tuning, which overcomes the limitations of previous sparsity techniques for LLMs. By leveraging the CS-3’s support for unlimited unstructured sparsity, this method combines one-shot pruning, sparse pretraining, and fine-tuning on specific datasets to create highly sparse LLMs without sacrificing accuracy.

The process begins by applying one-shot pruning to a dense model, such as the LLAMA architecture, removing 50% of the model’s weights. The pruned model then undergoes additional pretraining using the SlimPajama dataset developed by Cerebras. This sparse pretraining step recovers most of the accuracy of the foundational model before pruning, effectively adapting the model to its new sparse structure.

Following the sparse pretraining, the model is fine-tuned on specific datasets tailored for downstream tasks, such as chatbots or code generation. The resulting sparse LLM reaches the same level of accuracy as its dense counterpart while being up to 70% smaller in size.

Neural Magic DeepSparse: Lightning-Fast Inference

Deploying sparse LLMs for inference presents its own set of challenges, particularly on resource-constrained devices. Neural Magic’s DeepSparse engine addresses this issue by delivering exceptional inference performance on CPUs, the most widely available and cost-effective computing resource.

DeepSparse leverages advanced techniques such as Just-In-Time (JIT) compilation, kernel fusion, and vectorized instructions to optimize sparse operations on CPUs. By exploiting the sparsity patterns in LLMs, DeepSparse minimizes memory accesses and computational overhead, resulting in up to 3x faster inference compared to dense models.

The efficiency gains achieved by DeepSparse have significant implications for organizations seeking to deploy LLMs in real-world applications. With faster inference times and reduced hardware requirements, sparse LLMs become more accessible and cost-effective, enabling a wider range of use cases across industries.

Empowering the AI Community

To facilitate the adoption and further development of sparse LLMs, we are releasing a comprehensive package containing the training recipe, model weights, code, data, and documentation. These resources empower researchers and practitioners to build upon our work and explore the potential of sparsity in their own applications.

https://huggingface.co/papers/2405.03594 (paper)
https://huggingface.co/neuralmagic (model)
https://github.com/neuralmagic (weights)

The breakthrough achieved by Cerebras and Neural Magic marks a significant milestone in the evolution of LLMs. By demonstrating the practicality and effectiveness of sparsity in these models, we have opened the door to a new era of efficient and accessible natural language processing.