Large Language Models (LLMs) such as GPT-3 are the cutting edge in artificial intelligence, with remarkable deployed uses today and unbounded promise for new capabilities. As you may have seen recently, Cerebras has announced some very cool results in the LLM space. Recent results include:

We have done so with advanced platform technologies that we’ll relate here.

LLMs stress today’s AI compute systems with model size, time to train, and difficulty of launching jobs, martialing resources, and tuning parameters to succeed in training. To address the urgent need to train extremely large language models, in the hundred billion to trillion parameter range, Cerebras has innovated in three directions:

First, to cope with model size, we have introduced a Weight Streaming training modality that stores the model weights off the wafer, and brings in the model one layer at a time.

Second, while Weight Streaming is already an easy and robust way to use one CS-2 system for training models of moderately large size, training time for very large models can be excessive. To reduce the time to train, Release 1.5 of the Cerebras Software Platform, CSoft, now supports training a single model on our Cerebras Wafer-Scale Cluster with up to 4 CS-2 systems. Note that we’re just getting started: we’ve demonstrated results with 8-node clusters and our unique cluster architecture, which brings together several novel hardware, software and fabric technologies, allows us to build clusters of up to 192 Cerebras CS-2 systems. (That multiplies out to more than 150 million cores, which is a startling number.) Because these clusters run data parallel, there is no distributed computing, and they deliver near-perfect linear scaling performance, making them ideal for rapidly training extreme-scale AI models, without requiring any code changes as the cluster size increases.

If you are familiar with doing similar work on traditional hardware, you would say, well, great that you can do it, but it probably takes a really long time to set everything up, and it’s just a ton of work before you actually can run something. Fear not; this is where Appliance Mode, the third innovation, comes in. We created Appliance Mode for simplicity, ease of use, and robustness in training any LLM on the Cerebras cluster. Appliance Mode approaches usability from a different angle than what one sees on traditional hardware: because we control all the aspects of the Cerebras cluster and the unique capabilities of the wafer-scale engine (WSE) that powers our CS-2 systems, we can provide an easily accessible way to scale up to LLMs and other very large ML workloads. An Appliance Mode user describes the model and specifies how many CS-2 nodes (out of the number in the cluster to which they submit) to use for the training.

The Many Benefits of Appliance Mode.

There are many benefits to the Appliance Mode flow:

    • As a user, you no longer need to worry about the distributed nature of launching multiple tasks to get the maximum performance on the cluster.
    • Rather than running within a container, we allow the user to pip install the needed Cerebras software and add additional packages to meet their needs.
    • Previously our run scripts were exactly that, scripts that were used to deploy jobs. We are now much more “pythonic,” and the user can modify these scripts to match their needs and use the cluster as a single device.
    • We will no longer ship custom TensorFlow or Pytorch wheels but rather work with the vanilla wheels distributed by Google and Meta respectively.
    • These simplifications allow our users to focus on what’s important: their research and the results of their model, and not to worry about how to get better performance or make the model “fit” the hardware.
    • Appliance flow makes it simple to run in either our earlier pipelined execution mode or the weight streaming execution mode across the Wafer Scale Cluster.

Launching a Training Job

Users interact with the cluster via a user node, which is a customer controlled node on which the user can install the Cerebras python packages. It will also be where the user stores the model parameters and a data loader. The data will stored in a filesystem that will be mounted on the appliance to access the data via the user’s data loader. As in our previous workflow, a shared storage system works well. One can have a single or many user nodes to interact with the cluster depending on the needed bandwidths and performance.

Pipelined and Weight Streaming Execution Modes 

In our pipelined execution mode, all of the model weights are present on one WSE. The memory capacity of the WSE limits the model size. This approach works well for small to medium size models. (Small is a relative term; models with up to about a billion parameters will work in pipelined mode. That size model was state-of-the-art only a few years ago.)

In the newer weight streaming mode, model weights are streamed into the WSE one network layer at a time. Activations for one batch are also present on the WSE. This permits training models larger than the memory capacity of the wafer.

Figure 1. The two execution modes of the MLIR-based Cerebras Graph Compiler.

Weight Streaming Execution on the Cerebras Wafer-Scale Cluster

To train on a cluster of CS-2s using weight streaming, we employ the data parallel idea at the wafer level. This means that each batch of training data is sharded into smaller batches, and each CS-2 in the cluster is given a shard, which is a smaller number of example inputs from the training set. We take the network weights layer by layer, starting with the input layer, from a large capacity memory service called MemoryX, and we broadcast the current layer so that the WSE within each CS-2 has its own copy of the weights. The WSEs compute the output of this layer for their shards. After the loss layer and first gradient computation is done, we repeat the broadcasts of weights layer by layer in the reverse order, and carry out back-propagation. After each layer’s backprop is done, each WSE has its own new weight gradient for the current layer. These are sent into the interconnect fabric, SwarmX, that previously accomplished the broadcast through data replication. SwarmX now provides data aggregation by adding together the weight gradients provided by the WSEs, returning the sum of these weight gradients to MemoryX. MemoryX then implements the weight update due to the batch via the chosen learning algorithm (ADAM, momentum, or another method). This separation of weight storage and learning algorithm from the job of forward and backprop allows us to scale memory capacity to meet the needs of large model sizes, and independently scale the cluster size to meet the compute throughput needs for training rapidly.

The dynamic coordination of the cluster and its CS-2s, the SwarmX nodes, the compute and storage nodes and their worker and coordinator processes that constitute MemoryX are all the responsibility of the Cerebras Appliance Mode software that runs on host system CPU nodes.

In appliance mode, the user’s minimal specification is automatically amplified into commands to assemble resources and deploy software, creating a full parallel training system as shown in the figure. The assembled hardware includes the user-chosen number of CS-2 nodes, custom networking gear (switches, in-network compute) to implement the SwarmX broadcast and reduce network for streaming weights and gradients, compute/network nodes to create MemoryX, with sufficient bandwidth to provide training data batches (the activation servers) and model weights (the weight server or servers), and the data-preprocessing and coordinator nodes to access and supply training data from the file system and to monitor, synchronize, and otherwise direct the activities of these parallel components. Appliance Mode deploys Cerebras software on these assembled resources to run the job.

Figure 2. Topology of weight streaming on the Cerebras Wafer-Scale Cluster.
Figure 3. Topology of pipeline mode on a single Cerebras CS-2 system.

As seen in the above diagrams, for both pipeline and weight streaming execution, the user can deploy their job on the user node and does not need to worry about the complexity taking place on the appliance.

Appliance Mode Makes Training a Snap

So with the cluster and Appliance Mode, we get a more friendly user interface, an easy way to scale to a cluster and an ability to run from the smallest to largest models (pipeline and weight streaming) with almost no change to code, which works for the most popular frameworks, PyTorch and TensorFlow. The user works on the user node and needs only to focus on what is deployed there (python environment). You can even deploy it from a Jupyter notebook.

With the wafer scale cluster and the appliance flow, we make training large language models just as easy as training MNIST. (AI humor. Sorry.)

What Was Training Like before Appliance Mode?

Our previous training mode was straightforward, but it does require the user to understand how to program both the CS-2 and its support nodes. And it was also limited to a single CS-2. It was built on top of Slurm to deploy multiple tasks, within a sif container where the tasks are executed and communicate over grpc.

Some limitations of this approach were:

    • The user had to deploy nodes to interact with the CS-2.
    • The user had to be aware of the distributed nature of using the CS-2 and structure the run scripts accordingly.
    • Custom deployments needed customer scripts; for example, to train and eval in a loop would require custom bash scripts.
    • The user could not easily add additional python packages needed for their work.
    • It did not play well with datacenter orchestration software other than Slurm.

Now, with Appliance Mode flow, we avoid all of these issues. The appliance flow is built on top of python packages on the user node, with a grpc to communicate between the various roles and kubernetes to schedule and deploy on the appliance. This allows a user to integrate the cluster into their datacenter with the tools and settings they are most familiar and comfortable with.

There are also advantages compared with the workflows used for conventional GPU-based hardware.

    • There is no need to figure out distributed TensorFlow or PyTorch. The model is described by the same code that runs on a single GPU. One does not need to break the model into shards that run across multiple devices.
    • Memory, bandwidth, and compute performance constraints disappear or are pushed far into the background.
    • Third party software integration with, e.g., OpenMPI and Horovod, is no longer needed.
    • Efficient use of a very large number of devices, as well as fault tolerance, is taken care of for the user.

Running Appliance Mode on the Cerebras Wafer-Scale Cluster is, in short, amazing. It offers previously unheard-of compute power and simultaneously eliminates all the complexity associated with scaling training out to millions upon millions of cores.

If you are curious to learn more about our software stack, visit the developer page on our website. If you have question, visit us our Discourse page. To get a demo of our cluster capabilities and appliance workflow, contact Cerebras support by sending email to support@cerebras.net.

Vishal Subbiah, ML Engineering Manager

Rob Schreiber, Distinguished Engineer

September 14, 2021