The latest release of our SDK, 0.6.0, includes a host of new features to improve usability and squeeze even more performance out of your applications. Some of the highlights include new communications primitives, enhanced support for generics, more example programs and a new runtime.

The Cerebras SDK

The Cerebras SDK is our platform for enabling developers to utilize the full power of the Cerebras Wafer-Scale Engine, the massive 850,000 core chip that sits at the heart of the Cerebras CS-2 system. The SDK allows users to write completely custom applications using the Cerebras Software Language (CSL), a C-like language built around a dataflow programming model. In this model, computation is activated by the arrival of data on a processing element (PE) from the fabric that knits the cores together.

CSL and the Cerebras SDK are already enabling innovation in computational science. The single-cycle memory accesses and ultra-fast fabric presents a totally new paradigm for HPC applications. In particular, TotalEnergies has been using the SDK to reinvent high performance simulations for energy research. You can read about that work here, but in short, TotalEnergies used the Cerebras CS-2 system to turn a stencil code problem that had long been accepted to be memory-bound, into a compute-bound one. On a benchmark case by a seismic kernel used to image the Earth, the CS-2 delivered more than 200x performance compared to a NVIDIA® A100 GPU.

Library support

The CSL libraries are better than ever. We’ve introduced a new library with collective communications primitives to implement broadcast, scatter, gather, and single precision add-reduce operations across rows and columns of PEs. All of the routing configuration is handled completely behind the scenes, so users no longer have to manually implement these common operations.

To developers who have utilized MPI for distributed memory programming, these primitives should look very familiar. For example, a broadcast operation across a row of PEs can be implemented something like this:

if (collective_comms.pe_id == 0) {
  collective_comms.broadcast(0, send_buf, num_elems, broadcast_color);
} else {
collective_comms.broadcast(0, recv_buf, num_elems, broadcast_color);
}

In this case, the PE with pe_id 0 sends num_elems number of elements from its send_buf to the recv_bufs of all other PEs along its row, using broadcast_color as the color for routing. The collective_comms module in this example is instantiated on import to specify that the operations occur across rows of PEs.  Future work will include implementations of all-scatter and all-gather primitives, and additional reduce operations.

Additionally, the math library has introduced more efficient implementations of several critical functions, including half-precision sin and cos implementations. Several small libraries with quality-of-life utility functions have been introduced as well.

CSL language improvements

The CSL language itself has several major enhancements. CSL now features improved support for generics, giving users a way to write functions with unspecified types. Now you can write generic functions which are specialized at compile time depending on the type of their input arguments. In fact, CSL’s entire math library has been rewritten to take advantage of this, and now generically implements IEEE floating point functions, increasing code reuse and maintainability of our libraries.

For advanced users, CSL now supports several low-level features to control prioritization of tasks, and more efficiently execute operations on data structure descriptors (DSDs). Initial support for a new remote procedure call (RPC) mechanism for host-device communication has been introduced. Additionally, compile times are greatly improved, and compiler diagnostics are more descriptive and helpful.

Example programs

Several new example programs now ship with our SDK to showcase some of this new functionality. This includes a dense Cholesky decomposition, which takes a symmetric positive-definite matrix A, and finds the lower triangular matrix L such that LLT = A, implementing a novel algorithm on a triangle of processing elements to compute L. Additionally, we’ve introduced a new example to document our debug library, which allows users to easily trace values and record timestamps during runtime.

We’ve also provided a new GEMV (general matrix-vector product) example built around collective communications. This example, which demonstrates the power and simplicity of our new collective scatter, gather, broadcast, and reduce operations, deserves a bit more explanation. We can demonstrate how this program calculates Ax + b using a 4 by 4 grid of PEs, where the corresponding submatrix of A is already resident on each PE.

The program begins by streaming b and x into the top left PE, and then performs scatter operations to send portions of b and x across the leftmost column of PEs and topmost row of PEs, respectively.

A broadcast is then performed across all four columns, copying the appropriate subvector of x to each PE. After the broadcast is finished, each PE can calculate its own contribution to the resulting vector y. While the rightmost three columns just compute Ax, the leftmost column additionally adds in the contribution from b.

New host runtime

We’ve also released an early version of a new host runtime, the part of the software stack responsible for launching programs on the wafer, and moving data between the CS-2 and a host CPU server. Because this runtime introduces new Python APIs for host code, we’ve introduced new versions of a couple of our example programs to showcase it. This new runtime introduces greater than 4x improvement to host-to-device and device-to-host data transfer speeds, unlocking significant performance gains for applications that either often checkpoint and write data over the course of a simulation, or process more data than can fit on the wafer simultaneously.

Our stencil example program, which illustrates many of the core computational motifs in TotalEnergies’ seismic modelling work, has been rewritten to take advantage of the new runtime. We’ve also ported over our residual example program, which illustrates calculating the norm |Axb| distributed over several PEs.

New releases will focus on improving the usability and performance of this new runtime even further, and eventually we’ll replace our legacy runtime altogether. In the future, we’ll also be releasing a version of the new runtime which supports C++ host code. This will eventually allow users to integrate CSL kernels into their existing C++ code bases.

How to get started

If you’re already an SDK user and you haven’t yet downloaded 0.6.0, reach out to developer@cerebras.net to get set up with the new version. If you’re not yet a Cerebras SDK user and you want to dive in, please request access to our SDK here.

Have questions?

Interested in exploring if your application or algorithm might be a fit for the CS-2? Send us an email at developer@cerebras.net, or join the discussion at discourse.cerebras.net.

Learn more

Leighton Wilson, HPC Solultions Engineer / February 15, 2023