The Cerebras SDK

Cerebras Systems is a team of pioneering engineers of all types, driven by the world’s largest computing challenges. Our newly announced flagship product, the CS-3 system, is powered by the world’s largest and fastest AI processor, our Wafer-Scale Engine-3 (WSE-3). Our software makes scaling the largest models dead simple, avoiding the complexity of traditional distributed computing. Leading institutions use Cerebras solutions for the development of pathbreaking proprietary models, and to train open-source models with millions of downloads.

Additionally, many scientists and researchers want to utilize the full power of the WSE for domains beyond AI. The Cerebras SDK is our publicly available platform for HPC, enabling users to write low-level kernels for our unique architecture. The SDK allows users to write completely custom applications for all generations of WSE using the Cerebras Software Language (CSL), a C-like language built around a dataflow programming model. In this model, computation is activated by the arrival of data on a processing element (PE) from the fabric that knits the cores together.

Last week, we released version 1.1.0 of the Cerebras SDK, our second publicly available release, and our first release with initial support for WSE-3. The Cerebras SDK is available as a singularity container with a fabric simulator, allowing anybody to develop programs for the Cerebras architecture on their own x86 computing resource.

The rest of this post contains an overview of the Wafer-Scale Engine and the opportunity it presents to high-performance computing, as well as some research highlights from Cerebras SDK users. If you’re interested in downloading the latest SDK and opportunities for researchers, check out the end of this post for more information.

Harnessing the Power of the Wafer-Scale Engine

The Cerebras WSE is a wafer-parallel compute accelerator, containing hundreds of thousands of independent processing elements (PEs). The WSE-3 chip features a massive 900,000 PEs. The PEs are interconnected by communication links into a two-dimensional rectangular mesh on one single silicon wafer. Each PE has its own 48 kB of memory (used by it and no other), containing both instructions and data. 32-bit messages, called wavelets, can be sent to or received by neighboring PEs in just a couple clock cycles. Wavelets travel the fabric along a virtual channel, called a color. For each color used for incoming wavelets, a chunk of executable code known as a task may be activated by its arrival. Thus, the PE has dataflow characteristics: a program’s state is determined by the flow of data.

The extremely low latency fabric connecting the PEs has a massive aggregate peak fabric bandwidth of 214 Petabits per second. Additionally, the 48 kB of memory owned by each PE is accessible in a single cycle. More specifically, each PE can read 128 bits from and write 64 bits to memory each cycle, resulting in an aggregate peak memory bandwidth of more than 20 Petabytes per second. Wafer-scale computing means not only unprecedented scaling, but unprecedented memory bandwidth, making more algorithms compute-bound than ever. The single-cycle memory accesses and ultra-fast fabric presents a totally new paradigm for HPC applications.

Recent Research with the Cerebras SDK

CSL and the Cerebras SDK are already enabling innovation in computational science. Researchers at TotalEnergies, KAUST, ANL, PSC, and EPCC, among others, have used the Cerebras SDK to develop applications in areas such as seismic processing and Monte Carlo particle transport. I’ll highlight some of this research here.

KAUST scaled the memory wall for multi-dimensional seismic processing on Cerebras, redesigning a Tile Low-Rank Matrix-Vector Multiplication (TLR-MVM) algorithm for Cerebras CS-2, to take advantage of the ultra-high memory bandwidth. The kernel achieved sustained memory bandwidth of 92.58 PB/s on 48 CS-2s within the CG-1 AI supercomputer cluster. This is 3x faster than the aggregated theoretical bandwidth of Leonardo or Summit, two of the world’s current top five supercomputers. Additionally, this result resembles the estimated upper bound (95.38 PB/s) for Fugaku, another top-five supercomputer, at a fraction of the energy consumption. This work was selected as a 2023 Gordon Bell Prize finalist.

TotalEnergies achieved 204x A100 performance on a finite-volume flux calculation used for a single-phase flow carbon sequestration simulation, at 2.2x greater energy efficiency. This work was presented at SC23. Previously, Total implemented 25-point stencil for the 3D wave equation with source perturbation, achieved 228x speedup over A100, and presented this work at SC22. Additionally, TotalEnergies has implemented a proprietary RTM (Reverse Time Migration) code for seismic imaging running on a Cerebras system.

Argonne National Labs demonstrated 130x A100 performance on a Monte Carlo continuous particle transport kernel central to nuclear energy applications. These researchers implemented the macroscopic cross-section lookup kernel and compared its performance to a highly-optimized CUDA implementation. A forthcoming publication accepted at the International Conference on the Physics of Reactors (PHYSOR) further optimizes this kernel and demonstrates 180x speedup over A100 performance.

How to get started

If you’re not yet a Cerebras SDK user and want to dive in, get access to our SDK here. You can find documentation and tutorials here, and example programs here.

If you’re a researcher interested in getting hands-on with Cerebras hardware, our partners at ANL and PSC provide research grants for CS-2 access.

Have questions? Interested in exploring if your application or algorithm might be a fit for Cerebras? Send us an email at developer@cerebras.net, or join the discussion at discourse.cerebras.net.

Learn more

Selected Cerebras SDK Publications

  • Luczynski & Gianinazzi et al. Near-optimal wafer-scale reduce. Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing (2024). arXiv:2404.15888
  • Tramm et al. Efficient algorithms for Monte Carlo particle transport on AI accelerator hardware. Computer Physics Communications (2024). arXiv:2311.01739
  • Moreno et al. Trackable agent-based evolution models at wafer scale. arXiv preprint (2024). arXiv:2404.10861
  • Ltaief et al. Scaling the memory wall for multi-dimensional seismic processing with algebraic compression on Cerebras CS-2 systems. Proceedings of SC23 (2023). org/10.1145/3581784.3627042
  • Sai et al. Massively distributed finite-volume flux computation. Proceedings of SC23 (2023). arXiv:2304.11274
  • Jacquelin et al. Massively scalable stencil algorithm. Proceedings of SC22 (2022). arXiv:2204.03775