IBM Squeezes AI Into Tiny Scalable Cores

Posted by alphaatlas 12:11 PM (CDT)

Wednesday October 10, 2018

At VLSI 2018, IBM showed off an interesting machine learning architecture. Instead of squeezing a ton of throughput through huge cores, as AMD, Nvidia, and Google do with their AI products, IBM tiled tiny cores in a giant 16x32 2D array. David Kanter says that each "PE is a tiny core; including an instruction buffer, fetch/decode stage, 16-entry register file, a 16-bit floating-point execution units, and binary and ternary ALUs, and fabric links to and from the neighboring PEs." There are also SFUs designed to handle 32 bit floating point data, and separate X and Y caches with 192GB/s of bandwidth each. Much like Intel's Skylake Xeons, the cores are connected to each other with a mesh fabric. A test accelerator IBM made reportedly offered 1.5 Tflops of machine learning training throughput on a 9mm^2 chip made on a 14nm process, and achieved 95% utilization when training on a batch of images.

As a research project, the absolute performance is not terribly important. However, the key architectural choices are quite interesting. IBM's processor uses a large array of very small processor cores with very little SIMD. This architectural choice enables better performance for sparse dataflow (e.g., sparse activations in a neural network). In contrast, Google, Intel, and Nvidia all rely on a small number of large cores with lots of dense data parallelism to achieve good performance. Related, IBM's PEs are arranged in a 2D array with a mesh network, a natural organization for planar silicon and a workload with a reasonable degree of locality. While Intel processors also use a mesh fabric for inter-core communication, GPUs have a rather different architecture that looks more similar to a crossbar. The IBM PEs are optimized for common operations (e.g., multiply-accumulate) and sufficiently programmable to support different dataflows and reuse patterns. Less common operations are performed outside of the core in the special function units. As with many machine learning processors, a variety of reduced precision data formats are used to improve throughput. Last, the processor relies on software-managed data (and instruction) movement in explicitly addressed SRAMs, rather than hardware-managed caches. This approach is similar to the Cell processor and offers superior flexibility and power-efficiency (compared to caches) at the cost of significant programmer and tool chain complexity. While not every machine learning processor will share all these attributes, it certainly illustrates a different approach from any of the incumbents - and more consistent with the architectures chosen by start-ups such as Graphcore or Wave that solely focus on machine learning and neural networks.