Fine-Grain Many-Core Processor Arrays for Efficient and High-Performance Computation
Thursday, January 17, 2019 - 4:30pm to 5:30pm
Bldg. 380 Rm. 380X
Prof. Bevan Baas - Electrical and Computer Engineering - UC Davis
Abstract / Description:
The continually-growing number of devices available per chip assures the presence of many processing blocks per die communicating by some type of inter-processor interconnect. It is interesting to consider what the granularity of the processing blocks should be given a fixed amount of die area. The smallest reasonable tile size is on the order of an FPGA's LUT. Between the domains of FPGAs and traditional processors lies a lightly-explored region which we call fine-grain many-core, whose processors: can be programmed by simple traditional programs; typically operate with high throughput and high energy-efficiency; are well suited for deep submicron fabrication technologies; and are well matched to many DSP, multimedia, and embedded workloads, and--somewhat counterintuitively--also to some enterprise and scientific kernels.
The AsAP project has developed fine-grain many-core systems composed of large numbers of programmable reduced-complexity processors with no algorithm-specific hardware and with individual per-processor digitally-tunable clock oscillators operating completely independently with respect to each other (GALS). Due to the independence of the MIMD cores and individual near-optimal oscillator halting, the system operates with a power dissipation that is almost ideally proportional to the system load.
A third generation 32 nm design that integrates 1000 independent programmable processors and 12 memory modules has been designed and fabricated. The processors and memory modules communicate through a reconfigurable full-rate circuit-switched mesh network and a complementary very small area packet router, and they operate to an average maximum clock frequency of 1.78 GHz, which is believed to be the highest clock frequency achieved by a fabricated processor designed in a university. At a supply voltage of 0.9 V, processors operating at an average of 1.24 GHz dissipate 17 mW while issuing one instruction per cycle. At 0.56 V, processors operating at 115 MHz dissipate 0.61 mW resulting in 5.3 pJ/instruction, enabling 1000 100%-active cores to be powered by a single AA battery.
Several dozen DSP and general tasks have been coded plus more complex applications including: AES encryption engines, a full-rate H.264 1080p 30fps HDTV residual encoder, a fully-compliant IEEE 802.11a/11g Wi-Fi wireless LAN baseband transmitter and receiver, a SAR radar engine, a complete first-pass H.264 encoder, convolutional neural networks, large sparse matrix operations, sorting and processing of enterprise data, and others. Power, throughput, and die area results generally compare very well with solutions on existing programmable processors. A C++ compiler and automatic mapping tool greatly simplify programming.