Data access latencies and bandwidth bottlenecks frequently represent major limiting factors for the computational effectiveness of many-core processor architectures. This talk introduces a region-based cache coherence (RBCC) approach for DSM hierarchies in order to reduce the synchronization overheads for coherence maintenance and to improve the locality between computing resources and data. A 2D array of compute tiles with multiple, heterogeneous RISC cores, two levels of caches and a tile-local SRAM memory serves as our reference processing node. Multiple such compute tiles, I/O tiles and a globally shared DDR SDRAM memory tile are interconnected by a meshed Network on Chip (NoC) with support for multiple quality of service levels. Overall, this processing architecture follows a distributed-shared-memory model.
Embedded system applications, with their inherently limited parallelism, rarely exploit global coherence in manycore architectures. Global coherence spanning across all tiles does not scale well and could be confined to a limited cluster of tiles. Therefore, we favor region-based cache coherence (RBCC) among a limited number of compute tiles (working set) over global coherence approaches. Coherence regions can be dynamically configured at runtime and comprise a number of arbitrary (adjacent or non-adjacent) compute tiles. We further extend RBCC with RBCC-malloc() that transparently tailors coherence to truly shareable application working sets at runtime. However, we will also show that the benefits of RBCC strongly depend on the task and data placement among tiles in the coherency region that can affect performance by up to an order of magnitude.
Synthesis results for a multi-FPGA system design consisting of a 16 tile, 64-core system reveal, RBCC allows maintaining substantially smaller coherence directories (57% reduction in BRAM-utilization compared to global coherence for working sets up to 16 cores) and shorter sharer checking latencies than global coherence. Experimental evaluations reveal an application acceleration of up to 42% compared to a message passing based implementation.
Andreas Herkersdorf is a professor in the Department of Electrical and Computer Engineering and also affiliated to the Department of Informatics at Technical University of Munich (TUM). He received a Dr. degree from ETH Zurich, Switzerland, in 1991. Between 1988 and 2003, he has been in technical and management positions with the IBM Research Laboratory in Rüschlikon, Switzerland. Since 2003, Dr. Herkersdorf is the head of the Chair of Integrated Systems at TUM. He is a senior member of the IEEE, member of the DFG (German Research Foundation) Review Board and serves as editor for Springer and De Gruyter journals for design automation and information technology. His research interests include application-specific multi-processor architectures, IP network processing, Network on Chip and self-adaptive fault-tolerant computing.