Keynotes
Table of Contents
Keynote #1: Multi-core Multi-thread RISC-V-based System-on-Chip
Cong-Kha Pham
The University of Electro-Communications, Japan
Abstract: The relentless demand for computational power has driven the development of diverse parallel architectures, fundamentally categorized as general-purpose and special-purpose systems. General- purpose systems, characterized by programmable controllers, provide versatility across a wide range of applications. Conversely, special-purpose systems, utilizing fixed hardware controllers, prioritize efficiency for specific tasks. Multicore processors, a cornerstone of modern computing, are employed in both categories, leveraging interconnected functional modules to enable concurrent execution. General-purpose multicore architectures, such as linear arrays and trees, are adept at handling varied workloads, while specialized multiprocessors, including systolic arrays and hypercubes, are tailored to specific computational patterns. Designing these complex systems requires significant expertise to optimize performance and overcome the inherent memory bottlenecks associated with traditional architectures.
The advent of the big-data era has brought about a paradigm shift towards data-centric design, placing a strong emphasis on the importance of high-quality, structured data. This shift has also blurred the traditional boundaries between general-purpose and special-purpose systems, leading to the emergence of hybrid designs that integrate features from both categories. Notable examples include NVIDIA’s GH200 and GB200 architectures, which effectively combine the flexibility of programmability with the efficiency of specialized hardware accelerators.
Furthermore, the importance of hardware/software co-design has become increasingly evident. Modern computing systems are highly complex, requiring a holistic approach where hardware and software are developed in tandem to achieve optimal performance and efficiency. Software frameworks developed by companies like Tenstorrent and Cerebras, alongside custom chips designed by companies like Meta and Microsoft, underscore the significance of integrated design methodologies. These efforts aim to strike a balance between the flexibility of general-purpose systems and the efficiency of application-specific designs.
Our research leverages the RISC-V open-source Instruction Set Architecture (ISA) to develop advanced multicore systems, focusing on both hardware design and multithreaded software. A high- performance core initiates thread generation for non-parallel tasks, storing thread information in a queue accessible by other cores. Upon queue writes, idle cores are activated to fetch and store thread data, mitigating shared resource bottlenecks. Bidirectional private buses are employed to further minimize data movement overhead.
We have explored the integration of near-cache processing capabilities and tightly coupled accelerators, utilizing a hybrid Level 1 (L1) cache. This cache can function as both a traditional cache and local memory, enabling efficient data access for accelerators such as Matrix Processors. Techniques like divide-and-conquer and task splitting are utilized to optimize performance, with minimal hardware overhead.
Software benchmarks, including matrix multiplication and convolution, have demonstrated significant performance gains. Four-core configurations achieve speed-up factors ranging from 4 to 6 times, while eight-core configurations yield speed-ups ranging from 6 to 18 times. The hybrid L1 cache, coupled with tightly coupled accelerators, achieves speed-ups ranging from 6 to over 1,000 times compared to a single core, highlighting the effectiveness of this integrated design approach.