Deep neural networks (DNNs) are the backbone of modern artificial intelligence (AI). While they deliver state-of-the-art accuracy in numerous AI tasks, deploying DNNs into the field is still very challenging due to their high computational complexity and diverse shapes and sizes. Therefore, DNN accelerators that can achieve high performance and energy efficiency across a wide range of DNNs are critical for enabling AI in real-world applications.
In this thesis, we present Eyeriss, a hardware architecture for DNN processing that is optimized for performance, energy efficiency and flexibility. Eyeriss minimizes data movement, which is the bottleneck of both performance and energy efficiency for DNNs, with a novel dataflow, named row-stationary (RS). The RS dataflow supports highly-parallel processing while fully exploiting data reuse in a multi-level memory hierarchy to optimize for the overall system energy efficiency given any DNN shape and size. It has demonstrated 1.4× to 2.5× higher energy efficiency than other existing dataflows.
We present two versions of the Eyeriss architecture that support the RS dataflow. Eyeriss v1 targets large DNNs that have plenty of data reuse opportunities. It features a flexible mapping strategy to increase the utilization of processing elements (PEs) for high performance, a multicast on-chip network (NoC) to exploit data reuse, and further exploits data sparsity to save 45% PE power and reduce the off-chip bandwidth by 1.2×–1.9×. Fabricated in a 65nm CMOS, Eyeriss v1 consumes 278 mW at 34.7 fps for CONV layers of AlexNet, which was 10× more efficient than a mobile GPU.
Eyeriss v2 addresses the recent trend of making DNNs more compact in terms of size and computation, which also introduces higher variation in the amount of data reuse and sparsity. It has two key features: (1) a flexible and scalable NoC that can provide high bandwidth when data reuse is low while still being able to exploit data reuse when available; (2) an improved dataflow, named RS Plus, that increases the utilization of PEs. Together, they provide over 10× higher throughput than Eyeriss v1. Eyeriss v2 also further exploits sparsity for up to an additional 4.6× increase in throughput.