Architecture overview

AI-generated Key Takeaways

Coral NPU is a RISC-V based reference design for neural processing units composed of a scalar core, vector execution unit, and matrix execution unit.
The scalar core handles traditional CPU functions and drives the command queues for the vector and matrix units.
The vector execution unit performs a range of vector and ML computations based on a SIMD design.
The matrix execution unit accelerates Matmul operations essential for ML, though it is currently under development.
The Coral NPU roadmap outlines milestones for integrating a full RISC-V compliant scalar core, vector execution unit, matrix execution unit, and future enhancements like floating point support and CHERI.

Architecture components

Coral NPU is a complete reference design for neural processing units (NPUs) based on the open RISC-V standard. Coral NPU is a composition consisting of three core components:

A scalar core for traditional CPU functions
A vector execution unit for additional computational features
A matrix execution unit for ML model operations

Kelvin architecture components

Together, these three components provide the complete feature set and performance required for a Coral NPU-based NPU. Any chip vendor designing a discrete NPU or integrating Coral NPU into a System-on-Chip (SoC) will need to incorporate all three components for a complete solution.

The main features and functions offered by each of the components are the following:

Scalar core:
- Serves as the in-order, non-speculative frontend processor.
- Drives the command queues for the vector and matrix execution units.
- Fully compliant with the open RISC-V 32-bit base ISA standard (RV32I).
- Features 31 general-purpose scalar registers, each 32 bits wide.
- Offers a C-programmable interface for managing loops, control flow, flexible type encodings, and instruction compression for the SIMD/vector backend.
Vector execution unit:
- Performs a wide range of vector and machine learning (ML) computations, including array operations, ML activation functions, and reductions.
- Based on a Single Instruction, Multiple Data (SIMD) design.
- Decoupled from the scalar frontend by a FIFO command queue, which buffers vector instructions.
- Equipped with 32 vector registers, each 256 bits wide (capable of holding eight 32-bit integers).
- Natively supports 8-bit, 16-bit, and 32-bit data widths.
Matrix execution unit:
- Accelerates matrix multiply-accumulate (Matmul) operations essential for ML, such as matrix multiplication and convolutions.
- Features an outer-product engine capable of 256 multiply-accumulate (MAC) operations per cycle.
- Status: Under development and evaluation as part of the RISC-V matrix extension task group.

Architecture overview Stay organized with collections Save and categorize content based on your preferences.

AI-generated Key Takeaways

Architecture components

Architecture overview