RVV vector engine micro-architecture

The RVV vector execution engine of Coral NPU is illustrated below.

RVV vector execution engine

Coral's RVS scalar core decodes all incoming instructions and sends vector instructions (with XRF scalar operands and vector CSR values) to the RVV command queue. The vector instructions are decoded into micro-operations (µops) according to vtype.lmul, the nf field in vld/vst instructions, and the encoding (widening, narrowing, whole-register instructions) — the results are then written to the µops queue.

vtype.lmul refers to the Vector Register Group Multiplier (LMUL) field within the Vector Type Register (vtype). This field is a critical part of the RISC-V vector programming model, as it determines how vector registers are grouped and how instructions are processed.

The vtype CSR, which is updated only by vsetvli, vsetivli, and vsetvl instructions, provides the default type used to interpret the contents of the Vector Register File (VRF). Specifically, vtype.lmul determines how multiple vector registers are combined to form a vector register group.

Micro-operations

A micro-operation (μop) is the fundamental unit into which vector instructions are decomposed for processing.

μops are generated during the decode stage of the RVV pipeline. The decode unit reads instructions from the command queue (CQ) and converts the instructions into μops. A single vector instruction may be decoded into multiple μops, particularly when the instruction involves large data sizes or multiple register groups.

An instruction (or the results of its μops) is ready to be retired only if all of its constituent μops are ready in the re-order buffer.

Example: Whole register load

A single instruction to load four whole registers

vl4re32.v v8, (a0)

is decoded into four separate μops: uop1 – uop4, corresponding to registers v8 – v11.

Scalar-to-vector interface

The following figure describes how Coral NPU's scalar core and vector engine work together.

RVV vector execution engine

Coral NPU's RVS scalar core dispatcher sends vld/vst instructions to the LSU instructions queue and to the RVV vector frontend. The LSU decodes the vld/vst instructions into µops.

If the instruction is a vector segment vld/vst, it is decoded into stride/indexed vld/vst µops.

AXI tile link is a combination of the AXI bus protocol and the TileLink protocol, often used in RISC-V ML accelerators for efficient communication between processing cores and accelerators. The AXI protocol is a point-to-point interface for data transfer, while TileLink is a message-based protocol that facilitates memory coherence and high-speed interconnects.