Coral NPU's Zve32x vector execution engine is designed to conform to the RISC-V vector ISA extensions. These SIMD vector instructions are described in the official specification document RISC-V "V" Vector Extension, version 1.0.
Version 1.0 of this specification is considered stable enough to begin developing toolchains, functional simulators, and silicon implementations, including in upstream software projects. This RISC-V specification includes the complete set of frozen vector instructions.
The RISC-V scalar core of Coral NPU handles initial instruction fetch, decode, and dispatch, including the dispatching of vector instructions to the vector execution engine.
Coral NPU's vector execution engine maintains its own SIMD instruction (command) queue, decoding unit, register file with scoreboard, and sub-computation units (ALUs). Other than the TCM memory interface, the vector unit effectively operates independently of the scalar core once instructions have been enqueued. Keeping this queue full is key to Coral NPU's performance.
The SIMD (single instruction, multiple data) instruction set additions allow parallel data processing.
RISC-V vector (RVV) processing features
RVV adds vector processing capabilities to the RISC-V architecture by introducing a set of 32 vector registers that can be dynamically configured for different data types and vector lengths. This allows for significant performance gains by processing multiple data elements simultaneously, which is crucial for applications like machine learning. Key features include a standardized vector length (VLEN), length-agnostic code support, and a rich set of vector instructions for operations like addition, multiplication, and comparison.
RVV provides the following features and benefits:
- Vector registers: RVV adds 32 vector registers (V0 to V31), each of which is VLEN bits wide. VLEN can be a power-of-two greater than or equal to 128 bits.
- Configurable vector length: The vector length is adjustable, allowing developers to configure it for specific performance requirements.
- Data types: Registers can be interpreted as multiple elements of different sizes (for example, 8, 16, 32, or 64 bits) and can operate on signed/unsigned integers or floating-point numbers.
- Vector instructions: A rich set of vector instructions are provided, including floating-point operations (vfadd.s, vfmul.s) and integer operations (vfadd.h, vfmul.h).
- Length-agnostic code: Software can be written to be vector-length agnostic, meaning the same code can run efficiently on hardware with different vector lengths, promoting code reuse.
- Tail policies: RVV defines two tail policies:
- Undisturbed: Elements past the vector length are left unmodified.
- Agnostic: Tail elements may be modified or filled with all ones.
- Masking: Instruction execution can be masked; only the v0 register can be used as a mask operand.
- Vector chaining: This feature can be leveraged for further performance enhancements.
- Enhanced software development: Support for auto-vectorization in toolchains like LLVM and GCC compilers.
RVV memory operations are highly detailed, specifying encoding formats for various vector loads and stores, including unit-stride, strided, indexed, and segment transfers. Instructions are provided for integer, fixed-point, and floating-point vector arithmetic, covering functions ranging from widening adds to fused multiply-adds and high-accuracy reciprocal estimations.
Instructions like vsetvli let software query the maximum effective vector
length
supported by the implementation, based on the current configuration (SEW and
LMUL). vtype.lmul refers to the Vector Register Group Multiplier (LMUL) field
within the Vector Type Register (vtype).
RVV instructions
The definitive reference for RVV instructions is the official specification document RISC-V "V" Vector Extension, version 1.0.
A summary of the vector instructions is given here for quick reference. The
instructions are categorized by function, including their common variants —
vector-vector (.vv), vector-scalar (.vx), vector-immediate (.vi),
for example — and pseudo-instructions where noted.
Configuration-setting instructions for CSRs
These instructions set the Vector Length Register (vl) and the Vector Type
Register (vtype), to dynamically govern vector configuration:
| Instruction | Description |
|---|---|
vsetvli |
Set vector length and type with immediate type encoding |
vsetivli |
Set vector length and type with immediate vector length and immediate type encoding |
vsetvl |
Set vector length and type from register values |
Vector stores and loads
Vector memory access operations, including unit-stride, strided, indexed (scatter/gather), segment, mask, and whole-register transfers.
| Category | Base instructions |
|---|---|
| Unit-Stride Loads/Stores | vle8.v, vle16.v, vle32.v, vle64.v / vse8.v, vse16.v, vse32.v, vse64.v |
| Strided Loads/Stores | vlse8.v, vlse16.v, vlse32.v, vlse64.v / vsse8.v, vsse16.v, vsse32.v, vsse64.v |
| Indexed (Unordered) Loads/Stores | vluxei8.v — vluxei64.v (load) / vsuxei8.v — vsuxei64.v (store) |
| Indexed (Ordered) Loads/Stores | vloxei8.v — vloxei64.v (load) / vsoxei8.v — vsoxei64.v (store) |
| Fault-Only-First Loads | vle8ff.v, vle16ff.v, vle32ff.v, vle64ff.v |
| Mask Loads/Stores | vlm.v (load mask), vsm.v (store mask) |
| Unit-Stride Segment Loads/Stores | vlsegXeX.v, vssegXeX.v, vlsegXeXff.v (fault-only-first segment load) |
| Strided Segment Loads/Stores | vlssegXeX.v, vsssegXeX>.v |
| Indexed Segment Loads/Stores | vluxsegXeiX.v, vloxsegXeiX.v vsuxsegXeiX.v, vsoxsegXeiX.v |
| Whole Register Loads/Stores | vlXr.v (load 1, 2, 4, or 8 registers), vsXr.v (store 1, 2, 4, or 8 registers) |
Vector integer arithmetic instructions
These instructions support single-width, widening, narrowing, and other integer operations.
| Category | Base instructions (.vv, .vx, .vi variants) |
|---|---|
| Add/Subtract | vadd, vsub, vrsub (reverse subtract) |
| Bitwise Logical | vand, vor, vxor |
| Comparison | vmseq, vmsne, vmsltu, vmslt, vmsleu, vmsle, vmsgtu, vmsgt |
| Min/Max | vminu, vmin, vmaxu, vmax |
| Single-Width Multiply | vmul, vmulh (high bits), vmulhu (unsigned high bits), vmulhsu (signed/unsigned high bits) |
| Divide/Remainder | vdivu, vdiv, vremu, vrem |
| Shift | vsll (Shift left logical), vsrl (Shift right logical), vsra (Shift right arithmetic) |
| Multiply-Add/Subtract | vmadd, vnmsub, vmacc, vnmsac |
| Carry/Borrow | vadc, vmadc (with carry out), vsbc, vmsbc (with borrow out) |
| Merge/Move | vmerge, vmv.v.v, vmv.v.x, vmv.v.i |
| Extension | vzext.vf2, vsext.vf2, vzext.vf4, vsext.vf4, vzext.vf8, vsext.vf8 |
Vector widening integer instructions
These instructions produce a result that is twice the width of the input operands.
| Category | Base instructions (.vv, .vx, .wv, .wx variants) |
|---|---|
| Add/Subtract | vwaddu, vwadd, vwsubu, vwsub, vwaddu.w, vwadd.w, vwsubu.w, vwsub.w (double-width source) |
| Multiply | vwmulu, vwmul, vwmulsu |
| Multiply-Add | vwmaccu, vwmacc, vwmaccus, vwmaccsu |
Vector fixed-point arithmetic instructions
These instructions handle fixed-point scaling, rounding, and saturation.
| Category | Base instructions (.vv, .vx, .vi variants) |
|---|---|
| Saturating Add/Subtract | vsaddu, vsadd, vssubu, vssub |
| Averaging Add/Subtract | vaaddu, vaadd, vasubu, vasub |
| Fractional Multiply | vsmul (rounding and saturating fractional multiply) |
| Scaling Shift | vssrl (scaling shift right logical), vssra (scaling shift right arithmetic) |
| Narrowing Clip | vnclipu, vnclip (rounding, scaling, and saturation) |
Vector floating-point instructions
These instructions operate on IEEE-754/2008 compatible floating-point values.
| Category | Base instructions (.vv, .vf variants) |
|---|---|
| Add/Subtract | vfadd, vfsub, vfrsub (reverse subtract) |
| Multiply/Divide | vfmul, vfdiv, vfrdiv (reverse divide) |
| Fused Multiply-Add | vfmadd, vfnmadd, vfmsub, vfnmsub, vfmacc, vfnmacc, vfmsac, vfnmsac |
| Square Root | vfsqrt.v |
| Estimate | vfrsqrt7.v (reciprocal square-root estimate), vfrec7.v (reciprocal estimate) |
| Min/Max | vfmin, vfmax |
| Sign Injection | vfsgnj, vfsgnjn, vfsgnjx |
| Comparison | vmfeq, vmfne, vmflt, vmfle, vmfgt, vmfge |
| Classification | vfclass.v |
| Move/Merge | vfmerge.vfm, vfmv.v.f |
| Widening FP | vfwadd, vfwsub, vfwmul |
| Widening FMA | vfwmacc, vfwnmacc, vfwmsac, vfwnmsac |
| Widening Convert | vfwcvt.xu.f.v, vfwcvt.x.f.v, vfwcvt.f.xu.v, vfwcvt.f.x.v, vfwcvt.f.f.v, vfwcvt.rtz.xu.f.v, vfwcvt.rtz.x.f.v |
| Narrowing Convert | vfncvt.xu.f.w, vfncvt.x.f.w, vfncvt.f.xu.w, vfncvt.f.x.w, vfncvt.f.f.w, vfncvt.rod.f.f.w, vfncvt.rtz.xu.f.w, vfncvt.rtz.x.f.w |
Vector reduction operations
These instructions take a vector register group and produce a scalar result in element 0 of a vector register.
| Category | Base instructions (.vs variants) |
|---|---|
| Integer Reduction | vredsum.vs, vredmaxu.vs, vredmax.vs, vredminu.vs, vredmin.vs, vredand.vs, vredor.vs, vredxor.vs |
| Widening Integer Reduction | vwredsumu.vs, vwredsum.vs |
| Floating-Point Reduction | vfredosum.vs (ordered sum), vfredusum.vs (unordered sum), vfredmax.vs, vfredmin.vs |
| Widening FP Reduction | vfwredosum.vs, vfwredusum.vs |
Vector mask instructions
These instructions operate explicitly on mask registers.
| Category | Base instructions (.mm variants) |
|---|---|
| Mask Logical | vmand, vmnand, vmandn, vmxor, vmor, vmnor, vmorn, vmxnor |
| Population Count | vcpop.m (count population in mask) |
| Find First | vfirst.m (find first set mask bit) |
| Set Mask | vmsbf.m (set before first mask bit), vmsif.m (set including first mask bit), vmsof.m (set only first mask bit) |
| Parallel Prefix Sum | viota.m (vector iota – parallel prefix sum of mask values) |
| Element Index | vid.v (vector element index) |
Vector permutation instructions
These instructions facilitate data movement within and between the vector and scalar registers.
| Category | Base instructions |
|---|---|
| Integer Scalar Move | vmv.x.s (vector element 0 to X register), vmv.s.x (X register to vector element 0) |
| Floating-Point Scalar Move | vfmv.f.s (vector element 0 to F register), vfmv.s.f (F register to vector element 0) |
| Vector Slide | vslideup.vx, vslideup.vi, vslidedown.vx, vslidedown.vi |
| Vector Slide-1 | vslide1up.vx, vfslide1up.vf, vslide1down.vx, vfslide1down.vf |
| Vector Register Gather | vrgather.vv, vrgatherei16.vv, vrgather.vx, vrgather.vi |
| Vector Compress | vcompress.vm |
| Whole Register Move | vmvXr.v (copy N vector registers) |