RVV vector execution engine

Coral NPU's Zve32x vector execution engine is designed to conform to the RISC-V vector ISA extensions. These SIMD vector instructions are described in the official specification document RISC-V "V" Vector Extension, version 1.0.

Version 1.0 of this specification is considered stable enough to begin developing toolchains, functional simulators, and silicon implementations, including in upstream software projects. This RISC-V specification includes the complete set of frozen vector instructions.

The RISC-V scalar core of Coral NPU handles initial instruction fetch, decode, and dispatch, including the dispatching of vector instructions to the vector execution engine.

Coral NPU's vector execution engine maintains its own SIMD instruction (command) queue, decoding unit, register file with scoreboard, and sub-computation units (ALU, multiply, divide, MAC). Other than the TCM memory interface, the vector unit effectively operates independently of the scalar core once instructions have been enqueued. Keeping this queue full is key to Coral NPU's performance.

The SIMD (single instruction, multiple data) instruction set additions allow parallel data processing.

RISC-V vector (RVV) processing features

RVV adds vector processing capabilities to the RISC-V architecture by introducing a set of 32 vector registers that can be dynamically configured for different data types and vector lengths. This allows for significant performance gains by processing multiple data elements simultaneously, which is crucial for applications like machine learning. Key features include a standardized vector length (VLEN), length-agnostic code support, and a rich set of vector instructions for operations like addition, multiplication, and comparison.

RVV provides the following features and benefits:

Vector registers: RVV adds 32 vector registers (V0 to V31), each of which is VLEN bits wide. VLEN can be a power-of-two greater than or equal to 128 bits.
Configurable vector length: The vector length is adjustable, allowing developers to configure it for specific performance requirements.
Data types: Registers can be interpreted as multiple elements of different sizes (for example, 8, 16, 32, or 64 bits) and can operate on signed/unsigned integers or floating-point numbers.
Vector instructions: A rich set of vector instructions are provided, including floating-point operations (vfadd.s, vfmul.s) and integer operations (vfadd.h, vfmul.h).
Length-agnostic code: Software can be written to be vector-length agnostic, meaning the same code can run efficiently on hardware with different vector lengths, promoting code reuse.
Tail policies: RVV defines two tail policies:
- Undisturbed: Elements past the vector length are left unmodified.
- Agnostic: Tail elements may be modified or filled with all ones.
Masking: Instruction execution can be masked; only the v0 register can be used as a mask operand.
Vector chaining: This feature can be leveraged for further performance enhancements.
Enhanced software development: Support for auto-vectorization in toolchains like LLVM and GCC compilers.

RVV memory operations are highly detailed, specifying encoding formats for various vector loads and stores, including unit-stride, strided, indexed, and segment transfers. Instructions are provided for integer, fixed-point, and floating-point vector arithmetic, covering functions ranging from widening adds to fused multiply-adds and high-accuracy reciprocal estimations.

Instructions like vsetvli let software query the maximum effective vector length supported by the implementation, based on the current configuration (SEW and LMUL). vtype.lmul refers to the Vector Register Group Multiplier (LMUL) field within the Vector Type Register (vtype).

RVV instructions

The definitive reference for RVV instructions is the official specification document RISC-V "V" Vector Extension, version 1.0.

A summary of the vector instructions is given here for quick reference. The instructions are categorized by function, including their common variants — vector-vector (.vv), vector-scalar (.vx), vector-immediate (.vi), for example — and pseudo-instructions where noted.

Configuration-setting instructions for CSRs

These instructions set the Vector Length Register (vl) and the Vector Type Register (vtype), to dynamically govern vector configuration:

Instruction	Description
`vsetvli`	Set vector length and type with immediate type encoding
`vsetivli`	Set vector length and type with immediate vector length and immediate type encoding
`vsetvl`	Set vector length and type from register values

Vector stores and loads

Vector memory access operations, including unit-stride, strided, indexed (scatter/gather), segment, mask, and whole-register transfers.

Category	Base instructions
Unit-Stride Loads/Stores	vle8.v, vle16.v, vle32.v, vle64.v / vse8.v, vse16.v, vse32.v, vse64.v
Strided Loads/Stores	vlse8.v, vlse16.v, vlse32.v, vlse64.v / vsse8.v, vsse16.v, vsse32.v, vsse64.v
Indexed (Unordered) Loads/Stores	vluxei8.v — vluxei64.v (load) / vsuxei8.v — vsuxei64.v (store)
Indexed (Ordered) Loads/Stores	vloxei8.v — vloxei64.v (load) / vsoxei8.v — vsoxei64.v (store)
Fault-Only-First Loads	vle8ff.v, vle16ff.v, vle32ff.v, vle64ff.v
Mask Loads/Stores	vlm.v (load mask), vsm.v (store mask)
Unit-Stride Segment Loads/Stores	vlsegXeX.v, vssegXeX.v, vlsegXeXff.v (fault-only-first segment load)
Strided Segment Loads/Stores	vlssegXeX.v, vsssegXeX>.v
Indexed Segment Loads/Stores	vluxsegXeiX.v, vloxsegXeiX.v vsuxsegXeiX.v, vsoxsegXeiX.v
Whole Register Loads/Stores	vlXr.v (load 1, 2, 4, or 8 registers), vsXr.v (store 1, 2, 4, or 8 registers)

Vector integer arithmetic instructions

These instructions support single-width, widening, narrowing, and other integer operations.

Category	Base instructions (.vv, .vx, .vi variants)
Add/Subtract	vadd, vsub, vrsub (reverse subtract)
Bitwise Logical	vand, vor, vxor
Comparison	vmseq, vmsne, vmsltu, vmslt, vmsleu, vmsle, vmsgtu, vmsgt
Min/Max	vminu, vmin, vmaxu, vmax
Single-Width Multiply	vmul, vmulh (high bits), vmulhu (unsigned high bits), vmulhsu (signed/unsigned high bits)
Divide/Remainder	vdivu, vdiv, vremu, vrem
Shift	vsll (Shift left logical), vsrl (Shift right logical), vsra (Shift right arithmetic)
Multiply-Add/Subtract	vmadd, vnmsub, vmacc, vnmsac
Carry/Borrow	vadc, vmadc (with carry out), vsbc, vmsbc (with borrow out)
Merge/Move	vmerge, vmv.v.v, vmv.v.x, vmv.v.i
Extension	vzext.vf2, vsext.vf2, vzext.vf4, vsext.vf4, vzext.vf8, vsext.vf8

Vector widening integer instructions

These instructions produce a result that is twice the width of the input operands.

Category	Base instructions (.vv, .vx, .wv, .wx variants)
Add/Subtract	vwaddu, vwadd, vwsubu, vwsub, vwaddu.w, vwadd.w, vwsubu.w, vwsub.w (double-width source)
Multiply	vwmulu, vwmul, vwmulsu
Multiply-Add	vwmaccu, vwmacc, vwmaccus, vwmaccsu

Vector fixed-point arithmetic instructions

These instructions handle fixed-point scaling, rounding, and saturation.

Category	Base instructions (.vv, .vx, .vi variants)
Saturating Add/Subtract	vsaddu, vsadd, vssubu, vssub
Averaging Add/Subtract	vaaddu, vaadd, vasubu, vasub
Fractional Multiply	vsmul (rounding and saturating fractional multiply)
Scaling Shift	vssrl (scaling shift right logical), vssra (scaling shift right arithmetic)
Narrowing Clip	vnclipu, vnclip (rounding, scaling, and saturation)

Vector floating-point instructions

These instructions operate on IEEE-754/2008 compatible floating-point values.

Category	Base instructions (.vv, .vf variants)
Add/Subtract	vfadd, vfsub, vfrsub (reverse subtract)
Multiply/Divide	vfmul, vfdiv, vfrdiv (reverse divide)
Fused Multiply-Add	vfmadd, vfnmadd, vfmsub, vfnmsub, vfmacc, vfnmacc, vfmsac, vfnmsac
Square Root	vfsqrt.v
Estimate	vfrsqrt7.v (reciprocal square-root estimate), vfrec7.v (reciprocal estimate)
Min/Max	vfmin, vfmax
Sign Injection	vfsgnj, vfsgnjn, vfsgnjx
Comparison	vmfeq, vmfne, vmflt, vmfle, vmfgt, vmfge
Classification	vfclass.v
Move/Merge	vfmerge.vfm, vfmv.v.f
Widening FP	vfwadd, vfwsub, vfwmul
Widening FMA	vfwmacc, vfwnmacc, vfwmsac, vfwnmsac
Widening Convert	vfwcvt.xu.f.v, vfwcvt.x.f.v, vfwcvt.f.xu.v, vfwcvt.f.x.v, vfwcvt.f.f.v, vfwcvt.rtz.xu.f.v, vfwcvt.rtz.x.f.v
Narrowing Convert	vfncvt.xu.f.w, vfncvt.x.f.w, vfncvt.f.xu.w, vfncvt.f.x.w, vfncvt.f.f.w, vfncvt.rod.f.f.w, vfncvt.rtz.xu.f.w, vfncvt.rtz.x.f.w

Vector reduction operations

These instructions take a vector register group and produce a scalar result in element 0 of a vector register.

Category	Base instructions (.vs variants)
Integer Reduction	vredsum.vs, vredmaxu.vs, vredmax.vs, vredminu.vs, vredmin.vs, vredand.vs, vredor.vs, vredxor.vs
Widening Integer Reduction	vwredsumu.vs, vwredsum.vs
Floating-Point Reduction	vfredosum.vs (ordered sum), vfredusum.vs (unordered sum), vfredmax.vs, vfredmin.vs
Widening FP Reduction	vfwredosum.vs, vfwredusum.vs

Vector mask instructions

These instructions operate explicitly on mask registers.

Category	Base instructions (.mm variants)
Mask Logical	vmand, vmnand, vmandn, vmxor, vmor, vmnor, vmorn, vmxnor
Population Count	vcpop.m (count population in mask)
Find First	vfirst.m (find first set mask bit)
Set Mask	vmsbf.m (set before first mask bit), vmsif.m (set including first mask bit), vmsof.m (set only first mask bit)
Parallel Prefix Sum	viota.m (vector iota – parallel prefix sum of mask values)
Element Index	vid.v (vector element index)

Vector permutation instructions

These instructions facilitate data movement within and between the vector and scalar registers.

Category	Base instructions
Integer Scalar Move	vmv.x.s (vector element 0 to X register), vmv.s.x (X register to vector element 0)
Floating-Point Scalar Move	vfmv.f.s (vector element 0 to F register), vfmv.s.f (F register to vector element 0)
Vector Slide	vslideup.vx, vslideup.vi, vslidedown.vx, vslidedown.vi
Vector Slide-1	vslide1up.vx, vfslide1up.vf, vslide1down.vx, vfslide1down.vf
Vector Register Gather	vrgather.vv, vrgatherei16.vv, vrgather.vx, vrgather.vi
Vector Compress	vcompress.vm
Whole Register Move	vmvXr.v (copy N vector registers)

RVV vector execution engine Stay organized with collections Save and categorize content based on your preferences.