RVV vector execution engine

Coral NPU's Zve32x vector execution engine is designed to conform to the RISC-V vector ISA extensions. These SIMD vector instructions are described in the official specification document RISC-V "V" Vector Extension, version 1.0.

Version 1.0 of this specification is considered stable enough to begin developing toolchains, functional simulators, and silicon implementations, including in upstream software projects. This RISC-V specification includes the complete set of frozen vector instructions.

The RISC-V scalar core of Coral NPU handles initial instruction fetch, decode, and dispatch, including the dispatching of vector instructions to the vector execution engine.

Coral NPU's vector execution engine maintains its own SIMD instruction (command) queue, decoding unit, register file with scoreboard, and sub-computation units (ALUs). Other than the TCM memory interface, the vector unit effectively operates independently of the scalar core once instructions have been enqueued. Keeping this queue full is key to Coral NPU's performance.

The SIMD (single instruction, multiple data) instruction set additions allow parallel data processing.

RISC-V vector (RVV) processing features

RVV adds vector processing capabilities to the RISC-V architecture by introducing a set of 32 vector registers that can be dynamically configured for different data types and vector lengths. This allows for significant performance gains by processing multiple data elements simultaneously, which is crucial for applications like machine learning. Key features include a standardized vector length (VLEN), length-agnostic code support, and a rich set of vector instructions for operations like addition, multiplication, and comparison.

RVV provides the following features and benefits:

  • Vector registers: RVV adds 32 vector registers (V0 to V31), each of which is VLEN bits wide. VLEN can be a power-of-two greater than or equal to 128 bits.
  • Configurable vector length: The vector length is adjustable, allowing developers to configure it for specific performance requirements.
  • Data types: Registers can be interpreted as multiple elements of different sizes (for example, 8, 16, 32, or 64 bits) and can operate on signed/unsigned integers or floating-point numbers.
  • Vector instructions: A rich set of vector instructions are provided, including floating-point operations (vfadd.s, vfmul.s) and integer operations (vfadd.h, vfmul.h).
  • Length-agnostic code: Software can be written to be vector-length agnostic, meaning the same code can run efficiently on hardware with different vector lengths, promoting code reuse.
  • Tail policies: RVV defines two tail policies:
    • Undisturbed: Elements past the vector length are left unmodified.
    • Agnostic: Tail elements may be modified or filled with all ones.
  • Masking: Instruction execution can be masked; only the v0 register can be used as a mask operand.
  • Vector chaining: This feature can be leveraged for further performance enhancements.
  • Enhanced software development: Support for auto-vectorization in toolchains like LLVM and GCC compilers.

RVV memory operations are highly detailed, specifying encoding formats for various vector loads and stores, including unit-stride, strided, indexed, and segment transfers. Instructions are provided for integer, fixed-point, and floating-point vector arithmetic, covering functions ranging from widening adds to fused multiply-adds and high-accuracy reciprocal estimations.

Instructions like vsetvli let software query the maximum effective vector length supported by the implementation, based on the current configuration (SEW and LMUL). vtype.lmul refers to the Vector Register Group Multiplier (LMUL) field within the Vector Type Register (vtype).

RVV instructions

The definitive reference for RVV instructions is the official specification document RISC-V "V" Vector Extension, version 1.0.

A summary of the vector instructions is given here for quick reference. The instructions are categorized by function, including their common variants — vector-vector (.vv), vector-scalar (.vx), vector-immediate (.vi), for example — and pseudo-instructions where noted.

Configuration-setting instructions for CSRs

These instructions set the Vector Length Register (vl) and the Vector Type Register (vtype), to dynamically govern vector configuration:

Instruction Description
vsetvli Set vector length and type with immediate type encoding
vsetivli Set vector length and type with immediate vector length and immediate type encoding
vsetvl Set vector length and type from register values

Vector stores and loads

Vector memory access operations, including unit-stride, strided, indexed (scatter/gather), segment, mask, and whole-register transfers.

Category Base instructions
Unit-Stride Loads/Stores vle8.v, vle16.v, vle32.v, vle64.v / vse8.v, vse16.v, vse32.v, vse64.v
Strided Loads/Stores vlse8.v, vlse16.v, vlse32.v, vlse64.v / vsse8.v, vsse16.v, vsse32.v, vsse64.v
Indexed (Unordered) Loads/Stores vluxei8.v — vluxei64.v (load) / vsuxei8.v — vsuxei64.v (store)
Indexed (Ordered) Loads/Stores vloxei8.v — vloxei64.v (load) / vsoxei8.v — vsoxei64.v (store)
Fault-Only-First Loads vle8ff.v, vle16ff.v, vle32ff.v, vle64ff.v
Mask Loads/Stores vlm.v (load mask), vsm.v (store mask)
Unit-Stride Segment Loads/Stores vlsegXeX.v, vssegXeX.v, vlsegXeXff.v (fault-only-first segment load)
Strided Segment Loads/Stores vlssegXeX.v, vsssegXeX>.v
Indexed Segment Loads/Stores vluxsegXeiX.v, vloxsegXeiX.v vsuxsegXeiX.v, vsoxsegXeiX.v
Whole Register Loads/Stores vlXr.v (load 1, 2, 4, or 8 registers), vsXr.v (store 1, 2, 4, or 8 registers)

Vector integer arithmetic instructions

These instructions support single-width, widening, narrowing, and other integer operations.

Category Base instructions (.vv, .vx, .vi variants)
Add/Subtract vadd, vsub, vrsub (reverse subtract)
Bitwise Logical vand, vor, vxor
Comparison vmseq, vmsne, vmsltu, vmslt, vmsleu, vmsle, vmsgtu, vmsgt
Min/Max vminu, vmin, vmaxu, vmax
Single-Width Multiply vmul, vmulh (high bits), vmulhu (unsigned high bits), vmulhsu (signed/unsigned high bits)
Divide/Remainder vdivu, vdiv, vremu, vrem
Shift vsll (Shift left logical), vsrl (Shift right logical), vsra (Shift right arithmetic)
Multiply-Add/Subtract vmadd, vnmsub, vmacc, vnmsac
Carry/Borrow vadc, vmadc (with carry out), vsbc, vmsbc (with borrow out)
Merge/Move vmerge, vmv.v.v, vmv.v.x, vmv.v.i
Extension vzext.vf2, vsext.vf2, vzext.vf4, vsext.vf4, vzext.vf8, vsext.vf8

Vector widening integer instructions

These instructions produce a result that is twice the width of the input operands.

Category Base instructions (.vv, .vx, .wv, .wx variants)
Add/Subtract vwaddu, vwadd, vwsubu, vwsub, vwaddu.w, vwadd.w, vwsubu.w, vwsub.w (double-width source)
Multiply vwmulu, vwmul, vwmulsu
Multiply-Add vwmaccu, vwmacc, vwmaccus, vwmaccsu

Vector fixed-point arithmetic instructions

These instructions handle fixed-point scaling, rounding, and saturation.

Category Base instructions (.vv, .vx, .vi variants)
Saturating Add/Subtract vsaddu, vsadd, vssubu, vssub
Averaging Add/Subtract vaaddu, vaadd, vasubu, vasub
Fractional Multiply vsmul (rounding and saturating fractional multiply)
Scaling Shift vssrl (scaling shift right logical), vssra (scaling shift right arithmetic)
Narrowing Clip vnclipu, vnclip (rounding, scaling, and saturation)

Vector floating-point instructions

These instructions operate on IEEE-754/2008 compatible floating-point values.

Category Base instructions (.vv, .vf variants)
Add/Subtract vfadd, vfsub, vfrsub (reverse subtract)
Multiply/Divide vfmul, vfdiv, vfrdiv (reverse divide)
Fused Multiply-Add vfmadd, vfnmadd, vfmsub, vfnmsub, vfmacc, vfnmacc, vfmsac, vfnmsac
Square Root vfsqrt.v
Estimate vfrsqrt7.v (reciprocal square-root estimate), vfrec7.v (reciprocal estimate)
Min/Max vfmin, vfmax
Sign Injection vfsgnj, vfsgnjn, vfsgnjx
Comparison vmfeq, vmfne, vmflt, vmfle, vmfgt, vmfge
Classification vfclass.v
Move/Merge vfmerge.vfm, vfmv.v.f
Widening FP vfwadd, vfwsub, vfwmul
Widening FMA vfwmacc, vfwnmacc, vfwmsac, vfwnmsac
Widening Convert vfwcvt.xu.f.v, vfwcvt.x.f.v, vfwcvt.f.xu.v, vfwcvt.f.x.v, vfwcvt.f.f.v, vfwcvt.rtz.xu.f.v, vfwcvt.rtz.x.f.v
Narrowing Convert vfncvt.xu.f.w, vfncvt.x.f.w, vfncvt.f.xu.w, vfncvt.f.x.w, vfncvt.f.f.w, vfncvt.rod.f.f.w, vfncvt.rtz.xu.f.w, vfncvt.rtz.x.f.w

Vector reduction operations

These instructions take a vector register group and produce a scalar result in element 0 of a vector register.

Category Base instructions (.vs variants)
Integer Reduction vredsum.vs, vredmaxu.vs, vredmax.vs, vredminu.vs, vredmin.vs, vredand.vs, vredor.vs, vredxor.vs
Widening Integer Reduction vwredsumu.vs, vwredsum.vs
Floating-Point Reduction vfredosum.vs (ordered sum), vfredusum.vs (unordered sum), vfredmax.vs, vfredmin.vs
Widening FP Reduction vfwredosum.vs, vfwredusum.vs

Vector mask instructions

These instructions operate explicitly on mask registers.

Category Base instructions (.mm variants)
Mask Logical vmand, vmnand, vmandn, vmxor, vmor, vmnor, vmorn, vmxnor
Population Count vcpop.m (count population in mask)
Find First vfirst.m (find first set mask bit)
Set Mask vmsbf.m (set before first mask bit), vmsif.m (set including first mask bit), vmsof.m (set only first mask bit)
Parallel Prefix Sum viota.m (vector iota – parallel prefix sum of mask values)
Element Index vid.v (vector element index)

Vector permutation instructions

These instructions facilitate data movement within and between the vector and scalar registers.

Category Base instructions
Integer Scalar Move vmv.x.s (vector element 0 to X register), vmv.s.x (X register to vector element 0)
Floating-Point Scalar Move vfmv.f.s (vector element 0 to F register), vfmv.s.f (F register to vector element 0)
Vector Slide vslideup.vx, vslideup.vi, vslidedown.vx, vslidedown.vi
Vector Slide-1 vslide1up.vx, vfslide1up.vf, vslide1down.vx, vfslide1down.vf
Vector Register Gather vrgather.vv, vrgatherei16.vv, vrgather.vx, vrgather.vi
Vector Compress vcompress.vm
Whole Register Move vmvXr.v (copy N vector registers)