Tutorial: Writing a simple program for Coral NPU

Overview

This page presents a tutorial that introduces the basics of writing a first program for the Coral NPU. You will copy and paste some skeleton code and learn all of the steps required to write a small kernel that adds two vectors. Instructions for building the code are also included. You can then inspect the RISC-V assembly code that is generated.

The next step is to load your code into the Coral NPU simulator, then execute and test it. You will need to write a small amount of Python code to feed data to the simulator, run it, read the data out, and print it back. After that, you can write code to check the results.

In this tutorial you will:

  • Learn the basic structure of a Coral NPU program.
  • Write and compile a basic program.
  • Test your program with a cocotb test bench.

cocotb is a coroutine-based co-simulation test bench environment for verifying VHDL and SystemVerilog RTL using Python. cocotb is free, open source, and hosted on GitHub.

Prerequisites

These instructions assume you are working in the Google Coral NPU repository on GitHub.

Be sure to complete the preliminary steps described in Software prerequisites and system setup.

How to write the program

Open tests/cocotb/tutorial/program.cc, which is a skeleton program:

Sample code

// TODO: Add two inputs buffers of 8 uint32_t's (input1_buffer, input2_buffer)
// TODO: Add one output buffer of 8 uint32_t's (output_buffer)

int main(int argc, char** argv) {
// TODO: Add code to element wise add/subtract from input1_buffer and
// input2_buffer and store the result to output_buffer.

return 0;
}

The typical structure of a Coral NPU program includes:

  • Input buffers, to store the inputs to the computation you want to perform. For this tutorial, we will assume the host core will write data to Coral NPU's DTCM before the program executes.
  • Output buffers, for Coral NPU to store the result of computation. Similar to the input buffers, we‘ll assume that Coral NPU will write to a location in its DTCM to be read by the host processor after it completes.
  • The actual computation to be performed.

Defining input and output buffers

For this tutorial we'll accept two input buffers and emit one output buffer, each consisting of 8 uint32_ts. We define them outside of main.attribute((section(“.data”))) defines buffer is stored in data section.

Sample code

uint32_t input1_buffer[8] __attribute__((section(".data")));
uint32_t input2_buffer[8] __attribute__((section(".data")));
uint32_t output_buffer[8] __attribute__((section(".data")));

int main(int argc, char** argv) {
// TODO: Add code to element wise add/subtract from input1_buffer and
// input2_buffer and store the result to output_buffer.

return 0;
}

For this tutorial, we do not need to define the precise locations of these buffers. Our linker script will allocate them in DTCM and we'll query their locations in our test bench.

Defining computation

As a simple example, let's add element-wise the elements from input1_buffer to input2_buffer:

Sample code

uint32_t input1_buffer[8] __attribute__((section(".data")));
uint32_t input2_buffer[8] __attribute__((section(".data")));
uint32_t output_buffer[8] __attribute__((section(".data")));

int main(int argc, char** argv) {
for (int i = 0; i < 8; i++) {
  output_buffer[i] = input1_buffer[i] + input2_buffer[i];
}
return 0;
}

The core will halt when returning from main.

Compiling the program

Run this build command:
bazel build tests/cocotb/tutorial:coralnpu_v2_program

This generates the coralnpu_v2_program.elf file.

Creating the test bench

Open tests/cocotb/tutorial/tutorial.py which contains the skeleton testbench:

Sample code

@cocotb.test()
async def core_mini_axi_tutorial(dut):
    """Testbench to run your Coral NPU program"""
    # Test bench setup
    core_mini_axi = CoreMiniAxiInterface(dut)
    await core_mini_axi.init()
    await core_mini_axi.reset()
    cocotb.start_soon(core_mini_axi.clock.start())

First, we need to program ITCM with your program. A load_elf function is provided to copy all loadable sections into memory. Add the following to core_mini_axi_tutorial:

Sample code

@cocotb.test()
async def core_mini_axi_tutorial(dut):
    """Testbench to run your Coral NPU program"""
    # Test bench setup
    core_mini_axi = CoreMiniAxiInterface(dut)
    await core_mini_axi.init()
    await core_mini_axi.reset()
    cocotb.start_soon(core_mini_axi.clock.start())

+   r = runfiles.Create()
+   elf_path = r.Rlocation(
+       "coralnpu_hw/tests/cocotb/tutorial/coralnpu_v2_program.elf")
+   with open(elf_path, "rb") as f:
+     entry_point = await core_mini_axi.load_elf(f)

Before we start the program, let's also write inputs into DTCM. We can determine the location of a buffer using lookup_symbol and write to DTCM with:

Sample code

@cocotb.test()
async def core_mini_axi_tutorial(dut):
    """Testbench to run your Coral NPU program."""
    # Test bench setup
    core_mini_axi = CoreMiniAxiInterface(dut)
    await core_mini_axi.init()
    await core_mini_axi.reset()
    cocotb.start_soon(core_mini_axi.clock.start())

    r = runfiles.Create()
    elf_path = r.Rlocation(
        "coralnpu_hw/tests/cocotb/tutorial/coralnpu_v2_program.elf")
    with open(elf_path, "rb") as f:
      entry_point = await core_mini_axi.load_elf(f)
+     inputs1_addr = core_mini_axi.lookup_symbol(f, "input1_buffer")
+     inputs2_addr = core_mini_axi.lookup_symbol(f, "input2_buffer")

+   input1_data = np.arange(8, dtype=np.uint32)
+   input2_data = 8994 * np.ones(8, dtype=np.uint32)
+   await core_mini_axi.write(inputs1_addr, input1_data)
+   await core_mini_axi.write(inputs2_addr, input2_data)

Now that input data has been written, let‘s actually run the program! Use execute_from to start the program on Coral NPU. Once it’s running, wait for the core to halt, so we know it's done and we can read the result:

Sample code

@cocotb.test()
async def core_mini_axi_tutorial(dut):
    """Testbench to run your Coral NPU program."""
    # Test bench setup
    core_mini_axi = CoreMiniAxiInterface(dut)
    await core_mini_axi.init()
    await core_mini_axi.reset()
    cocotb.start_soon(core_mini_axi.clock.start())

    r = runfiles.Create()
    elf_path = r.Rlocation(
        "coralnpu_hw/tests/cocotb/tutorial/coralnpu_v2_program.elf")
    with open(elf_path, "rb") as f:
      entry_point = await core_mini_axi.load_elf(f)
      inputs1_addr = core_mini_axi.lookup_symbol(f, "input1_buffer")
      inputs2_addr = core_mini_axi.lookup_symbol(f, "input2_buffer")

    input1_data = np.arange(8, dtype=np.uint32)
    input2_data = 8994 * np.ones(8, dtype=np.uint32)
    await core_mini_axi.write(inputs1_addr, input1_data)
    await core_mini_axi.write(inputs2_addr, input2_data)

+   await core_mini_axi.execute_from(entry_point)
+   await core_mini_axi.wait_for_halted()

Finally, let's read and print the result:

Sample code

async def core_mini_axi_tutorial(dut):
    """Testbench to run your Coral NPU program."""
    # Test bench setup
    core_mini_axi = CoreMiniAxiInterface(dut)
    await core_mini_axi.init()
    await core_mini_axi.reset()
    cocotb.start_soon(core_mini_axi.clock.start())

    r = runfiles.Create()
    elf_path = r.Rlocation(
        "coralnpu_hw/tests/cocotb/tutorial/coralnpu_v2_program.elf")
    with open(elf_path, "rb") as f:
      entry_point = await core_mini_axi.load_elf(f)
      inputs1_addr = core_mini_axi.lookup_symbol(f, "input1_buffer")
      inputs2_addr = core_mini_axi.lookup_symbol(f, "input2_buffer")

    input1_data = np.arange(8, dtype=np.uint32)
    input2_data = 8994 * np.ones(8, dtype=np.uint32)
    await core_mini_axi.write(inputs1_addr, input1_data)
    await core_mini_axi.write(inputs2_addr, input2_data)
    await core_mini_axi.execute_from(entry_point)
    await core_mini_axi.wait_for_halted()

+   rdata = (await core_mini_axi.read(outputs_addr, 4 * 8)).view(np.uint32)
+   print(f"I got {rdata}")

Running the test bench

You can run the test bench with this command:
bazel run //tests/cocotb/tutorial:tutorial

You should see the following in the console output:
I got [8994 8995 8996 8997 8998 8999 9000 9001]