HeAR serving API

This document describes the Application Programming Interface (API) for HeAR when deployed as an HTTPS service endpoint, referred to as the service in this document.

Overview

The serving source code for HeAR can be built and hosted on any API management system, but it's specially designed to take advantage of Vertex AI prediction endpoints. Therefore, it conforms to Vertex AI's required API signature and implements a predict method.

The service is designed to support micro batching, not to be mistaken with batch jobs. For every audio clip in the request, if the processing is successful, the service returns a one-dimensional embedding vector in the same order as the audio clip in the request. Refer to the sections on API request, response, and micro batching for details.

You can provide audio clips to the service either directly within the request (inlined) or by providing a reference to their location. Inlining the audio data in the request is not recommended for large-scale productions; read more. When using data storage links the service expects corresponding OAuth 2.0 bearer tokens to retrieve the data on your behalf. For detailed information on constructing API requests and the different ways to provide audio data, refer to the API request section.

To invoke the service, consult the request section, compose a valid request JSON and send a POST request to your endpoint. If you haven't already deployed HeAR as an endpoint, the easiest way is through Model Garden. The following script shows a sample cURL command which you can use to invoke the service. Set LOCATION, PROJECT_ID and ENDPOINT_ID to target your endpoint:

LOCATION="your endpoint location"
PROJECT_ID="your project ID"
ENDPOINT_ID="your endpoint ID"
REQUEST_JSON="path/to/your/request.json"

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}/endpoints/${ENDPOINT_ID}:predict" \
-d "@${REQUEST_JSON}"

Request

An API request can include multiple instances, each conforming to the instance schema. Note that this schema is based on Vertex AI PredictSchemata standard and is a partial OpenAPI specification. The complete JSON request has the following structure:

{
  "instances": [
    {...},
    {...}
  ]
}

The service accepts audio clips in three ways:

  • Directly within the HTTPS request as input_bytes: You can include audio data as base64-encoded WAV bytes using the input_bytes JSON field; read more about inlined audio.

  • Directly within the HTTPS request as input_array: You can include audio data as an array of 32000 floats using the input_array JSON field; read more about inlined audio.

  • Indirectly via storage links: You can provide links to the WAV audio files stored in GCS using the gcs_uri JSON field.

The audio clips must be exactly 2 seconds sampled at 16kHz (crop or pad with zeros as needed). For best results, health sounds should be detected and cropped before using HeAR.

To illustrate these methods, the following example JSON request shows input_bytes, input_array, and gcs_uri all in one request.

Note: In a real-world scenario, you'll typically only use one of these options for all audio clips within a single request. See Micro batching for more details on sending multiple audio clips in one request.

{
  "instances": [
    {
      "input_bytes": "your base64 WAV encoded audio bytes",
    },
    {
      "input_array": "your array of 32000 floats",
    },
    {
      "gcs_uri": "gs://your-bucket/path/to/audio.wav",
      "bearer_token": "your-bearer-token",
    }
  ]
}

Inlined audio

You can inline the audio clips in the API request as a base64-encoded string in the input_bytes or as an array of 32000 floats in the input_array JSON field. However, keep in mind most API management systems enforce a limit on the maximum size of the request payloads. When HeAR is hosted as a Vertex AI Prediction endpoint, Vertex AI quotas apply.

To optimize the request size, you should compress the audio data using common audio compression codecs. If you require lossless compression, use WAV encoding.

The following code snippets use these constants which match the fixed HeAR input requirement of 2 second mono audio sampled at 16kHz.

SAMPLE_RATE = 16000  # hertz
CLIP_DURATION = 2    # seconds

Here are two ways of loading and preprocessing audio data for HeAR:

1. Using librosa:

This approach uses the librosa library, which handles various audio formats. It loads audio, resamples to 16kHz and converts to mono as required by HeAR.

import librosa

audio_array, original_sampling_rate = librosa.load('audio.mp3', sr=SAMPLE_RATE, mono=True)

2. Using a custom function:

This approach uses a custom function to load audio from a WAV file and perform basic preprocessing like resampling to 16kHz and converting to mono.

import numpy as np
from scipy.io import wavfile
from scipy import signal

with open('audio.wav', 'rb') as f:
  original_sampling_rate, audio_array = wavfile.read(f)

# Enforce Mono (1d)
if audio_array.ndim > 1:
  audio_mono = np.mean(audio_array, axis=1)
else:
  audio_mono = audio_array

# Resample to SAMPLE_RATE
original_sample_count = audio_mono.shape[0]
new_sample_count = int(round(original_sample_count * (SAMPLE_RATE / original_sampling_rate)))
audio_array = signal.resample(audio_mono, new_sample_count)

Once you have loaded your audio as a mono float array, you need to crop it to exactly length 32000.

CLIP_LENGTH = SAMPLE_RATE * CLIP_DURATION

# Assuming you have 'audio_array' loaded using one of the methods above,
# extract the first 2s clip from the audio array and encode as bytes.
start_index = 0
end_index = start_index + CLIP_LENGTH

# audio_clip can be used in the "input_array" field of the API request JSON.
audio_clip = audio_array[start_index:end_index]

To reduce the size of the payload you can optionally encode the audio_clip as a WAV base64 string which can be used to construct the API request JSON for HeAR:

import base64
import io
from scipy.io import wavfile

def encode_audio_bytes(audio_data: np.ndarray) -> str:
  with io.BytesIO() as audio_bytes:
    wavfile.write(audio_bytes, SAMPLE_RATE, audio_data)
    return base64.b64encode(audio_bytes.getvalue()).decode('utf-8')

# Optionally encode the audio_clip as base64_encoded_audio which can be used
# in the "input_bytes" field of the API request JSON.
base64_encoded_audio = encode_audio_bytes(audio_clip)

Response

An API response can include multiple predictions that correspond to the order of the instances in the request. Each prediction conforms to the prediction schema. Note that this schema is based on Vertex AI PredictSchemata standard and is a partial OpenAPI specification. The complete JSON request has the following structure:

{
  "predictions": [
    {...},
    {...}
  ],
  "deployedModelId": "model-id",
  "model": "model",
  "modelVersionId": "version-id",
  "modelDisplayName": "model-display-name",
  "metadata": {...}
}

Each request instance can independently succeed or fail. When succeeded, the corresponding prediction JSON includes an embedding field (one-dimensional vector of length 512) and when failed an error field. Here is an example of a response to a request with two instances where the first one has succeeded and the second one failed:

{
  "predictions": [
    {
      "embedding": [0.1, 0.2, 0.3, 0.4]
    },
    {
      "error": {
        "description": "Some actionable text."
      }
    }
  ],
  "deployedModelId": "model-id",
  "model": "model",
  "modelVersionId": "version-id",
  "modelDisplayName": "model-display-name",
  "metadata": {...}
}

Micro batching

The API request supports micro batching. You can request embeddings for multiple audio clips using different instances within the same JSON request:

{
  "instances": [
    {...},
    {...}
  ]
}

Keep in mind that the total number of embeddings that you can request in one API call will be capped by the service to a fixed limit. A link to the service configuration is coming soon.