This document describes the Application Programming Interface (API) for HeAR when deployed as an HTTPS service endpoint, referred to as the service in this document.
Overview
The
serving source code for HeAR
can be built and hosted on any API management system, but it's specially
designed to take advantage of
Vertex AI prediction endpoints.
Therefore, it conforms to Vertex AI's required API signature and implements a
predict
method.
The service is designed to support micro batching, not to be mistaken with batch jobs. For every audio clip in the request, if the processing is successful, the service returns a one-dimensional embedding vector in the same order as the audio clip in the request. Refer to the sections on API request, response, and micro batching for details.
You can provide audio clips to the service either directly within the request (inlined) or by providing a reference to their location. Inlining the audio data in the request is not recommended for large-scale productions; read more. When using data storage links the service expects corresponding OAuth 2.0 bearer tokens to retrieve the data on your behalf. For detailed information on constructing API requests and the different ways to provide audio data, refer to the API request section.
To invoke the service, consult the request section, compose a
valid request JSON and send a POST
request to your endpoint. If you haven't already deployed HeAR as an endpoint,
the easiest way is through
Model Garden.
The following script shows a sample cURL
command which you can use to invoke the service. Set LOCATION
, PROJECT_ID
and ENDPOINT_ID
to target your endpoint:
LOCATION="your endpoint location"
PROJECT_ID="your project ID"
ENDPOINT_ID="your endpoint ID"
REQUEST_JSON="path/to/your/request.json"
curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}/endpoints/${ENDPOINT_ID}:predict" \
-d "@${REQUEST_JSON}"
Request
An API request can include multiple instances, each conforming to the instance schema. Note that this schema is based on Vertex AI PredictSchemata standard and is a partial OpenAPI specification. The complete JSON request has the following structure:
{
"instances": [
{...},
{...}
]
}
The service accepts audio clips in three ways:
Directly within the HTTPS request as
input_bytes
: You can include audio data as base64-encoded WAV bytes using theinput_bytes
JSON field; read more about inlined audio.Directly within the HTTPS request as
input_array
: You can include audio data as an array of 32000 floats using theinput_array
JSON field; read more about inlined audio.Indirectly via storage links: You can provide links to the WAV audio files stored in GCS using the
gcs_uri
JSON field.
The audio clips must be exactly 2 seconds sampled at 16kHz (crop or pad with zeros as needed). For best results, health sounds should be detected and cropped before using HeAR.
To illustrate these methods, the following example JSON request shows
input_bytes
, input_array
, and gcs_uri
all in one request.
Note: In a real-world scenario, you'll typically only use one of these options for all audio clips within a single request. See Micro batching for more details on sending multiple audio clips in one request.
{
"instances": [
{
"input_bytes": "your base64 WAV encoded audio bytes",
},
{
"input_array": "your array of 32000 floats",
},
{
"gcs_uri": "gs://your-bucket/path/to/audio.wav",
"bearer_token": "your-bearer-token",
}
]
}
Inlined audio
You can inline the audio clips in the API request as a
base64-encoded
string in the input_bytes
or as an array of 32000 floats in
the input_array
JSON field. However, keep in mind most API management systems
enforce a limit on the maximum size of the request payloads. When HeAR is hosted
as a Vertex AI Prediction endpoint,
Vertex AI quotas apply.
To optimize the request size, you should compress the audio data using common audio compression codecs. If you require lossless compression, use WAV encoding.
The following code snippets use these constants which match the fixed HeAR input requirement of 2 second mono audio sampled at 16kHz.
SAMPLE_RATE = 16000 # hertz
CLIP_DURATION = 2 # seconds
Here are two ways of loading and preprocessing audio data for HeAR:
1. Using librosa:
This approach uses the librosa
library, which handles various audio formats.
It loads audio, resamples to 16kHz and converts to mono as required by HeAR.
import librosa
audio_array, original_sampling_rate = librosa.load('audio.mp3', sr=SAMPLE_RATE, mono=True)
2. Using a custom function:
This approach uses a custom function to load audio from a WAV file and perform basic preprocessing like resampling to 16kHz and converting to mono.
import numpy as np
from scipy.io import wavfile
from scipy import signal
with open('audio.wav', 'rb') as f:
original_sampling_rate, audio_array = wavfile.read(f)
# Enforce Mono (1d)
if audio_array.ndim > 1:
audio_mono = np.mean(audio_array, axis=1)
else:
audio_mono = audio_array
# Resample to SAMPLE_RATE
original_sample_count = audio_mono.shape[0]
new_sample_count = int(round(original_sample_count * (SAMPLE_RATE / original_sampling_rate)))
audio_array = signal.resample(audio_mono, new_sample_count)
Once you have loaded your audio as a mono float array, you need to crop it to exactly length 32000.
CLIP_LENGTH = SAMPLE_RATE * CLIP_DURATION
# Assuming you have 'audio_array' loaded using one of the methods above,
# extract the first 2s clip from the audio array and encode as bytes.
start_index = 0
end_index = start_index + CLIP_LENGTH
# audio_clip can be used in the "input_array" field of the API request JSON.
audio_clip = audio_array[start_index:end_index]
To reduce the size of the payload you can optionally encode the audio_clip as a WAV base64 string which can be used to construct the API request JSON for HeAR:
import base64
import io
from scipy.io import wavfile
def encode_audio_bytes(audio_data: np.ndarray) -> str:
with io.BytesIO() as audio_bytes:
wavfile.write(audio_bytes, SAMPLE_RATE, audio_data)
return base64.b64encode(audio_bytes.getvalue()).decode('utf-8')
# Optionally encode the audio_clip as base64_encoded_audio which can be used
# in the "input_bytes" field of the API request JSON.
base64_encoded_audio = encode_audio_bytes(audio_clip)
Response
An API response can include multiple predictions that correspond to the order of the instances in the request. Each prediction conforms to the prediction schema. Note that this schema is based on Vertex AI PredictSchemata standard and is a partial OpenAPI specification. The complete JSON request has the following structure:
{
"predictions": [
{...},
{...}
],
"deployedModelId": "model-id",
"model": "model",
"modelVersionId": "version-id",
"modelDisplayName": "model-display-name",
"metadata": {...}
}
Each request instance can independently succeed or fail. When succeeded, the
corresponding prediction JSON includes an embedding
field (one-dimensional
vector of length 512) and when failed an error
field. Here is an example of
a response to a request with two instances where the first one has succeeded
and the second one failed:
{
"predictions": [
{
"embedding": [0.1, 0.2, 0.3, 0.4]
},
{
"error": {
"description": "Some actionable text."
}
}
],
"deployedModelId": "model-id",
"model": "model",
"modelVersionId": "version-id",
"modelDisplayName": "model-display-name",
"metadata": {...}
}
Micro batching
The API request supports micro batching. You can request embeddings for multiple audio clips using different instances within the same JSON request:
{
"instances": [
{...},
{...}
]
}
Keep in mind that the total number of embeddings that you can request in one API call will be capped by the service to a fixed limit. A link to the service configuration is coming soon.