MedASR Model Card

Model documentation: MedASR

Resources:

Author: Google

Model information

This section describes the MedASR (Medical Automated Speech Recognition) model and how to use it.

Description

MedASR is a speech-to-text model based on the Conformer architecture pre-trained for medical dictation. MedASR is intended as a starting point for developers, and is well-suited for dictation tasks involving medical terminologies, such as radiology dictation, and transcribing physician-patient conversations. While MedASR has been extensively pre-trained on a corpus of medical audio data, it may occasionally exhibit performance variability when encountering terms outside of its pre-training data, such as non-standard medication names or consistent handling of temporal data (dates, times, or durations).

How to use

The following are some example code snippets to help you quickly get started running the model locally. If you want to use the model at scale, we recommend that you create a production version using Model Garden.

First, install the Transformers library. MedASR is supported starting from transformers 5.0.0. You may need to install transformers from GitHub The following code snippets contain samples from PARROT, a synthetic dataset for radiology reporting. See: Guellec, Bastien Le, et al. "PARROT: An Open Multilingual Radiology Reports Dataset." arXiv preprint arXiv:2507.22939 (2025)

$ uv pip install git+https://github.com/huggingface/transformers.git

Run model with the pipeline API

from transformers import pipeline
import huggingface_hub
from IPython.display import Audio, display
audio = huggingface_hub.hf_hub_download('google/medasr', 'test_audio.wav')
model_id = "google/medasr"
pipe = pipeline("automatic-speech-recognition", model=model_id)
result = pipe(audio,chunk_length_s=20,stride_length_s=2)
# the chunk length is how long in seconds MedASR batches audio and the stride length is the overlap between chunks.
print(result)

Run the model directly

from transformers import AutoModelForCTC, AutoProcessor
import huggingface_hub
import librosa
import torch
audio = huggingface_hub.hf_hub_download('google/medasr', 'test_audio.wav')
model_id = f"google/medasr"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCTC.from_pretrained(model_id).to(device)
audio = huggingface_hub.hf_hub_download('google/medasr', 'test_audio.wav')
speech, sample_rate = librosa.load(audio, sr=16000)
inputs = processor(speech, sampling_rate=sample_rate, return_tensors="pt", padding=True)
inputs = inputs.to(device)
outputs = model.generate(**inputs)
decoded_text = processor.batch_decode(outputs)[0]
print(f"result={decoded_text}")

Examples

See the following tutorial notebooks for examples of how to use MedASR:

Model architecture overview

The MedASR model is built based on the Conformer architecture.

Technical specifications

  • Model type: Automated-speech-detector

  • Input Modalities: Mono-channel audio 16kHz, int16 waveform

  • Output Modality: Text only

  • Number of parameters: 105M

  • Key publication: LAST: Scalable Lattice-Based Speech Modelling in JAX

  • Model created: December 15, 2025

  • Model version: 1.0.0

Citation

When using this model, cite:
@inproceedings{wu2023last,
title={Last: Scalable Lattice-Based Speech Modelling in Jax},
author={Wu, Ke and Variani, Ehsan and Bagby, Tom and Riley, Michael},
booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={1--5},
year={2023},
organization={IEEE}
}

Performance and Evaluations

Our evaluation methods include evaluating word-error rate (WER) of MedASR against held out medical audio examples. We also evaluate specifically medical WER, where we only look at words that have a medical context. These audio samples have been transcribed by human experts, but there is always some noise in such transcriptions.

Key performance metrics

Word error rate of MedASR versus other models*

Dataset name Dataset description MedASR with greedy decoding MedASR + 6-gram language model Gemini 2.5 Pro Gemini 3 Pro Gemini 2.5 Flash Whisper v3 Large
RAD-DICT Private radiologist dictation dataset 7.1% 5.8% 14.2% 12.7% 26% 25.2%
GENERAL-DICT Private general and internal medicine dataset 9.8% 7.9% 17.9% 18.6% 27.8% 32.4%
FM-DICT Private family medicine dataset 8.7% 7.2% 16.3% 16.8% 21.2% 32.6%
Eye Gaze Dictation of audio from 998 MIMIC cases (multiple speakers) 7.2% 5.2% 5.9% 5.5% 9.3% 12.8%

*All results in the preceding table use greedy decoding.

Safety evaluation

Our evaluation methods include structured evaluations and internal red-teaming testing of relevant safety policies. This model was evaluated across various dimensions to assess safety. Human evaluations were conducted on 100 example outputs to assess for potential safety impact, specifically related to incorrect transcriptions associated with medication names, dosages, diagnoses, semantic changes, and medical terminology. The results of these evaluations were determined to be acceptable in regards to internal policies for overall safety.

Data card

Dataset overview

Training

The MedASR model is specifically trained on a diverse set of de-identified medical speech data. Its training utilizes approximately 5000 hours of physician dictations across a range of specialities (proprietary dataset 1) and de-identified medical conversations, primarily physician-patient dialogue (proprietary dataset 2). The model is trained on audio segments paired with corresponding transcripts and metadata, with subsets of the conversational data also including extensive annotations for medical named entities such as symptoms, medications, and conditions. MedASR therefore has a strong understanding of vocabulary used in medical contexts.

Evaluation

MedASR has been evaluated using a mix of internal and public datasets as noted in the Key Performance Metrics section. We used argmax of the model for posterior probability (greedy decoding) to get the output model's hypothesis tokens. The hypothesis is compared against ground truth transcript using jiwer library to calculate the word error rate.

Source

The datasets used to train MedASR include a public dataset for pre-training and a proprietary dataset that was licensed and incorporated (described in the following section).

Data ownership and documentation

Pre-training with the full LibriHeavy training set. Fine-tuning was conducted on de-identified, licensed datasets described in the following section

Private Medical Dict: Google internal dataset consisting of de-identified dictations made by physicians of different specialities including radiology, internal medicine, family medicine, and other subspecialties totaling more than 5000 hours of audio. This dataset was split into test sets that constitute RAD-DICT, FM-DICT and General and Internal Medicine-DICT referenced previously in Performance and Evaluations.

Data citation

Eye Gaze Data for Chest X-rays (evaluation set described previously in Performance and Evaluations) was derived from:

MIMIC-CXR Database v1.0.0 and MIMIC-IV v0.4

De-identification/anonymization:

Google and its partners utilize datasets that have been rigorously anonymized or de-identified to ensure the protection of individual research participants and patient privacy.

Implementation Information

Details about the model internals.

Hardware

Tensor Processing Unit (TPU) hardware (TPUv4p, TPUv5p and TPUv5e). Training speech-to text models requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain:

  • Performance: TPUs are specifically designed to handle the massive computations involved in training VLMs. They can speed up training considerably compared to CPUs.
  • Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training. This can lead to better model quality.
  • Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. You can distribute training across multiple TPU devices for faster and more efficient processing.
  • Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially when considering the time and resources saved due to faster training.
  • These advantages are aligned with Google's commitments to operate sustainably.

Software

Training was done using JAX and ML Pathways. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones.

Together, JAX and ML Pathways are used as described in the paper about the Gemini family of models; "the 'single controller' programming model of JAX and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow."

Usage and Limitations

The MedASR model has certain limitations that users should be aware of.

Intended Use

MedASR is a speech-to-text model intended to be used as a starting point that enables more efficient development of downstream healthcare applications requiring speech as input. MedASR is intended for developers in the healthcare and life sciences space. Developers are responsible for training, adapting, and making meaningful changes to MedASR to accomplish their specific intended use. The MedASR model can be fine-tuned by developers using their own proprietary data for their specific tasks or solutions.

MedASR is trained on many medical audio, speech, and text and enables further development and integration, or both with generative models like MedGemma, where MedASR converts speech to text, which can then be used as input for a text-to-text response. Full details of all the tasks MedASR has been evaluated and pre-trained on can be found in the MedASR model card.

MedASR is not intended to be used without appropriate validation, adaptation, or making meaningful modification by developers for their specific use case. The outputs generated by MedASR may include transcription errors and are not intended to directly inform clinical diagnosis, patient management decisions, treatment recommendations, or any other direct clinical practice applications. All outputs from MedASR should be considered preliminary and require independent verification, clinical correlation, and further investigation through established research and development methodologies.

Limitations

  • Training Data
    • English-only: All training data is in English
    • Speaker diversity: Most training data comes from speakers where English is their first language and were raised in the United States. The base model's performance may be lower for other types of speakers, necessitating the need for fine-tuning.
    • Speaker Sex/Gender: Training data included both men and women but had a higher proportion of men.
    • Audio quality: Training data is mostly from high quality microphones. The base model's performance may deteriorate on low quality audio with background noise, necessitating the need for fine-tuning.
    • Specialized medical terminology: Although MedASR has specialized medical audio training, its training may not include all medications, procedures or terminology, especially ones that have come into usage in the past 10 years.
    • Dates: MedASR has been trained on de-identified data so its performance on different date formats may be lacking. This can be rectified with further finetuning or alternative decoding approaches such as language model decoding debiasing.

Benefits

At the time of release, MedASR is a high performing open speech-to-text model, with specific training for medical applications. Users can update its vocabulary with few-shot fine-tuning or decoding with external language models.

Based on the benchmark evaluation metrics in this document, MedASR represents a significant leap forward in medical speech-to-text performance relative to other comparably-sized open model alternatives.