Microsoft Azure documentation
Deploy NVIDIA Parakeet for Automatic Speech Recognition (ASR) on Azure AI
Deploy NVIDIA Parakeet for Automatic Speech Recognition (ASR) on Azure AI
This example showcases how to deploy NVIDIA Parakeet for Automatic Speech Recognition (ASR) from the Hugging Face Collection in Azure AI Foundry Hub as an Azure ML Managed Online Endpoint, powered by Hugging Face’s Inference container on top of NVIDIA NeMo. It also covers how to run inference with cURL, requests, OpenAI Python SDK, and even how to locally run a Gradio application for audio transcription from both recordings and files.
TL;DR NVIDIA NeMo is a scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech). NVIDIA NeMo Parakeet ASR Models attain strong speech recognition accuracy while being efficient for inference. Azure AI Foundry provides a unified platform for enterprise AI operations, model builders, and application development. Azure Machine Learning is a cloud service for accelerating and managing the machine learning (ML) project lifecycle.
This example will specifically deploy nvidia/parakeet-tdt-0.6b-v2
from the Hugging Face Hub (or see it on AzureML or on Azure AI Foundry) as an Azure ML Managed Online Endpoint on Azure AI Foundry Hub.
nvidia/parakeet-tdt-0.6b-v2
is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction.
This XL variant of the FastConformer architecture integrates the TDT decoder and is trained with full attention, enabling efficient transcription of audio segments up to 24 minutes in a single pass. The model achieves an RTFx of 3380 on the HF-Open-ASR leaderboard with a batch size of 128. Note: RTFx Performance may vary depending on dataset audio duration and batch size.
- Accurate word-level timestamp predictions
- Automatic punctuation and capitalization
- Robust performance on spoken numbers, and song lyrics transcription
For more information, make sure to check their model card on the Hugging Face Hub and the NVIDIA NeMo Documentation.
Note that you can select any Automatic Speech Recognition (ASR) model available on the Hugging Face Hub with the tag NeMo
and the “Deploy to AzureML” option enabled, or directly select any of the ASR models available on either Azure ML or Azure AI Foundry Hub Model Catalog under the “HuggingFace” collection (note that for Azure AI Foundry the Hugging Face Collection will only be available for Hub-based projects), but only the NVIDIA Parakeet models are powered by NVIDIA NeMo, the rest of those rely on the Hugging Face Inference Toolkit.
Pre-requisites
To run the following example, you will need to comply with the following pre-requisites, alternatively, you can also read more about those in the Azure Machine Learning Tutorial: Create resources you need to get started.
Azure Account
A Microsoft Azure account with an active subscription. If you don’t have a Microsoft Azure account, you can now create one for free, including 200 USD worth of credits to use within the next 30 days after the account creation.
Azure CLI
The Azure CLI (az
) installed on the instance that you’re running this example on, see the installation steps, and follow the steps of the preferred method based on your instance. Then log in into your subscription as follows:
az login
More information at Sign in with Azure CLI - Login and Authentication.
Azure CLI extension for Azure ML
Besides the Azure CLI (az
), you also need to install the Azure ML CLI extension (az ml
) which will be used to create the Azure ML and Azure AI Foundry required resources.
First you will need to list the current extensions and remove any ml
-related extension before installing the latest one i.e., v2.
az extension list az extension remove --name azure-cli-ml az extension remove --name ml
Then you can install the az ml
v2 extension as follows:
az extension add --name ml
More information at Azure Machine Learning (ML) - Install and setup the CLI (v2).
Azure Resource Group
An Azure Resource Group under the one you will create the Azure AI Foundry Hub-based project (note it will create an Azure AI Foundry resource as an Azure ML Workspace, but not the other way around, meaning that the Azure AI Foundry Hub will be listed as an Azure ML workspace, but leveraging the Azure AI Foundry capabilities for Gen AI), and the rest of the required resources. If you don’t have one, you can create it as follows:
az group create --name huggingface-azure-rg --location eastus
Then, you can ensure that the resource group was created successfully by e.g. listing all the available resource groups that you have access to on your subscription:
az group list --output table
More information at Manage Azure resource groups by using Azure CLI.
You can also create the Azure Resource Group via the Azure Portal, or via the Azure Resource Management Python SDK (requires it to be installed as pip install azure-mgmt-resource
in advance).
Azure AI Foundry Hub-based project
An Azure AI Foundry Hub under the aforementioned subscription and resource group. If you don’t have one, you can create it as follows:
az ml workspace create \ --kind hub \ --name huggingface-azure-hub \ --resource-group huggingface-azure-rg \ --location eastus
Note that the main difference with an standard Azure ML Workspace is that the Azure AI Foundry Hub command requires you to specify the --kind hub
, removing it would create a standard Azure ML Workspace instead, so you wouldn’t benefit from the features that the Azure AI Foundry brings. But, when you create an Azure AI Foundry Hub, you can still benefit from all the features that Azure ML brings, since the Azure AI Foundry Hub will still rely on Azure ML, but not the other way around.
Then, you can ensure that the workspace was created successfully by e.g. listing all the available workspaces that you have access to on your subscription:
az ml workspace list --filtered-kinds hub --query "[].{Name:name, Kind:kind}" --resource-group huggingface-azure-rg --output table
The --filtered-kinds
argument has been recently included as of Azure ML CLI 2.37.0, meaning that you may need to upgrade az ml
as az extension update --name ml
.
Once the Azure AI Foundry Hub is created, you need to create an Azure AI Foundry Project linked to that Hub, to do so you first need to obtain the Azure AI Foundry Hub ID of the recently created Hub as follows (replace the resource names with yours):
az ml workspace show \
--name huggingface-azure-hub \
--resource-group huggingface-azure-rg \
--query "id" \
-o tsv
That command will provide the ID as follows /subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/Microsoft.MachineLearningServices/workspaces/huggingface-azure-hub
, meaning that you can also format it manually yourself with the appropriate replacements. Then you need to run the following command to create the Azure AI Foundry Project for that Hub as:
az ml workspace create \
--kind project \
--hub-id $(az ml workspace show --name huggingface-azure-hub --resource-group huggingface-azure-rg --query "id" -o tsv) \
--name huggingface-azure-project \
--resource-group huggingface-azure-rg \
--location eastus
Finally, you can verify that it was correctly created with the following command:
az ml workspace list --filtered-kinds project --query "[].{Name:name, Kind:kind}" --resource-group huggingface-azure-rg --output table
More information at How to create and manage an Azure AI Foundry Hub and at How to create a Hub using the Azure CLI.
You can also create the Azure AI Foundry Hub via the Azure Portal, or via the Azure ML Python SDK, among other options listed in Manage AI Hub Resources.
Setup and installation
In this example, the Azure Machine Learning SDK for Python will be used to create the endpoint and the deployment, as well as to invoke the deployed API. Along with it, you will also need to install azure-identity
to authenticate with your Azure credentials via Python.
%pip install azure-ai-ml azure-identity --upgrade --quiet
More information at Azure Machine Learning SDK for Python.
Then, for convenience setting the following environment variables is recommended as those will be used along the example for the Azure ML Client, so make sure to update and set those values accordingly as per your Microsoft Azure account and resources.
%env LOCATION eastus %env SUBSCRIPTION_ID <YOUR_SUBSCRIPTION_ID> %env RESOURCE_GROUP <YOUR_RESOURCE_GROUP> %env AI_FOUNDRY_HUB_PROJECT <YOUR_AI_FOUNDRY_HUB_PROJECT>
Finally, you also need to define both the endpoint and deployment names, as those will be used throughout the example too:
Note that endpoint names must be globally unique per region i.e., even if you don’t have any endpoint named that way running under your subscription, if the name is reserved by another Azure customer, then you won’t be able to use the same name. Adding a timestamp or a custom identifier is recommended to prevent running into HTTP 400 validation issues when trying to deploy an endpoint with an already locked / reserved name. Also the endpoint name must be between 3 and 32 characters long.
import os
from uuid import uuid4
os.environ["ENDPOINT_NAME"] = f"nvidia-parakeet-{str(uuid4())[:8]}"
os.environ["DEPLOYMENT_NAME"] = f"nvidia-parakeet-{str(uuid4())[:8]}"
!echo $ENDPOINT_NAME !echo $DEPLOYMENT_NAME
Authenticate to Azure ML
Initially, you need to authenticate into the Azure AI Foundry Hub via Azure ML with the Azure ML Python SDK, which will be later used to deploy nvidia/parakeet-tdt-0.6b-v2
as an Azure ML Managed Online Endpoint in your Azure AI Foundry Hub.
On standard Azure ML deployments you’d need to create the MLClient
using the Azure ML Workspace as the workspace_name
whereas for Azure AI Foundry, you need to provide the Azure AI Foundry Hub name as the workspace_name
instead, and that will deploy the endpoint under the Azure AI Foundry too.
import os
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
client = MLClient(
credential=DefaultAzureCredential(),
subscription_id=os.getenv("SUBSCRIPTION_ID"),
resource_group_name=os.getenv("RESOURCE_GROUP"),
workspace_name=os.getenv("AI_FOUNDRY_HUB_PROJECT"),
)
Create and Deploy Azure AI Endpoint
Before creating the Managed Online Endpoint, you need to build the model URI, which is formatted as it follows azureml://registries/<REGISTRY_NAME>/models/<MODEL_ID>/labels/latest
(even if the URI contains azureml
it’s the same as in Azure AI Foundry, since the model catalog is shared), that means that the REGISTRY_NAME
should be set to “HuggingFace” as you intend to deploy a model from the Hugging Face Collection, and the MODEL_ID
won’t be the Hugging Face Hub ID, but rather the ID with hyphen replacements for both backslash (/) and underscores (_) with hyphens (-), and then into lower case, as follows:
model_id = "nvidia/parakeet-tdt-0.6b-v2"
model_uri = (
f"azureml://registries/HuggingFace/models/{model_id.replace('/', '-').replace('_', '-').lower()}/labels/latest"
)
model_uri
Note that you will need to verify in advance that the URI is valid, and that the given Hugging Face Hub Model ID exists on Azure, since Hugging Face is publishing those models into their collection, meaning that some models may be available on the Hugging Face Hub but not yet on the Azure Model Catalog (you can request adding a model following the guide Request a model addition).
Alternatively, you can use the following snippet to verify if a model is available on the Azure Model Catalog programmatically:
import requests
response = requests.get(f"https://generate-azureml-urls.azurewebsites.net/api/generate?modelId={model_id}")
if response.status_code != 200:
print(
"[{response.status_code=}] {model_id=} not available on the Hugging Face Collection in Azure ML Model Catalog"
)
Then you can create the Managed Online Endpoint specifying its name (note that the name must be unique per entire region, not only within a single subscription, resource group, workspace, etc., so it’s a nice practice to add some sort of unique name to it in case multi-region deployments are intended) via the ManagedOnlineEndpoint Python class.
Also note that by default the ManagedOnlineEndpoint
will use the key
authentication method, meaning that there will be a primary and secondary key that should be sent within the Authentication headers as a Bearer token; but also the aml_token
authentication method can be used, read more about it at Authenticate clients for online endpoints.
The deployment, created via the ManagedOnlineDeployment Python class, will define the actual model deployment that will be exposed via the previously created endpoint. The ManagedOnlineDeployment
will expect: the model
i.e., the previously created URI azureml://registries/HuggingFace/models/nvidia-parakeet-tdt-0.6b-v2/labels/latest
, the endpoint_name
, and the instance requirements being the instance_type
and the instance_count
.
Every model in the Hugging Face Collection is powered by an efficient inference backend, and each of those can run on a wide variety of instance types (as listed in Supported Hardware); in this case, a NVIDIA H100 GPU will be used i.e., Standard_NC40ads_H100_v5
.
Since for some models and inference engines you need to run those on a GPU-accelerated instance, you may need to request a quota increase for some of the supported instances as per the model you want to deploy. Also, keep into consideration that each model comes with a list of all the supported instances, being the recommended one for each tier the lower instance in terms of available VRAM. Read more about quota increase requests for Azure ML at Manage and increase quotas and limits for resources with Azure Machine Learning.
from azure.ai.ml.entities import ManagedOnlineEndpoint, ManagedOnlineDeployment
endpoint = ManagedOnlineEndpoint(name=os.getenv("ENDPOINT_NAME"))
deployment = ManagedOnlineDeployment(
name=os.getenv("DEPLOYMENT_NAME"),
endpoint_name=os.getenv("ENDPOINT_NAME"),
model=model_uri,
instance_type="Standard_NC40ads_H100_v5",
instance_count=1,
)
client.begin_create_or_update(endpoint).wait()
In Azure AI Foundry the endpoint will only be listed within the “My assets -> Models + endpoints” tab once the deployment is created, not before as in Azure ML where the endpoint is shown even if it doesn’t contain any active or in-progress deployments.
client.online_deployments.begin_create_or_update(deployment).wait()
Note that whilst the Azure AI Endpoint creation is relatively fast, the deployment will take longer since it needs to allocate the resources on Azure so expect it to take ~10-15 minutes, but it could as well take longer depending on the instance provisioning and availability.
Once deployed, via either the Azure AI Foundry or the Azure ML Studio you’ll be able to inspect the endpoint details, the real-time logs, how to consume the endpoint, and even use the, still on preview, monitoring feature.
Find more information about it at Azure ML Managed Online Endpoints
Send requests to the Azure AI Endpoint
Finally, now that the Azure AI Endpoint is deployed, you can send requests to it. In this case, since the task of the model is automatic-speech-recognition
and since it expects a multi-part request to be sent along the audio file, the invoke
method cannot be used since it only supports JSON payloads.
This being said, you can still send requests to it programmatically via requests
, via the OpenAI SDK for Python or with cURL, to the /api/v1/audio/transcriptions
route which is the OpenAI-compatible route for the Transcriptions API.
Support for Hugging Face models via azure-ai-inference
Python SDK is still a work in progress, but that will be included soon and set as the recommended inference method, stay tuned!
To send the requests then we need both the primary_key
and the scoring_uri
, which can be retrieved via the Azure ML Python SDK as it follows:
api_key = client.online_endpoints.get_keys(os.getenv("ENDPOINT_NAME")).primary_key
api_url = client.online_endpoints.get(os.getenv("ENDPOINT_NAME")).scoring_uri
Additionally, since you will need a sample audio file to run the inference over, you will need to download an audio file as e.g. the following, which is the audio file showcased within the nvidia/parakeet-tdt-0.6b-v2
model card:
!wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
Python requests
As the deployed Azure AI Endpoint for ASR expects a multi-part request, you need to send separately the files, in this case being the audio files, and the data, being the request parameters such as e.g. the model name or the temperature, among others. To do so, you first need to read the audio file into an io.BytesIO
object, and then prepare the requests with the necessary headers for both the authentication and the azureml-model-deployment
set to point to the actual Azure AI Deployment, and send the HTTP POST with both the file and the data as follows:
from io import BytesIO
import requests
audio_file = BytesIO(open("2086-149220-0033.wav", "rb").read())
audio_file.name = "2086-149220-0033.wav"
response = requests.post(
api_url,
headers={
"Authorization": f"Bearer {api_key}",
"azureml-model-deployment": os.getenv("DEPLOYMENT_NAME"),
},
files={"file": (audio_file.name, audio_file, "audio/wav")},
data={"model": model_id},
)
print(response.json())
# {'text': "Well, I don't wish to see it any more, observed Phebe, turning away her eyes. It is certainly very like the old portrait."}
OpenAI Python SDK
As the exposed scoring URI is an OpenAI-compatible route i.e., /api/v1/audio/transcriptions
, you can leverage the OpenAI Python SDK to send requests to the deployed Azure AI Endpoint.
%pip install openai --upgrade --quiet
To use the OpenAI Python SDK with Azure ML Managed Online Endpoints, you need to update the api_url
value defined above, since the default scoring_uri
comes with the full route, whereas the OpenAI SDK expects the route up until the v1
included, meaning that the /audio/transcriptions
should be removed before instantiating the client.
api_url = client.online_endpoints.get(os.getenv("ENDPOINT_NAME")).scoring_uri.replace("/audio/transcriptions", "")
Alternatively, you can also build the API URL manually as it follows, since the URIs are globally unique per region, meaning that there will only be one endpoint named the same way within the same region:
api_url = f"https://{os.getenv('ENDPOINT_NAME')}.{os.getenv('LOCATION')}.inference.ml.azure.com/api/v1"
Or just retrieve it from either the Azure AI Foundry or the Azure ML Studio.
Then you can use the OpenAI Python SDK normally, making sure to include the extra header azureml-model-deployment
header that contains the Azure AI / ML Deployment name.
Via the OpenAI Python SDK it can either be set within each call to chat.completions.create
via the extra_headers
parameter as commented below, or via the default_headers
parameter when instantiating the OpenAI
client (which is the recommended approach since the header needs to be present on each request, so setting it just once is preferred).
import os
from openai import OpenAI
openai_client = OpenAI(
base_url=api_url,
api_key=api_key,
default_headers={"azureml-model-deployment": os.getenv("DEPLOYMENT_NAME")},
)
transcription = openai_client.audio.transcriptions.create(
model=model_id,
file=open("2086-149220-0033.wav", "rb"),
response_format="json",
)
print(transcription.text)
# Well, I don't wish to see it any more, observed Phebe, turning away her eyes. It is certainly very like the old portrait.
cURL
Alternatively, you can also just use cURL
to send requests to the deployed endpoint, with the api_url
and api_key
values programmatically retrieved in the OpenAI snippet and now set as environment variables so that cURL
can use those, as it follows:
os.environ["API_URL"] = api_url
os.environ["API_KEY"] = api_key
!curl -sS $API_URL/audio/transcriptions \
-H "Authorization: Bearer $API_KEY" \
-H "azureml-model-deployment: $DEPLOYMENT_NAME" \
-H "Content-Type: multipart/form-data" \
-F file=@2086-149220-0033.wav \
-F model=nvidia/parakeet-tdt-0.6b-v2
You can also just go to the Azure AI Endpoint in either the Azure AI Foundry under “My assets -> Models + endpoints” or in the Azure ML Studio via “Endpoints”, and retrieve both the scoring URI and the API Key values, as well as the Azure AI / ML Deployment name for the given model, and then send the request as follows after replacing the values:
curl -sS <API_URL> \
-H "Authorization: Bearer <PRIMARY_KEY>" \
-H "azureml-model-deployment: $DEPLOYMENT_NAME" \
-H "Content-Type: multipart/form-data" \
-F [email protected] \
-F model=nvidia/parakeet-tdt-0.6b-v2 | jq
Gradio
Gradio is the fastest way to demo your machine learning model with a friendly web interface so that anyone can use it. You can also leverage the OpenAI Python SDK to build a simple automatic-speech-recognition i.e., speech to text demo that you can use within the Jupyter Notebook cell where you are running it.
Alternatively, the Gradio demo connected to your Azure ML Managed Online Endpoint as an Azure Container App as described in Tutorial: Build and deploy from source code to Azure Container Apps. If you’d like us to show you how to do it for Gradio in particular, feel free to open an issue requesting it.
%pip install gradio --upgrade --quiet
import os
from pathlib import Path
import gradio as gr
from openai import OpenAI
openai_client = OpenAI(
base_url=os.getenv("API_URL"),
api_key=os.getenv("API_KEY"),
default_headers={"azureml-model-deployment": os.getenv("DEPLOYMENT_NAME")},
)
def transcribe(audio: Path, temperature: float = 1.0) -> str:
return openai_client.audio.transcriptions.create(
model=model_id,
file=open(audio, "rb"),
temperature=temperature,
response_format="text",
)
demo = gr.Interface(
fn=transcribe,
inputs=[
# https://www.gradio.app/docs/gradio/audio
gr.Audio(type="filepath", streaming=False, label="Upload or Record Audio"),
gr.Slider(0, 1, value=0.0, step=0.1, label="Temperature"),
],
outputs=gr.Textbox(label="Transcribed Text"),
title="NVIDIA Parakeet on Azure AI",
description="Upload or record audio and get the transcribed text using NVIDIA Parakeet on Azure AI via the OpenAI's Transcription API.",
)
demo.launch()
Release resources
Once you are done using the Azure AI Endpoint / Deployment, you can delete the resources as it follows, meaning that you will stop paying for the instance on which the model is running and all the attached costs will be stopped.
client.online_endpoints.begin_delete(name=os.getenv("ENDPOINT_NAME")).result()
Conclusion
Throughout this example you learnt how to create and configure your Azure account for Azure ML and Azure AI Foundry, how to then create a Managed Online Endpoint running an open model for Automatic Speech Recognition (ASR) from the Hugging Face Collection in the Azure AI Foundry Hub / Azure ML Model Catalog, how to send inference requests to it afterwards with different alternatives, how to build a simple Gradio chat interface around it, and finally, how to stop and release the resources.
If you have any doubt, issue or question about this example, feel free to open an issue and we’ll do our best to help!
📍 Find the complete example on GitHub here!