Inference Providers documentation
Inference Providers
Inference Providers


Hugging Face Inference Providers simplify and unify how developers access and run machine learning models by offering a unified, flexible interface to multiple serverless inference providers. This new approach extends our previous Serverless Inference API, providing more models, increased performances and better reliability thanks to our inference partners.
To learn more about the launch of Inference Providers, check out our announcement blog post.
Why use Inference Providers?
Inference Providers offers a fast and simple way to explore thousands of models for a variety of tasks. Whether you’re experimenting with ML capabilities or building a new application, this API gives you instant access to high-performing models across multiple domains:
- Text Generation: Including large language models and tool-calling prompts, generate and experiment with high-quality responses.
- Image and Video Generation: Easily create customized images, including LoRAs for your own styles.
- Document Embeddings: Build search and retrieval systems with SOTA embeddings.
- Classical AI Tasks: Ready-to-use models for text classification, image classification, speech recognition, and more.
⚡ Fast and Free to Get Started: Inference Providers comes with a free-tier and additional included credits for PRO users, as well as Enterprise Hub organizations.
Key Features
- 🎯 All-in-One API: A single API for text generation, image generation, document embeddings, NER, summarization, image classification, and more.
- 🔀 Multi-Provider Support: Easily run models from top-tier providers like fal, Replicate, Sambanova, Together AI, and others.
- 🚀 Scalable & Reliable: Built for high availability and low-latency performance in production environments.
- 🔧 Developer-Friendly: Simple requests, fast responses, and a consistent developer experience across Python and JavaScript clients.
- 💰 Cost-Effective: No extra markup on provider rates.
Inference Playground
To get started quickly with Chat Completion models, use the Inference Playground to easily test and compare models with your prompts.

Get Started
You can use Inference Providers with your preferred tools, such as Python, JavaScript, or cURL. To simplify integration, we offer both a Python SDK (huggingface_hub
) and a JavaScript SDK (huggingface.js
).
In this section, we will demonstrate a simple example using deepseek-ai/DeepSeek-V3-0324, a conversational Large Language Model. For the example, we will use Novita AI as Inference Provider.
Authentication
Inference Providers requires passing a user token in the request headers. You can generate a token by signing up on the Hugging Face website and going to the settings page. We recommend creating a fine-grained
token with the scope to Make calls to Inference Providers
.
For more details about user tokens, check out this guide.
cURL
Let’s start with a cURL command highlighting the raw HTTP request. You can adapt this request to be run with the tool of your choice.
curl https://router.huggingface.co/novita/v3/openai/chat/completions \
-H "Authorization: Bearer $HF_TOKEN" \
-H 'Content-Type: application/json' \
-d '{
"messages": [
{
"role": "user",
"content": "How many G in huggingface?"
}
],
"model": "deepseek/deepseek-v3-0324",
"stream": false
}'
Python
In Python, you can use the requests
library to make raw requests to the API:
import requests
API_URL = "https://router.huggingface.co/novita/v3/openai/chat/completions"
headers = {"Authorization": "Bearer hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"}
payload = {
"messages": [
{
"role": "user",
"content": "How many 'G's in 'huggingface'?"
}
],
"model": "deepseek/deepseek-v3-0324",
}
response = requests.post(API_URL, headers=headers, json=payload)
print(response.json()["choices"][0]["message"])
For convenience, the Python library huggingface_hub
provides an InferenceClient
that handles inference for you. Make sure to install it with pip install huggingface_hub
.
from huggingface_hub import InferenceClient
client = InferenceClient(
provider="novita",
api_key="hf_xxxxxxxxxxxxxxxxxxxxxxxx",
)
completion = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3-0324",
messages=[
{
"role": "user",
"content": "How many 'G's in 'huggingface'?"
}
],
)
print(completion.choices[0].message)
JavaScript
In JS, you can use the fetch
library to make raw requests to the API:
import fetch from "node-fetch";
const response = await fetch(
"https://router.huggingface.co/novita/v3/openai/chat/completions",
{
method: "POST",
headers: {
Authorization: `Bearer hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx`,
"Content-Type": "application/json",
},
body: JSON.stringify({
provider: "novita",
model: "deepseek-ai/DeepSeek-V3-0324",
messages: [
{
role: "user",
content: "How many 'G's in 'huggingface'?",
},
],
}),
}
);
console.log(await response.json());
For convenience, the JS library @huggingface/inference
provides an InferenceClient
that handles inference for you. You can install it with npm install @huggingface/inference
.
import { InferenceClient } from "@huggingface/inference";
const client = new InferenceClient("hf_xxxxxxxxxxxxxxxxxxxxxxxx");
const chatCompletion = await client.chatCompletion({
provider: "novita",
model: "deepseek-ai/DeepSeek-V3-0324",
messages: [
{
role: "user",
content: "How many 'G's in 'huggingface'?",
},
],
});
console.log(chatCompletion.choices[0].message);
Next Steps
In this introduction, we’ve covered the basics of Inference Providers. To learn more about this service, check out our guides and API Reference:
- Pricing and Billing: everything you need to know about billing.
- Hub integration: how is Inference Providers integrated with the Hub?
- Register as an Inference Provider: everything about how to become an official partner.
- Hub API: high-level API for Inference Providers.
- API Reference: learn more about the parameters and task-specific settings.