[FEEDBACK and SHOWCASE] PRO subscription

#13
by osanseviero - opened

Feel free to add your feedback about the Inference API for PRO users as well as other features.

osanseviero pinned discussion

I am getting a very partial response. I am using huggingfacehub like:
repo_id = "meta-llama/Llama-2-70b-chat-hf"
args = {
'temperature': 1,
"max_length":1024
}
HuggingFaceService.llm = HuggingFaceHub(repo_id=repo_id, model_kwargs=args)

prompt is : You are a assistant who can generate the response based on the prompt
Use the following pieces of context to answer the question at the end.
If you don't find the answer, just say Sorry I didn't understand, can you rephrase please.
[Document(page_content='Types of workflow in the DigitalChameleon platform There are two types of workflows that can be created in the platform which include: 1.\tConversation: A series of nodes with questions or text displayed to the customer in a sequence one by one, to capture the response of Customer, is referred to as Conversation workflow. The nodes of a workflow of conversation type are loaded on the webpage to the customer one at a time. The flow can be modified to return to a previous flow or allow customer to resume work at a later point in time. Workflow will go to the next node only when the customer performs the desired action in the previous node as configured in the workflow. 2.\tForm: A one time loading of the nodes/questions/messages to the end customer all at once in the UI of a form. The form will be created in the similar manner as we create for conversation in the CMS except for the workflow type in the journey properties should be selected as Form while creating/copying the workflow.
Question: explain the Types of workflow in the DigitalChameleon platform

result : "result": ". \n ')]] Sure, I'd be happy to explain the types of"

I am using langchain to get the answers based on a text file.

Hugging Face org

@it-chameleoncx can you format your post with codeblocks (```) thanks

Have any models been added for use by Pro users beyond the four listed in the announcement blog post, such as current top Code Llama derivative from Phind, or if I wanted to use that with llm-vscode would I still need to pay for my own inference endpoint?

Are you planning to add more models to PRO interfaces like for example teknium/OpenHermes-2.5-Mistral-7B?

Hi,
Please add PRO interface for mistralai/Mixtral-8x7B-Instruct-v0.1. It would also be nice to have interfaces for other models that are available through HuggingChat and are not available for PRO subscribers.
Thank you 🙂

Hi, can you please provide a link to a privacy policy that applies to the PRO Inference API?

Hello, can you please add https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B in the pro subscription.

Sorry, I didn't understand are any limiations for requests in PRO / Free accounts, like limit of tokens

I am trying to access meta-llama/Llama-2-70b-chat-hf, which was previously available as a PRO subscriber, but it seems the model does not respond.
Can you please reactivate it?

I can't apply on spaces.GPU on async funtions. I can't apply spaces.GPU in wrapped functions. It would be nice if both would be possible.

Hello Hugging Face Support Team,

I’m interested in using the models available through the PRO subscription and have reviewed the details on Inference for PRO in the following link: https://huggingface.co/blog/inference-pro.

Specifically, I would like to use the following model:
https://api-inference.huggingface.co/models/openai/whisper-large-v3-turbo

I’d like to know what the monthly usage limits are for this model under the PRO subscription. Specifically, how many requests can I make in a month, and what other limitations might apply?

Could you please provide detailed information regarding rate limits, monthly request quotas, response times, and any other restrictions associated with the PRO plan?

Thank you for your assistance.

Hey folks, I'm not sure where the best place to put this but I'd like some clarity on the models that have increased inference usage for PROs ~

Current Inference Docs

In the recently published serverless inference docs it mentions these models as having higher rate limits:

Model Size Supported Context Length Use
Meta Llama 3.1 Instruct 8B, 70B 70B: 32k tokens / 8B: 8k tokens High quality multilingual chat model with large context length
Meta Llama 3 Instruct 8B, 70B 8k tokens One of the best chat models
Meta Llama Guard 3 8B 4k tokens
Llama 2 Chat 7B, 13B, 70B 4k tokens One of the best conversational models
DeepSeek Coder v2 236B 16k tokens A model with coding capabilities.
Bark 0.9B - Text to audio generation

Old Blog Article

But there's also this old blog post that introduces the feature with these models and it hasn't been updated:

Model Size Context Length Use
Meta Llama 3 Instruct 8B, 70B 8k tokens One of the best chat models
Mixtral 8x7B Instruct 45B MOE 32k tokens Performance comparable to top proprietary models
Nous Hermes 2 Mixtral 8x7B DPO 45B MOE 32k tokens Further trained over Mixtral 8x7B MoE
Zephyr 7B β 7B 4k tokens One of the best chat models at the 7B weight
Llama 2 Chat 7B, 13B 4k tokens One of the best conversational models
Mistral 7B Instruct v0.2 7B 4k tokens One of the best chat models at the 7B weight
Code Llama Base 7B and 13B 4k tokens Autocomplete and infill code
Code Llama Instruct 34B 16k tokens Conversational code assistant
Stable Diffusion XL 3B UNet - Generate images
Bark 0.9B - Text to audio generation

I assume that the new inference docs have the correct supported models list but could be updated to avoid confusion.

My Suggestions

If the inference docs are correct, I think it could use some updating!

  1. Llama-3-70B could be swapped out for Llama-3.3-70B-Instruct, while keeping Llama-3.1-8B-Instruct.
  2. We probably don't need two large Llama 3.x models, so I'd suggest replacing Llama-3.1-70B with Qwen2.5-72B-Instruct.
  3. It's time to retire Llama-2... In 2025 we have plenty of great reasoning models to prioritize like QwQ-32B-Preview or DeepSeek-R1-Distill-Qwen-32B.
  4. Although I do like the novel nature of suno/bark, it is starting to show its age. I'd suggest replacing it hexgrad/Kokoro-82M for its small size, exceptional quality, and long inputs.
  5. DeepSeek-Coder-V2 is a very large model that is matched or outperformed by Qwen2.5-Coder-32B-Instruct. If there are any concerns about the size of models or potential load, I think aiming to replace DeepSeek-Coder-V2 would be a wise use of resources.

Notable mentions and other thoughts

I tried to keep my suggestions limited to the current paradigm of serverless inference so that each model is a drop-in replacement for existing ones, while being realistic about size. However, it would be awesome to have a text-to-image model available on this list. The best and most agreeable image gen model is either FLUX.1-schnell or stabilityai/stable-diffusion-3.5-medium. Both models are relatively smol, and all above models are commercially permissive or already available on HuggingChat.

Thanks for reading :)

Sign up or log in to comment