API - Hugging Face

Hugging Face Inference API - Free Testing of Thousands of Open-Source Models

๐Ÿ“‹ Service Overview

Service Name: Hugging Face Inference API
Provider: Hugging Face
API Endpoint: https://api-inference.huggingface.co/models/{model_id}
Service Type: Freemium (Free ~$0.10/month + PRO $9/month includes $2/month)
Registration: Registration and API Token required


โœ… Service Description

Hugging Face Inference API is a serverless inference API service that allows developers to call thousands of open-source models hosted on Hugging Face Hub through simple HTTP requests. No need to deploy models yourself - quickly test and integrate various AI capabilities.

Main Features

  • Abundant Models: Supports thousands of public models covering various AI tasks
  • Free Quota: Free account ~$0.10/month, PRO account ~$2/month (reference values)
  • Ready to Use: No deployment needed, use via API calls
  • Multi-Task Support: Text generation, image generation, speech recognition, image classification, etc.

๐ŸŽ Available Models

Free Model Types

Hugging Face Inference API supports the following task types:

Natural Language Processing (NLP)

Task TypeDescriptionExample Models
Text GenerationText generationLlama, Mistral, Qwen, DeepSeek
Text ClassificationText classificationBERT, RoBERTa
Token ClassificationNamed entity recognitionBERT-NER
Question AnsweringQ&A systemsBERT-QA
TranslationMachine translationMarianMT, T5
SummarizationText summarizationBART, T5
Fill-MaskFill in the blankBERT, RoBERTa

Computer Vision (CV)

Task TypeDescriptionExample Models
Image ClassificationImage classificationResNet, ViT
Object DetectionObject detectionDETR, YOLO
Image SegmentationImage segmentationSegFormer
Image-to-ImageImage transformationStable Diffusion
Text-to-ImageText to image generationStable Diffusion, DALL-E mini

Audio Processing

Task TypeDescriptionExample Models
Automatic Speech RecognitionSpeech recognitionWhisper
Audio ClassificationAudio classificationWav2Vec2
Text-to-SpeechSpeech synthesisFastSpeech

Popular Model Examples

  • Llama 3.1 8B / 70B - Meta’s open-source large language model
  • Mistral 7B / Mixtral 8x7B - Mistral AI’s high-performance models
  • Qwen 2.5 - Alibaba Cloud’s multilingual model
  • FLUX.1 - High-quality image generation model
  • Whisper - OpenAI’s speech recognition model
  • Stable Diffusion - Image generation model

๐Ÿ”ข Quotas and Limits

Free Tier Limits

Limit ItemQuotaNotes
Monthly QuotaFree ~$0.10/monthPRO ~$2/month (reference, subject to official terms)
Rate LimitsVaries by tierFree/PRO/Team/Enterprise have different limits
Concurrent RequestsLimitedAvoid many requests in short time
Cold Start TimeMay be longFirst request may need model loading
Response TimeNo guaranteeBest effort, no SLA
Credit Card RequiredโŒ Not RequiredFree quota needs no credit card

PRO Account ($9/month)

Limit ItemQuotaNotes
Monthly Quota~$2/monthIncluded in $9/month subscription, supports pay-as-you-go
Rate LimitsHigher limitSignificantly increased rate limits
Priority Processingโœ…Requests prioritized, less waiting
Cold StartFasterModels kept active
Early Accessโœ…Early access to new features and models

โš ๏ธ Important Limitations

  1. Monthly Quota Limits: Free ~$0.10/month, PRO ~$2/month, limitations apply when exhausted (values subject to official terms)
  2. Cold Start Delay: First request or after long inactivity needs loading time (may be 10-30 seconds)
  3. Rate Limits: Different account tiers have different rate limits, exceeding returns 429 error
  4. Model Availability: Some models may require PRO account or special permissions
  5. No SLA Guarantee: Free tier provides no service level agreement
  6. Production Use Limits: Free tier not recommended for production, use dedicated inference endpoints

๐Ÿ’ฐ Pricing

Free Tier

  • Price: Completely free
  • Monthly Quota: ~$0.10/month (reference, subject to official terms)
  • Use Cases: Testing, learning, small-scale applications

PRO Account

  • Price: $9/month
  • Included Quota: ~$2/month of Inference credits
  • Features:
    • Higher rate limits
    • Priority request processing
    • Faster cold starts
    • Pay-as-you-go support (after quota exhausted)
    • Early access to new features
  • Use Cases: Personal projects, small to medium-scale applications

Dedicated Inference Endpoints

  • Price: Pay-as-you-go, from $0.06/hour
  • Features:
    • Dedicated compute resources
    • No cold starts
    • Auto-scaling
    • SLA guarantees
  • Use Cases: Production environments, enterprise applications

๐Ÿš€ How to Use

Prerequisites

1. Register Account

First register a Hugging Face account.

2. Get Access Token

Log in to Hugging Face

Visit https://huggingface.co and log in to your account

Go to Settings Page

Click avatar in top right โ†’ Settings โ†’ Access Tokens

Create New Token
  1. Click “New token” button
  2. Enter token name (e.g., my-api-token)
  3. Select permission type (recommend Read)
  4. Click “Generate a token”
  5. Important: Copy and securely save your token

๐Ÿ’ป Code Examples

Python Examples

Install dependencies:

Bash
pip install requests
# Or use official library
pip install huggingface_hub

Using requests library:

Python
import requests

API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-3.1-8B-Instruct"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

# Call API
output = query({
    "inputs": "Explain the history of artificial intelligence.",
    "parameters": {
        "max_new_tokens": 500,
        "temperature": 0.7
    }
})

print(output)

Using huggingface_hub library:

Python
from huggingface_hub import InferenceClient

# Initialize client
client = InferenceClient(token="YOUR_HF_TOKEN")

# Text generation
response = client.text_generation(
    "Explain the history of artificial intelligence.",
    model="meta-llama/Llama-3.1-8B-Instruct",
    max_new_tokens=500,
    temperature=0.7
)

print(response)

Streaming output:

Python
from huggingface_hub import InferenceClient

client = InferenceClient(token="YOUR_HF_TOKEN")

# Stream text generation
for token in client.text_generation(
    "Write a poem about spring",
    model="meta-llama/Llama-3.1-8B-Instruct",
    max_new_tokens=200,
    stream=True
):
    print(token, end="", flush=True)

cURL Examples

Text generation:

Bash
curl https://api-inference.huggingface.co/models/meta-llama/Llama-3.1-8B-Instruct \
  -X POST \
  -H "Authorization: Bearer YOUR_HF_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": "Explain the history of artificial intelligence.",
    "parameters": {
      "max_new_tokens": 500,
      "temperature": 0.7
    }
  }'

Image generation:

Bash
curl https://api-inference.huggingface.co/models/stabilityai/stable-diffusion-2-1 \
  -X POST \
  -H "Authorization: Bearer YOUR_HF_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": "a beautiful sunset over the ocean, oil painting style"
  }' \
  --output image.jpg

Node.js Examples

JavaScript
import { HfInference } from '@huggingface/inference'

const hf = new HfInference('YOUR_HF_TOKEN')

async function generateText() {
  const result = await hf.textGeneration({
    model: 'meta-llama/Llama-3.1-8B-Instruct',
    inputs: 'Explain the history of artificial intelligence.',
    parameters: {
      max_new_tokens: 500,
      temperature: 0.7
    }
  })
  
  console.log(result.generated_text)
}

generateText()

๐ŸŒŸ Core Advantages

Technical Advantages

  1. No Deployment Needed:

    • No server and infrastructure management
    • No model installation and configuration
    • Ready to use out of the box, focus on application development
  2. Rich Model Selection:

    • Over 1 million models to choose from
    • Covers various AI tasks and scenarios
    • Continuously updated with latest models
  3. Fast Iteration:

    • Quickly test different models
    • Easily switch models
    • Accelerate prototype development

Comparison with Other APIs

FeatureHugging Face APIOpenAI APIGoogle AI Studio APIDeepSeek API
Free Quota~Hundreds/hour$18/3 monthsFree usageยฅ5/7 days
Model Count๐Ÿ† 1M+Few modelsGemini seriesDeepSeek series
Open Source Models๐Ÿ† Full supportโŒ Not supportedโŒ Not supportedโœ… Partially open
Custom Modelsโœ… Can uploadโŒ CannotโŒ CannotโŒ Cannot
Task Types๐Ÿ† Most comprehensiveMainly NLPMainly NLPMainly NLP
Credit Card RequiredโŒโœ…โŒโš ๏ธ Recharge
Cold Startโš ๏ธ YesโŒ NoโŒ NoโŒ No

๐Ÿ’ก Practical Tips

โœ… Best Practices

  1. Choose the Right Model:

    • Select specialized models based on task type
    • Check model downloads and ratings
    • Test in Playground before integration
  2. Handle Cold Starts:

    import time
    
    def query_with_retry(payload, max_retries=3):
        """API call with retry for cold start handling"""
        for attempt in range(max_retries):
            try:
                response = requests.post(API_URL, headers=headers, json=payload)
                if response.status_code == 503:
                    # Model loading, wait and retry
                    wait_time = 10 + (attempt * 5)
                    print(f"Model loading, waiting {wait_time} seconds...")
                    time.sleep(wait_time)
                    continue
                return response.json()
            except Exception as e:
                print(f"Request failed: {e}")
                if attempt == max_retries - 1:
                    raise
        return None
  3. Cache Results:

    • Cache results for same inputs
    • Reduce API call count
    • Improve application response time

๐ŸŽฏ Best Practices

Rate Limit Handling:

Python
import time
from requests.exceptions import HTTPError

def call_api_with_rate_limit(payload):
    """API call with rate limit handling"""
    max_retries = 5
    retry_delay = 1
    
    for attempt in range(max_retries):
        try:
            response = requests.post(API_URL, headers=headers, json=payload)
            response.raise_for_status()
            return response.json()
        except HTTPError as e:
            if e.response.status_code == 429:
                # Rate limited, exponential backoff
                wait_time = retry_delay * (2 ** attempt)
                print(f"Rate limited, waiting {wait_time} seconds...")
                time.sleep(wait_time)
            else:
                raise
    
    raise Exception("Max retries reached")

Batch Processing:

Python
def batch_inference(inputs, batch_size=10):
    """Batch inference to reduce request count"""
    results = []
    
    for i in range(0, len(inputs), batch_size):
        batch = inputs[i:i+batch_size]
        payload = {"inputs": batch}
        
        try:
            response = requests.post(API_URL, headers=headers, json=payload)
            results.extend(response.json())
            
            # Avoid rate limiting
            time.sleep(1)
        except Exception as e:
            print(f"Batch {i//batch_size} failed: {e}")
    
    return results

โš ๏ธ Notes

  1. Token Security: Don’t hardcode tokens in code, use environment variables
  2. Rate Limits: Be mindful of free tier rate limits, avoid frequent requests
  3. Cold Starts: First request may be slow, handle timeouts properly
  4. Production Environment: Free tier not suitable for production, consider dedicated inference endpoints
  5. Model License: Check model usage licenses to ensure compliance with your use case

๐ŸŽฏ Real-World Use Cases

Case 1: Multi-Model Comparison Tool

Scenario: Compare answers from different models to same question

Python
from huggingface_hub import InferenceClient

client = InferenceClient(token="YOUR_HF_TOKEN")

models = [
    "meta-llama/Llama-3.1-8B-Instruct",
    "mistralai/Mistral-7B-Instruct-v0.2",
    "Qwen/Qwen2.5-7B-Instruct"
]

def compare_models(prompt):
    """Compare outputs from multiple models"""
    results = {}
    
    for model in models:
        print(f"\nTesting model: {model}")
        try:
            response = client.text_generation(
                prompt,
                model=model,
                max_new_tokens=200
            )
            results[model] = response
            print(f"Answer: {response[:100]}...")
        except Exception as e:
            results[model] = f"Error: {e}"
    
    return results

# Usage example
prompt = "Explain artificial intelligence in one sentence."
results = compare_models(prompt)

for model, response in results.items():
    print(f"\n{model}:")
    print(response)

Case 2: Document Summarization Service

Scenario: Automatically generate document summaries

Python
from huggingface_hub import InferenceClient

client = InferenceClient(token="YOUR_HF_TOKEN")

def summarize_document(document_text):
    """Generate document summary"""
    # Use summarization model
    summary = client.summarization(
        document_text,
        model="facebook/bart-large-cnn",
        max_length=150,
        min_length=50
    )
    
    return summary['summary_text']

# Usage example
document = """
Artificial Intelligence (AI) is a branch of computer science
aimed at creating systems capable of performing tasks that typically
require human intelligence. These tasks include visual perception,
speech recognition, decision-making, and language translation...
"""

summary = summarize_document(document)
print(f"Summary: {summary}")

Case 3: Smart Image Classification App

Scenario: Image classification and content recognition

Python
from huggingface_hub import InferenceClient
from PIL import Image

client = InferenceClient(token="YOUR_HF_TOKEN")

def classify_image(image_path):
    """Image classification"""
    with open(image_path, "rb") as f:
        data = f.read()
    
    result = client.image_classification(
        data,
        model="google/vit-base-patch16-224"
    )
    
    return result

# Usage example
image_path = "photo.jpg"
results = classify_image(image_path)

print("Image classification results:")
for item in results:
    print(f"- {item['label']}: {item['score']:.2%}")

๐Ÿ”ง Common Questions

Q: Is Inference API completely free?
A: Provides free quota (Free ~$0.10/month, PRO $9/month includes $2/month), limitations apply when exhausted. PRO supports pay-as-you-go.

Q: What is cold start?
A: Models need loading time on first request or after long inactivity, may take 10-30 seconds. PRO users have faster cold starts.

Q: Can I use my own uploaded models?
A: Yes! After uploading models to Hugging Face Hub, you can call them via Inference API.

Q: Is free tier suitable for production?
A: Not recommended. Free tier has no SLA guarantee, has rate limits and cold starts. Production environments should use dedicated inference endpoints.

Q: How to handle rate limit errors?
A: Implement exponential backoff retry mechanism, or upgrade to PRO account, or use dedicated inference endpoints.

Q: Which programming languages are supported?
A: Officially supports Python, JavaScript/TypeScript. Other languages can use direct HTTP requests.


๐Ÿ”— Related Resources


๐Ÿ“ Changelog

  • 2024: Support for more model types and tasks
  • 2023: Launched PRO account plan
  • 2022: Inference API officially released
  • 2021: Started providing hosted inference service

Service Provider: Hugging Face

Last updated on