API - Hugging Face

Hugging Face Inference API - Free Testing of Thousands of Open-Source Models

šŸ“‹ Service Overview

Service Name: Hugging Face Inference API
Provider: Hugging Face
API Endpoint: https://api-inference.huggingface.co/models/{model_id}
Service Type: Freemium (Free ~$0.10/month + PRO $9/month includes $2/month)
Registration: Registration and API Token required


āœ… Service Description

Hugging Face Inference API is a serverless inference API service that allows developers to call thousands of open-source models hosted on Hugging Face Hub through simple HTTP requests. No need to deploy models yourself - quickly test and integrate various AI capabilities.

Main Features

  • Abundant Models: Supports thousands of public models covering various AI tasks
  • Free Quota: Free account ~$0.10/month, PRO account ~$2/month (reference values)
  • Ready to Use: No deployment needed, use via API calls
  • Multi-Task Support: Text generation, image generation, speech recognition, image classification, etc.

šŸŽ Available Models

Free Model Types

Hugging Face Inference API supports the following task types:

Natural Language Processing (NLP)

Task TypeDescriptionExample Models
Text GenerationText generationLlama, Mistral, Qwen, DeepSeek
Text ClassificationText classificationBERT, RoBERTa
Token ClassificationNamed entity recognitionBERT-NER
Question AnsweringQ&A systemsBERT-QA
TranslationMachine translationMarianMT, T5
SummarizationText summarizationBART, T5
Fill-MaskFill in the blankBERT, RoBERTa

Computer Vision (CV)

Task TypeDescriptionExample Models
Image ClassificationImage classificationResNet, ViT
Object DetectionObject detectionDETR, YOLO
Image SegmentationImage segmentationSegFormer
Image-to-ImageImage transformationStable Diffusion
Text-to-ImageText to image generationStable Diffusion, DALL-E mini

Audio Processing

Task TypeDescriptionExample Models
Automatic Speech RecognitionSpeech recognitionWhisper
Audio ClassificationAudio classificationWav2Vec2
Text-to-SpeechSpeech synthesisFastSpeech

Popular Model Examples

  • Llama 3.1 8B / 70B - Meta’s open-source large language model
  • Mistral 7B / Mixtral 8x7B - Mistral AI’s high-performance models
  • Qwen 2.5 - Alibaba Cloud’s multilingual model
  • FLUX.1 - High-quality image generation model
  • Whisper - OpenAI’s speech recognition model
  • Stable Diffusion - Image generation model

šŸ”¢ Quotas and Limits

Free Tier Limits

Limit ItemQuotaNotes
Monthly QuotaFree ~$0.10/monthPRO ~$2/month (reference, subject to official terms)
Rate LimitsVaries by tierFree/PRO/Team/Enterprise have different limits
Concurrent RequestsLimitedAvoid many requests in short time
Cold Start TimeMay be longFirst request may need model loading
Response TimeNo guaranteeBest effort, no SLA
Credit Card RequiredāŒ Not RequiredFree quota needs no credit card

PRO Account ($9/month)

Limit ItemQuotaNotes
Monthly Quota~$2/monthIncluded in $9/month subscription, supports pay-as-you-go
Rate LimitsHigher limitSignificantly increased rate limits
Priority Processingāœ…Requests prioritized, less waiting
Cold StartFasterModels kept active
Early Accessāœ…Early access to new features and models

āš ļø Important Limitations

  1. Monthly Quota Limits: Free ~$0.10/month, PRO ~$2/month, limitations apply when exhausted (values subject to official terms)
  2. Cold Start Delay: First request or after long inactivity needs loading time (may be 10-30 seconds)
  3. Rate Limits: Different account tiers have different rate limits, exceeding returns 429 error
  4. Model Availability: Some models may require PRO account or special permissions
  5. No SLA Guarantee: Free tier provides no service level agreement
  6. Production Use Limits: Free tier not recommended for production, use dedicated inference endpoints

šŸ’° Pricing

Free Tier

  • Price: Completely free
  • Monthly Quota: ~$0.10/month (reference, subject to official terms)
  • Use Cases: Testing, learning, small-scale applications

PRO Account

  • Price: $9/month
  • Included Quota: ~$2/month of Inference credits
  • Features:
    • Higher rate limits
    • Priority request processing
    • Faster cold starts
    • Pay-as-you-go support (after quota exhausted)
    • Early access to new features
  • Use Cases: Personal projects, small to medium-scale applications

Dedicated Inference Endpoints

  • Price: Pay-as-you-go, from $0.06/hour
  • Features:
    • Dedicated compute resources
    • No cold starts
    • Auto-scaling
    • SLA guarantees
  • Use Cases: Production environments, enterprise applications

šŸš€ How to Use

Prerequisites

1. Register Account

First register a Hugging Face account.

2. Get Access Token

Log in to Hugging Face

Visit https://huggingface.co and log in to your account

Go to Settings Page

Click avatar in top right → Settings → Access Tokens

Create New Token
  1. Click “New token” button
  2. Enter token name (e.g., my-api-token)
  3. Select permission type (recommend Read)
  4. Click “Generate a token”
  5. Important: Copy and securely save your token

šŸ’» Code Examples

Python Examples

Install dependencies:

Bash
pip install requests
# Or use official library
pip install huggingface_hub

Using requests library:

Python
import requests

API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-3.1-8B-Instruct"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

# Call API
output = query({
    "inputs": "Explain the history of artificial intelligence.",
    "parameters": {
        "max_new_tokens": 500,
        "temperature": 0.7
    }
})

print(output)

Using huggingface_hub library:

Python
from huggingface_hub import InferenceClient

# Initialize client
client = InferenceClient(token="YOUR_HF_TOKEN")

# Text generation
response = client.text_generation(
    "Explain the history of artificial intelligence.",
    model="meta-llama/Llama-3.1-8B-Instruct",
    max_new_tokens=500,
    temperature=0.7
)

print(response)

Streaming output:

Python
from huggingface_hub import InferenceClient

client = InferenceClient(token="YOUR_HF_TOKEN")

# Stream text generation
for token in client.text_generation(
    "Write a poem about spring",
    model="meta-llama/Llama-3.1-8B-Instruct",
    max_new_tokens=200,
    stream=True
):
    print(token, end="", flush=True)

cURL Examples

Text generation:

Bash
curl https://api-inference.huggingface.co/models/meta-llama/Llama-3.1-8B-Instruct \
  -X POST \
  -H "Authorization: Bearer YOUR_HF_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": "Explain the history of artificial intelligence.",
    "parameters": {
      "max_new_tokens": 500,
      "temperature": 0.7
    }
  }'

Image generation:

Bash
curl https://api-inference.huggingface.co/models/stabilityai/stable-diffusion-2-1 \
  -X POST \
  -H "Authorization: Bearer YOUR_HF_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": "a beautiful sunset over the ocean, oil painting style"
  }' \
  --output image.jpg

Node.js Examples

JavaScript
import { HfInference } from '@huggingface/inference'

const hf = new HfInference('YOUR_HF_TOKEN')

async function generateText() {
  const result = await hf.textGeneration({
    model: 'meta-llama/Llama-3.1-8B-Instruct',
    inputs: 'Explain the history of artificial intelligence.',
    parameters: {
      max_new_tokens: 500,
      temperature: 0.7
    }
  })
  
  console.log(result.generated_text)
}

generateText()

🌟 Core Advantages

Technical Advantages

  1. No Deployment Needed:

    • No server and infrastructure management
    • No model installation and configuration
    • Ready to use out of the box, focus on application development
  2. Rich Model Selection:

    • Over 1 million models to choose from
    • Covers various AI tasks and scenarios
    • Continuously updated with latest models
  3. Fast Iteration:

    • Quickly test different models
    • Easily switch models
    • Accelerate prototype development

Comparison with Other APIs

FeatureHugging Face APIOpenAI APIGoogle AI Studio APIDeepSeek API
Free Quota~Hundreds/hour$18/3 monthsFree usage„5/7 days
Model CountšŸ† 1M+Few modelsGemini seriesDeepSeek series
Open Source ModelsšŸ† Full supportāŒ Not supportedāŒ Not supportedāœ… Partially open
Custom Modelsāœ… Can uploadāŒ CannotāŒ CannotāŒ Cannot
Task TypesšŸ† Most comprehensiveMainly NLPMainly NLPMainly NLP
Credit Card RequiredāŒāœ…āŒāš ļø Recharge
Cold Startāš ļø YesāŒ NoāŒ NoāŒ No

šŸ’” Practical Tips

āœ… Best Practices

  1. Choose the Right Model:

    • Select specialized models based on task type
    • Check model downloads and ratings
    • Test in Playground before integration
  2. Handle Cold Starts:

    import time
    
    def query_with_retry(payload, max_retries=3):
        """API call with retry for cold start handling"""
        for attempt in range(max_retries):
            try:
                response = requests.post(API_URL, headers=headers, json=payload)
                if response.status_code == 503:
                    # Model loading, wait and retry
                    wait_time = 10 + (attempt * 5)
                    print(f"Model loading, waiting {wait_time} seconds...")
                    time.sleep(wait_time)
                    continue
                return response.json()
            except Exception as e:
                print(f"Request failed: {e}")
                if attempt == max_retries - 1:
                    raise
        return None
  3. Cache Results:

    • Cache results for same inputs
    • Reduce API call count
    • Improve application response time

šŸŽÆ Best Practices

Rate Limit Handling:

Python
import time
from requests.exceptions import HTTPError

def call_api_with_rate_limit(payload):
    """API call with rate limit handling"""
    max_retries = 5
    retry_delay = 1
    
    for attempt in range(max_retries):
        try:
            response = requests.post(API_URL, headers=headers, json=payload)
            response.raise_for_status()
            return response.json()
        except HTTPError as e:
            if e.response.status_code == 429:
                # Rate limited, exponential backoff
                wait_time = retry_delay * (2 ** attempt)
                print(f"Rate limited, waiting {wait_time} seconds...")
                time.sleep(wait_time)
            else:
                raise
    
    raise Exception("Max retries reached")

Batch Processing:

Python
def batch_inference(inputs, batch_size=10):
    """Batch inference to reduce request count"""
    results = []
    
    for i in range(0, len(inputs), batch_size):
        batch = inputs[i:i+batch_size]
        payload = {"inputs": batch}
        
        try:
            response = requests.post(API_URL, headers=headers, json=payload)
            results.extend(response.json())
            
            # Avoid rate limiting
            time.sleep(1)
        except Exception as e:
            print(f"Batch {i//batch_size} failed: {e}")
    
    return results

āš ļø Notes

  1. Token Security: Don’t hardcode tokens in code, use environment variables
  2. Rate Limits: Be mindful of free tier rate limits, avoid frequent requests
  3. Cold Starts: First request may be slow, handle timeouts properly
  4. Production Environment: Free tier not suitable for production, consider dedicated inference endpoints
  5. Model License: Check model usage licenses to ensure compliance with your use case

šŸŽÆ Real-World Use Cases

Case 1: Multi-Model Comparison Tool

Scenario: Compare answers from different models to same question

Python
from huggingface_hub import InferenceClient

client = InferenceClient(token="YOUR_HF_TOKEN")

models = [
    "meta-llama/Llama-3.1-8B-Instruct",
    "mistralai/Mistral-7B-Instruct-v0.2",
    "Qwen/Qwen2.5-7B-Instruct"
]

def compare_models(prompt):
    """Compare outputs from multiple models"""
    results = {}
    
    for model in models:
        print(f"\nTesting model: {model}")
        try:
            response = client.text_generation(
                prompt,
                model=model,
                max_new_tokens=200
            )
            results[model] = response
            print(f"Answer: {response[:100]}...")
        except Exception as e:
            results[model] = f"Error: {e}"
    
    return results

# Usage example
prompt = "Explain artificial intelligence in one sentence."
results = compare_models(prompt)

for model, response in results.items():
    print(f"\n{model}:")
    print(response)

Case 2: Document Summarization Service

Scenario: Automatically generate document summaries

Python
from huggingface_hub import InferenceClient

client = InferenceClient(token="YOUR_HF_TOKEN")

def summarize_document(document_text):
    """Generate document summary"""
    # Use summarization model
    summary = client.summarization(
        document_text,
        model="facebook/bart-large-cnn",
        max_length=150,
        min_length=50
    )
    
    return summary['summary_text']

# Usage example
document = """
Artificial Intelligence (AI) is a branch of computer science
aimed at creating systems capable of performing tasks that typically
require human intelligence. These tasks include visual perception,
speech recognition, decision-making, and language translation...
"""

summary = summarize_document(document)
print(f"Summary: {summary}")

Case 3: Smart Image Classification App

Scenario: Image classification and content recognition

Python
from huggingface_hub import InferenceClient
from PIL import Image

client = InferenceClient(token="YOUR_HF_TOKEN")

def classify_image(image_path):
    """Image classification"""
    with open(image_path, "rb") as f:
        data = f.read()
    
    result = client.image_classification(
        data,
        model="google/vit-base-patch16-224"
    )
    
    return result

# Usage example
image_path = "photo.jpg"
results = classify_image(image_path)

print("Image classification results:")
for item in results:
    print(f"- {item['label']}: {item['score']:.2%}")

šŸ”§ Common Questions

Q: Is Inference API completely free?
A: Provides free quota (Free ~$0.10/month, PRO $9/month includes $2/month), limitations apply when exhausted. PRO supports pay-as-you-go.

Q: What is cold start?
A: Models need loading time on first request or after long inactivity, may take 10-30 seconds. PRO users have faster cold starts.

Q: Can I use my own uploaded models?
A: Yes! After uploading models to Hugging Face Hub, you can call them via Inference API.

Q: Is free tier suitable for production?
A: Not recommended. Free tier has no SLA guarantee, has rate limits and cold starts. Production environments should use dedicated inference endpoints.

Q: How to handle rate limit errors?
A: Implement exponential backoff retry mechanism, or upgrade to PRO account, or use dedicated inference endpoints.

Q: Which programming languages are supported?
A: Officially supports Python, JavaScript/TypeScript. Other languages can use direct HTTP requests.


šŸ”— Related Resources


šŸ“ Changelog

  • 2024: Support for more model types and tasks
  • 2023: Launched PRO account plan
  • 2022: Inference API officially released
  • 2021: Started providing hosted inference service

Service Provider: Hugging Face

Last updated on