API - Cloudflare Workers AI

Cloudflare Workers AI API - Edge AI Inference Interface

šŸ“‹ Service Overview

Service Name: Cloudflare Workers AI API
Provider: Cloudflare Workers AI
API Endpoint: https://api.cloudflare.com/client/v4/accounts/{account_id}/ai/run/{model_name}
Service Type: Freemium (10,000 neurons free daily)
Registration Requirements: Cloudflare account required, no credit card needed


āœ… Service Description

Cloudflare Workers AI API is an edge AI inference service based on Cloudflare’s global network. Unlike traditional centralized AI APIs, Workers AI deploys models across 300+ data centers globally, executing AI inference closest to users, significantly reducing latency and improving user experience.

Key Features

  • šŸŒ Global Edge Deployment: Runs in 300+ cities worldwide, providing minimum latency (typically < 50ms)
  • šŸŽ Generous Free Tier: 10,000 neurons free daily, no credit card required
  • ⚔ Ultra-Low Cost: Only $0.011/1000 neurons when exceeding free tier, 80%+ cheaper than traditional cloud services
  • šŸ¤– Rich Model Library: 50+ open-source models covering LLM, image, speech, embeddings, and more
  • šŸ”Œ Developer Friendly: REST API + Workers bindings, compatible with OpenAI SDK
  • šŸ”§ Deep Integration: Seamlessly integrates with Cloudflare Workers, Pages, AI Gateway, and Vectorize

šŸŽ Available Models

Text Generation Models (LLM)

Model NameParametersContext LengthFeaturesUse Cases
@cf/meta/llama-3.1-8b-instruct8B8KMeta’s latest, balancedGeneral chat, code generation
@cf/meta/llama-3-8b-instruct8B8KMeta Llama 3, fastReal-time chat, summarization
@cf/mistral/mistral-7b-instruct-v0.27B32KMistral AI, long contextDocument analysis, long text
@cf/qwen/qwen1.5-7b-chat-awq7B32KAlibaba Qwen, Chinese optimizedChinese chat, translation
@cf/google/gemma-7b-it7B8KGoogle Gemma, open sourceGeneral tasks
@cf/deepseek-ai/deepseek-math-7b-instruct7B4KDeepSeek math expertMath reasoning, problem solving

Image Generation Models

Model NameFeaturesUse Cases
@cf/stabilityai/stable-diffusion-xl-base-1.0SDXL, high-quality imagesArt creation, design
@cf/lykon/dreamshaper-8-lcmLCM, fast generationQuick prototyping, real-time
@cf/bytedance/stable-diffusion-xl-lightningUltra-fast, 4-8 stepsReal-time applications

Image Analysis Models

Model NameFeaturesUse Cases
@cf/unum/uform-gen2-qwen-500mImage understanding + text generationImage captioning, VQA
@cf/llava-hf/llava-1.5-7b-hfMultimodal understandingImage Q&A, analysis

Speech Recognition Models

Model NameFeaturesUse Cases
@cf/openai/whisperOpenAI Whisper, multilingualSpeech-to-text, subtitles

Embedding Models

Model NameDimensionsFeaturesUse Cases
@cf/baai/bge-base-en-v1.5768English embeddings, high performanceSemantic search, RAG
@cf/baai/bge-large-en-v1.51024English embeddings, higher accuracyPrecise matching
@cf/baai/bge-small-en-v1.5384English embeddings, lightweightFast retrieval

Complete Model List

Visit Cloudflare Workers AI Model Catalog for the latest model list.


šŸ”¢ Quotas and Limits

Free Tier Limits

Limit ItemQuotaDescription
Daily Neurons10,000 neurons/dayShared across all models
Request RateBy task typeText Generation ~300 req/min, Embeddings ~3000 req/min
Single Request SizeModel dependentUsually 1-10MB
Concurrent RequestsRate limitedDifferent limits for different task types
Credit Card RequiredāŒ NoNot needed for free tier, required when exceeding quota

āš ļø Important Limits

  1. Neuron Calculation: Different models consume different amounts of neurons per inference:

    • Small LLM (e.g., Llama-3-8B): ~5-10 neurons/request
    • Image generation (e.g., SDXL): ~50-100 neurons/image
    • Speech recognition: ~1 neuron/second of audio
  2. Rate Limits: Different task types have different rate limits (req/min):

    • Text Generation: ~300 req/min
    • Text Embeddings: ~3000 req/min
    • Speech Recognition (ASR): ~720 req/min
    • Some models may have stricter limits
  3. Daily Reset: Free quota resets at UTC 00:00 daily

  4. Excess Billing: When exceeding free tier, charges $0.011/1000 neurons

Quota Reset Time

  • Daily Quota: Resets at UTC 00:00
  • Real-time Monitoring: Check usage in Cloudflare Dashboard

šŸ’° Pricing

Free Tier

  • Free Quota: 10,000 neurons daily
  • How to Get: Available immediately upon Cloudflare account registration
  • Validity: Permanent, resets daily

Paid Pricing

Billing after exceeding free tier:

ItemPriceDescription
Neurons$0.011/1000 neuronsPay per actual usage
Minimum SpendNoneTrue pay-as-you-go

Cost Comparison

ServiceTypical Cost (1000 LLM Inferences)Relative Savings
Cloudflare Workers AI~$0.05-0.10Baseline
OpenAI GPT-3.5~$0.50-1.005-20x
OpenAI GPT-4~$10-30100-600x
AWS Bedrock~$0.30-2.003-40x

šŸš€ How to Use

Prerequisites

1. Register Cloudflare Account

See: Cloudflare Workers AI Registration Guide

2. Get API Token

Access API Tokens Page

Log in to Cloudflare Dashboard and go to API Tokens page

Create Token
  1. Click “Create Token”
  2. Select “Edit Cloudflare Workers” template
  3. Or customize permissions, ensure Workers AI permission is included
Save Token

Copy the generated Token and save securely (shown only once)

3. Get Account ID

Find your Account ID on the right side of any page in Cloudflare Dashboard.


šŸ’» Code Examples

Method 1: Using REST API (Any Language)

Python Example

Install Dependencies:

Bash
pip install requests

Basic Usage (Text Generation):

Python
import requests
import os

# Configuration
CLOUDFLARE_ACCOUNT_ID = "your-account-id"
CLOUDFLARE_API_TOKEN = "your-api-token"

# API endpoint
url = f"https://api.cloudflare.com/client/v4/accounts/{CLOUDFLARE_ACCOUNT_ID}/ai/run/@cf/meta/llama-3.1-8b-instruct"

# Request headers
headers = {
    "Authorization": f"Bearer {CLOUDFLARE_API_TOKEN}",
    "Content-Type": "application/json"
}

# Request data
data = {
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Cloudflare Workers AI?"}
    ]
}

# Send request
response = requests.post(url, headers=headers, json=data)
result = response.json()

# Print result
print(result["result"]["response"])

Streaming Output:

Python
import requests
import json

# Enable streaming response
data = {
    "messages": [
        {"role": "user", "content": "Write a short story about AI"}
    ],
    "stream": True
}

response = requests.post(url, headers=headers, json=data, stream=True)

# Handle streaming data
for line in response.iter_lines():
    if line:
        line = line.decode('utf-8')
        if line.startswith('data: '):
            try:
                data = json.loads(line[6:])
                if 'response' in data:
                    print(data['response'], end='', flush=True)
            except:
                pass

Image Generation Example:

Python
import requests
import base64

# Image generation API
url = f"https://api.cloudflare.com/client/v4/accounts/{CLOUDFLARE_ACCOUNT_ID}/ai/run/@cf/stabilityai/stable-diffusion-xl-base-1.0"

data = {
    "prompt": "A beautiful sunset over mountains, digital art style"
}

response = requests.post(url, headers=headers, json=data)
result = response.json()

# Save image
image_data = base64.b64decode(result["result"]["image"])
with open("output.png", "wb") as f:
    f.write(image_data)
print("Image saved as output.png")

Method 2: Using Cloudflare Workers (Recommended)

Using Workers AI in Cloudflare Workers is simpler without managing API Tokens.

Step 1: Configure wrangler.toml

wrangler.toml
name = "my-ai-worker"
main = "src/index.js"
compatibility_date = "2024-01-01"

[ai]
binding = "AI"

Step 2: Write Worker Code

JavaScript
export default {
  async fetch(request, env) {
    // Use AI binding
    const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
      messages: [
        { role: "system", content: "You are a helpful assistant." },
        { role: "user", content: "Explain Workers AI in one sentence." }
      ]
    });

    return new Response(JSON.stringify(response), {
      headers: { "content-type": "application/json" }
    });
  }
};

Step 3: Deploy

Bash
npx wrangler deploy

More Examples:

JavaScript
// Text embeddings
const embeddings = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
  text: "Cloudflare Workers AI is amazing"
});

// Image generation
const imageResponse = await env.AI.run("@cf/stabilityai/stable-diffusion-xl-base-1.0", {
  prompt: "A futuristic city"
});

// Speech-to-text
const transcription = await env.AI.run("@cf/openai/whisper", {
  audio: audioArrayBuffer
});

Method 3: Using OpenAI SDK (Compatible)

Cloudflare Workers AI is partially compatible with OpenAI SDK:

Python
from openai import OpenAI

# Configure Workers AI
client = OpenAI(
    api_key=CLOUDFLARE_API_TOKEN,
    base_url=f"https://api.cloudflare.com/client/v4/accounts/{CLOUDFLARE_ACCOUNT_ID}/ai/v1"
)

# Use OpenAI-like API
response = client.chat.completions.create(
    model="@cf/meta/llama-3.1-8b-instruct",
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ]
)

print(response.choices[0].message.content)

cURL Examples

Basic Request:

Bash
curl https://api.cloudflare.com/client/v4/accounts/{account_id}/ai/run/@cf/meta/llama-3.1-8b-instruct \
  -H "Authorization: Bearer {api_token}" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is edge computing?"}
    ]
  }'

Streaming Output:

Bash
curl https://api.cloudflare.com/client/v4/accounts/{account_id}/ai/run/@cf/meta/llama-3.1-8b-instruct \
  -H "Authorization: Bearer {api_token}" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Tell me a story"}],
    "stream": true
  }'

🌟 Core Advantages

Technical Advantages

  1. Ultra-Low Latency:

    • Deployed in 300+ cities globally
    • Average latency < 50ms (vs traditional cloud 100-300ms)
    • Nearby response, no cross-region transmission
  2. Excellent Cost-Performance:

    • 10,000 neurons free daily
    • Only $0.011/1000 neurons when exceeded
    • 80%+ cheaper than AWS and GCP
  3. Serverless Architecture:

    • No GPU or server management
    • Auto-scaling, pay-per-use
    • Zero operational costs

Comparison with Other AI APIs

FeatureWorkers AIOpenAI APIGoogle AI StudioAWS Bedrock
Free Tier10K neurons/dayTrial creditsFree usageTrial credits
Edge Deploymentāœ… 300+ citiesāŒāŒāŒ
Latency< 50ms100-300ms100-300ms100-300ms
Cost (LLM)~$0.05/1K requests~$0.50/1K requestsFree~$0.30/1K requests
Open Source Modelsāœ… 50+āŒPartialPartial
Serverlessāœ…Self-managedāŒPartial

šŸ’” Practical Recommendations

āœ… Best Practices

  1. Prioritize Workers Bindings:

    // Use in Workers, simpler without managing Tokens
    const response = await env.AI.run(model, input);
  2. Combine with AI Gateway:

    // Add caching, logging, retry features
    const response = await env.AI.run(model, input, {
      gateway: {
        id: "my-gateway",
        skipCache: false,
        cacheTtl: 3600
      }
    });
  3. Choose Appropriate Models:

    • Real-time chat: Llama-3-8B (fast)
    • Chinese tasks: Qwen-1.5-7B (Chinese optimized)
    • Math reasoning: DeepSeek-Math-7B (math expert)
    • Fast image generation: SD-XL-Lightning
  4. Monitor Neuron Usage:

    // Log consumption per request
    console.log(`Neurons used: ${response.neuronCount}`);

šŸŽÆ Best Practices

Maximize Free Tier:

  • Daily 10,000 neurons can support ~1,000-2,000 LLM requests
  • Choose appropriate model size and input length
  • Use caching to reduce redundant requests

Cost Control Tips:

  • Use smaller models or shorter contexts during development/testing
  • Leverage AI Gateway for caching and rate limiting to reduce repeat calls
  • Set usage alerts in Dashboard to avoid unexpected overages
  • Conduct load testing for high-frequency scenarios to understand actual neuron consumption

Optimize Latency:

  • Leverage edge deployment for low-latency apps
  • Use streaming output to improve UX
  • Preload commonly used models

Error Handling:

Python
import time

def call_workers_ai_with_retry(url, headers, data, max_retries=3):
    """API call with retry mechanism"""
    for attempt in range(max_retries):
        try:
            response = requests.post(url, headers=headers, json=data, timeout=30)
            response.raise_for_status()
            return response.json()
        except requests.exceptions.RequestException as e:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt
                print(f"Request failed, retrying in {wait_time}s... ({attempt + 1}/{max_retries})")
                time.sleep(wait_time)
            else:
                raise Exception(f"Request failed: {e}")

āš ļø Cautions

  1. Neuron Consumption: Different models consume different amounts; larger models and longer inputs consume more
  2. Model Availability: Some models may not be available in certain regions
  3. Input Limits: Note input length limits for each model
  4. Billing Cycle: Billing starts immediately when free tier is exceeded, monitor usage

šŸŽÆ Practical Use Cases

Case 1: Smart Customer Service Chatbot

Build a low-latency global customer service system:

JavaScript
export default {
  async fetch(request, env) {
    const { message } = await request.json();
    
    // Use Llama 3 for fast response
    const response = await env.AI.run(
      "@cf/meta/llama-3.1-8b-instruct",
      {
        messages: [
          {
            role: "system",
            content: "You are a helpful customer service agent."
          },
          { role: "user", content: message }
        ],
        stream: true
      }
    );

    return new Response(response, {
      headers: { "content-type": "text/event-stream" }
    });
  }
};

Case 2: Document Semantic Search

Build semantic search using embedding models:

JavaScript
export default {
  async fetch(request, env) {
    const { query } = await request.json();
    
    // Generate query embeddings
    const embeddings = await env.AI.run(
      "@cf/baai/bge-base-en-v1.5",
      { text: query }
    );
    
    // Use Vectorize for similarity search
    const results = await env.VECTORIZE.query(embeddings.data[0], {
      topK: 5
    });
    
    return Response.json(results);
  }
};

Case 3: Automated Image Generation

API receives text description and generates image:

Python
import requests
import base64

def generate_image(prompt, account_id, api_token):
    """Generate image"""
    url = f"https://api.cloudflare.com/client/v4/accounts/{account_id}/ai/run/@cf/bytedance/stable-diffusion-xl-lightning"
    
    headers = {
        "Authorization": f"Bearer {api_token}",
        "Content-Type": "application/json"
    }
    
    data = {
        "prompt": prompt,
        "num_steps": 4  # Lightning model, 4 steps sufficient
    }
    
    response = requests.post(url, headers=headers, json=data)
    result = response.json()
    
    # Save image
    image_data = base64.b64decode(result["result"]["image"])
    filename = f"generated_{hash(prompt)}.png"
    with open(filename, "wb") as f:
        f.write(image_data)
    
    return filename

# Usage example
image_path = generate_image(
    "A futuristic cityscape at sunset, cyberpunk style",
    "your-account-id",
    "your-api-token"
)
print(f"Image saved to: {image_path}")

Case 4: Speech-to-Text Service

JavaScript
export default {
  async fetch(request, env) {
    // Receive audio file
    const formData = await request.formData();
    const audioFile = formData.get('audio');
    const audioBuffer = await audioFile.arrayBuffer();
    
    // Use Whisper for transcription
    const transcription = await env.AI.run(
      "@cf/openai/whisper",
      {
        audio: [...new Uint8Array(audioBuffer)]
      }
    );
    
    return Response.json({
      text: transcription.text,
      language: transcription.language
    });
  }
};

šŸ”§ FAQ

Q: How do I check my neuron usage?
A: Log in to Cloudflare Dashboard, check AI usage and neuron consumption in the Workers & Pages section.

Q: When am I charged?
A: Only when you exceed the daily 10,000 neurons free tier, charged at $0.011/1000 neurons.

Q: Can I use my own models?
A: Currently, Workers AI only supports models from Cloudflare’s open-source catalog and doesn’t support custom model uploads.

Q: Which regions does Workers AI support?
A: Workers AI runs on Cloudflare’s global network, covering 300+ cities and is available almost globally. However, some models may be restricted in certain regions.

Q: How do I get the best performance?
A: 1) Use Workers bindings instead of REST API; 2) Choose appropriately sized models; 3) Enable streaming output; 4) Use AI Gateway caching.

Q: What’s the difference between Workers AI and OpenAI API?
A: Workers AI uses open-source models deployed at the edge with lower latency and cost. OpenAI API uses proprietary models (like GPT-4) with stronger capabilities but higher cost.


šŸ”— Related Resources


šŸ“ Update Log

  • January 2024: Added support for more Llama 3.1 and Mistral models
  • December 2023: Launched image generation and speech recognition models
  • September 2023: Official Workers AI launch with 10,000 neurons free daily

Service Provider: Cloudflare Workers AI

Last updated on