Groq API - Ultra-Fast Free AI API Service

๐Ÿ“‹ Service Information

Provider: Groq
Service Type: API Service
API Endpoint: https://api.groq.com/openai/v1
Free Tier: Free Service (with usage limits)
API Compatibility: Fully compatible with OpenAI API


๐ŸŽฏ Service Overview

Groq API provides the industry’s fastest LLM inference service, fully compatible with OpenAI API format, allowing you to switch existing code to Groq by simply changing the base_url.

Key Advantages:

  • โšก Industry Fastest - 800+ tokens/s inference speed
  • ๐Ÿ”„ OpenAI Compatible - Seamless code migration
  • ๐ŸŽ High Quota - Typical quota ~14,400 requests/day
  • ๐Ÿ’ฐ Free Service - No payment required
  • ๐Ÿš€ Low Latency - Millisecond first-token response

๐Ÿš€ Quick Start

Prerequisites

Required:

  • โœ… API key created

For detailed steps, see: Groq Registration Guide

5-Minute Quick Example

Method 1: Using Groq SDK (Recommended)

Bash
pip install groq
Python
from groq import Groq

# Initialize client
client = Groq(api_key="YOUR_API_KEY")

# Send request
chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Explain Groq's LPU technology",
        }
    ],
    model="llama-3.3-70b-versatile",
)

# Print response
print(chat_completion.choices[0].message.content)

Method 2: Using OpenAI SDK (Compatibility Mode)

Python
from openai import OpenAI

# Just modify base_url and api_key
client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key="YOUR_API_KEY"
)

# Rest of the code is identical to OpenAI
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "user", "content": "Hello"}
    ]
)

print(response.choices[0].message.content)

๐Ÿค– Supported Models

Llama Series (Recommended)

Model IDParametersContextSpeedUse Cases
llama-3.3-70b-versatile70B128KโšกโšกโšกGeneral tasks, best balance
llama-3.1-70b-versatile70B128KโšกโšกโšกComplex tasks
llama-3.1-8b-instant8B128KโšกโšกโšกโšกFast responses

Other Models

Model IDParametersContextFeatures
mixtral-8x7b-3276847B32KMistral Mixture of Experts
gemma2-9b-it9B8KGoogle Open Source
deepseek-r1-distill-llama-70b70B32KReasoning Expert

๐Ÿ”ข Quotas and Limits

API Quotas

Limit TypeTypical QuotaNotes
Daily Requests~14,400 requestsShared across all models
Requests/Minute~30 requestsShared across all models
Daily Tokens~20,000-1,000,000Varies by model
Tokens/Minute~6,000 tokensInput + output combined
Max Context128K tokensLlama 3.x series

Note: Actual quotas may vary by account. Please check your console for actual limits. | Max Output | 8,192 tokens | Per request |

โš ๏ธ Important Limitations

  1. Quota Sharing: All models share the same account quota
  2. Rate Limits: Returns 429 error when exceeded
  3. Daily Reset: UTC 00:00
  4. Concurrency Limit: Recommended max 10 concurrent requests

๐Ÿ“– API Usage Examples

1. Basic Conversation

Python
from groq import Groq

client = Groq(api_key="YOUR_API_KEY")

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "What is machine learning?"}
    ]
)

print(response.choices[0].message.content)

2. Streaming Output

Python
from groq import Groq

client = Groq(api_key="YOUR_API_KEY")

stream = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "user", "content": "Write an article about AI"}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

3. Multi-turn Conversation

Python
from groq import Groq

client = Groq(api_key="YOUR_API_KEY")

messages = [
    {"role": "system", "content": "You are a Python programming assistant"}
]

# First round
messages.append({"role": "user", "content": "How to read a file?"})
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=messages
)
messages.append({"role": "assistant", "content": response.choices[0].message.content})

# Second round (model remembers context)
messages.append({"role": "user", "content": "What about writing?"})
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=messages
)
print(response.choices[0].message.content)

4. Adjust Generation Parameters

Python
from groq import Groq

client = Groq(api_key="YOUR_API_KEY")

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "user", "content": "Write a poem"}
    ],
    temperature=0.9,        # Increase creativity
    max_tokens=1024,        # Limit output length
    top_p=0.95,            # Nucleus sampling
    frequency_penalty=0.5,  # Reduce repetition
    presence_penalty=0.5    # Increase topic diversity
)

print(response.choices[0].message.content)

5. Using cURL

Bash
curl "https://api.groq.com/openai/v1/chat/completions" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "llama-3.3-70b-versatile",
    "messages": [
      {
        "role": "user",
        "content": "Hello, introduce Groq"
      }
    ]
  }'

6. Node.js Example

Bash
npm install groq-sdk
JavaScript
const Groq = require("groq-sdk");

const groq = new Groq({
  apiKey: "YOUR_API_KEY"
});

async function main() {
  const chatCompletion = await groq.chat.completions.create({
    messages: [
      { role: "user", content: "Introduce Groq's LPU technology" }
    ],
    model: "llama-3.3-70b-versatile",
  });

  console.log(chatCompletion.choices[0].message.content);
}

main();

๐Ÿ’ก Best Practices

โœ… Recommended Practices

  1. Choose the Right Model

    # Fast tasks
    model = "llama-3.1-8b-instant"
    
    # Balanced performance
    model = "llama-3.3-70b-versatile"
    
    # Math/code reasoning
    model = "deepseek-r1-distill-llama-70b"
  2. Implement Error Handling and Retries

    import time
    from groq import Groq, RateLimitError
    
    client = Groq(api_key="YOUR_API_KEY")
    
    def call_with_retry(messages, max_retries=3):
        for i in range(max_retries):
            try:
                return client.chat.completions.create(
                    model="llama-3.3-70b-versatile",
                    messages=messages
                )
            except RateLimitError:
                if i < max_retries - 1:
                    wait_time = 2 ** i
                    print(f"Rate limited, waiting {wait_time} seconds...")
                    time.sleep(wait_time)
                else:
                    raise
  3. Monitor Quota Usage

    import logging
    
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)
    
    def track_usage(response):
        usage = response.usage
        logger.info(f"Tokens used: "
                    f"input={usage.prompt_tokens}, "
                    f"output={usage.completion_tokens}, "
                    f"total={usage.total_tokens}")
        return response
  4. Use Streaming to Optimize Experience

    def stream_chat(prompt):
        stream = client.chat.completions.create(
            model="llama-3.3-70b-versatile",
            messages=[{"role": "user", "content": prompt}],
            stream=True
        )
    
        full_response = ""
        for chunk in stream:
            if chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                print(content, end="", flush=True)
                full_response += content
    
        return full_response
  5. Securely Manage API Keys

    import os
    from dotenv import load_dotenv
    
    # Use environment variables
    load_dotenv()
    api_key = os.getenv('GROQ_API_KEY')
    
    client = Groq(api_key=api_key)

๐Ÿ”ง Common Issues

1. 429 Error: Rate Limit Exceeded

Cause: Exceeded rate limit

Solution:

Python
import time
from groq import RateLimitError

try:
    response = client.chat.completions.create(...)
except RateLimitError as e:
    print(f"Rate limited: {e}")
    print("Retry after waiting...")
    time.sleep(60)  # Wait 1 minute
    response = client.chat.completions.create(...)

2. 401 Error: Invalid API Key

Cause: API key is invalid or expired

Solution:

  1. Check if API key is correct
  2. Regenerate key in Console
  3. Ensure no extra spaces in key

3. Network Timeout

Solution:

Python
from groq import Groq

client = Groq(
    api_key="YOUR_API_KEY",
    timeout=60.0  # Set 60 second timeout
)

4. How to Check Remaining Quota?

Method:

  1. Login to https://console.groq.com
  2. Visit “Usage” page
  3. View current quota usage

๐Ÿ“Š Performance Optimization

1. Batch Processing

Python
import asyncio
from groq import AsyncGroq

async def batch_process(questions):
    client = AsyncGroq(api_key="YOUR_API_KEY")
    
    tasks = []
    for q in questions:
        task = client.chat.completions.create(
            model="llama-3.1-8b-instant",
            messages=[{"role": "user", "content": q}]
        )
        tasks.append(task)
    
    return await asyncio.gather(*tasks)

# Usage
questions = ["Question 1", "Question 2", "Question 3"]
responses = asyncio.run(batch_process(questions))

2. Cache Results

Python
from functools import lru_cache
import hashlib
import json

cache = {}

def get_cache_key(messages):
    return hashlib.md5(
        json.dumps(messages, sort_keys=True).encode()
    ).hexdigest()

def cached_completion(messages):
    key = get_cache_key(messages)
    
    if key in cache:
        print("Using cached result")
        return cache[key]
    
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=messages
    )
    
    cache[key] = response
    return response

3. Choose Appropriate Model

Python
def select_model(task_complexity):
    if task_complexity == "simple":
        return "llama-3.1-8b-instant"  # Fastest
    elif task_complexity == "medium":
        return "llama-3.3-70b-versatile"  # Balanced
    elif task_complexity == "complex":
        return "deepseek-r1-distill-llama-70b"  # Reasoning

๐Ÿ“š Related Resources

Official Documentation

SDKs and Tools

Example Code


๐ŸŒŸ Practical Cases

Case 1: Real-time Chatbot

Python
import streamlit as st
from groq import Groq

client = Groq(api_key="YOUR_API_KEY")

st.title("Groq Ultra-fast Chatbot")

if "messages" not in st.session_state:
    st.session_state.messages = []

# Display message history
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.write(message["content"])

# User input
if prompt := st.chat_input("Enter your question..."):
    # Add user message
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.write(prompt)
    
    # Generate response
    with st.chat_message("assistant"):
        stream = client.chat.completions.create(
            model="llama-3.3-70b-versatile",
            messages=st.session_state.messages,
            stream=True
        )
        
        response = st.write_stream(
            (chunk.choices[0].delta.content for chunk in stream 
             if chunk.choices[0].delta.content)
        )
    
    # Add assistant message
    st.session_state.messages.append({"role": "assistant", "content": response})

Case 2: Code Assistant

Python
from groq import Groq

client = Groq(api_key="YOUR_API_KEY")

def code_assistant(language, task):
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[
            {
                "role": "system",
                "content": f"You are a professional {language} programming assistant. "
                           f"Provide clear code examples and explanations."
            },
            {
                "role": "user",
                "content": task
            }
        ],
        temperature=0.2  # Lower temperature for more deterministic code
    )
    
    return response.choices[0].message.content

# Usage
result = code_assistant("Python", "Implement a binary search algorithm")
print(result)

Service Provider: Groq

Last updated on