API Error Handling and Retry Strategies for Production AI Applications
AI APIs are production infrastructure. They fail, throttle, timeout, and return unexpected responses. A robust error handling strategy is essential for applications that users depend on. This guide covers practical patterns for handling common AI API errors and keeping your application resilient.
The 5 Common Failure Modes
1. Rate Limit Errors (429)
Every provider has rate limits—requests per minute, tokens per minute, concurrent requests.
Symptoms: `429 Too Many Requests`, `rate_limit_exceeded`
Handling:
```python
from functools import wraps
def retry_with_backoff(max_retries=3, base_delay=1):
def decorator(func):
@wraps(func)
def wrapper(args, *kwargs):
for attempt in range(max_retries):
try:
return func(args, *kwargs)
except RateLimitError as e:
if attempt == max_retries - 1:
raise
delay = base_delay (2 * attempt) # Exponential backoff
time.sleep(delay)
return None
return wrapper
return decorator
```
2. Timeout Errors
Long requests (complex reasoning, large documents) can timeout.
Symptoms: `timeout`, `deadline_exceeded`, connection drops
Handling:
- Set explicit timeouts aligned to your needs
- Use streaming for long outputs to avoid timeout on full response
- Implement client-side timeout with fallback
```python
Explicit timeout
response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
timeout=30.0, # 30 seconds
stream=True # Avoid timeout on long responses
)
```
3. Context Length Errors (400)
Exceeding model context limits returns errors.
Symptoms: `context_length_exceeded`, `maximum_context_length`
Handling:
- Pre-check token count before sending
- Implement smart truncation (keep system prompt, trim history)
- Use summarization to compress context
```python
def safe_context(messages, model, max_context):
total = count_tokens(messages)
if total > max_context:
# Truncate oldest messages, keep system prompt
messages = truncate_history(messages, max_context)
return messages
```
4. Invalid Response Errors
Sometimes models return malformed or unexpected outputs.
Symptoms: Empty responses, truncated JSON, format mismatches
Handling:
- Validate response structure before using
- Implement structured output with JSON mode
- Add parsing fallbacks
```python
Use JSON mode for structured output
response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
response_format={"type": "json_object"}
)
Validate before using
try:
data = json.loads(response.choices[0].message.content)
if not validate_schema(data):
retry_with_different_prompt()
except json.JSONDecodeError:
handle_malformed_response()
```
5. Service Errors (500/502/503)
Provider infrastructure issues—temporary outages, overloaded servers.
Symptoms: `500 Internal Server Error`, `502 Bad Gateway`, `503 Service Unavailable`
Handling:
- These are usually temporary—retry with backoff
- Have a fallback model ready
- Implement circuit breaker pattern
The Production Retry Pattern
A complete retry strategy combines multiple techniques:
```python
class AIClient:
def __init__(self, primary_model, fallback_model):
self.primary = primary_model
self.fallback = fallback_model
self.circuit_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=60)
def complete(self, messages, max_retries=3):
# Try primary model
if self.circuit_breaker.is_open():
return self._fallback_complete(messages)
for attempt in range(max_retries):
try:
response = self._call_primary(messages)
self.circuit_breaker.record_success()
return response
except RateLimitError:
time.sleep(2 ** attempt)
except (TimeoutError, ServiceError) as e:
if attempt == max_retries - 1:
self.circuit_breaker.record_failure()
return self._fallback_complete(messages)
time.sleep(1)
except ContextLengthError:
messages = self._truncate_context(messages)
continue
return self._fallback_complete(messages)
def _fallback_complete(self, messages):
# Use smaller/faster fallback model
return client.chat.completions.create(
model=self.fallback,
messages=messages
)
```
Fallback Model Strategy
When your primary model fails, have intelligent fallbacks:
| Primary | Fallback | Use Case |
|---------|----------|----------|
| GPT-4o | GPT-4o-mini | Complex → simple tasks |
| Claude Opus | Claude Haiku | Expensive → fast |
| DeepSeek V4 | DeepSeek V3 | Premium → standard |
| Any | Gemini Flash | Provider diversity |
Key principle: Fallback should be cheaper and faster, not just different.
Monitoring and Alerting
Track error rates to detect issues before users complain:
```python
Log all errors
def log_api_call(model, success, error_type=None, latency=None):
metrics = {
"model": model,
"success": success,
"error_type": error_type,
"latency": latency,
"timestamp": datetime.now()
}
send_to_monitoring(metrics)
Alert thresholds
if error_rate > 0.05: # 5% error rate
alert_team("API error rate elevated")
if rate_limit_hits > 100/hour:
alert_team("Rate limit threshold exceeded")
```
Best Practices Summary
1. Always retry rate limits with exponential backoff
2. Set explicit timeouts for every request
3. Validate responses before using them
4. Have fallback models ready for provider failures
5. Track token counts to prevent context errors
6. Use streaming for long responses
7. Implement circuit breaker for persistent failures
8. Monitor error rates and alert proactively
Next Steps
- Read the docs for API specifications
- Get an API key to test these patterns
- Compare models for primary/fallback planning
Robust error handling is what separates demo apps from production systems. These patterns will keep your AI application running even when APIs behave unexpectedly.