API Error Handling and Retry Strategies for Production AI Applications

Share

AI APIs are production infrastructure. They fail, throttle, timeout, and return unexpected responses. A robust error handling strategy is essential for applications that users depend on. This guide covers practical patterns for handling common AI API errors and keeping your application resilient.

The 5 Common Failure Modes

1. Rate Limit Errors (429)

Every provider has rate limits—requests per minute, tokens per minute, concurrent requests.

Symptoms: `429 Too Many Requests`, `rate_limit_exceeded`

Handling:
```python
from functools import wraps

def retry_with_backoff(max_retries=3, base_delay=1):
def decorator(func):
@wraps(func)
def wrapper(args, *kwargs):
for attempt in range(max_retries):
try:
return func(args, *kwargs)
except RateLimitError as e:
if attempt == max_retries - 1:
raise
delay = base_delay (2 * attempt) # Exponential backoff
time.sleep(delay)
return None
return wrapper
return decorator
```

2. Timeout Errors

Long requests (complex reasoning, large documents) can timeout.

Symptoms: `timeout`, `deadline_exceeded`, connection drops

Handling:
- Set explicit timeouts aligned to your needs
- Use streaming for long outputs to avoid timeout on full response
- Implement client-side timeout with fallback

```python

Explicit timeout


response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
timeout=30.0, # 30 seconds
stream=True # Avoid timeout on long responses
)
```

3. Context Length Errors (400)

Exceeding model context limits returns errors.

Symptoms: `context_length_exceeded`, `maximum_context_length`

Handling:
- Pre-check token count before sending
- Implement smart truncation (keep system prompt, trim history)
- Use summarization to compress context

```python
def safe_context(messages, model, max_context):
total = count_tokens(messages)
if total > max_context:
# Truncate oldest messages, keep system prompt
messages = truncate_history(messages, max_context)
return messages
```

4. Invalid Response Errors

Sometimes models return malformed or unexpected outputs.

Symptoms: Empty responses, truncated JSON, format mismatches

Handling:
- Validate response structure before using
- Implement structured output with JSON mode
- Add parsing fallbacks

```python

Use JSON mode for structured output


response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
response_format={"type": "json_object"}
)

Validate before using


try:
data = json.loads(response.choices[0].message.content)
if not validate_schema(data):
retry_with_different_prompt()
except json.JSONDecodeError:
handle_malformed_response()
```

5. Service Errors (500/502/503)

Provider infrastructure issues—temporary outages, overloaded servers.

Symptoms: `500 Internal Server Error`, `502 Bad Gateway`, `503 Service Unavailable`

Handling:
- These are usually temporary—retry with backoff
- Have a fallback model ready
- Implement circuit breaker pattern

The Production Retry Pattern

A complete retry strategy combines multiple techniques:

```python
class AIClient:
def __init__(self, primary_model, fallback_model):
self.primary = primary_model
self.fallback = fallback_model
self.circuit_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=60)

def complete(self, messages, max_retries=3):
# Try primary model
if self.circuit_breaker.is_open():
return self._fallback_complete(messages)

for attempt in range(max_retries):
try:
response = self._call_primary(messages)
self.circuit_breaker.record_success()
return response
except RateLimitError:
time.sleep(2 ** attempt)
except (TimeoutError, ServiceError) as e:
if attempt == max_retries - 1:
self.circuit_breaker.record_failure()
return self._fallback_complete(messages)
time.sleep(1)
except ContextLengthError:
messages = self._truncate_context(messages)
continue

return self._fallback_complete(messages)

def _fallback_complete(self, messages):
# Use smaller/faster fallback model
return client.chat.completions.create(
model=self.fallback,
messages=messages
)
```

Fallback Model Strategy

When your primary model fails, have intelligent fallbacks:

| Primary | Fallback | Use Case |
|---------|----------|----------|
| GPT-4o | GPT-4o-mini | Complex → simple tasks |
| Claude Opus | Claude Haiku | Expensive → fast |
| DeepSeek V4 | DeepSeek V3 | Premium → standard |
| Any | Gemini Flash | Provider diversity |

Key principle: Fallback should be cheaper and faster, not just different.

Monitoring and Alerting

Track error rates to detect issues before users complain:

```python

Log all errors


def log_api_call(model, success, error_type=None, latency=None):
metrics = {
"model": model,
"success": success,
"error_type": error_type,
"latency": latency,
"timestamp": datetime.now()
}
send_to_monitoring(metrics)

Alert thresholds


if error_rate > 0.05: # 5% error rate
alert_team("API error rate elevated")
if rate_limit_hits > 100/hour:
alert_team("Rate limit threshold exceeded")
```

Best Practices Summary

1. Always retry rate limits with exponential backoff
2. Set explicit timeouts for every request
3. Validate responses before using them
4. Have fallback models ready for provider failures
5. Track token counts to prevent context errors
6. Use streaming for long responses
7. Implement circuit breaker for persistent failures
8. Monitor error rates and alert proactively

Next Steps

- Read the docs for API specifications
- Get an API key to test these patterns
- Compare models for primary/fallback planning

Robust error handling is what separates demo apps from production systems. These patterns will keep your AI application running even when APIs behave unexpectedly.