6 LLM integration patterns for existing codebases (without a full rewrite)
According to the 2026 State of AI Infrastructure Report by DDN, 54% of enterprises have delayed or cancelled AI projects in the past two years – often because they approached AI as a full-stack transformation rather than a targeted integration. The organizations succeeding with LLM adoption share a common trait: they’re not rewriting their systems. They’re augmenting them.
This article walks through six proven patterns for adding LLM capabilities to your existing systems. Whether you’re running a decade-old monolith or a sprawling microservices landscape, there’s a path forward that doesn’t involve rewriting your core.

Table of contents
The mindset shift: LLM as a layer, not a replacement
Before diving into patterns, let’s establish a key principle: LLM integration is supposed to be functional augmentation, not architectural revolution.
Think about how GitHub Copilot works. It doesn’t replace your editor – it sits alongside it, offering suggestions within the existing developer workflow. Products like Notion integrated AI into existing workflows and interfaces instead of turning it into a separate product experience. Salesforce Einstein GPT augments CRM workflows by adding generative capabilities to existing customer data, rather than requiring users to adopt a separate AI system.
The pattern is consistent: LLM as an overlay, not an overhaul.
This matters because it changes the conversation with stakeholders. You’re not asking for budget to rebuild. You’re proposing to add a capability layer that enhances what’s already working.
What 2026 demands from production-grade integration
Let’s be clear about what “production-ready” means nowadays. Every LLM integration in a serious codebase needs to address:
Structured outputs and schema enforcement
LLMs cannot return “almost correct” data structures. When output feeds into deterministic business logic, you need guaranteed schema adherence. OpenAI’s Structured Outputs (not just JSON mode) and similar features from other providers enforce this at the API level. If you’re parsing LLM responses into typed objects, this is non-negotiable.
Observability
No LLM integration without observability. This means tracing prompts and responses, tracking token usage and latency per endpoint, monitoring cost, and debugging retrieval/inference flows. Tools like Langfuse, Helicone, and Arize are standard infrastructure now.
Prompt versioning and management
Treat prompts like code. Version them, review them, test them. Prompt drift is real, and rollback capability is essential when a prompt change breaks downstream logic.
Evaluation loops
How do you know the LLM is performing well? Define metrics upfront (e.g. accuracy against labeled data, latency, user satisfaction signals) and measure continuously.
Privacy controls
Before sending user data to external LLM APIs, implement PII masking. GDPR and compliance teams will thank you.
These aren’t “nice to haves” anymore. They’re table stakes for any team that wants to ship LLM features without creating operational nightmares.
Pattern 1: Sidecar / Wrapper
How It works
The LLM runs as an aux service alongside your existing microservice. Your main application logic remains untouched while the sidecar handles all AI-related processing and exposes a simple API for your service to call when needed.
┌─────────────────┐ ┌─────────────────┐
│ Your Service │────▶│ LLM Sidecar │
│ (unchanged) │◀────│ (new service) │
└─────────────────┘ └─────────────────┘When to use
- Adding AI-generated responses to existing support ticket systems
- Augmenting search results with semantic understanding
- Generating summaries or translations for content already in your system
Implementation example
# llm_sidecar/main.py
# A separate microservice that handles all LLM calls
from fastapi import FastAPI
from openai import OpenAI
from pydantic import BaseModel
app = FastAPI()
client = OpenAI()
# Response structure — enforces consistent output format
class SupportResponse(BaseModel):
response_text: str
confidence: float
suggested_tags: list[str]
# Endpoint called by your main application
@app.post("/generate-response")
async def generate_support_response(ticket: dict) -> SupportResponse:
completion = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "Generate a helpful support response."},
{"role": "user", "content": ticket["description"]}
],
response_format=SupportResponse,
temperature=0.3
)
return completion.choices[0].message.parsedProduction constraints
| Aspect | Guidance |
|---|---|
| Failure mode | Sidecar timeout or model error leaves main service waiting. Always set aggressive timeouts and define fallback behavior. |
| Latency fit | Acceptable for async or semi-sync flows (e.g., ticket response generation). Not ideal for sub-100ms user-facing paths. |
| Control points | Timeout (2-5s max), structured output schema, fallback to template response, request/response logging, rate limiting. |
Tools
OpenAI API with Structured Outputs, Anthropic Claude API, Ollama for local models, BentoML for model serving.
Pro Tip: For latency-sensitive use cases, consider running a local model (Mistral, Llama) through Ollama. You control the infrastructure and eliminate external API dependencies.
Pattern 2: Middleware / Interceptor
How it works
The LLM is inserted into your request pipeline as middleware. It processes requests before they hit your business logic (pre-processing) or enriches responses before they’re sent to clients (post-processing).
Request → [LLM Middleware] → Business Logic → [LLM Middleware] → ResponseWhen to use
- Semantic validation of user input before processing
- Automatic query rewriting (natural language → SQL, GraphQL)
- Response enrichment (adding context, translations, summaries)
- PII detection and masking before data reaches your backend
Implementation example
# middleware/llm_interceptor.py
from fastapi import Request
from starlette.middleware.base import BaseHTTPMiddleware
from openai import AsyncOpenAI
from pydantic import BaseModel
client = AsyncOpenAI()
class SearchIntent(BaseModel):
category: str | None
color: str | None
max_price: float | None
keywords: list[str]
class LLMEnrichmentMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next):
# Pre-processing: enrich search requests with structured intent
if request.url.path == "/search":
body = await request.json()
try:
structured_query = await self.extract_search_intent(body["query"])
request.state.structured_query = structured_query
except Exception as e:
# Fallback: pass raw query through if LLM fails
request.state.structured_query = None
# Continue to your business logic
response = await call_next(request)
return response
async def extract_search_intent(self, natural_query: str) -> SearchIntent:
# LLM converts "red shoes under $100" → structured SearchIntent
completion = await client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract search intent from natural language query."},
{"role": "user", "content": natural_query}
],
response_format=SearchIntent,
)
return completion.choices[0].message.parsedProduction constraints
| Aspect | Guidance |
|---|---|
| Failure mode | Middleware timeout blocks entire request. Schema validation failure on LLM output corrupts downstream logic. |
| Latency fit | Only for paths where 200-500ms added latency is acceptable. Never on checkout or payment flows. |
| Control points | Path matching (don’t run on every request), strict timeout (1-2s), schema validation with Pydantic/Zod, graceful fallback to passthrough, per-path observability. |
Watch out: Never put synchronous LLM calls in middleware that runs on every request. Use path matching to limit scope. Always define what happens when the LLM fails or times out.
Pattern 3: Feature Flag + Shadow Mode
How It works
You deploy the LLM integration behind a feature flag. In shadow mode, the LLM processes requests in parallel with your existing logic, but its output is logged – not served to users. This lets you compare accuracy, latency, and cost before going live.
Request → Existing Logic → Response (served)
└→ LLM Logic → Logged (not served)When to use
- Validating LLM accuracy against your current system
- A/B testing AI-generated content vs. human-written
- Gradual rollout to percentage of users
- Building confidence with stakeholders before full deployment
Implementation example
# handlers/support_ticket.py
from feature_flags import is_enabled, get_variant
async def handle_ticket(ticket: dict):
# Always run existing logic first
existing_response = await legacy_response_generator(ticket)
# Check if LLM integration is enabled via feature flag
if is_enabled("llm_support_responses"):
try:
llm_response = await llm_sidecar.generate_response(ticket)
# Shadow mode: compare outputs without affecting users
if get_variant("llm_support_responses") == "shadow":
await log_comparison(
ticket_id=ticket["id"],
existing=existing_response,
llm=llm_response,
latency_delta_ms=llm_response.latency - existing_response.latency
)
return existing_response # Users still get the old response
# Live mode: serve LLM response to users
return llm_response
except Exception as e:
log_llm_failure(e)
return existing_response # Fallback on any LLM failure
return existing_responseProduction constraints
| Aspect | Guidance |
|---|---|
| Failure mode | Shadow mode doubles compute cost. Comparison metrics poorly defined → false confidence in rollout readiness. |
| Latency fit | Shadow path is async/fire-and-forget. No latency impact on served response. |
| Control points | Feature flag granularity (user %, geo, account tier), structured comparison logging, cost tracking per variant, automatic rollback triggers. |
Tools
LaunchDarkly, Optimizely, Unleash, Flipper, or a simple Redis-backed flag store.
Pro tip: Define your comparison metrics upfront. Track response time, token cost, user satisfaction (if measurable), and accuracy (if you have labeled data). Don’t roll out based on vibes.
Pattern 4: API Gateway with LLM
How It works
A centralized gateway handles all LLM traffic. Your services don’t call OpenAI or Claude directly – they call your AI Gateway, which manages routing, rate limiting, key rotation, prompt templates, and cost tracking.
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Service A │────▶│ │────▶│ OpenAI │
├──────────────┤ │ AI Gateway │ ├──────────────┤
│ Service B │────▶│ │────▶│ Claude │
├──────────────┤ │ │ ├──────────────┤
│ Service C │────▶│ │────▶│ Local LLM │
└──────────────┘ └──────────────┘ └──────────────┘When to use
- Multiple services need LLM access
- You need centralized cost control and observability
- Compliance requires audit logs of all prompts and responses
- You want to swap models without changing service code
Implementation example
# ai_gateway/main.py
from fastapi import FastAPI, Header, HTTPException
from litellm import completion
import hashlib
app = FastAPI()
# Centralized prompt management
PROMPT_TEMPLATES = {
"support_response": "You are a helpful support agent...",
"summarize": "Summarize the following text concisely...",
}
response_cache = {}
@app.post("/v1/complete")
async def unified_completion(
request: dict,
x_service_name: str = Header(...), # Identifies calling service
x_prompt_template: str = Header(None), # Optional template key
x_cache_ttl: int = Header(0) # Cache duration in seconds
):
# Rate limiting per service
if not await rate_limiter.check(x_service_name):
raise HTTPException(429, "Rate limit exceeded")
# Check cache for repeated requests
cache_key = hashlib.sha256(str(request).encode()).hexdigest()
if x_cache_ttl > 0 and cache_key in response_cache:
return response_cache[cache_key]
# Apply centralized template if specified
system_prompt = PROMPT_TEMPLATES.get(x_prompt_template, request.get("system"))
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": request["prompt"]}
]
# Route to model with automatic fallback
try:
response = await completion(
model=request.get("model", "gpt-4o"),
messages=messages
)
except Exception:
# Fallback to secondary provider
response = await completion(
model="claude-3-haiku-20240307",
messages=messages
)
# Log for cost tracking and compliance
await log_request(x_service_name, request, response)
# Cache if requested
if x_cache_ttl > 0:
response_cache[cache_key] = response
return responseProduction constraints
| Aspect | Guidance |
|---|---|
| Failure mode | Single point of failure. Gateway outage = all AI features down. Requires HA deployment. |
| Latency fit | Adds 10-50ms overhead. Acceptable for most use cases. |
| Control points | Per-service rate limits, prompt template versioning, model fallback chain, response caching, full audit logging, cost dashboards, PII filtering before external calls. |
Tools
Kong, Tyk, custom FastAPI gateway, LiteLLM Router, Portkey, Helicone.
2026 Trend: AI Gateways are becoming standard infrastructure. They handle prompt versioning, A/B testing between models, automatic fallback (GPT-4 → Claude → local), and real-time cost dashboards. If you’re integrating LLM across multiple services, build this early.
Pattern 5: Event-Driven / Async Processing
How It works
The LLM operates asynchronously, triggered by events in a message queue. It processes work in the background without blocking user-facing requests.
User Action → Queue (Kafka/SQS) → LLM Worker → Result Store → NotificationWhen to use
- Batch processing (summarizing daily logs, generating reports)
- Non-blocking enrichment (recommendations sent after purchase)
- Heavy processing that would timeout in synchronous flow
- Cost optimization through batching
Implementation example
# workers/llm_processor.py
from kafka import KafkaConsumer
import json
consumer = KafkaConsumer(
'content-to-summarize',
bootstrap_servers=['localhost:9092'],
value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)
def process_batch(messages: list):
# Batch multiple items into single LLM call for efficiency
combined_prompt = "\n---\n".join([m["content"] for m in messages])
response = llm_client.complete(
prompt=f"Summarize each section separated by ---:\n{combined_prompt}",
response_format=BatchSummaryResponse
)
# Store results with full tracing
for msg, summary in zip(messages, response.summaries):
result_store.save(
id=msg["id"],
summary=summary,
trace_id=response.trace_id,
tokens_used=response.usage.total_tokens
)
# Batch processing: collect 10 messages, then process
batch = []
for message in consumer:
batch.append(message.value)
if len(batch) >= 10:
process_batch(batch)
batch = []Production constraints
| Aspect | Guidance |
|---|---|
| Failure mode | Dead letter queue fills up. Results never arrive. User sees stale data indefinitely. |
| Latency fit | Not for user-facing sync flows. Results available minutes to hours later. |
| Control points | DLQ monitoring, batch size limits, processing timeout per message, idempotency keys, result TTL, cost tracking per batch. |
Tools
Kafka, RabbitMQ, AWS SQS, Redis Streams, Temporal.io for orchestration.
Pro tip: Batching can materially reduce inference cost and request overhead in asynchronous workflows, especially for repeatable summarization and enrichment jobs. Combine related items into batched requests where the use case allows.
Pattern 6: Model-Agnostic Abstraction Layer
How It works
You build an internal “AI SDK” that abstracts away the specific model provider. Your application code calls your SDK; the SDK handles routing to Claude, GPT, Mistral, or a local model.
# Your code calls this:
response = await ai_sdk.complete(task="summarize", content=text)
# SDK handles:
# - Model selection based on task
# - Fallback if primary model fails
# - Response schema validation
# - Cost tracking
# - ObservabilityWhen to use
- You want flexibility to switch providers without code changes
- Different tasks need different models (fast/cheap vs. slow/accurate)
- You’re preparing for a future where model pricing and capabilities shift rapidly
- Enterprise policy requires multi-vendor strategy
Implementation example
# ai_sdk/client.py
from litellm import completion
from pydantic import BaseModel
class AIClient:
# Route tasks to optimal models with fallbacks
MODEL_ROUTING = {
"summarize": ["claude-3-haiku-20240307", "gpt-4o-mini"], # Fast, cheap
"analyze": ["gpt-4o", "claude-sonnet-4-20250514"], # Accurate
"generate": ["claude-sonnet-4-20250514", "gpt-4o"], # Balanced
}
async def complete(
self,
task: str,
content: str,
response_schema: BaseModel = None,
**kwargs
):
models = self.MODEL_ROUTING.get(task, ["gpt-4o-mini"])
# Try each model in order until one succeeds
for model in models:
try:
response = await completion(
model=model,
messages=[{"role": "user", "content": content}],
response_format=response_schema
)
await self.log_success(task, model, response)
return response
except Exception as e:
await self.log_failure(task, model, e)
continue
raise AllModelsFailedError(task, models)
# Usage in your application — no direct provider dependencies
ai = AIClient()
summary = await ai.complete("summarize", long_text, response_schema=SummarySchema)Production constraints
| Aspect | Guidance |
|---|---|
| Failure mode | Abstraction hides model-specific behaviors. Debugging becomes harder. Fallback chain masks repeated failures. |
| Latency fit | Depends on underlying models. Abstraction adds minimal overhead (<10ms). |
| Control points | Per-task model routing config, fallback chain definition, unified observability across providers, cost allocation per task type, capability feature flags (e.g., vision, function calling). |
Tools
LiteLLM, LangChain, Portkey, custom abstraction.
Why this matters in 2026: Enterprise increasingly uses multiple model families rather than a single provider. This pattern is no longer optional for teams that want operational flexibility and cost optimization.
Decision framework: Which pattern should you use?
Instead of a simple “situation → pattern” mapping, consider these four criteria:
| Criterion | Questions to Ask | Pattern Implications |
|---|---|---|
| Latency sensitivity | Is this in a user-facing sync path? Sub-500ms requirement? | High sensitivity → Sidecar with aggressive timeout, or Async. Never Middleware on hot paths. |
| Blast radius | If this fails, what breaks? Core checkout? Internal tooling? | High blast radius → Shadow mode first, Gateway for centralized control, aggressive fallbacks. |
| Compliance / PII exposure | Does data leave your infrastructure? GDPR/HIPAA constraints? | High exposure → Gateway with PII masking, audit logging, possibly local models only. |
| Model portability | Do you need to switch providers? Multi-model strategy? | High portability need → Abstraction Layer, Gateway with routing. |
Quick reference
| Your Situation | Start With | Why |
|---|---|---|
| Monolith, low risk tolerance | Sidecar + Feature Flag | Isolated, easy rollback |
| Microservices, multiple teams | API Gateway | Centralized control, cost visibility |
| High-volume, latency-tolerant | Event-driven | Cost-efficient, non-blocking |
| Request enrichment/validation | Middleware | Clean pipeline integration (with strict timeouts) |
| Uncertain about model choice | Abstraction Layer | Flexibility to pivot |
| Regulated industry | Gateway + Shadow Mode | Audit trail, gradual validation |
Start small. Pick one use case, one pattern, and prove value before expanding.
Antipatterns to avoid
Synchronous LLM in hot path
A 2-second LLM call in your checkout flow will kill conversion. If it must be synchronous, cache aggressively, set strict timeouts, and always have a non-LLM fallback.
No caching strategy
Identical prompts should return cached responses. Without this, costs spiral and latency becomes unpredictable.
Hardcoded prompts
Treat prompts like code – version them, review them, test them. Prompt drift is real, and you need rollback capability.
LLM as black box
Log prompts, responses, latency, and token usage. You can’t optimize what you can’t measure. Observability tools like Langfuse, Helicone, or custom logging are essential infrastructure.
JSON mode instead of Structured Outputs
If you’re parsing LLM output into business logic, use proper schema enforcement (OpenAI Structured Outputs, Anthropic tool use with schemas). “Almost valid JSON” will corrupt your data.
Skipping PII considerations
Before sending user data to external LLM APIs, implement masking. GDPR and compliance teams will thank you.
No evaluation loop
How do you know quality is maintained over time? Define metrics, measure continuously, alert on drift.
Getting started
You don’t need permission to experiment. Most of these patterns can be prototyped in a day:
- Pick a low-risk use case – internal tooling, batch reports, non-critical features
- Deploy a sidecar with a simple REST endpoint and structured outputs
- Add observability from day one – even basic logging beats nothing
- Run in shadow mode for a week, collect comparison data
- Review results with your team – latency, accuracy, cost
- Expand or pivot based on evidence
The goal isn’t to “add AI” alone but to solve a real problem faster or better than you could before. The patterns just help you do it without breaking what’s already working.
Ready to integrate LLM without the risk?
Boldare helps engineering teams design and deploy LLM integration patterns matched to their stack – Python, Node.js, Java, Kotlin, Go. We’ve done this for energy providers, SaaS platforms, and enterprise systems that couldn’t afford downtime.
Share this article:





