LLM Integration Patterns for Existing Codebases | Boldare

Aleksander Dąbrowski

at Boldare - Product Design and Development Company

Home

Blog

How to

6 LLM integration patterns for existing codebases (without a full rewrite)

Aleksander Dąbrowski

at Boldare - Product Design and Development Company

According to the 2026 State of AI Infrastructure Report by DDN, 54% of enterprises have delayed or cancelled AI projects in the past two years – often because they approached AI as a full-stack transformation rather than a targeted integration. The organizations succeeding with LLM adoption share a common trait: they’re not rewriting their systems. They’re augmenting them.

This article walks through six proven patterns for adding LLM capabilities to your existing systems. Whether you’re running a decade-old monolith or a sprawling microservices landscape, there’s a path forward that doesn’t involve rewriting your core.

6 LLM integration patterns for existing codebases (without a full rewrite)

Share this article:

Search for an article

The mindset shift: LLM as a layer, not a replacement

Before diving into patterns, let’s establish a key principle: LLM integration is supposed to be functional augmentation, not architectural revolution.

Think about how GitHub Copilot works. It doesn’t replace your editor – it sits alongside it, offering suggestions within the existing developer workflow. Products like Notion integrated AI into existing workflows and interfaces instead of turning it into a separate product experience. Salesforce Einstein GPT augments CRM workflows by adding generative capabilities to existing customer data, rather than requiring users to adopt a separate AI system.

The pattern is consistent: LLM as an overlay, not an overhaul.

This matters because it changes the conversation with stakeholders. You’re not asking for budget to rebuild. You’re proposing to add a capability layer that enhances what’s already working.

What 2026 demands from production-grade integration

Let’s be clear about what “production-ready” means nowadays. Every LLM integration in a serious codebase needs to address:

Structured outputs and schema enforcement

LLMs cannot return “almost correct” data structures. When output feeds into deterministic business logic, you need guaranteed schema adherence. OpenAI’s Structured Outputs (not just JSON mode) and similar features from other providers enforce this at the API level. If you’re parsing LLM responses into typed objects, this is non-negotiable.

Observability

No LLM integration without observability. This means tracing prompts and responses, tracking token usage and latency per endpoint, monitoring cost, and debugging retrieval/inference flows. Tools like Langfuse, Helicone, and Arize are standard infrastructure now.

Prompt versioning and management

Treat prompts like code. Version them, review them, test them. Prompt drift is real, and rollback capability is essential when a prompt change breaks downstream logic.

Evaluation loops

How do you know the LLM is performing well? Define metrics upfront (e.g. accuracy against labeled data, latency, user satisfaction signals) and measure continuously.

Privacy controls

Before sending user data to external LLM APIs, implement PII masking. GDPR and compliance teams will thank you.

These aren’t “nice to haves” anymore. They’re table stakes for any team that wants to ship LLM features without creating operational nightmares.

Pattern 1: Sidecar / Wrapper

How It works

The LLM runs as an aux service alongside your existing microservice. Your main application logic remains untouched while the sidecar handles all AI-related processing and exposes a simple API for your service to call when needed.

┌─────────────────┐     ┌─────────────────┐
│  Your Service   │────▶│  LLM Sidecar    │
│  (unchanged)    │◀────│  (new service)  │
└─────────────────┘     └─────────────────┘

When to use

Adding AI-generated responses to existing support ticket systems
Augmenting search results with semantic understanding
Generating summaries or translations for content already in your system

Implementation example

# llm_sidecar/main.py
# A separate microservice that handles all LLM calls

from fastapi import FastAPI
from openai import OpenAI
from pydantic import BaseModel

app = FastAPI()
client = OpenAI()

# Response structure — enforces consistent output format
class SupportResponse(BaseModel):
    response_text: str
    confidence: float
    suggested_tags: list[str]

# Endpoint called by your main application
@app.post("/generate-response")
async def generate_support_response(ticket: dict) -> SupportResponse:
    completion = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Generate a helpful support response."},
            {"role": "user", "content": ticket["description"]}
        ],
        response_format=SupportResponse,
        temperature=0.3
    )
    return completion.choices[0].message.parsed

Production constraints

Aspect	Guidance
Failure mode	Sidecar timeout or model error leaves main service waiting. Always set aggressive timeouts and define fallback behavior.
Latency fit	Acceptable for async or semi-sync flows (e.g., ticket response generation). Not ideal for sub-100ms user-facing paths.
Control points	Timeout (2-5s max), structured output schema, fallback to template response, request/response logging, rate limiting.

Tools

OpenAI API with Structured Outputs, Anthropic Claude API, Ollama for local models, BentoML for model serving.

Pro Tip: For latency-sensitive use cases, consider running a local model (Mistral, Llama) through Ollama. You control the infrastructure and eliminate external API dependencies.

Pattern 2: Middleware / Interceptor

How it works

The LLM is inserted into your request pipeline as middleware. It processes requests before they hit your business logic (pre-processing) or enriches responses before they’re sent to clients (post-processing).

Request → [LLM Middleware] → Business Logic → [LLM Middleware] → Response

When to use

Semantic validation of user input before processing
Automatic query rewriting (natural language → SQL, GraphQL)
Response enrichment (adding context, translations, summaries)
PII detection and masking before data reaches your backend

Implementation example

# middleware/llm_interceptor.py
from fastapi import Request
from starlette.middleware.base import BaseHTTPMiddleware
from openai import AsyncOpenAI
from pydantic import BaseModel

client = AsyncOpenAI()

class SearchIntent(BaseModel):
    category: str | None
    color: str | None
    max_price: float | None
    keywords: list[str]

class LLMEnrichmentMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        # Pre-processing: enrich search requests with structured intent
        if request.url.path == "/search":
            body = await request.json()
            try:
                structured_query = await self.extract_search_intent(body["query"])
                request.state.structured_query = structured_query
            except Exception as e:
                # Fallback: pass raw query through if LLM fails
                request.state.structured_query = None

        # Continue to your business logic
        response = await call_next(request)
        return response

    async def extract_search_intent(self, natural_query: str) -> SearchIntent:
        # LLM converts "red shoes under $100" → structured SearchIntent
        completion = await client.beta.chat.completions.parse(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "Extract search intent from natural language query."},
                {"role": "user", "content": natural_query}
            ],
            response_format=SearchIntent,
        )
        return completion.choices[0].message.parsed

Production constraints

Aspect	Guidance
Failure mode	Middleware timeout blocks entire request. Schema validation failure on LLM output corrupts downstream logic.
Latency fit	Only for paths where 200-500ms added latency is acceptable. Never on checkout or payment flows.
Control points	Path matching (don’t run on every request), strict timeout (1-2s), schema validation with Pydantic/Zod, graceful fallback to passthrough, per-path observability.

Watch out: Never put synchronous LLM calls in middleware that runs on every request. Use path matching to limit scope. Always define what happens when the LLM fails or times out.

Pattern 3: Feature Flag + Shadow Mode

How It works

You deploy the LLM integration behind a feature flag. In shadow mode, the LLM processes requests in parallel with your existing logic, but its output is logged – not served to users. This lets you compare accuracy, latency, and cost before going live.

Request → Existing Logic → Response (served)
      └→ LLM Logic → Logged (not served)

When to use

Validating LLM accuracy against your current system
A/B testing AI-generated content vs. human-written
Gradual rollout to percentage of users
Building confidence with stakeholders before full deployment

Implementation example

# handlers/support_ticket.py
from feature_flags import is_enabled, get_variant

async def handle_ticket(ticket: dict):
    # Always run existing logic first
    existing_response = await legacy_response_generator(ticket)

    # Check if LLM integration is enabled via feature flag
    if is_enabled("llm_support_responses"):
        try:
            llm_response = await llm_sidecar.generate_response(ticket)

            # Shadow mode: compare outputs without affecting users
            if get_variant("llm_support_responses") == "shadow":
                await log_comparison(
                    ticket_id=ticket["id"],
                    existing=existing_response,
                    llm=llm_response,
                    latency_delta_ms=llm_response.latency - existing_response.latency
                )
                return existing_response  # Users still get the old response

            # Live mode: serve LLM response to users
            return llm_response

        except Exception as e:
            log_llm_failure(e)
            return existing_response  # Fallback on any LLM failure

    return existing_response

Production constraints

Aspect	Guidance
Failure mode	Shadow mode doubles compute cost. Comparison metrics poorly defined → false confidence in rollout readiness.
Latency fit	Shadow path is async/fire-and-forget. No latency impact on served response.
Control points	Feature flag granularity (user %, geo, account tier), structured comparison logging, cost tracking per variant, automatic rollback triggers.

Tools

LaunchDarkly, Optimizely, Unleash, Flipper, or a simple Redis-backed flag store.

Pro tip: Define your comparison metrics upfront. Track response time, token cost, user satisfaction (if measurable), and accuracy (if you have labeled data). Don’t roll out based on vibes.

Pattern 4: API Gateway with LLM

How It works

A centralized gateway handles all LLM traffic. Your services don’t call OpenAI or Claude directly – they call your AI Gateway, which manages routing, rate limiting, key rotation, prompt templates, and cost tracking.

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Service A   │────▶│              │────▶│  OpenAI      │
├──────────────┤     │  AI Gateway  │     ├──────────────┤
│  Service B   │────▶│              │────▶│  Claude      │
├──────────────┤     │              │     ├──────────────┤
│  Service C   │────▶│              │────▶│  Local LLM   │
└──────────────┘     └──────────────┘     └──────────────┘

When to use

Multiple services need LLM access
You need centralized cost control and observability
Compliance requires audit logs of all prompts and responses
You want to swap models without changing service code

Implementation example

# ai_gateway/main.py
from fastapi import FastAPI, Header, HTTPException
from litellm import completion
import hashlib

app = FastAPI()

# Centralized prompt management
PROMPT_TEMPLATES = {
    "support_response": "You are a helpful support agent...",
    "summarize": "Summarize the following text concisely...",
}

response_cache = {}

@app.post("/v1/complete")
async def unified_completion(
    request: dict,
    x_service_name: str = Header(...),      # Identifies calling service
    x_prompt_template: str = Header(None),  # Optional template key
    x_cache_ttl: int = Header(0)            # Cache duration in seconds
):
    # Rate limiting per service
    if not await rate_limiter.check(x_service_name):
        raise HTTPException(429, "Rate limit exceeded")

    # Check cache for repeated requests
    cache_key = hashlib.sha256(str(request).encode()).hexdigest()
    if x_cache_ttl > 0 and cache_key in response_cache:
        return response_cache[cache_key]

    # Apply centralized template if specified
    system_prompt = PROMPT_TEMPLATES.get(x_prompt_template, request.get("system"))
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": request["prompt"]}
    ]

    # Route to model with automatic fallback
    try:
        response = await completion(
            model=request.get("model", "gpt-4o"),
            messages=messages
        )
    except Exception:
        # Fallback to secondary provider
        response = await completion(
            model="claude-3-haiku-20240307",
            messages=messages
        )

    # Log for cost tracking and compliance
    await log_request(x_service_name, request, response)

    # Cache if requested
    if x_cache_ttl > 0:
        response_cache[cache_key] = response

    return response

Production constraints

Aspect	Guidance
Failure mode	Single point of failure. Gateway outage = all AI features down. Requires HA deployment.
Latency fit	Adds 10-50ms overhead. Acceptable for most use cases.
Control points	Per-service rate limits, prompt template versioning, model fallback chain, response caching, full audit logging, cost dashboards, PII filtering before external calls.

Tools

Kong, Tyk, custom FastAPI gateway, LiteLLM Router, Portkey, Helicone.

2026 Trend: AI Gateways are becoming standard infrastructure. They handle prompt versioning, A/B testing between models, automatic fallback (GPT-4 → Claude → local), and real-time cost dashboards. If you’re integrating LLM across multiple services, build this early.

Pattern 5: Event-Driven / Async Processing

How It works

The LLM operates asynchronously, triggered by events in a message queue. It processes work in the background without blocking user-facing requests.

User Action → Queue (Kafka/SQS) → LLM Worker → Result Store → Notification

When to use

Batch processing (summarizing daily logs, generating reports)
Non-blocking enrichment (recommendations sent after purchase)
Heavy processing that would timeout in synchronous flow
Cost optimization through batching

Implementation example

# workers/llm_processor.py
from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
    'content-to-summarize',
    bootstrap_servers=['localhost:9092'],
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)

def process_batch(messages: list):
    # Batch multiple items into single LLM call for efficiency
    combined_prompt = "\n---\n".join([m["content"] for m in messages])

    response = llm_client.complete(
        prompt=f"Summarize each section separated by ---:\n{combined_prompt}",
        response_format=BatchSummaryResponse
    )

    # Store results with full tracing
    for msg, summary in zip(messages, response.summaries):
        result_store.save(
            id=msg["id"],
            summary=summary,
            trace_id=response.trace_id,
            tokens_used=response.usage.total_tokens
        )

# Batch processing: collect 10 messages, then process
batch = []
for message in consumer:
    batch.append(message.value)
    if len(batch) >= 10:
        process_batch(batch)
        batch = []

Production constraints

Aspect	Guidance
Failure mode	Dead letter queue fills up. Results never arrive. User sees stale data indefinitely.
Latency fit	Not for user-facing sync flows. Results available minutes to hours later.
Control points	DLQ monitoring, batch size limits, processing timeout per message, idempotency keys, result TTL, cost tracking per batch.

Tools

Kafka, RabbitMQ, AWS SQS, Redis Streams, Temporal.io for orchestration.

Pro tip: Batching can materially reduce inference cost and request overhead in asynchronous workflows, especially for repeatable summarization and enrichment jobs. Combine related items into batched requests where the use case allows.

Pattern 6: Model-Agnostic Abstraction Layer

How It works

You build an internal “AI SDK” that abstracts away the specific model provider. Your application code calls your SDK; the SDK handles routing to Claude, GPT, Mistral, or a local model.

# Your code calls this:
response = await ai_sdk.complete(task="summarize", content=text)

# SDK handles:
# - Model selection based on task
# - Fallback if primary model fails
# - Response schema validation
# - Cost tracking
# - Observability

When to use

You want flexibility to switch providers without code changes
Different tasks need different models (fast/cheap vs. slow/accurate)
You’re preparing for a future where model pricing and capabilities shift rapidly
Enterprise policy requires multi-vendor strategy

Implementation example

# ai_sdk/client.py
from litellm import completion
from pydantic import BaseModel

class AIClient:
    # Route tasks to optimal models with fallbacks
    MODEL_ROUTING = {
        "summarize": ["claude-3-haiku-20240307", "gpt-4o-mini"],  # Fast, cheap
        "analyze": ["gpt-4o", "claude-sonnet-4-20250514"],        # Accurate
        "generate": ["claude-sonnet-4-20250514", "gpt-4o"],       # Balanced
    }

    async def complete(
        self, 
        task: str, 
        content: str, 
        response_schema: BaseModel = None,
        **kwargs
    ):
        models = self.MODEL_ROUTING.get(task, ["gpt-4o-mini"])

        # Try each model in order until one succeeds
        for model in models:
            try:
                response = await completion(
                    model=model,
                    messages=[{"role": "user", "content": content}],
                    response_format=response_schema
                )
                await self.log_success(task, model, response)
                return response
            except Exception as e:
                await self.log_failure(task, model, e)
                continue

        raise AllModelsFailedError(task, models)

# Usage in your application — no direct provider dependencies
ai = AIClient()
summary = await ai.complete("summarize", long_text, response_schema=SummarySchema)

Production constraints

Aspect	Guidance
Failure mode	Abstraction hides model-specific behaviors. Debugging becomes harder. Fallback chain masks repeated failures.
Latency fit	Depends on underlying models. Abstraction adds minimal overhead (<10ms).
Control points	Per-task model routing config, fallback chain definition, unified observability across providers, cost allocation per task type, capability feature flags (e.g., vision, function calling).

Tools

LiteLLM, LangChain, Portkey, custom abstraction.

Why this matters in 2026: Enterprise increasingly uses multiple model families rather than a single provider. This pattern is no longer optional for teams that want operational flexibility and cost optimization.

Decision framework: Which pattern should you use?

Instead of a simple “situation → pattern” mapping, consider these four criteria:

Criterion	Questions to Ask	Pattern Implications
Latency sensitivity	Is this in a user-facing sync path? Sub-500ms requirement?	High sensitivity → Sidecar with aggressive timeout, or Async. Never Middleware on hot paths.
Blast radius	If this fails, what breaks? Core checkout? Internal tooling?	High blast radius → Shadow mode first, Gateway for centralized control, aggressive fallbacks.
Compliance / PII exposure	Does data leave your infrastructure? GDPR/HIPAA constraints?	High exposure → Gateway with PII masking, audit logging, possibly local models only.
Model portability	Do you need to switch providers? Multi-model strategy?	High portability need → Abstraction Layer, Gateway with routing.

Quick reference

Your Situation	Start With	Why
Monolith, low risk tolerance	Sidecar + Feature Flag	Isolated, easy rollback
Microservices, multiple teams	API Gateway	Centralized control, cost visibility
High-volume, latency-tolerant	Event-driven	Cost-efficient, non-blocking
Request enrichment/validation	Middleware	Clean pipeline integration (with strict timeouts)
Uncertain about model choice	Abstraction Layer	Flexibility to pivot
Regulated industry	Gateway + Shadow Mode	Audit trail, gradual validation

Start small. Pick one use case, one pattern, and prove value before expanding.

Antipatterns to avoid

Synchronous LLM in hot path

A 2-second LLM call in your checkout flow will kill conversion. If it must be synchronous, cache aggressively, set strict timeouts, and always have a non-LLM fallback.

No caching strategy

Identical prompts should return cached responses. Without this, costs spiral and latency becomes unpredictable.

Hardcoded prompts

Treat prompts like code – version them, review them, test them. Prompt drift is real, and you need rollback capability.

LLM as black box

Log prompts, responses, latency, and token usage. You can’t optimize what you can’t measure. Observability tools like Langfuse, Helicone, or custom logging are essential infrastructure.

JSON mode instead of Structured Outputs

If you’re parsing LLM output into business logic, use proper schema enforcement (OpenAI Structured Outputs, Anthropic tool use with schemas). “Almost valid JSON” will corrupt your data.

Skipping PII considerations

Before sending user data to external LLM APIs, implement masking. GDPR and compliance teams will thank you.

No evaluation loop

How do you know quality is maintained over time? Define metrics, measure continuously, alert on drift.

Getting started

You don’t need permission to experiment. Most of these patterns can be prototyped in a day:

Pick a low-risk use case – internal tooling, batch reports, non-critical features
Deploy a sidecar with a simple REST endpoint and structured outputs
Add observability from day one – even basic logging beats nothing
Run in shadow mode for a week, collect comparison data
Review results with your team – latency, accuracy, cost
Expand or pivot based on evidence

The goal isn’t to “add AI” alone but to solve a real problem faster or better than you could before. The patterns just help you do it without breaking what’s already working.

Ready to integrate LLM without the risk?

Boldare helps engineering teams design and deploy LLM integration patterns matched to their stack – Python, Node.js, Java, Kotlin, Go. We’ve done this for energy providers, SaaS platforms, and enterprise systems that couldn’t afford downtime.

Talk to our AI integration team now.

Share this article:

6 LLM integration patterns for existing codebases (without a full rewrite)

Table of contents

The mindset shift: LLM as a layer, not a replacement

Claude Code vs GitHub Copilot: Choosing the right tool for enterprise backend systems

What 2026 demands from production-grade integration

Pattern 1: Sidecar / Wrapper

How It works

When to use

Implementation example

Production constraints

Tools

Pattern 2: Middleware / Interceptor

H﻿ow it works

When to use

Implementation example

Production constraints

Pattern 3: Feature Flag + Shadow Mode

How It works

When to use

Implementation example

Production constraints

Tools

Pattern 4: API Gateway with LLM

How It works

When to use

Implementation example

Production constraints

Tools

Pattern 5: Event-Driven / Async Processing

How It works

When to use

Implementation example

Production constraints

T﻿ools

Pattern 6: Model-Agnostic Abstraction Layer

How It works

When to use

Implementation example

Production constraints

T﻿ools

Decision framework: Which pattern should you use?

Quick reference

Antipatterns to avoid

How to build a production RAG system that doesn't hallucinate

Getting started

Ready to integrate LLM without the risk?

Case Study: How we extracted structured data from Arabic-English PDFs with Claude Vision

How to build a production RAG system that doesn't hallucinate

Join our Team

Get in touch

How it works

Tools

Tools