LLM Integration Patterns for B2B SaaS: From API Wrapper to Production-Grade AI Feature
Adding an LLM to your B2B product is not the same as building a consumer chatbot. Token costs, reliability, latency, multi-tenant data isolation, and auditability all look different when your customers are businesses using your AI feature in production workflows every day

The gap between "we integrated GPT-4" and "we shipped a production AI feature that our enterprise customers actually trust" is larger than most product teams expect. Consumer AI apps can absorb model errors gracefully — the user tries again or laughs it off. B2B customers running production workflows cannot. A hallucinated output in an enterprise HR platform or a HIPAA-covered healthcare tool is not a minor inconvenience; it is a support ticket, a potential liability, and a trust problem that is very hard to undo.
This post covers the integration patterns we have found to work for LLM features in B2B SaaS — not the wrapper code, but the architectural decisions that determine whether the feature holds up under production usage.
Pattern 1: The prompt-as-config pattern
The single most common mistake in first-generation LLM integrations: embedding the system prompt as a string constant in application code.
const systemPrompt = \`You are a helpful assistant for our HR platform.
Answer questions about employee benefits based on the provided policy documents.\`
The problem: when you want to improve the prompt — and you will, constantly — you need a code change, a deployment cycle, and a rollback path. You cannot A/B test prompt variations, cannot roll out changes to a subset of customers, and cannot give customer success teams visibility into what the system is actually instructed to do.
The pattern that works: treat system prompts as configuration, not code. Store them in your database, version them, and deploy prompt changes independently from code changes.
const promptConfig = await db.prompts.findFirst({
where: {
feature: 'benefits-assistant',
tenantId: context.tenantId,
isActive: true,
},
orderBy: { version: 'desc' },
})
const systemPrompt = promptConfig?.template ?? DEFAULT_SYSTEM_PROMPT
This enables prompt versioning, per-tenant customization, A/B testing at the prompt level, and instant rollback. It also means non-engineers can iterate on prompts through an admin interface rather than needing a code change for every iteration.
Pattern 2: Structured output contracts
LLM outputs are strings. Your application probably wants structured data. The naive approach is to ask the LLM to format its output as JSON and then parse the string. This works until the LLM decides to add a preamble, use single quotes, or include trailing commas. All of these happen in production.
With OpenAI's structured output mode:
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [...],
response_format: {
type: 'json_schema',
json_schema: {
name: 'benefits_answer',
schema: {
type: 'object',
properties: {
answer: { type: 'string' },
sourceDocuments: {
type: 'array',
items: {
type: 'object',
properties: {
title: { type: 'string' },
section: { type: 'string' },
relevanceScore: { type: 'number' },
},
required: ['title', 'section', 'relevanceScore'],
},
},
confidence: { type: 'string', enum: ['high', 'medium', 'low'] },
cannotAnswer: { type: 'boolean' },
},
required: ['answer', 'sourceDocuments', 'confidence', 'cannotAnswer'],
},
strict: true,
},
},
})
The strict: true flag guarantees the response matches the schema exactly. The model cannot deviate. This is significantly more reliable than asking the model to produce JSON in the user message and hoping it complies. For Anthropic's Claude, use tool use (function calling) to achieve the same effect.
Pattern 3: Multi-tenant data isolation
This is the pattern most B2B LLM integrations get wrong, with consequences that range from embarrassing to catastrophic. If multiple customers' data can appear in the same LLM context — in retrieved chunks, in conversation history, or in the prompt — you have a data isolation problem.
The rule: a customer's data must never appear in another customer's LLM context. The failure modes are subtle:
- Shared vector store without tenant filtering — searching for "employee benefits policy" returns chunks from all tenants' documents, not just the querying tenant's
- Caching at the wrong layer — caching LLM responses based on the query string alone without including the tenant ID in the cache key, so a response generated from Tenant A's data is returned to Tenant B asking the same question
// Every vector store query MUST include the tenant filter
const results = await vectorStore.query({
vector: queryEmbedding,
topK: 10,
filter: {
tenantId: { $eq: context.tenantId }, // mandatory
},
})
// Cache key includes tenant ID
const cacheKey = \`llm:\${context.tenantId}:\${hashQuery(queryText)}\`
We add a lint rule and a code review checklist item: every vector store query in the codebase must have a tenant filter.
Pattern 4: Cost control by design
LLM API costs at B2B scale are not like other API costs. At a few cents per 1,000 tokens, a single feature processing 100,000 requests per month can easily cost $1,000-$5,000 monthly. Without attribution, you cannot distinguish between "this is expensive because our best enterprise customer is using it heavily" and "this is expensive because we have a bug."
// Log every LLM call with full attribution
await db.llmUsage.create({
data: {
tenantId: context.tenantId,
userId: context.userId,
feature: 'benefits-assistant',
model: 'gpt-4o',
promptTokens: usage.prompt_tokens,
completionTokens: usage.completion_tokens,
latencyMs: Date.now() - startTime,
requestId: response.id,
},
})
// Enforce per-tenant quotas
const tenantUsage = await getMonthlyUsage(context.tenantId)
if (tenantUsage.totalTokens > PLAN_LIMITS[context.plan].monthlyTokens) {
return { error: 'ai_quota_exceeded', message: 'Monthly AI quota reached. Upgrade to continue.' }
}
Pattern 5: Async processing for slow operations
GPT-4o with a 2,000-token context and 500-token response takes 3-8 seconds. Generating a 2,000-word document can take 30+ seconds. Holding an HTTP connection open that long is not acceptable UX and is fragile — load balancers and proxies with shorter timeouts will kill the connection.
For operations over a few seconds, use an async processing pattern: accept the request and return a job ID immediately (202 Accepted), queue the LLM work, and notify the client when done via WebSocket, SSE, or webhook.
// API handler — immediate 202
export async function POST(req: Request) {
const { query } = await req.json()
const jobId = await queue.add('benefits-query', {
tenantId: context.tenantId,
userId: context.userId,
query,
})
return Response.json({ jobId, status: 'queued' }, { status: 202 })
}
// Worker — runs async
queue.process('benefits-query', async (job) => {
const result = await runBenefitsQuery(job.data)
await db.aiResults.upsert({ where: { jobId: job.id }, create: { jobId: job.id, ...result }, update: { ...result } })
await pusher.trigger(\`user-\${job.data.userId}\`, 'ai-result', { jobId: job.id })
})
Pattern 6: Graceful degradation
LLM APIs go down. They also rate-limit and return malformed outputs. Your feature must handle all of these gracefully.
For each AI feature, define a fallback hierarchy: primary model → secondary/cheaper model → cached response from a similar recent query → static fallback acknowledging the AI is temporarily unavailable.
async function runWithFallback(prompt: string, options: LLMOptions) {
for (const model of [options.primaryModel, options.fallbackModel]) {
try {
return await callLLM(model, prompt, options)
} catch (error) {
if (isRetryableError(error)) continue
break
}
}
const cached = await cache.get(hashPrompt(prompt))
if (cached) return { ...cached, fromCache: true }
return STATIC_FALLBACK_RESPONSE
}

Pattern 7: Audit trail for regulated industries
B2B customers in healthcare, finance, and legal require an audit trail for AI-generated outputs. Who asked what, what model generated the response, what source documents were used, and whether the response was modified before use — these are compliance requirements in regulated industries, not nice-to-haves.
interface AIAuditRecord {
id: string
tenantId: string
userId: string
feature: string
modelId: string
modelVersion: string
systemPromptVersion: number
retrievedChunks: Array<{ documentId: string; chunkId: string; score: number }>
rawResponse: string
parsedResponse: Record
latencyMs: number
inputTokens: number
outputTokens: number
wasModified: boolean // did a human edit the AI output before use?
modifiedBy?: string
createdAt: Date
}
The wasModified flag captures a legally meaningful distinction: an AI output used directly versus one reviewed and edited by a human before use are different situations that must be distinguishable in the audit log.
Shipping AI features in B2B SaaS is genuinely harder than it looks from the outside. The gap between a demo that impresses and a feature that enterprise customers trust enough to use in production workflows every day is significant — and the seven patterns above are what bridges it.
Related service
AI Development & Automation
Production RAG pipelines, LLM integrations, and AI workflow automation for healthcare and e-commerce.
Written by
Founder & CEO
Gaurang Ghinaiya is the Founder & CEO of Nexios Technologies. He is passionate about building innovative software solutions that drive business growth. With years of experience in technology leadership, he guides teams toward excellence.