Ai AutomationAI EngineeringSoftware Engineering

LLM Integration Patterns for B2B SaaS: From API Wrapper to Production-Grade AI Feature

Adding an LLM to your B2B product is not the same as building a consumer chatbot. Token costs, reliability, latency, multi-tenant data isolation, and auditability all look different when your customers are businesses using your AI feature in production workflows every day

Gaurang Ghinaiya

Founder & CEO

May 28, 2026

8 min read

LLM Integration Patterns for B2B SaaS: From API Wrapper to Production-Grade AI Feature

The gap between "we integrated GPT-4" and "we shipped a production AI feature that our enterprise customers actually trust" is larger than most product teams expect. Consumer AI apps can absorb model errors gracefully: the user tries again or laughs it off. B2B customers running production workflows cannot. A hallucinated output in an enterprise HR platform or a HIPAA-covered healthcare tool is not a minor inconvenience; it is a support ticket, a potential liability, and a trust problem that is very hard to undo.

This post covers the integration patterns we have found to work for LLM features in B2B SaaS: not the wrapper code, but the architectural decisions that determine whether the feature holds up under production usage.

Pattern 1: The prompt-as-config pattern

The single most common mistake in first-generation LLM integrations: embedding the system prompt as a string constant in application code.

const systemPrompt = `You are a helpful assistant for our HR platform.
Answer questions about employee benefits based on the provided policy documents.`

The problem: when you want to improve the prompt, and you will, constantly, you need a code change, a deployment cycle, and a rollback path. You cannot A/B test prompt variations, cannot roll out changes to a subset of customers, and cannot give customer success teams visibility into what the system is actually instructed to do.

The pattern that works: treat system prompts as configuration, not code. Store them in your database, version them, and deploy prompt changes independently from code changes.

const promptConfig = await db.prompts.findFirst({
  where: {
    feature: 'benefits-assistant',
    tenantId: context.tenantId,
    isActive: true,
  },
  orderBy: { version: 'desc' },
})

const systemPrompt = promptConfig?.template ?? DEFAULT_SYSTEM_PROMPT

This enables prompt versioning, per-tenant customization, A/B testing at the prompt level, and instant rollback. It also means non-engineers can iterate on prompts through an admin interface rather than needing a code change for every iteration. Prompt iteration alone will not carry a production feature, a point we expand in why prompt engineering alone will not save your LLM product, but slow prompt iteration will definitely sink one.

Pattern 2: Structured output contracts

LLM outputs are strings. Your application probably wants structured data. The naive approach is to ask the LLM to format its output as JSON and then parse the string. This works until the LLM decides to add a preamble, use single quotes, or include trailing commas. All of these happen in production.

With OpenAI's structured output mode:

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [...],
  response_format: {
    type: 'json_schema',
    json_schema: {
      name: 'benefits_answer',
      schema: {
        type: 'object',
        properties: {
          answer: { type: 'string' },
          sourceDocuments: {
            type: 'array',
            items: {
              type: 'object',
              properties: {
                title: { type: 'string' },
                section: { type: 'string' },
                relevanceScore: { type: 'number' },
              },
              required: ['title', 'section', 'relevanceScore'],
            },
          },
          confidence: { type: 'string', enum: ['high', 'medium', 'low'] },
          cannotAnswer: { type: 'boolean' },
        },
        required: ['answer', 'sourceDocuments', 'confidence', 'cannotAnswer'],
      },
      strict: true,
    },
  },
})

The strict: true flag guarantees the response matches the schema exactly. The model cannot deviate. This is significantly more reliable than asking the model to produce JSON in the user message and hoping it complies. For Anthropic's Claude, use tool use (function calling) to achieve the same effect.

Pattern 3: Multi-tenant data isolation

This is the pattern most B2B LLM integrations get wrong, with consequences that range from embarrassing to catastrophic. If multiple customers' data can appear in the same LLM context, in retrieved chunks, in conversation history, or in the prompt, you have a data isolation problem.

The rule: a customer's data must never appear in another customer's LLM context. The failure modes are subtle:

Shared vector store without tenant filtering: searching for "employee benefits policy" returns chunks from all tenants' documents, not just the querying tenant's
Caching at the wrong layer: caching LLM responses based on the query string alone without including the tenant ID in the cache key, so a response generated from Tenant A's data is returned to Tenant B asking the same question

// Every vector store query MUST include the tenant filter
const results = await vectorStore.query({
  vector: queryEmbedding,
  topK: 10,
  filter: {
    tenantId: { $eq: context.tenantId },  // mandatory
  },
})

// Cache key includes tenant ID
const cacheKey = `llm:${context.tenantId}:${hashQuery(queryText)}`

We add a lint rule and a code review checklist item: every vector store query in the codebase must have a tenant filter. The isolation model you chose for the rest of your platform applies here too; if you are on a pooled database with row-level security, the vector store needs the equivalent discipline. Our guide to multi-tenant SaaS architecture covers how these decisions compound.

Pattern 4: Cost control by design

LLM API costs at B2B scale are not like other API costs. At a few cents per 1,000 tokens, a single feature processing 100,000 requests per month can easily cost $1,000-$5,000 monthly. Without attribution, you cannot distinguish between "this is expensive because our best enterprise customer is using it heavily" and "this is expensive because we have a bug."

// Log every LLM call with full attribution
await db.llmUsage.create({
  data: {
    tenantId: context.tenantId,
    userId: context.userId,
    feature: 'benefits-assistant',
    model: 'gpt-4o',
    promptTokens: usage.prompt_tokens,
    completionTokens: usage.completion_tokens,
    latencyMs: Date.now() - startTime,
    requestId: response.id,
  },
})

// Enforce per-tenant quotas
const tenantUsage = await getMonthlyUsage(context.tenantId)
if (tenantUsage.totalTokens > PLAN_LIMITS[context.plan].monthlyTokens) {
  return { error: 'ai_quota_exceeded', message: 'Monthly AI quota reached. Upgrade to continue.' }
}

Attribution data also answers the pricing question every B2B AI feature eventually faces: whether AI usage is bundled, metered, or gated by plan tier. You cannot price what you do not measure.

Pattern 5: Async processing for slow operations

GPT-4o with a 2,000-token context and 500-token response takes 3-8 seconds. Generating a 2,000-word document can take 30+ seconds. Holding an HTTP connection open that long is not acceptable UX and is fragile: load balancers and proxies with shorter timeouts will kill the connection.

For operations over a few seconds, use an async processing pattern: accept the request and return a job ID immediately (202 Accepted), queue the LLM work, and notify the client when done via WebSocket, SSE, or webhook.

// API handler: immediate 202
export async function POST(req: Request) {
  const { query } = await req.json()
  const jobId = await queue.add('benefits-query', {
    tenantId: context.tenantId,
    userId: context.userId,
    query,
  })
  return Response.json({ jobId, status: 'queued' }, { status: 202 })
}

// Worker: runs async, persists result, notifies the client
queue.process('benefits-query', async (job) => {
  const result = await runBenefitsQuery(job.data)
  await db.aiResults.upsert({
    where: { jobId: job.id },
    create: { jobId: job.id, ...result },
    update: { ...result },
  })
  await notifyClient(job.data.userId, { jobId: job.id, status: 'complete' })
})

The queue also gives you retry semantics, rate limiting against provider quotas, and a natural place to implement priority (interactive requests ahead of batch jobs).

Pattern 6: Graceful degradation and fallbacks

LLM providers have outages, rate limits, and latency spikes. A production B2B feature needs a defined behavior for every one of those, decided before launch rather than during the incident.

Provider fallback: primary model unavailable, fail over to a second provider with an equivalent prompt and schema. Keep the abstraction thin; the schemas differ more than the marketing suggests.
Capability fallback: if the smart path fails, degrade to a simpler deterministic path (keyword search instead of semantic answer) with honest UI labeling, rather than erroring out.
Refusal as a feature: when retrieval confidence is low, the correct output is "I cannot answer this from your documents," with an escalation path. For accuracy-critical domains this is non-negotiable; the full approach is in the anti-hallucination stack.

Pattern 7: Evaluation before and after shipping

B2B customers will ask what your accuracy is, and "it seems good" is not an answer procurement accepts. Build an evaluation set from real (permissioned, de-identified) usage: representative queries with reviewed expected answers. Run it on every prompt change, model upgrade, and retrieval tweak. Track answer accuracy, citation correctness, and refusal correctness separately; a model that answers more questions by citing worse sources is a regression, not an improvement.

Post-launch, sample production outputs for human review on a schedule, and wire user feedback (thumbs down, corrections, support tickets tagged to AI answers) back into the evaluation set. The evaluation loop is what turns an impressive demo into a feature that survives an enterprise renewal conversation. If the feature answers from customer documents, the evaluation loop rides on retrieval quality, which is where production RAG pipeline design does the heavy lifting.

Where teams actually stumble

None of these patterns are individually difficult. The failure mode is shipping pattern 1 and 2 (the visible demo) while deferring 3 through 7 (the production substrate) to "after launch." After launch never comes; the roadmap moves on, and the feature accumulates isolation risks and unattributed costs until an enterprise security review or a billing surprise forces the retrofit at ten times the price.

If you are scoping an LLM feature for a B2B product and want the production substrate designed in from the start, that is exactly what our AI and automation practice does, from architecture review through shipped feature.

Related service

AI Development & Automation

Production RAG pipelines, LLM integrations, and AI workflow automation for healthcare and e-commerce.

Learn more

Written by