Case studies
AI Engineering · RAG · Knowledge Retrieval

Building a RAG pipeline that answers from real content, not hallucinations

Generic AI chatbots give confident wrong answers. We built a production RAG pipeline with semantic search, BM25 hybrid retrieval, and anti-hallucination guardrails that answers only from what it actually knows.

Client: SaaS platform (MyQRGuide)

LaravelPHPMySQLOpenAI text-embedding-3-largeGPT-4BM25Pinecone
RAG Pipeline for AI Knowledge Assistant architecture diagram
>80%

Retrieval hit rate (relevant source in top 5)

>70%

Questions answered from grounded context

<5%

Empty context when data exists

>0.50

Average RAG confidence score

Case study
The challenge

Most AI chatbots share the same fatal flaw: they hallucinate. The client needed an AI assistant embedded across thousands of seller mini-sites, each with different content. The AI needed to answer accurately from that specific seller's content, not generic knowledge. Off-the-shelf AI integrations had no source grounding, no confidence thresholds, and no way to know when the AI was guessing.

  • Hallucinated product names, prices, and URLs that did not exist
  • No source grounding: answers were not traceable to actual content
  • Generic fallback answers when questions got specific
  • No confidence scoring: no way to detect when AI was guessing
  • Pinecone retrieval returning irrelevant chunks with no quality threshold
What we built
  • MySQL vector store (guidy_knowledge_chunks) with cosine similarity search, replacing Pinecone for read-heavy workloads
  • OpenAI text-embedding-3-large (3072 dimensions) for higher semantic accuracy on domain-specific content
  • Sentence-aware chunking (~2,800 chars, 350 overlap) preserving context across chunk boundaries
  • Hybrid BM25 + semantic retrieval: dense vectors for paraphrased queries, BM25 for exact product names
  • MMR reranking: top-50 candidates reranked to 8-10 maximizing relevance and diversity
  • Anti-hallucination prompt rules: answer only from context, cite sources, return "not in knowledge base" below 0.45 confidence
  • RAG temperature set to 0.1 for near-deterministic factual answers

The switch from Pinecone to a MySQL vector store was the most counterintuitive decision in this project and the most impactful. Pinecone is the obvious choice for production RAG: it is purpose-built for vector similarity search at scale. But the read pattern for Guidy is highly concentrated: most queries on any given mini-site hit the same 15–20 knowledge chunks repeatedly. MySQL with a well-indexed cosine similarity query on this workload is faster than a Pinecone call because it benefits from buffer pool caching in a way a remote API cannot. The cost difference is also significant at the request volume we were operating. The hybrid BM25 + semantic retrieval layer addressed a specific failure mode we observed in testing: dense vector search alone misses exact product names and SKU codes when the query phrasing deviates from the training distribution. BM25 retrieves by keyword overlap regardless of semantic similarity. Running both in parallel and reranking on combined score recovered a class of failures that semantic-only retrieval could not handle.

The results
>80%

Retrieval hit rate (relevant source in top 5)

>70%

RAG path vs general fallback

<5%

Empty context when data exists

>0.50

Average RAG confidence score

All 4 production metrics achieved. Hallucination rate meaningfully reduced vs. baseline generic AI implementation.

The outcome

Guidy went from a generic chatbot that hallucinated confidently to a grounded knowledge assistant that answers accurately from real content and says "I don't know" when it doesn't. The pipeline architecture (chunking, embedding, MySQL vector store, and hybrid retrieval) is now a repeatable internal capability we deploy for healthcare document Q&A, enterprise knowledge bases, and e-commerce product assistants.

The most useful engineering decision we made was setting the confidence threshold before we started tuning retrieval quality. Defining what counts as a "not in knowledge base" response, and holding to it, forced us to improve retrieval rather than paper over failures with fallback generation. Teams that skip confidence thresholds end up with AI that sounds confident about things it does not actually know. The threshold is not a safety feature bolted on at the end; it is the primary quality signal that drives every other tuning decision in a production RAG system.

Continue reading
CareCoordinations
Healthcare

CareCoordinations

7–10×
ROI generated on average
LaravelSwiftKotlin
Read case study
MyQRGuide: Post-Sale Customer Platform
E-commerce

MyQRGuide: Post-Sale Customer Platform

8
Professional mini-site themes
LaravelMySQLAI Chatbot
Read case study

Start a project

Ready to build something like this?

30-minute discovery call. No pitch deck. We talk through your problem, tell you honestly if we can help, and scope it properly before quoting.