RAG Pipeline for AI Knowledge Assistant

The challenge

Most AI chatbots share the same fatal flaw: they hallucinate. The client needed an AI assistant embedded across thousands of seller mini-sites, each with different content. The AI needed to answer accurately from that specific seller's content, not generic knowledge. Off-the-shelf AI integrations had no source grounding, no confidence thresholds, and no way to know when the AI was guessing.

Hallucinated product names, prices, and URLs that did not exist
No source grounding: answers were not traceable to actual content
Generic fallback answers when questions got specific
No confidence scoring: no way to detect when AI was guessing
Pinecone retrieval returning irrelevant chunks with no quality threshold

What we built

MySQL vector store (guidy_knowledge_chunks) with cosine similarity search, replacing Pinecone for read-heavy workloads
OpenAI text-embedding-3-large (3072 dimensions) for higher semantic accuracy on domain-specific content
Sentence-aware chunking (~2,800 chars, 350 overlap) preserving context across chunk boundaries
Hybrid BM25 + semantic retrieval: dense vectors for paraphrased queries, BM25 for exact product names
MMR reranking: top-50 candidates reranked to 8-10 maximizing relevance and diversity
Anti-hallucination prompt rules: answer only from context, cite sources, return "not in knowledge base" below 0.45 confidence
RAG temperature set to 0.1 for near-deterministic factual answers

The switch from Pinecone to a MySQL vector store was the most counterintuitive decision in this project and the most impactful. Pinecone is the obvious choice for production RAG: it is purpose-built for vector similarity search at scale. But the read pattern for Guidy is highly concentrated: most queries on any given mini-site hit the same 15–20 knowledge chunks repeatedly. MySQL with a well-indexed cosine similarity query on this workload is faster than a Pinecone call because it benefits from buffer pool caching in a way a remote API cannot. The cost difference is also significant at the request volume we were operating. The hybrid BM25 + semantic retrieval layer addressed a specific failure mode we observed in testing: dense vector search alone misses exact product names and SKU codes when the query phrasing deviates from the training distribution. BM25 retrieves by keyword overlap regardless of semantic similarity. Running both in parallel and reranking on combined score recovered a class of failures that semantic-only retrieval could not handle.

The results

>80%

Retrieval hit rate (relevant source in top 5)

>70%

RAG path vs general fallback

<5%

Empty context when data exists

>0.50

Average RAG confidence score

All 4 production metrics achieved. Hallucination rate meaningfully reduced vs. baseline generic AI implementation.

The outcome

Guidy went from a generic chatbot that hallucinated confidently to a grounded knowledge assistant that answers accurately from real content and says "I don't know" when it doesn't. The pipeline architecture (chunking, embedding, MySQL vector store, and hybrid retrieval) is now a repeatable internal capability we deploy for healthcare document Q&A, enterprise knowledge bases, and e-commerce product assistants.

The most useful engineering decision we made was setting the confidence threshold before we started tuning retrieval quality. Defining what counts as a "not in knowledge base" response, and holding to it, forced us to improve retrieval rather than paper over failures with fallback generation. Teams that skip confidence thresholds end up with AI that sounds confident about things it does not actually know. The threshold is not a safety feature bolted on at the end; it is the primary quality signal that drives every other tuning decision in a production RAG system.

Building a RAG pipeline that answers from real content, not hallucinations

School Management System for a 1,200-Student International School

MyQRGuide: Post-Sale Customer Platform

Ready to build something like this?