Most AI chatbots share the same fatal flaw: they hallucinate. The client needed an AI assistant embedded across thousands of seller mini-sites, each with different content. The AI needed to answer accurately from that specific seller's content, not generic knowledge. Off-the-shelf AI integrations had no source grounding, no confidence thresholds, and no way to know when the AI was guessing.
- Hallucinated product names, prices, and URLs that did not exist
- No source grounding: answers were not traceable to actual content
- Generic fallback answers when questions got specific
- No confidence scoring: no way to detect when AI was guessing
- Pinecone retrieval returning irrelevant chunks with no quality threshold
- MySQL vector store (guidy_knowledge_chunks) with cosine similarity search, replacing Pinecone for read-heavy workloads
- OpenAI text-embedding-3-large (3072 dimensions) for higher semantic accuracy on domain-specific content
- Sentence-aware chunking (~2,800 chars, 350 overlap) preserving context across chunk boundaries
- Hybrid BM25 + semantic retrieval: dense vectors for paraphrased queries, BM25 for exact product names
- MMR reranking: top-50 candidates reranked to 8-10 maximizing relevance and diversity
- Anti-hallucination prompt rules: answer only from context, cite sources, return "not in knowledge base" below 0.45 confidence
- RAG temperature set to 0.1 for near-deterministic factual answers
The switch from Pinecone to a MySQL vector store was the most counterintuitive decision in this project and the most impactful. Pinecone is the obvious choice for production RAG: it is purpose-built for vector similarity search at scale. But the read pattern for Guidy is highly concentrated: most queries on any given mini-site hit the same 15–20 knowledge chunks repeatedly. MySQL with a well-indexed cosine similarity query on this workload is faster than a Pinecone call because it benefits from buffer pool caching in a way a remote API cannot. The cost difference is also significant at the request volume we were operating. The hybrid BM25 + semantic retrieval layer addressed a specific failure mode we observed in testing: dense vector search alone misses exact product names and SKU codes when the query phrasing deviates from the training distribution. BM25 retrieves by keyword overlap regardless of semantic similarity. Running both in parallel and reranking on combined score recovered a class of failures that semantic-only retrieval could not handle.
Retrieval hit rate (relevant source in top 5)
RAG path vs general fallback
Empty context when data exists
Average RAG confidence score
All 4 production metrics achieved. Hallucination rate meaningfully reduced vs. baseline generic AI implementation.
Guidy went from a generic chatbot that hallucinated confidently to a grounded knowledge assistant that answers accurately from real content and says "I don't know" when it doesn't. The pipeline architecture (chunking, embedding, MySQL vector store, and hybrid retrieval) is now a repeatable internal capability we deploy for healthcare document Q&A, enterprise knowledge bases, and e-commerce product assistants.
The most useful engineering decision we made was setting the confidence threshold before we started tuning retrieval quality. Defining what counts as a "not in knowledge base" response, and holding to it, forced us to improve retrieval rather than paper over failures with fallback generation. Teams that skip confidence thresholds end up with AI that sounds confident about things it does not actually know. The threshold is not a safety feature bolted on at the end; it is the primary quality signal that drives every other tuning decision in a production RAG system.


