How We Built and Evaluated AI Chatbots with Self-Hosted n8n and LangSmith

Introduction: Why Chatbot Evaluation Is No Longer Optional
The first wave of AI chatbots focused on response generation. The second wave must focus on reliability.
At Framework Friday, we build agentic systems that don’t just talk — they trace, evaluate, and improve themselves.
In this post, we share how our Tiger Team built a production-ready chatbot stack using self-hosted n8n for orchestration and LangSmith for evaluation and observability. It’s modular, private, and performance-verified — built for operators who care about what’s actually happening under the hood.
From Automation to Observability: The Shift in Chatbot Design
Today’s LLM apps are complex, multi-step systems. Without full evaluation, you risk:
- Burned tokens and unknown costs
- Hallucinated answers that go unnoticed
- Manual QA that can’t scale
- No way to prove ROI
That’s why we built evaluation into the system from the start — not as an afterthought.
The Framework Friday Stack
Our chatbot system is structured around five core layers:
-
n8n (self-hosted via Docker)
Acts as the central workflow engine, orchestrating logic, memory, and external tools. -
LangSmith
Provides full evaluation and observability — tracing every step, scoring responses, and logging token usage. -
OpenAI GPT-4 (with optional Ollama fallback)
Powers the assistant's natural language responses, tuned for accuracy and low temperature. -
Supabase
Hosts our vector store, storing embedded documents and supporting high-precision retrieval. -
Session-based memory (10-turn buffer)
Maintains conversational context across multiple messages, scoped by user session ID.
This modular setup gives us control, visibility, and performance — without relying on cloud-native SaaS stacks.
Implementation: How We Built It
1. Self-Hosting n8n with LangSmith Integration
We used Docker Desktop to deploy n8n locally, exposing port 5678 and mapping volumes for persistence.
Key environment variables connected the system to LangSmith:
LANGCHAIN_TRACING_V2=true LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
LANGCHAIN_API_KEY=your_key LANGCHAIN_PROJECT=chatbot-evaluation
This gave us a GUI-accessible flow builder with built-in trace logging for every agent run.
2. Vector Search from Private Docs
Knowledge ingestion pipeline:
Google Drive → Data Loader → Chunking → Embeddings → Supabase Vector Store
Optimization decisions:
- Chunk size: 1000 characters
- Overlap: 200 characters
- Retrieval: Top 5 results, threshold ≥ 0.8
- Metadata: File source, section title, date
This setup delivered high-relevance context without noise.
3. Configuring the Agent
Using LangChain’s Tools Agent, we configured:
- Retrieval as a conditional step (not default)
- System prompt with rules for citation, clarity, and fallback behavior
- GPT-4 as the LLM, with temperature set at 0.1
- 10-message memory buffer tied to session ID
Each interaction logged tool use, response metadata, and agent reasoning paths.
4. Evaluation via LangSmith
LangSmith added full observability:
- Traces of all tool/LLM/memory steps
- Token usage and latency per run
- Quality scores using LLM-as-a-Judge
- Custom session tags for chatbot versioning and A/B testing
LangSmith didn’t just show us what happened — it showed why and at what cost.
What Changed for Our Team
This wasn’t just a chatbot project. It became a blueprint for agentic evaluation.
Key wins:
- Debugging time dropped by 70%
- Token spend stabilized through early prompt optimization
- Edge cases flagged before they reached users
- Stakeholders gained traceable QA visibility
Instead of shipping another black-box tool, we built an agentic layer we can trust.
Governance: Traceability, Not Guesswork
Every chat flow now generates structured, reviewable logs.
We don’t rely on anecdotal feedback. We trace and score everything — from the first message to the final token.
Evaluation isn’t just about performance. It’s about confidence, governance, and repeatability.
Final Thoughts: A Smarter Path to AI Chatbots
This system proves that self-hosted, evaluation-first AI is not only possible — it’s practical.
By combining n8n’s flexibility with LangSmith’s evaluation backbone, we turned an experimental chatbot into an operator-ready system.
Coming next: This full setup — workflows, config, and prompt logic — will be available as a Framework Friday template.
👉 Join the All-in on AI community to access templates, workflows, and full configuration packs.
Let’s build agentic systems you can measure, trace, and trust.
Related Articles
More articles from Business Intelligence & Research

Why LangSmith is the Backbone of our LLM Evaluation Stack
### Introduction: From Black Box to Evaluation-First AI Building LLM workflows without evaluation is like flying blind. ...
Read more