I had a growing collection of internal documentation — product specs, meeting notes, technical guides — that was becoming impossible to navigate. Searching was slow, remembering which document contained which information was worse.
A RAG chatbot changed how I interact with that documentation. I ask a question in natural language, it retrieves the relevant passages from my documents, and generates a coherent answer. The quality of the answer depends on the quality of the retrieval — which is why the chunk size and embedding model choice matter.
This guide builds the full pipeline with LangChain, ChromaDB, and Ollama (so no API fees). I'll explain the design decisions that actually affect answer quality.
I built a RAG system to chat with my own notes and documentation. The setup: documents are converted into vector embeddings, stored in a vector database, and when a question comes in, the most relevant document chunks are retrieved and fed to a language model along with the question.
This guide builds a working RAG pipeline on a cloud server using open-source tools: LangChain, ChromaDB, and Ollama.
I run this on Tencent Cloud Lighthouse. The 4 GB RAM plan works for small document collections with 3B models; 8 GB RAM for larger collections or 7B models. Lighthouse's TencentOS AI application image is worth considering for AI workloads: it comes pre-installed with Python 3, Node.js, Docker, Git, and major AI frameworks (PyTorch, TensorFlow) — the Python and ML library setup that normally takes an hour is already done. The main reason to self-host a RAG system: your documents stay on your server. Internal wikis, proprietary documentation, client materials — none of it is sent to a third-party API for embedding or retrieval.
- Key Takeaways
The process has two phases:
Indexing (done once, or when documents change):
Documents → Split into chunks → Generate embeddings → Store in vector DB
Querying (every time a question is asked):
Question → Generate embedding → Find similar chunks in vector DB
→ Send question + relevant chunks to LLM → Get answer
The key insight: instead of training the model on your documents, you let it read the relevant parts at query time. This works with any LLM and can be updated just by re-indexing your documents.
| Requirement | Details |
|---|---|
| Server | Ubuntu 22.04, 4 GB+ RAM |
| Ollama | Running with at least one model |
| Python | 3.10+ |
| Documents | PDF, TXT, Markdown, or HTML files |
You need two models: one for generating embeddings (converting text to vectors), and one for generating answers.
curl -fsSL https://ollama.com/install.sh | sh
# Embedding model — converts text to vectors
ollama pull nomic-embed-text
# LLM for generating answers — choose based on your RAM
ollama pull llama3.2:3b # 4 GB RAM
# or
ollama pull llama3.1:8b # 8 GB RAM
Verify both are available:
ollama list
sudo apt install -y python3 python3-pip python3-venv
mkdir -p /opt/rag-chatbot
cd /opt/rag-chatbot
python3 -m venv venv
source venv/bin/activate
pip install \
langchain \
langchain-community \
langchain-ollama \
chromadb \
pypdf \
sentence-transformers \
chainlit \
unstructured \
python-docx
Create index_documents.py:
import os
from langchain_community.document_loaders import (
PyPDFLoader,
TextLoader,
DirectoryLoader,
UnstructuredMarkdownLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
# Configuration
DOCS_DIR = "./documents" # Put your documents here
CHROMA_DIR = "./chroma_db" # Vector database storage
EMBEDDING_MODEL = "nomic-embed-text"
def load_documents(docs_dir: str):
"""Load all supported documents from a directory."""
documents = []
for root, _, files in os.walk(docs_dir):
for file in files:
filepath = os.path.join(root, file)
if file.endswith(".pdf"):
loader = PyPDFLoader(filepath)
elif file.endswith(".txt"):
loader = TextLoader(filepath)
elif file.endswith(".md"):
loader = TextLoader(filepath)
else:
continue
print(f"Loading: {filepath}")
docs = loader.load()
documents.extend(docs)
return documents
def index_documents():
print("Loading documents...")
documents = load_documents(DOCS_DIR)
print(f"Loaded {len(documents)} documents")
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Characters per chunk
chunk_overlap=200, # Overlap between chunks (for context continuity)
length_function=len,
)
chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
# Generate embeddings and store in ChromaDB
print("Generating embeddings (this may take a few minutes)...")
embeddings = OllamaEmbeddings(model=EMBEDDING_MODEL)
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=CHROMA_DIR
)
print(f"Indexed {len(chunks)} chunks into ChromaDB at {CHROMA_DIR}")
return vectorstore
if __name__ == "__main__":
os.makedirs(DOCS_DIR, exist_ok=True)
if not os.listdir(DOCS_DIR):
print(f"No documents found in {DOCS_DIR}")
print("Add your PDF, TXT, or Markdown files to the documents/ folder and run again")
else:
index_documents()
print("\nIndexing complete. Run 'python3 query.py' to test queries.")
mkdir -p /opt/rag-chatbot/documents
# Copy your PDFs, text files, or Markdown files here
cp /path/to/your/docs/* /opt/rag-chatbot/documents/
cd /opt/rag-chatbot
source venv/bin/activate
python3 index_documents.py
This takes 1–5 minutes depending on document size. ChromaDB is created at ./chroma_db/.
Create query.py for command-line testing:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_ollama import ChatOllama
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
# Configuration
CHROMA_DIR = "./chroma_db"
EMBEDDING_MODEL = "nomic-embed-text"
LLM_MODEL = "llama3.2:3b"
def create_rag_chain():
# Load the vector store
embeddings = OllamaEmbeddings(model=EMBEDDING_MODEL)
vectorstore = Chroma(
persist_directory=CHROMA_DIR,
embedding_function=embeddings
)
# Create a retriever — finds the top 5 most relevant chunks
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 5}
)
# Define the prompt template
template = """You are a helpful assistant. Answer the question based on the context provided.
If the answer isn't in the context, say so clearly rather than making something up.
Context:
{context}
Question: {question}
Answer:"""
prompt = ChatPromptTemplate.from_template(template)
# Create the LLM
llm = ChatOllama(model=LLM_MODEL, temperature=0)
# Build the RAG chain
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
return rag_chain
def main():
print(f"Loading RAG chain with {LLM_MODEL}...")
chain = create_rag_chain()
print("Ready! Type your questions (Ctrl+C to quit)\n")
while True:
question = input("You: ").strip()
if not question:
continue
print("Assistant: ", end="", flush=True)
for chunk in chain.stream(question):
print(chunk, end="", flush=True)
print("\n")
if __name__ == "__main__":
main()
python3 query.py
Type a question about your documents and see the AI answer based on your content.
Create app.py for a chat web interface:
import chainlit as cl
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_ollama import ChatOllama
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
CHROMA_DIR = "./chroma_db"
EMBEDDING_MODEL = "nomic-embed-text"
LLM_MODEL = "llama3.2:3b"
def get_rag_chain():
embeddings = OllamaEmbeddings(model=EMBEDDING_MODEL)
vectorstore = Chroma(
persist_directory=CHROMA_DIR,
embedding_function=embeddings
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
template = """You are a helpful assistant. Answer based on the provided context only.
If the answer isn't in the context, say "I don't have information about that in the provided documents."
Context:
{context}
Question: {question}
Answer:"""
prompt = ChatPromptTemplate.from_template(template)
llm = ChatOllama(model=LLM_MODEL, temperature=0)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
return (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
@cl.on_chat_start
async def start():
chain = get_rag_chain()
cl.user_session.set("chain", chain)
await cl.Message(
content="Hello! I'm ready to answer questions about your documents. What would you like to know?"
).send()
@cl.on_message
async def main(message: cl.Message):
chain = cl.user_session.get("chain")
msg = cl.Message(content="")
async for chunk in chain.astream(message.content):
await msg.stream_token(chunk)
await msg.send()
chainlit run app.py --host 0.0.0.0 --port 8000
Access via SSH tunnel: ssh -L 8000:localhost:8000 ubuntu@YOUR_SERVER_IP
Open http://localhost:8000 in your browser.
sudo nano /etc/systemd/system/rag-chatbot.service
[Unit]
Description=RAG Chatbot
After=network.target ollama.service
[Service]
Type=simple
User=ubuntu
WorkingDirectory=/opt/rag-chatbot
ExecStart=/opt/rag-chatbot/venv/bin/chainlit run app.py --host 0.0.0.0 --port 8000
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable rag-chatbot
sudo systemctl start rag-chatbot
Set up Nginx with HTTPS and basic auth (same as other services in this guide) pointing to port 8000.
My RAG chatbot was running but giving poor answers — it kept saying "the context doesn't contain this information" even for questions that were clearly in my documents.
The issue: chunk size was too small (200 characters by default in my first attempt), so the retrieved chunks lacked enough context for the model to understand them.
The fix: Increase chunk size and overlap:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Was 200 — too small
chunk_overlap=200, # Was 0 — no overlap meant boundary cuts
)
Also, I was only retrieving 2 chunks (k=2). Increasing to 5 gave the model more context:
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
After re-indexing with these settings, answer quality improved significantly. The rule of thumb: chunks should be large enough to contain complete thoughts, with enough overlap that concepts split across chunk boundaries still get retrieved.
| Issue | Likely Cause | Fix |
|---|---|---|
| "Context doesn't contain..." despite relevant docs | Chunks too small or k too low | Increase chunk_size to 1000+, k to 5+ |
| Slow embedding generation | CPU-only embedding | Expected; nomic-embed-text on CPU takes time |
| ChromaDB errors on startup | DB corrupted | Delete ./chroma_db dir and re-index |
| LLM gives wrong answers | Poor prompt | Improve system prompt; add explicit "only use context" |
| PDF pages not loading | pypdf not installed | pip install pypdf |
| Out of memory | Too many chunks in memory | Process documents in batches |
| Chainlit not accessible | Port not open | Check UFW rules; use SSH tunnel |
✅ What you built:
The system works with any document type that can be converted to text. Add new documents, re-run the indexer, and the chatbot immediately knows about them.
How much RAM do I need to run RAG chatbot on a VPS?
It depends on the model size. 3B parameter models need ~3–4 GB RAM; 7B models need ~5–6 GB; 13B+ models need 12+ GB. Check the requirements section for specific recommendations.
Can RAG chatbot run on a CPU-only server without a GPU?
Yes, but inference speed varies significantly. 3B models are responsive on CPU. 7B+ models are noticeably slower without GPU acceleration. For production AI workloads, consider a GPU instance.
Is my data private when using self-hosted AI models?
Yes — data is processed entirely on your server with no external API calls. Conversations, documents, and prompts never leave your infrastructure. This is a key advantage of self-hosting AI.
What is the TencentOS AI image and should I use it?
The TencentOS AI application image comes pre-installed with Python 3, Docker, PyTorch, TensorFlow, PaddlePaddle, and GPU drivers. It eliminates hours of manual CUDA and AI framework setup. Strongly recommended for GPU-accelerated AI workloads.
base_url to your server address.👉 Get started with Tencent Cloud Lighthouse
👉 View current pricing and launch promotions
👉 Explore all active deals and offers