Build a RAG Chatbot on a Cloud Server — Chat with Your Own Documents Using AI

I had a growing collection of internal documentation — product specs, meeting notes, technical guides — that was becoming impossible to navigate. Searching was slow, remembering which document contained which information was worse.

A RAG chatbot changed how I interact with that documentation. I ask a question in natural language, it retrieves the relevant passages from my documents, and generates a coherent answer. The quality of the answer depends on the quality of the retrieval — which is why the chunk size and embedding model choice matter.

This guide builds the full pipeline with LangChain, ChromaDB, and Ollama (so no API fees). I'll explain the design decisions that actually affect answer quality.

I built a RAG system to chat with my own notes and documentation. The setup: documents are converted into vector embeddings, stored in a vector database, and when a question comes in, the most relevant document chunks are retrieved and fed to a language model along with the question.

This guide builds a working RAG pipeline on a cloud server using open-source tools: LangChain, ChromaDB, and Ollama.

I run this on Tencent Cloud Lighthouse. The 4 GB RAM plan works for small document collections with 3B models; 8 GB RAM for larger collections or 7B models. Lighthouse's TencentOS AI application image is worth considering for AI workloads: it comes pre-installed with Python 3, Node.js, Docker, Git, and major AI frameworks (PyTorch, TensorFlow) — the Python and ML library setup that normally takes an hour is already done. The main reason to self-host a RAG system: your documents stay on your server. Internal wikis, proprietary documentation, client materials — none of it is sent to a third-party API for embedding or retrieval.

How RAG Works
What You Need
Part 1: Install Ollama and Pull Models
Part 2: Set Up the Python Environment
Part 3: Build the Indexing Pipeline
Part 4: Build the Query Pipeline
Part 5: Add a Web Interface with Chainlit
Part 6: Deploy as a Service
The Thing That Tripped Me Up
Troubleshooting
Summary

Key Takeaways

Use the appropriate Lighthouse application image to skip manual installation steps where available
Lighthouse snapshots provide one-click full-server backup before major changes
OrcaTerm browser terminal lets you manage the server from any device
CBS cloud disk expansion handles growing storage needs without server migration
Console-level firewall + UFW = two independent protection layers

How RAG Works {#how-rag-works}

The process has two phases:

Indexing (done once, or when documents change):

Documents → Split into chunks → Generate embeddings → Store in vector DB

Querying (every time a question is asked):

Question → Generate embedding → Find similar chunks in vector DB
         → Send question + relevant chunks to LLM → Get answer

The key insight: instead of training the model on your documents, you let it read the relevant parts at query time. This works with any LLM and can be updated just by re-indexing your documents.

What You Need {#prerequisites}

Requirement	Details
Server	Ubuntu 22.04, 4 GB+ RAM
Ollama	Running with at least one model
Python	3.10+
Documents	PDF, TXT, Markdown, or HTML files

Part 1: Install Ollama and Pull Models {#part-1}

You need two models: one for generating embeddings (converting text to vectors), and one for generating answers.

1.1 — Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

1.2 — Pull the Models

# Embedding model — converts text to vectors
ollama pull nomic-embed-text

# LLM for generating answers — choose based on your RAM
ollama pull llama3.2:3b    # 4 GB RAM
# or
ollama pull llama3.1:8b    # 8 GB RAM

Verify both are available:

ollama list

Part 2: Set Up the Python Environment {#part-2}

2.1 — Install Python and pip

sudo apt install -y python3 python3-pip python3-venv

2.2 — Create a Project Directory and Virtual Environment

mkdir -p /opt/rag-chatbot
cd /opt/rag-chatbot
python3 -m venv venv
source venv/bin/activate

2.3 — Install Required Packages

pip install \
  langchain \
  langchain-community \
  langchain-ollama \
  chromadb \
  pypdf \
  sentence-transformers \
  chainlit \
  unstructured \
  python-docx

Part 3: Build the Indexing Pipeline {#part-3}

Create index_documents.py:

import os
from langchain_community.document_loaders import (
    PyPDFLoader,
    TextLoader,
    DirectoryLoader,
    UnstructuredMarkdownLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma

# Configuration
DOCS_DIR = "./documents"      # Put your documents here
CHROMA_DIR = "./chroma_db"    # Vector database storage
EMBEDDING_MODEL = "nomic-embed-text"

def load_documents(docs_dir: str):
    """Load all supported documents from a directory."""
    documents = []
    
    for root, _, files in os.walk(docs_dir):
        for file in files:
            filepath = os.path.join(root, file)
            
            if file.endswith(".pdf"):
                loader = PyPDFLoader(filepath)
            elif file.endswith(".txt"):
                loader = TextLoader(filepath)
            elif file.endswith(".md"):
                loader = TextLoader(filepath)
            else:
                continue
            
            print(f"Loading: {filepath}")
            docs = loader.load()
            documents.extend(docs)
    
    return documents

def index_documents():
    print("Loading documents...")
    documents = load_documents(DOCS_DIR)
    print(f"Loaded {len(documents)} documents")
    
    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,     # Characters per chunk
        chunk_overlap=200,   # Overlap between chunks (for context continuity)
        length_function=len,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split into {len(chunks)} chunks")
    
    # Generate embeddings and store in ChromaDB
    print("Generating embeddings (this may take a few minutes)...")
    embeddings = OllamaEmbeddings(model=EMBEDDING_MODEL)
    
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=CHROMA_DIR
    )
    
    print(f"Indexed {len(chunks)} chunks into ChromaDB at {CHROMA_DIR}")
    return vectorstore

if __name__ == "__main__":
    os.makedirs(DOCS_DIR, exist_ok=True)
    
    if not os.listdir(DOCS_DIR):
        print(f"No documents found in {DOCS_DIR}")
        print("Add your PDF, TXT, or Markdown files to the documents/ folder and run again")
    else:
        index_documents()
        print("\nIndexing complete. Run 'python3 query.py' to test queries.")

Add Your Documents

mkdir -p /opt/rag-chatbot/documents
# Copy your PDFs, text files, or Markdown files here
cp /path/to/your/docs/* /opt/rag-chatbot/documents/

Run the Indexer

cd /opt/rag-chatbot
source venv/bin/activate
python3 index_documents.py

This takes 1–5 minutes depending on document size. ChromaDB is created at ./chroma_db/.

Part 4: Build the Query Pipeline {#part-4}

Create query.py for command-line testing:

from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_ollama import ChatOllama
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

# Configuration
CHROMA_DIR = "./chroma_db"
EMBEDDING_MODEL = "nomic-embed-text"
LLM_MODEL = "llama3.2:3b"

def create_rag_chain():
    # Load the vector store
    embeddings = OllamaEmbeddings(model=EMBEDDING_MODEL)
    vectorstore = Chroma(
        persist_directory=CHROMA_DIR,
        embedding_function=embeddings
    )
    
    # Create a retriever — finds the top 5 most relevant chunks
    retriever = vectorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 5}
    )
    
    # Define the prompt template
    template = """You are a helpful assistant. Answer the question based on the context provided.
If the answer isn't in the context, say so clearly rather than making something up.

Context:
{context}

Question: {question}

Answer:"""
    
    prompt = ChatPromptTemplate.from_template(template)
    
    # Create the LLM
    llm = ChatOllama(model=LLM_MODEL, temperature=0)
    
    # Build the RAG chain
    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)
    
    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
    
    return rag_chain

def main():
    print(f"Loading RAG chain with {LLM_MODEL}...")
    chain = create_rag_chain()
    print("Ready! Type your questions (Ctrl+C to quit)\n")
    
    while True:
        question = input("You: ").strip()
        if not question:
            continue
        
        print("Assistant: ", end="", flush=True)
        for chunk in chain.stream(question):
            print(chunk, end="", flush=True)
        print("\n")

if __name__ == "__main__":
    main()

Test the Query

python3 query.py

Type a question about your documents and see the AI answer based on your content.

Part 5: Add a Web Interface with Chainlit {#part-5}

Create app.py for a chat web interface:

import chainlit as cl
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_ollama import ChatOllama
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

CHROMA_DIR = "./chroma_db"
EMBEDDING_MODEL = "nomic-embed-text"
LLM_MODEL = "llama3.2:3b"

def get_rag_chain():
    embeddings = OllamaEmbeddings(model=EMBEDDING_MODEL)
    vectorstore = Chroma(
        persist_directory=CHROMA_DIR,
        embedding_function=embeddings
    )
    retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
    
    template = """You are a helpful assistant. Answer based on the provided context only.
If the answer isn't in the context, say "I don't have information about that in the provided documents."

Context:
{context}

Question: {question}

Answer:"""
    
    prompt = ChatPromptTemplate.from_template(template)
    llm = ChatOllama(model=LLM_MODEL, temperature=0)
    
    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)
    
    return (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )

@cl.on_chat_start
async def start():
    chain = get_rag_chain()
    cl.user_session.set("chain", chain)
    await cl.Message(
        content="Hello! I'm ready to answer questions about your documents. What would you like to know?"
    ).send()

@cl.on_message
async def main(message: cl.Message):
    chain = cl.user_session.get("chain")
    
    msg = cl.Message(content="")
    
    async for chunk in chain.astream(message.content):
        await msg.stream_token(chunk)
    
    await msg.send()

Run the Web Interface

chainlit run app.py --host 0.0.0.0 --port 8000

Access via SSH tunnel: ssh -L 8000:localhost:8000 ubuntu@YOUR_SERVER_IP

Open http://localhost:8000 in your browser.

Part 6: Deploy as a Service {#part-6}

sudo nano /etc/systemd/system/rag-chatbot.service

[Unit]
Description=RAG Chatbot
After=network.target ollama.service

[Service]
Type=simple
User=ubuntu
WorkingDirectory=/opt/rag-chatbot
ExecStart=/opt/rag-chatbot/venv/bin/chainlit run app.py --host 0.0.0.0 --port 8000
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable rag-chatbot
sudo systemctl start rag-chatbot

Set up Nginx with HTTPS and basic auth (same as other services in this guide) pointing to port 8000.

The Thing That Tripped Me Up {#gotcha}

My RAG chatbot was running but giving poor answers — it kept saying "the context doesn't contain this information" even for questions that were clearly in my documents.

The issue: chunk size was too small (200 characters by default in my first attempt), so the retrieved chunks lacked enough context for the model to understand them.

The fix: Increase chunk size and overlap:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,   # Was 200 — too small
    chunk_overlap=200, # Was 0 — no overlap meant boundary cuts
)

Also, I was only retrieving 2 chunks (k=2). Increasing to 5 gave the model more context:

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

After re-indexing with these settings, answer quality improved significantly. The rule of thumb: chunks should be large enough to contain complete thoughts, with enough overlap that concepts split across chunk boundaries still get retrieved.

Troubleshooting {#troubleshooting}

Issue	Likely Cause	Fix
"Context doesn't contain..." despite relevant docs	Chunks too small or k too low	Increase `chunk_size` to 1000+, `k` to 5+
Slow embedding generation	CPU-only embedding	Expected; nomic-embed-text on CPU takes time
ChromaDB errors on startup	DB corrupted	Delete `./chroma_db` dir and re-index
LLM gives wrong answers	Poor prompt	Improve system prompt; add explicit "only use context"
PDF pages not loading	pypdf not installed	`pip install pypdf`
Out of memory	Too many chunks in memory	Process documents in batches
Chainlit not accessible	Port not open	Check UFW rules; use SSH tunnel

Summary {#verdict}

✅ What you built:

Document indexing pipeline: load → split → embed → store in ChromaDB
Query pipeline: embed question → retrieve relevant chunks → generate answer with Ollama
Command-line chat interface for testing
Chainlit web interface for browser-based chat
Systemd service for persistent deployment

The system works with any document type that can be converted to text. Add new documents, re-run the indexer, and the chatbot immediately knows about them.

Frequently Asked Questions {#faq}

How much RAM do I need to run RAG chatbot on a VPS?
It depends on the model size. 3B parameter models need ~3–4 GB RAM; 7B models need ~5–6 GB; 13B+ models need 12+ GB. Check the requirements section for specific recommendations.

Can RAG chatbot run on a CPU-only server without a GPU?
Yes, but inference speed varies significantly. 3B models are responsive on CPU. 7B+ models are noticeably slower without GPU acceleration. For production AI workloads, consider a GPU instance.

Is my data private when using self-hosted AI models?
Yes — data is processed entirely on your server with no external API calls. Conversations, documents, and prompts never leave your infrastructure. This is a key advantage of self-hosting AI.

What is the TencentOS AI image and should I use it?
The TencentOS AI application image comes pre-installed with Python 3, Docker, PyTorch, TensorFlow, PaddlePaddle, and GPU drivers. It eliminates hours of manual CUDA and AI framework setup. Strongly recommended for GPU-accelerated AI workloads.

Can I use RAG chatbot as a drop-in replacement for the OpenAI API?
Many self-hosted AI tools provide OpenAI-compatible API endpoints. You can often switch your application by just changing the `base_url` to your server address.

👉 Get started with Tencent Cloud Lighthouse
👉 View current pricing and launch promotions
👉 Explore all active deals and offers