I'd been using AI chatbots for writing assistance and code review, but I was increasingly conscious of what I was sending to external APIs — sometimes client code, sometimes sensitive project details. Setting up a private alternative seemed worth trying.
Ollama makes it surprisingly practical. Download a model, run it, get a local API that responds like OpenAI's. Open WebUI wraps it in a proper chat interface with conversation history, multiple model support, and a system prompt editor.
Running it on a cloud server instead of my laptop means it's available from any device and I don't have to keep my computer on. For CPU-only inference, a 3B model responds in seconds. For better quality, a 7B model on 8 GB RAM is workable.
This guide deploys Ollama with Open WebUI on Ubuntu 22.04 using Docker Compose, with Nginx as the reverse proxy and HTTPS.
I run this on Tencent Cloud Lighthouse. For 7B parameter models (like Qwen 2.5-7B or Llama 3.1-8B), the 4 vCPU / 8 GB RAM plan is the recommended minimum. Smaller 3B models run on the 4 GB RAM plan. A practical advantage of Lighthouse for AI workloads: you can start with a smaller plan and upgrade the spec from the control panel as you figure out which models you actually use — no need to re-provision. The OrcaTerm browser terminal also lets you pull new models and check Ollama's status without a local SSH client, which is useful when managing the server from different machines.
- Key Takeaways
| Model Size | RAM Required | Recommended Plan | Quality |
|---|---|---|---|
| 3B (e.g., Qwen2.5-3B) | 4 GB | Basic | Good for simple tasks |
| 7B (e.g., Llama3.1-8B) | 8 GB | Standard | Good general use |
| 13B | 16 GB | Pro | Better reasoning |
| 70B | 64 GB | Large instance | Excellent, near GPT-4 |
CPU inference is slower than GPU but functional. A 7B model on a modern CPU generates about 5-10 tokens per second — slow but usable for non-interactive tasks.
| Requirement | Notes |
|---|---|
| Cloud server | Tencent Cloud Lighthouse Ubuntu 22.04 |
| 8 GB+ RAM | For 7B models |
| 20 GB+ free disk | Models are 4-8 GB each |
| Docker + Compose | Installed |
💡 If you selected the Lighthouse Docker CE application image, Docker is already installed and running. Skip the Docker install lines below and start from
sudo apt install -y nginx.
ssh ubuntu@YOUR_SERVER_IP
sudo apt update && sudo apt upgrade -y
curl -fsSL https://get.docker.com | sudo sh
sudo usermod -aG docker $USER
newgrp docker
sudo apt install -y nginx
sudo ufw allow ssh
sudo ufw allow 'Nginx Full'
sudo ufw enable
# Official Ollama installation script
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama --version
# Check service status
sudo systemctl status ollama
Ollama installs as a systemd service and starts automatically. It listens on localhost:11434 by default.
By default, Ollama only accepts connections from localhost. Open WebUI (running in Docker) needs to connect to it:
sudo systemctl edit ollama
Add in the editor:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl daemon-reload
sudo systemctl restart ollama
# Download a 7B model (takes 5-15 minutes depending on connection)
# Good general-purpose English model:
ollama pull llama3.1:8b
# Good Chinese + English model:
ollama pull qwen2.5:7b
# Lightweight 3B model for limited RAM:
ollama pull qwen2.5:3b
# Code-focused model:
ollama pull codellama:7b
# Check downloaded models
ollama list
Test the model in the terminal:
ollama run qwen2.5:7b
>>> Hello! What can you help me with?
I'm a helpful AI assistant...
>>> /bye
mkdir -p ~/apps/open-webui && cd ~/apps/open-webui
Create docker-compose.yml:
version: '3.8'
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
volumes:
- open-webui_data:/app/backend/data
environment:
# Connect to Ollama running on the host
OLLAMA_BASE_URL: http://host.docker.internal:11434
# Security settings
WEBUI_SECRET_KEY: generate_a_long_random_string
WEBUI_AUTH: "true"
# Disable anonymous access
ENABLE_SIGNUP: "true" # Allow registration on first visit; set to "false" after setup
DEFAULT_MODELS: qwen2.5:7b
DEFAULT_USER_ROLE: user
extra_hosts:
- "host.docker.internal:host-gateway" # Allows container to reach host network
volumes:
open-webui_data:
docker compose up -d
docker compose logs -f open-webui
# Wait for: Application startup complete
sudo nano /etc/nginx/sites-available/open-webui
server {
listen 80;
server_name ai.yourdomain.com;
client_max_body_size 100m; # For document uploads
location / {
proxy_pass http://127.0.0.1:3000;
proxy_http_version 1.1;
# WebSocket support (required for streaming responses)
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_cache_bypass $http_upgrade;
# Long timeout for AI responses
proxy_read_timeout 300s;
proxy_send_timeout 300s;
}
}
sudo ln -s /etc/nginx/sites-available/open-webui /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx
sudo apt install -y certbot python3-certbot-nginx
sudo certbot --nginx -d ai.yourdomain.com
Visit https://ai.yourdomain.com.
First user = admin. Register your account — this first registered user gets admin access.
After login:
Key features:
# General conversation (English)
ollama pull llama3.1:8b
# General conversation (Chinese + English, very good)
ollama pull qwen2.5:7b
# Reasoning and math
ollama pull deepseek-r1:7b
# Code generation
ollama pull codellama:7b
ollama pull qwen2.5-coder:7b
# Very small models (4 GB RAM)
ollama pull qwen2.5:3b
ollama pull phi3:mini
# Embedding models (for RAG)
ollama pull nomic-embed-text
Ollama exposes an API compatible with OpenAI's format. Use it from any code:
# Direct Ollama API
curl http://localhost:11434/api/chat \
-d '{"model": "qwen2.5:7b", "messages": [{"role": "user", "content": "Hello"}]}'
# OpenAI-compatible API (works with any OpenAI SDK)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen2.5:7b", "messages": [{"role": "user", "content": "Hello"}]}'
Python example using the OpenAI SDK:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required but ignored by Ollama
)
response = client.chat.completions.create(
model="qwen2.5:7b",
messages=[{"role": "user", "content": "Explain Docker in simple terms"}]
)
print(response.choices[0].message.content)
Ollama loads the model into RAM when the first request comes in and keeps it loaded for a period. If your prompt plus the conversation history exceeds the model's context window, old messages are silently dropped.
Practical issue: you notice the AI "forgets" earlier parts of long conversations. This isn't a bug — it's the context window limit.
Workarounds:
Check available RAM before loading large models:
free -h
# Make sure available RAM > model size + 2 GB overhead
If Ollama crashes or returns errors, check memory:
sudo journalctl -u ollama -f
# List downloaded models
ollama list
# Download a model
ollama pull MODEL_NAME
# Remove a model
ollama rm MODEL_NAME
# Show model details
ollama show MODEL_NAME
# Run model in terminal (interactive)
ollama run MODEL_NAME
# Check Ollama service
sudo systemctl status ollama
sudo journalctl -u ollama -f
# View model storage location
ls ~/.ollama/models/
du -sh ~/.ollama/models/ # Check total disk usage
| Issue | Likely Cause | Fix |
|---|---|---|
| Connection refused | Service not running or wrong port | Check systemctl status SERVICE and verify firewall rules |
| Permission denied | Wrong file ownership or permissions | Check file ownership with ls -la and use chown/chmod to fix |
| 502 Bad Gateway | Backend service not running | Restart the backend service; check logs with journalctl -u SERVICE |
| SSL certificate error | Certificate expired or domain mismatch | Run sudo certbot renew and verify domain DNS points to server IP |
| Service not starting | Config error or missing dependency | Check logs with journalctl -u SERVICE -n 50 for specific error |
| Out of disk space | Logs or data accumulation | Run df -h to identify usage; clean logs or attach CBS storage |
| High memory usage | Too many processes or memory leak | Check with htop; consider upgrading instance plan if consistently high |
| Firewall blocking traffic | Port not open in UFW or Lighthouse console | Open port in Lighthouse console firewall AND sudo ufw allow PORT |
How much RAM do I need to run Ollama on a VPS?
It depends on the model size. 3B parameter models need ~3–4 GB RAM; 7B models need ~5–6 GB; 13B+ models need 12+ GB. Check the requirements section for specific recommendations.
Can Ollama run on a CPU-only server without a GPU?
Yes, but inference speed varies significantly. 3B models are responsive on CPU. 7B+ models are noticeably slower without GPU acceleration. For production AI workloads, consider a GPU instance.
Is my data private when using self-hosted AI models?
Yes — data is processed entirely on your server with no external API calls. Conversations, documents, and prompts never leave your infrastructure. This is a key advantage of self-hosting AI.
What is the TencentOS AI image and should I use it?
The TencentOS AI application image comes pre-installed with Python 3, Docker, PyTorch, TensorFlow, PaddlePaddle, and GPU drivers. It eliminates hours of manual CUDA and AI framework setup. Strongly recommended for GPU-accelerated AI workloads.
base_url to your server address.Run your own AI today:
👉 Tencent Cloud Lighthouse — High-RAM instances for AI workloads
👉 View current pricing and promotions
👉 Explore all active deals and offers