I got curious about running AI models locally after noticing how much I was spending on API calls for personal projects. The math was simple: at a certain usage level, a cloud server running Ollama costs less per month than the API fees, and you get unlimited calls.
More importantly, the data stays on your server. Conversations aren't logged by a third party, and you can deploy models that are fine-tuned for specific use cases without sending sensitive information anywhere.
Ollama makes running large language models on a server surprisingly straightforward. Combined with Open WebUI (which I covered in the Docker guide earlier), you get a ChatGPT-like interface connected to models you control.
I run this on Tencent Cloud Lighthouse. For CPU-only inference, a 4 GB RAM instance runs 3B models at reasonable speed; 8 GB for 7B+ models. For GPU-accelerated inference, Lighthouse offers a TencentOS AI application image that comes pre-installed with Python 3, Node.js, Docker, Git, and major AI frameworks (PyTorch, TensorFlow, PaddlePaddle) along with GPU drivers — eliminating the complex CUDA setup that typically takes hours. The primary advantage for private AI: your conversations and documents are processed entirely on your own server — nothing sent to any third-party API. The spec upgrade path lets you start with a CPU plan and move to a GPU plan as needed.
- Key Takeaways
Ollama is a tool for running open-source large language models locally. It:
localhost:11434Models you can run include Llama 3, Mistral, Phi-3, Gemma, Qwen, and many others — all the major open-source families.
Match the model size to your available RAM:
| Server RAM | Recommended Models | Notes |
|---|---|---|
| 4 GB | Llama 3.2:3b, Phi-3:mini, Gemma2:2b | Fast responses, good for general chat |
| 8 GB | Llama 3.1:8b, Mistral:7b, Qwen2.5:7b | Better quality, still responsive |
| 16 GB | Llama 3.1:13b, Mixtral:8x7b | High quality, good for coding tasks |
| 32 GB+ | Llama 3.1:70b (quantized), CodeLlama:34b | Near GPT-4 quality for many tasks |
Rule of thumb: The model's parameter count in billions roughly equals the RAM needed in GB at 4-bit quantization. A 7B model needs ~4-5 GB RAM.
For GPU-accelerated inference: Select the TencentOS AI application image when creating your Lighthouse instance. It comes pre-installed with Python 3, Docker, PyTorch, TensorFlow, PaddlePaddle, and GPU drivers — you still need to install Ollama itself (one command below), but the CUDA/driver setup that usually takes hours is already done.
curl -fsSL https://ollama.com/install.sh | sh
This installs Ollama and sets it up as a systemd service.
sudo systemctl status ollama
Check the API responds:
curl http://localhost:11434/api/tags
# Returns: {"models":[]} (empty until you pull a model)
By default, Ollama only listens on localhost. If you want to access it from outside the server (e.g., from your laptop), configure it to listen on all interfaces — but only do this behind Nginx with authentication:
sudo systemctl edit ollama
Add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl daemon-reload
sudo systemctl restart ollama
# Download Llama 3.2 3B (good for 4 GB RAM servers)
ollama pull llama3.2:3b
# Or Phi-3 Mini (very lightweight, 2.3 GB download)
ollama pull phi3:mini
# Or Mistral 7B (better quality, needs 8 GB RAM)
ollama pull mistral:7b
Downloads go to ~/.ollama/models/ (or /usr/share/ollama/.ollama/models/ when running as a service).
ollama run llama3.2:3b
You're now chatting with the model directly in your terminal. Type /bye to exit.
ollama run llama3.2:3b "Explain how HTTPS works in 3 sentences"
ollama list
ollama rm llama3.2:3b
Open WebUI gives you a ChatGPT-like browser interface connected to Ollama.
curl -fsSL https://get.docker.com | sh
docker run -d \
--name open-webui \
--restart=always \
--network=host \
-v open-webui:/app/backend/data \
-e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
ghcr.io/open-webui/open-webui:main
--network=host lets the container reach Ollama at localhost:11434.
Open WebUI runs on port 8080. Access it via SSH tunnel first:
ssh -L 8080:localhost:8080 ubuntu@YOUR_SERVER_IP
Open http://localhost:8080 in your browser. Create an account (first user becomes admin). Select a model from the dropdown and start chatting.
To access Open WebUI from anywhere (not just via SSH tunnel), set up Nginx with HTTPS.
sudo apt install -y nginx certbot python3-certbot-nginx
sudo nano /etc/nginx/sites-available/ai
server {
listen 80;
server_name ai.yourdomain.com;
location / {
proxy_pass http://localhost:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
# Increase timeout for long AI responses
proxy_read_timeout 300s;
proxy_send_timeout 300s;
}
}
sudo ln -s /etc/nginx/sites-available/ai /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx
sudo certbot --nginx -d ai.yourdomain.com
Open WebUI has its own user authentication built in — the first user to register becomes the admin. You can disable public registration in Open WebUI settings after creating your account.
Ollama provides an OpenAI-compatible API, so any OpenAI SDK works with minimal changes.
curl http://localhost:11434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:3b",
"messages": [
{"role": "user", "content": "Write a Python function to check if a number is prime"}
],
"stream": false
}'
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Ollama doesn't require a real key
)
response = client.chat.completions.create(
model="llama3.2:3b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain Docker in simple terms"}
]
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const ollama = new OpenAI({
baseURL: "http://localhost:11434/v1",
apiKey: "ollama",
});
const response = await ollama.chat.completions.create({
model: "llama3.2:3b",
messages: [{ role: "user", content: "What is a REST API?" }],
});
console.log(response.choices[0].message.content);
du -sh ~/.ollama/models/
# or
du -sh /usr/share/ollama/.ollama/models/
Models are large (2–20 GB each). Plan your storage accordingly.
Customize a model's system prompt and parameters:
Create Modelfile:
FROM llama3.2:3b
SYSTEM """
You are a helpful assistant for a tech support team.
Respond concisely and technically.
When asked about code, always provide working examples.
"""
PARAMETER temperature 0.7
PARAMETER top_p 0.9
Build and use the custom model:
ollama create myassistant -f Modelfile
ollama run myassistant
Ollama unloads models from RAM after 5 minutes of inactivity by default. To change:
sudo systemctl edit ollama
Add:
[Service]
Environment="OLLAMA_KEEP_ALIVE=30m"
My first attempts with the Mistral 7B model on a 4 GB RAM server resulted in extremely slow responses — 30-40 seconds per token. The model was technically running, but practically unusable.
The cause: 7B models at full precision need about 14 GB RAM. Even quantized to 4-bit (which Ollama does by default), they need ~5 GB. My 4 GB server was swapping to disk.
What I learned about model selection:
# Check if Ollama is swapping
free -h
# If 'used' under Swap is non-zero during inference, the model is too large
# Check which quantization level you have
ollama show mistral:7b --modelinfo | grep quantization
The fix: Switch to a smaller model that fits comfortably in RAM:
# For 4 GB RAM: use 3B models
ollama pull llama3.2:3b # 2.0 GB download, ~2.5 GB RAM usage
ollama pull phi3:mini # 2.3 GB download, ~2.5 GB RAM usage
# For 8 GB RAM: 7B models work well
ollama pull llama3.1:8b # 4.7 GB download, ~5 GB RAM usage
The 3B models are genuinely useful for most tasks — coding help, writing assistance, Q&A. For serious technical work or nuanced reasoning, upgrade to an 8 GB RAM instance and run a 7B model.
| Issue | Likely Cause | Fix |
|---|---|---|
| Slow responses | Model too large for available RAM | Use smaller model or upgrade RAM |
connection refused on port 11434 |
Ollama not running | sudo systemctl start ollama |
| Out of disk space | Model files are large | ollama rm modelname; check with ollama list |
| Open WebUI can't reach Ollama | Network mismatch | Use --network=host or set OLLAMA_BASE_URL correctly |
| Model download fails | Network timeout | Retry; models are large, downloads can take time |
| High memory usage after chat | Model still loaded | Ollama keeps model in RAM for 5 min; expected behavior |
| API returns 404 | Model name wrong | Check exact name with ollama list |
✅ What you built:
Total monthly cost: the price of your cloud server (starting from ~$5–10/month). No per-token fees, no usage limits, no data leaving your server.
How much RAM do I need to run Ollama AI on a VPS?
It depends on the model size. 3B parameter models need ~3–4 GB RAM; 7B models need ~5–6 GB; 13B+ models need 12+ GB. Check the requirements section for specific recommendations.
Can Ollama AI run on a CPU-only server without a GPU?
Yes, but inference speed varies significantly. 3B models are responsive on CPU. 7B+ models are noticeably slower without GPU acceleration. For production AI workloads, consider a GPU instance.
Is my data private when using self-hosted AI models?
Yes — data is processed entirely on your server with no external API calls. Conversations, documents, and prompts never leave your infrastructure. This is a key advantage of self-hosting AI.
What is the TencentOS AI image and should I use it?
The TencentOS AI application image comes pre-installed with Python 3, Docker, PyTorch, TensorFlow, PaddlePaddle, and GPU drivers. It eliminates hours of manual CUDA and AI framework setup. Strongly recommended for GPU-accelerated AI workloads.
base_url to your server address.👉 Get started with Tencent Cloud Lighthouse
👉 View current pricing and launch promotions
👉 Explore all active deals and offers