I had three different applications calling three different LLM APIs — each with slightly different request formats, different error handling, different ways to handle streaming. Every time I wanted to test a different model, I had to update code.
LiteLLM is the proxy that eliminates this. One endpoint, OpenAI-compatible format, routes to whatever backend you configure. Change from Ollama to OpenAI to Anthropic by updating a config file, not application code. Your existing OpenAI SDK integration works without modification.
I also use it for virtual API key management across a small team — different keys for different projects, with spending limits per key.nAI uses one format, Anthropic uses another, and your self-hosted Ollama uses yet another. When you switch models or providers, your code breaks.
LiteLLM solves this by acting as a proxy: it exposes a single OpenAI-compatible API endpoint and routes requests to whatever backend you configure — Ollama, OpenAI, Anthropic, Gemini, Azure OpenAI, or any other provider. Your application code never changes.
I use it to route traffic between a local Ollama model (for most tasks) and OpenAI GPT-4 (for complex tasks) without changing any application code.
I run LiteLLM on Tencent Cloud Lighthouse. The 2 GB RAM / 2 vCPU plan is sufficient for the proxy itself. If you're pairing LiteLLM with local Ollama models on the same server, Lighthouse's TencentOS AI application image saves significant setup time — it comes pre-installed with Python 3, Docker, Node.js, Git, and AI frameworks, so the environment for running both LiteLLM and local models is ready without manual dependency setup. Lighthouse's static public IP means your applications always reach the proxy at the same address, and OrcaTerm makes configuration updates accessible from any browser.
- Key Takeaways
LiteLLM as a proxy server gives you:
For teams or applications making many LLM calls, this centralized control is valuable.
| Requirement | Details |
|---|---|
| Server | Ubuntu 22.04, 2 GB+ RAM |
| Python | 3.10+ |
| Ollama | Running (if using local models) |
| API keys | For any cloud providers you want to use (optional) |
sudo apt update
sudo apt install -y python3 python3-pip python3-venv
mkdir -p /opt/litellm
cd /opt/litellm
python3 -m venv venv
source venv/bin/activate
pip install 'litellm[proxy]'
The [proxy] extra installs the proxy server dependencies.
litellm --version
LiteLLM is configured via a YAML file. Create /opt/litellm/config.yaml:
model_list:
# Local Ollama models
- model_name: llama3
litellm_params:
model: ollama/llama3.2:3b
api_base: http://localhost:11434
- model_name: mistral
litellm_params:
model: ollama/mistral:7b
api_base: http://localhost:11434
# OpenAI models (requires OPENAI_API_KEY env var)
- model_name: gpt-4o
litellm_params:
model: gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
api_key: os.environ/OPENAI_API_KEY
# Anthropic models (requires ANTHROPIC_API_KEY env var)
- model_name: claude-3-haiku
litellm_params:
model: claude-3-haiku-20240307
api_key: os.environ/ANTHROPIC_API_KEY
litellm_settings:
# Enable detailed logging
success_callback: []
failure_callback: []
# Cache responses (optional)
cache: false
general_settings:
# Master key for admin operations
master_key: sk-your-master-key-here
# Store usage data in SQLite
database_url: "sqlite:///./litellm.db"
Add this section to control access:
general_settings:
master_key: sk-master-key-change-this
# Virtual keys will be created via API
export OPENAI_API_KEY=sk-your-openai-key # if using OpenAI
export ANTHROPIC_API_KEY=sk-ant-your-key # if using Anthropic
cd /opt/litellm
source venv/bin/activate
litellm --config config.yaml --port 4000 --host 0.0.0.0
You should see:
LiteLLM: Proxy initialized with config, starting proxy
LiteLLM Proxy: Listening on http://0.0.0.0:4000
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-master-key-change-this" \
-d '{
"model": "llama3",
"messages": [{"role": "user", "content": "Hello, what can you do?"}]
}'
You should get a response from your local Ollama llama3.2:3b model.
Since LiteLLM is OpenAI-compatible, use the standard OpenAI SDK:
from openai import OpenAI
# Point to your LiteLLM proxy
client = OpenAI(
base_url="http://YOUR_SERVER_IP:4000",
api_key="sk-master-key-change-this"
)
# Use any configured model
response = client.chat.completions.create(
model="llama3", # Routes to Ollama llama3.2:3b
messages=[{"role": "user", "content": "Explain REST APIs"}]
)
print(response.choices[0].message.content)
# Switch to OpenAI with zero code change
response = client.chat.completions.create(
model="gpt-4o", # Routes to OpenAI
messages=[{"role": "user", "content": "Explain REST APIs"}]
)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://YOUR_SERVER_IP:4000",
apiKey: "sk-master-key-change-this",
});
// Use local model
const response = await client.chat.completions.create({
model: "llama3",
messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);
Configure automatic fallback if a model fails:
model_list:
- model_name: smart-llm
litellm_params:
model: gpt-4o
api_key: os.environ/OPENAI_API_KEY
router_settings:
model_group_alias:
smart-llm: ["gpt-4o", "llama3", "mistral"]
fallbacks: [{"smart-llm": ["llama3"]}]
# Retry on failure
num_retries: 3
retry_after: 5
Now calling smart-llm tries gpt-4o first, falls back to local llama3 if it fails.
LiteLLM includes a built-in dashboard for monitoring and managing the proxy.
Start with the UI enabled:
litellm --config config.yaml --port 4000 --host 0.0.0.0 --ui
Access the dashboard at http://YOUR_SERVER_IP:4000/ui
The dashboard shows:
In the UI, go to API Keys → Create Key:
Share these virtual keys with team members instead of your master key.
sudo nano /etc/systemd/system/litellm.service
[Unit]
Description=LiteLLM Proxy
After=network.target ollama.service
[Service]
Type=simple
User=ubuntu
WorkingDirectory=/opt/litellm
ExecStart=/opt/litellm/venv/bin/litellm --config config.yaml --port 4000 --host 127.0.0.1
Restart=on-failure
RestartSec=10
Environment=OPENAI_API_KEY=sk-your-key
Environment=ANTHROPIC_API_KEY=sk-ant-your-key
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable litellm
sudo systemctl start litellm
sudo nano /etc/nginx/sites-available/litellm
server {
listen 80;
server_name llm.yourdomain.com;
location / {
proxy_pass http://localhost:4000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 300s;
client_max_body_size 50m;
}
}
sudo ln -s /etc/nginx/sites-available/litellm /etc/nginx/sites-enabled/
sudo certbot --nginx -d llm.yourdomain.com
Now your applications point to https://llm.yourdomain.com for LLM calls.
My application was getting authentication errors even with the correct master key.
The issue: I hadn't set the master_key in config.yaml, so LiteLLM was running without any authentication. When I added the master key later and restarted, all existing API clients got 401 errors because they were sending the key in the wrong format.
LiteLLM expects the key as: Authorization: Bearer sk-your-key
My application was sending: Authorization: sk-your-key (missing "Bearer")
The fix: Update the OpenAI client initialization to pass the key correctly:
client = OpenAI(
base_url="https://llm.yourdomain.com",
api_key="sk-your-master-key" # OpenAI SDK adds "Bearer" automatically
)
Or if using raw HTTP:
curl https://llm.yourdomain.com/v1/chat/completions \
-H "Authorization: Bearer sk-your-master-key" \ # "Bearer" is required
-H "Content-Type: application/json" \
-d '...'
| Issue | Likely Cause | Fix |
|---|---|---|
| 401 Unauthorized | Missing or wrong API key | Add Authorization: Bearer YOUR_KEY header |
| Model not found | Model name not in config | Check model_list in config.yaml |
| Ollama connection refused | Ollama not running | sudo systemctl start ollama |
| Slow responses | Inference time, not LiteLLM | LiteLLM adds <5ms overhead; slowness is from model |
| Config changes not applied | Service not restarted | sudo systemctl restart litellm |
| Database errors | SQLite file permissions | Check file ownership matches service user |
| 504 Gateway Timeout | Response taking too long | Increase proxy_read_timeout in Nginx |
✅ What you built:
Your applications now talk to one endpoint. Switch from Ollama to GPT-4 in configuration, not in code. Add Anthropic Claude as a backup — no code changes required.
How much RAM do I need to run LiteLLM on a VPS?
It depends on the model size. 3B parameter models need ~3–4 GB RAM; 7B models need ~5–6 GB; 13B+ models need 12+ GB. Check the requirements section for specific recommendations.
Can LiteLLM run on a CPU-only server without a GPU?
Yes, but inference speed varies significantly. 3B models are responsive on CPU. 7B+ models are noticeably slower without GPU acceleration. For production AI workloads, consider a GPU instance.
Is my data private when using self-hosted AI models?
Yes — data is processed entirely on your server with no external API calls. Conversations, documents, and prompts never leave your infrastructure. This is a key advantage of self-hosting AI.
What is the TencentOS AI image and should I use it?
The TencentOS AI application image comes pre-installed with Python 3, Docker, PyTorch, TensorFlow, PaddlePaddle, and GPU drivers. It eliminates hours of manual CUDA and AI framework setup. Strongly recommended for GPU-accelerated AI workloads.
base_url to your server address.👉 Get started with Tencent Cloud Lighthouse
👉 View current pricing and launch promotions
👉 Explore all active deals and offers