Self-Host an OpenAI-Compatible API with LiteLLM — One Unified Endpoint for Any LLM

I had three different applications calling three different LLM APIs — each with slightly different request formats, different error handling, different ways to handle streaming. Every time I wanted to test a different model, I had to update code.

LiteLLM is the proxy that eliminates this. One endpoint, OpenAI-compatible format, routes to whatever backend you configure. Change from Ollama to OpenAI to Anthropic by updating a config file, not application code. Your existing OpenAI SDK integration works without modification.

I also use it for virtual API key management across a small team — different keys for different projects, with spending limits per key.nAI uses one format, Anthropic uses another, and your self-hosted Ollama uses yet another. When you switch models or providers, your code breaks.

LiteLLM solves this by acting as a proxy: it exposes a single OpenAI-compatible API endpoint and routes requests to whatever backend you configure — Ollama, OpenAI, Anthropic, Gemini, Azure OpenAI, or any other provider. Your application code never changes.

I use it to route traffic between a local Ollama model (for most tasks) and OpenAI GPT-4 (for complex tasks) without changing any application code.

I run LiteLLM on Tencent Cloud Lighthouse. The 2 GB RAM / 2 vCPU plan is sufficient for the proxy itself. If you're pairing LiteLLM with local Ollama models on the same server, Lighthouse's TencentOS AI application image saves significant setup time — it comes pre-installed with Python 3, Docker, Node.js, Git, and AI frameworks, so the environment for running both LiteLLM and local models is ready without manual dependency setup. Lighthouse's static public IP means your applications always reach the proxy at the same address, and OrcaTerm makes configuration updates accessible from any browser.

Why LiteLLM?
What You Need
Part 1: Install LiteLLM
Part 2: Configure Models
Part 3: Run as a Proxy Server
Part 4: Use the API from Your Applications
Part 5: Add a Web UI (LiteLLM Dashboard)
Part 6: Deploy as a Service with Nginx
The Thing That Tripped Me Up
Troubleshooting
Summary

Key Takeaways

Use the appropriate Lighthouse application image to skip manual installation steps where available
Lighthouse snapshots provide one-click full-server backup before major changes
OrcaTerm browser terminal lets you manage the server from any device
CBS cloud disk expansion handles growing storage needs without server migration
Console-level firewall + UFW = two independent protection layers

Why LiteLLM? {#why}

LiteLLM as a proxy server gives you:

Unified API — one endpoint, OpenAI-compatible, works with any OpenAI SDK
Multi-provider routing — switch between Ollama, OpenAI, Anthropic without code changes
Load balancing — distribute requests across multiple models or instances
Fallback handling — if one model fails, automatically try another
Cost tracking — log token usage and cost per model
API key management — create virtual API keys with per-key model access and rate limits
Caching — cache identical requests to reduce latency and cost

For teams or applications making many LLM calls, this centralized control is valuable.

What You Need {#prerequisites}

Requirement	Details
Server	Ubuntu 22.04, 2 GB+ RAM
Python	3.10+
Ollama	Running (if using local models)
API keys	For any cloud providers you want to use (optional)

Part 1: Install LiteLLM {#part-1}

1.1 — Install Python and Create Virtual Environment

sudo apt update
sudo apt install -y python3 python3-pip python3-venv
mkdir -p /opt/litellm
cd /opt/litellm
python3 -m venv venv
source venv/bin/activate

1.2 — Install LiteLLM

pip install 'litellm[proxy]'

The [proxy] extra installs the proxy server dependencies.

1.3 — Quick Test

litellm --version

Part 2: Configure Models {#part-2}

LiteLLM is configured via a YAML file. Create /opt/litellm/config.yaml:

model_list:
  # Local Ollama models
  - model_name: llama3
    litellm_params:
      model: ollama/llama3.2:3b
      api_base: http://localhost:11434

  - model_name: mistral
    litellm_params:
      model: ollama/mistral:7b
      api_base: http://localhost:11434

  # OpenAI models (requires OPENAI_API_KEY env var)
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: gpt-3.5-turbo
    litellm_params:
      model: gpt-3.5-turbo
      api_key: os.environ/OPENAI_API_KEY

  # Anthropic models (requires ANTHROPIC_API_KEY env var)
  - model_name: claude-3-haiku
    litellm_params:
      model: claude-3-haiku-20240307
      api_key: os.environ/ANTHROPIC_API_KEY

litellm_settings:
  # Enable detailed logging
  success_callback: []
  failure_callback: []
  
  # Cache responses (optional)
  cache: false

general_settings:
  # Master key for admin operations
  master_key: sk-your-master-key-here
  
  # Store usage data in SQLite
  database_url: "sqlite:///./litellm.db"

Virtual Keys for Team Members

Add this section to control access:

general_settings:
  master_key: sk-master-key-change-this
  
  # Virtual keys will be created via API

Part 3: Run as a Proxy Server {#part-3}

3.1 — Set Environment Variables

export OPENAI_API_KEY=sk-your-openai-key        # if using OpenAI
export ANTHROPIC_API_KEY=sk-ant-your-key        # if using Anthropic

3.2 — Start the Proxy

cd /opt/litellm
source venv/bin/activate
litellm --config config.yaml --port 4000 --host 0.0.0.0

You should see:

LiteLLM: Proxy initialized with config, starting proxy
LiteLLM Proxy: Listening on http://0.0.0.0:4000

3.3 — Test the API

curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-master-key-change-this" \
  -d '{
    "model": "llama3",
    "messages": [{"role": "user", "content": "Hello, what can you do?"}]
  }'

You should get a response from your local Ollama llama3.2:3b model.

Part 4: Use the API from Your Applications {#part-4}

Since LiteLLM is OpenAI-compatible, use the standard OpenAI SDK:

Python

from openai import OpenAI

# Point to your LiteLLM proxy
client = OpenAI(
    base_url="http://YOUR_SERVER_IP:4000",
    api_key="sk-master-key-change-this"
)

# Use any configured model
response = client.chat.completions.create(
    model="llama3",  # Routes to Ollama llama3.2:3b
    messages=[{"role": "user", "content": "Explain REST APIs"}]
)

print(response.choices[0].message.content)

# Switch to OpenAI with zero code change
response = client.chat.completions.create(
    model="gpt-4o",  # Routes to OpenAI
    messages=[{"role": "user", "content": "Explain REST APIs"}]
)

Node.js

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://YOUR_SERVER_IP:4000",
  apiKey: "sk-master-key-change-this",
});

// Use local model
const response = await client.chat.completions.create({
  model: "llama3",
  messages: [{ role: "user", content: "Hello!" }],
});

console.log(response.choices[0].message.content);

Model Fallback Configuration

Configure automatic fallback if a model fails:

model_list:
  - model_name: smart-llm
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY

router_settings:
  model_group_alias:
    smart-llm: ["gpt-4o", "llama3", "mistral"]
  
  fallbacks: [{"smart-llm": ["llama3"]}]
  
  # Retry on failure
  num_retries: 3
  retry_after: 5

Now calling smart-llm tries gpt-4o first, falls back to local llama3 if it fails.

Part 5: Add a Web UI (LiteLLM Dashboard) {#part-5}

LiteLLM includes a built-in dashboard for monitoring and managing the proxy.

Start with the UI enabled:

litellm --config config.yaml --port 4000 --host 0.0.0.0 --ui

Access the dashboard at http://YOUR_SERVER_IP:4000/ui

The dashboard shows:

Active models and their status
Request volume and latency
Token usage and cost tracking
API key management
Model routing configuration

Create Virtual API Keys via Dashboard

In the UI, go to API Keys → Create Key:

Set spending limits
Restrict to specific models
Set rate limits (requests per minute)

Share these virtual keys with team members instead of your master key.

Part 6: Deploy as a Service with Nginx {#part-6}

Create systemd Service

sudo nano /etc/systemd/system/litellm.service

[Unit]
Description=LiteLLM Proxy
After=network.target ollama.service

[Service]
Type=simple
User=ubuntu
WorkingDirectory=/opt/litellm
ExecStart=/opt/litellm/venv/bin/litellm --config config.yaml --port 4000 --host 127.0.0.1
Restart=on-failure
RestartSec=10
Environment=OPENAI_API_KEY=sk-your-key
Environment=ANTHROPIC_API_KEY=sk-ant-your-key

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable litellm
sudo systemctl start litellm

Nginx HTTPS Proxy

sudo nano /etc/nginx/sites-available/litellm

server {
    listen 80;
    server_name llm.yourdomain.com;

    location / {
        proxy_pass http://localhost:4000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_read_timeout 300s;
        client_max_body_size 50m;
    }
}

sudo ln -s /etc/nginx/sites-available/litellm /etc/nginx/sites-enabled/
sudo certbot --nginx -d llm.yourdomain.com

Now your applications point to https://llm.yourdomain.com for LLM calls.

The Thing That Tripped Me Up {#gotcha}

My application was getting authentication errors even with the correct master key.

The issue: I hadn't set the master_key in config.yaml, so LiteLLM was running without any authentication. When I added the master key later and restarted, all existing API clients got 401 errors because they were sending the key in the wrong format.

LiteLLM expects the key as: Authorization: Bearer sk-your-key

My application was sending: Authorization: sk-your-key (missing "Bearer")

The fix: Update the OpenAI client initialization to pass the key correctly:

client = OpenAI(
    base_url="https://llm.yourdomain.com",
    api_key="sk-your-master-key"  # OpenAI SDK adds "Bearer" automatically
)

Or if using raw HTTP:

curl https://llm.yourdomain.com/v1/chat/completions \
  -H "Authorization: Bearer sk-your-master-key" \  # "Bearer" is required
  -H "Content-Type: application/json" \
  -d '...'

Troubleshooting {#troubleshooting}

Issue	Likely Cause	Fix
401 Unauthorized	Missing or wrong API key	Add `Authorization: Bearer YOUR_KEY` header
Model not found	Model name not in config	Check `model_list` in config.yaml
Ollama connection refused	Ollama not running	`sudo systemctl start ollama`
Slow responses	Inference time, not LiteLLM	LiteLLM adds <5ms overhead; slowness is from model
Config changes not applied	Service not restarted	`sudo systemctl restart litellm`
Database errors	SQLite file permissions	Check file ownership matches service user
504 Gateway Timeout	Response taking too long	Increase `proxy_read_timeout` in Nginx

Summary {#verdict}

✅ What you built:

LiteLLM proxy running on your cloud server
Single OpenAI-compatible endpoint routing to multiple backends
Local Ollama models + optional cloud provider integration
Fallback routing when models fail
Virtual API keys with per-key limits and restrictions
Web dashboard for monitoring and management
HTTPS endpoint via Nginx

Your applications now talk to one endpoint. Switch from Ollama to GPT-4 in configuration, not in code. Add Anthropic Claude as a backup — no code changes required.

Frequently Asked Questions {#faq}

How much RAM do I need to run LiteLLM on a VPS?
It depends on the model size. 3B parameter models need ~3–4 GB RAM; 7B models need ~5–6 GB; 13B+ models need 12+ GB. Check the requirements section for specific recommendations.

Can LiteLLM run on a CPU-only server without a GPU?
Yes, but inference speed varies significantly. 3B models are responsive on CPU. 7B+ models are noticeably slower without GPU acceleration. For production AI workloads, consider a GPU instance.

Is my data private when using self-hosted AI models?
Yes — data is processed entirely on your server with no external API calls. Conversations, documents, and prompts never leave your infrastructure. This is a key advantage of self-hosting AI.

What is the TencentOS AI image and should I use it?
The TencentOS AI application image comes pre-installed with Python 3, Docker, PyTorch, TensorFlow, PaddlePaddle, and GPU drivers. It eliminates hours of manual CUDA and AI framework setup. Strongly recommended for GPU-accelerated AI workloads.

Can I use LiteLLM as a drop-in replacement for the OpenAI API?
Many self-hosted AI tools provide OpenAI-compatible API endpoints. You can often switch your application by just changing the `base_url` to your server address.

👉 Get started with Tencent Cloud Lighthouse
👉 View current pricing and launch promotions
👉 Explore all active deals and offers