Technology Encyclopedia Home >Self-Host an OpenAI-Compatible API with LiteLLM — One Unified Endpoint for Any LLM

Self-Host an OpenAI-Compatible API with LiteLLM — One Unified Endpoint for Any LLM

I had three different applications calling three different LLM APIs — each with slightly different request formats, different error handling, different ways to handle streaming. Every time I wanted to test a different model, I had to update code.

LiteLLM is the proxy that eliminates this. One endpoint, OpenAI-compatible format, routes to whatever backend you configure. Change from Ollama to OpenAI to Anthropic by updating a config file, not application code. Your existing OpenAI SDK integration works without modification.

I also use it for virtual API key management across a small team — different keys for different projects, with spending limits per key.nAI uses one format, Anthropic uses another, and your self-hosted Ollama uses yet another. When you switch models or providers, your code breaks.

LiteLLM solves this by acting as a proxy: it exposes a single OpenAI-compatible API endpoint and routes requests to whatever backend you configure — Ollama, OpenAI, Anthropic, Gemini, Azure OpenAI, or any other provider. Your application code never changes.

I use it to route traffic between a local Ollama model (for most tasks) and OpenAI GPT-4 (for complex tasks) without changing any application code.

I run LiteLLM on Tencent Cloud Lighthouse. The 2 GB RAM / 2 vCPU plan is sufficient for the proxy itself. If you're pairing LiteLLM with local Ollama models on the same server, Lighthouse's TencentOS AI application image saves significant setup time — it comes pre-installed with Python 3, Docker, Node.js, Git, and AI frameworks, so the environment for running both LiteLLM and local models is ready without manual dependency setup. Lighthouse's static public IP means your applications always reach the proxy at the same address, and OrcaTerm makes configuration updates accessible from any browser.


Table of Contents

  1. Why LiteLLM?
  2. What You Need
  3. Part 1: Install LiteLLM
  4. Part 2: Configure Models
  5. Part 3: Run as a Proxy Server
  6. Part 4: Use the API from Your Applications
  7. Part 5: Add a Web UI (LiteLLM Dashboard)
  8. Part 6: Deploy as a Service with Nginx
  9. The Thing That Tripped Me Up
  10. Troubleshooting
  11. Summary

  • Key Takeaways
  • Use the appropriate Lighthouse application image to skip manual installation steps where available
  • Lighthouse snapshots provide one-click full-server backup before major changes
  • OrcaTerm browser terminal lets you manage the server from any device
  • CBS cloud disk expansion handles growing storage needs without server migration
  • Console-level firewall + UFW = two independent protection layers

Why LiteLLM? {#why}

LiteLLM as a proxy server gives you:

  • Unified API — one endpoint, OpenAI-compatible, works with any OpenAI SDK
  • Multi-provider routing — switch between Ollama, OpenAI, Anthropic without code changes
  • Load balancing — distribute requests across multiple models or instances
  • Fallback handling — if one model fails, automatically try another
  • Cost tracking — log token usage and cost per model
  • API key management — create virtual API keys with per-key model access and rate limits
  • Caching — cache identical requests to reduce latency and cost

For teams or applications making many LLM calls, this centralized control is valuable.


What You Need {#prerequisites}

Requirement Details
Server Ubuntu 22.04, 2 GB+ RAM
Python 3.10+
Ollama Running (if using local models)
API keys For any cloud providers you want to use (optional)

Part 1: Install LiteLLM {#part-1}

1.1 — Install Python and Create Virtual Environment

sudo apt update
sudo apt install -y python3 python3-pip python3-venv
mkdir -p /opt/litellm
cd /opt/litellm
python3 -m venv venv
source venv/bin/activate

1.2 — Install LiteLLM

pip install 'litellm[proxy]'

The [proxy] extra installs the proxy server dependencies.

1.3 — Quick Test

litellm --version

Part 2: Configure Models {#part-2}

LiteLLM is configured via a YAML file. Create /opt/litellm/config.yaml:

model_list:
  # Local Ollama models
  - model_name: llama3
    litellm_params:
      model: ollama/llama3.2:3b
      api_base: http://localhost:11434

  - model_name: mistral
    litellm_params:
      model: ollama/mistral:7b
      api_base: http://localhost:11434

  # OpenAI models (requires OPENAI_API_KEY env var)
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: gpt-3.5-turbo
    litellm_params:
      model: gpt-3.5-turbo
      api_key: os.environ/OPENAI_API_KEY

  # Anthropic models (requires ANTHROPIC_API_KEY env var)
  - model_name: claude-3-haiku
    litellm_params:
      model: claude-3-haiku-20240307
      api_key: os.environ/ANTHROPIC_API_KEY

litellm_settings:
  # Enable detailed logging
  success_callback: []
  failure_callback: []
  
  # Cache responses (optional)
  cache: false

general_settings:
  # Master key for admin operations
  master_key: sk-your-master-key-here
  
  # Store usage data in SQLite
  database_url: "sqlite:///./litellm.db"

Virtual Keys for Team Members

Add this section to control access:

general_settings:
  master_key: sk-master-key-change-this
  
  # Virtual keys will be created via API

Part 3: Run as a Proxy Server {#part-3}

3.1 — Set Environment Variables

export OPENAI_API_KEY=sk-your-openai-key        # if using OpenAI
export ANTHROPIC_API_KEY=sk-ant-your-key        # if using Anthropic

3.2 — Start the Proxy

cd /opt/litellm
source venv/bin/activate
litellm --config config.yaml --port 4000 --host 0.0.0.0

You should see:

LiteLLM: Proxy initialized with config, starting proxy
LiteLLM Proxy: Listening on http://0.0.0.0:4000

3.3 — Test the API

curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-master-key-change-this" \
  -d '{
    "model": "llama3",
    "messages": [{"role": "user", "content": "Hello, what can you do?"}]
  }'

You should get a response from your local Ollama llama3.2:3b model.


Part 4: Use the API from Your Applications {#part-4}

Since LiteLLM is OpenAI-compatible, use the standard OpenAI SDK:

Python

from openai import OpenAI

# Point to your LiteLLM proxy
client = OpenAI(
    base_url="http://YOUR_SERVER_IP:4000",
    api_key="sk-master-key-change-this"
)

# Use any configured model
response = client.chat.completions.create(
    model="llama3",  # Routes to Ollama llama3.2:3b
    messages=[{"role": "user", "content": "Explain REST APIs"}]
)

print(response.choices[0].message.content)

# Switch to OpenAI with zero code change
response = client.chat.completions.create(
    model="gpt-4o",  # Routes to OpenAI
    messages=[{"role": "user", "content": "Explain REST APIs"}]
)

Node.js

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://YOUR_SERVER_IP:4000",
  apiKey: "sk-master-key-change-this",
});

// Use local model
const response = await client.chat.completions.create({
  model: "llama3",
  messages: [{ role: "user", content: "Hello!" }],
});

console.log(response.choices[0].message.content);

Model Fallback Configuration

Configure automatic fallback if a model fails:

model_list:
  - model_name: smart-llm
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY

router_settings:
  model_group_alias:
    smart-llm: ["gpt-4o", "llama3", "mistral"]
  
  fallbacks: [{"smart-llm": ["llama3"]}]
  
  # Retry on failure
  num_retries: 3
  retry_after: 5

Now calling smart-llm tries gpt-4o first, falls back to local llama3 if it fails.


Part 5: Add a Web UI (LiteLLM Dashboard) {#part-5}

LiteLLM includes a built-in dashboard for monitoring and managing the proxy.

Start with the UI enabled:

litellm --config config.yaml --port 4000 --host 0.0.0.0 --ui

Access the dashboard at http://YOUR_SERVER_IP:4000/ui

The dashboard shows:

  • Active models and their status
  • Request volume and latency
  • Token usage and cost tracking
  • API key management
  • Model routing configuration

Create Virtual API Keys via Dashboard

In the UI, go to API Keys → Create Key:

  • Set spending limits
  • Restrict to specific models
  • Set rate limits (requests per minute)

Share these virtual keys with team members instead of your master key.


Part 6: Deploy as a Service with Nginx {#part-6}

Create systemd Service

sudo nano /etc/systemd/system/litellm.service
[Unit]
Description=LiteLLM Proxy
After=network.target ollama.service

[Service]
Type=simple
User=ubuntu
WorkingDirectory=/opt/litellm
ExecStart=/opt/litellm/venv/bin/litellm --config config.yaml --port 4000 --host 127.0.0.1
Restart=on-failure
RestartSec=10
Environment=OPENAI_API_KEY=sk-your-key
Environment=ANTHROPIC_API_KEY=sk-ant-your-key

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable litellm
sudo systemctl start litellm

Nginx HTTPS Proxy

sudo nano /etc/nginx/sites-available/litellm
server {
    listen 80;
    server_name llm.yourdomain.com;

    location / {
        proxy_pass http://localhost:4000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_read_timeout 300s;
        client_max_body_size 50m;
    }
}
sudo ln -s /etc/nginx/sites-available/litellm /etc/nginx/sites-enabled/
sudo certbot --nginx -d llm.yourdomain.com

Now your applications point to https://llm.yourdomain.com for LLM calls.


The Thing That Tripped Me Up {#gotcha}

My application was getting authentication errors even with the correct master key.

The issue: I hadn't set the master_key in config.yaml, so LiteLLM was running without any authentication. When I added the master key later and restarted, all existing API clients got 401 errors because they were sending the key in the wrong format.

LiteLLM expects the key as: Authorization: Bearer sk-your-key

My application was sending: Authorization: sk-your-key (missing "Bearer")

The fix: Update the OpenAI client initialization to pass the key correctly:

client = OpenAI(
    base_url="https://llm.yourdomain.com",
    api_key="sk-your-master-key"  # OpenAI SDK adds "Bearer" automatically
)

Or if using raw HTTP:

curl https://llm.yourdomain.com/v1/chat/completions \
  -H "Authorization: Bearer sk-your-master-key" \  # "Bearer" is required
  -H "Content-Type: application/json" \
  -d '...'

Troubleshooting {#troubleshooting}

Issue Likely Cause Fix
401 Unauthorized Missing or wrong API key Add Authorization: Bearer YOUR_KEY header
Model not found Model name not in config Check model_list in config.yaml
Ollama connection refused Ollama not running sudo systemctl start ollama
Slow responses Inference time, not LiteLLM LiteLLM adds <5ms overhead; slowness is from model
Config changes not applied Service not restarted sudo systemctl restart litellm
Database errors SQLite file permissions Check file ownership matches service user
504 Gateway Timeout Response taking too long Increase proxy_read_timeout in Nginx

Summary {#verdict}

What you built:

  • LiteLLM proxy running on your cloud server
  • Single OpenAI-compatible endpoint routing to multiple backends
  • Local Ollama models + optional cloud provider integration
  • Fallback routing when models fail
  • Virtual API keys with per-key limits and restrictions
  • Web dashboard for monitoring and management
  • HTTPS endpoint via Nginx

Your applications now talk to one endpoint. Switch from Ollama to GPT-4 in configuration, not in code. Add Anthropic Claude as a backup — no code changes required.

Frequently Asked Questions {#faq}

How much RAM do I need to run LiteLLM on a VPS?
It depends on the model size. 3B parameter models need ~3–4 GB RAM; 7B models need ~5–6 GB; 13B+ models need 12+ GB. Check the requirements section for specific recommendations.

Can LiteLLM run on a CPU-only server without a GPU?
Yes, but inference speed varies significantly. 3B models are responsive on CPU. 7B+ models are noticeably slower without GPU acceleration. For production AI workloads, consider a GPU instance.

Is my data private when using self-hosted AI models?
Yes — data is processed entirely on your server with no external API calls. Conversations, documents, and prompts never leave your infrastructure. This is a key advantage of self-hosting AI.

What is the TencentOS AI image and should I use it?
The TencentOS AI application image comes pre-installed with Python 3, Docker, PyTorch, TensorFlow, PaddlePaddle, and GPU drivers. It eliminates hours of manual CUDA and AI framework setup. Strongly recommended for GPU-accelerated AI workloads.

Can I use LiteLLM as a drop-in replacement for the OpenAI API?
Many self-hosted AI tools provide OpenAI-compatible API endpoints. You can often switch your application by just changing the base_url to your server address.

👉 Get started with Tencent Cloud Lighthouse
👉 View current pricing and launch promotions
👉 Explore all active deals and offers