Run LocalAI on a Cloud Server — A Drop-In OpenAI Replacement That Runs Entirely on Your Hardware

I had a project already built against the OpenAI API. The application code was clean, but I wanted to explore running it against local models — for cost reasons and because some of the data I was sending was sensitive.

LocalAI's value proposition is simple: it exposes the same API endpoints as OpenAI (not just /v1/chat/completions, but embeddings, speech-to-text, image generation) and routes them to local models. Change OPENAI_BASE_URL to point to your server, and your existing code works.

The catch is that chat templates are model-specific, and using the wrong one gives you garbled output or errors. I hit this on my first attempt. I'll explain how to get it right.

I use it for a project that was originally built against OpenAI's API. Switching to LocalAI meant changing one environment variable (the base URL), and the application ran identically with a local model.

I run LocalAI on Tencent Cloud Lighthouse. The 4 GB RAM / 2 vCPU plan runs 3B parameter models; 8 GB for 7B+ models. Lighthouse's TencentOS AI application image is particularly useful for LocalAI deployments — it comes pre-installed with Python 3, Docker, Git, and AI frameworks including PyTorch and TensorFlow, along with GPU driver support for GPU instances. This eliminates the multi-hour CUDA + driver setup process. LocalAI on a server means your applications point to a permanent URL, and all inference runs entirely on your Lighthouse instance with no data sent to third-party AI providers.

LocalAI vs Ollama — Key Differences
What You Need
Part 1: Install LocalAI with Docker
Part 2: Download and Configure Models
Part 3: Test the API Endpoints
Part 4: Embeddings and Additional Capabilities
Part 5: Connect Your Existing Application
Part 6: Set Up Nginx and Deploy as a Service
The Thing That Tripped Me Up
Troubleshooting
Summary

Key Takeaways

Use the appropriate Lighthouse application image to skip manual installation steps where available
Lighthouse snapshots provide one-click full-server backup before major changes
OrcaTerm browser terminal lets you manage the server from any device
CBS cloud disk expansion handles growing storage needs without server migration
Console-level firewall + UFW = two independent protection layers

LocalAI vs Ollama — Key Differences {#comparison}

Feature	LocalAI	Ollama
OpenAI API compatibility	Full drop-in replacement	Chat + embeddings only
Image generation	Yes (Stable Diffusion)	No
Text-to-speech	Yes	No
Speech-to-text	Yes (Whisper)	No
Vision (image input)	Yes	Yes (limited)
Model format	GGUF, GPTQ, and others	Primarily GGUF
Web UI	No (API only)	No (needs Open WebUI)
Installation	Docker or binary	Simple install script
Setup complexity	Moderate	Easy

Use LocalAI when: you need full OpenAI API compatibility (including non-chat endpoints), or need multi-modal capabilities.
Use Ollama when: you want the simplest setup for chat and embeddings.

What You Need {#prerequisites}

Requirement	Details
Server	Ubuntu 22.04, 4 GB+ RAM
Docker	Installed
Storage	10–20 GB for models

Part 1: Install LocalAI with Docker {#part-1}

1.1 — Install Docker

curl -fsSL https://get.docker.com | sh
sudo systemctl enable docker
sudo usermod -aG docker $USER
newgrp docker

1.2 — Create a Directory for Models

mkdir -p /opt/localai/models

1.3 — Run LocalAI (CPU mode)

docker run -d \
  --name localai \
  --restart=always \
  -p 127.0.0.1:8080:8080 \
  -v /opt/localai/models:/models \
  -e MODELS_PATH=/models \
  localai/localai:latest-aio-cpu

The aio-cpu (all-in-one CPU) image includes all required backends for text generation, embeddings, image generation, and speech.

For GPU (NVIDIA):

docker run -d \
  --name localai \
  --restart=always \
  --gpus all \
  -p 127.0.0.1:8080:8080 \
  -v /opt/localai/models:/models \
  -e MODELS_PATH=/models \
  localai/localai:latest-aio-gpu-nvidia-cuda-12

1.4 — Check LocalAI is Running

docker logs localai
curl http://localhost:8080/v1/models
# Returns: {"object":"list","data":[]}  (empty until models are added)

Part 2: Download and Configure Models {#part-2}

LocalAI uses YAML configuration files to define models. Each model needs a config file and the model weights.

2.1 — Download a Model from the Gallery

LocalAI has a model gallery with pre-configured models:

# List available gallery models
curl http://localhost:8080/models/available | python3 -m json.tool | head -50

# Install a model from the gallery
curl -X POST http://localhost:8080/models/apply \
  -H "Content-Type: application/json" \
  -d '{"id": "ggml-gpt4all-j"}'

Wait for the download (can take several minutes):

# Check download progress
curl http://localhost:8080/models/jobs

2.2 — Manually Configure a GGUF Model

For more control, configure models manually.

Download a model (example — Llama 3.2 3B Q4):

cd /opt/localai/models
wget "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf" \
  -O llama3.2-3b.gguf

Create a model config file:

nano /opt/localai/models/llama3.2-3b.yaml

name: llama3.2-3b
context_size: 4096
backend: llama-cpp
parameters:
  model: llama3.2-3b.gguf
  threads: 4
  temperature: 0.7
  top_k: 40
  top_p: 0.9

template:
  chat: |
    {{.Input}}
  chat_message: |
    <|start_header_id|>{{.RoleName}}<|end_header_id|>

    {{.Content}}<|eot_id|>
  completion: |
    {{.Input}}

Restart LocalAI to load the new model:

docker restart localai

2.3 — Verify the Model is Available

curl http://localhost:8080/v1/models
# Should show llama3.2-3b in the list

Part 3: Test the API Endpoints {#part-3}

3.1 — Chat Completion

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2-3b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

3.2 — Text Completion

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2-3b",
    "prompt": "The capital of France is",
    "max_tokens": 50
  }'

3.3 — Embeddings

LocalAI supports embeddings for vector search — you'll need an embedding-capable model:

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-embedding-ada-002",
    "input": "The quick brown fox jumps over the lazy dog"
  }'

For local embeddings, configure a GGUF embedding model (like nomic-embed-text) in a separate YAML config.

Part 4: Embeddings and Additional Capabilities {#part-4}

Configure an Embedding Model

Download an embedding model:

cd /opt/localai/models
wget "https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF/resolve/main/nomic-embed-text-v1.5.Q8_0.gguf" \
  -O nomic-embed-text.gguf

Create config file /opt/localai/models/nomic-embed-text.yaml:

name: text-embedding-ada-002   # Use OpenAI's name for drop-in compatibility
backend: llama-cpp
embeddings: true
parameters:
  model: nomic-embed-text.gguf
  embeddings: true

Using the OpenAI model name (text-embedding-ada-002) means any code that calls OpenAI's embeddings API will work unchanged.

Configure Whisper for Speech-to-Text

cd /opt/localai/models
wget "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin" \
  -O whisper-base-en.bin

Config /opt/localai/models/whisper.yaml:

name: whisper-1
backend: whisper
parameters:
  model: whisper-base-en.bin

Test transcription:

curl http://localhost:8080/v1/audio/transcriptions \
  -F file=@/path/to/audio.mp3 \
  -F model=whisper-1

Part 5: Connect Your Existing Application {#part-5}

The key advantage of LocalAI: change one line in your application.

Python (OpenAI SDK)

from openai import OpenAI

# Before: using OpenAI
# client = OpenAI(api_key="sk-your-openai-key")

# After: using LocalAI — only the base_url changes
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"  # LocalAI doesn't require authentication by default
)

# This code is identical whether using OpenAI or LocalAI:
response = client.chat.completions.create(
    model="llama3.2-3b",   # Or "gpt-3.5-turbo" if you name the LocalAI model that
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Naming Models for Transparent Compatibility

For complete drop-in compatibility, name your LocalAI models to match OpenAI's names:

In your model YAML:

name: gpt-3.5-turbo    # OpenAI-compatible name

Now any application calling gpt-3.5-turbo will use your local model without knowing it changed.

Node.js

import OpenAI from "openai";

const openai = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "not-needed",
});

const completion = await openai.chat.completions.create({
  model: "llama3.2-3b",
  messages: [{ role: "user", content: "Translate to Spanish: Hello world" }],
});

console.log(completion.choices[0].message.content);

Part 6: Set Up Nginx and Deploy as a Service {#part-6}

Nginx Configuration

sudo apt install -y nginx certbot python3-certbot-nginx
sudo nano /etc/nginx/sites-available/localai

server {
    listen 80;
    server_name localai.yourdomain.com;

    location / {
        proxy_pass http://localhost:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_read_timeout 300s;
        client_max_body_size 100m;
    }
}

sudo ln -s /etc/nginx/sites-available/localai /etc/nginx/sites-enabled/
sudo certbot --nginx -d localai.yourdomain.com

Enable API Authentication

Edit LocalAI's Docker command to add an API key:

docker stop localai && docker rm localai

docker run -d \
  --name localai \
  --restart=always \
  -p 127.0.0.1:8080:8080 \
  -v /opt/localai/models:/models \
  -e MODELS_PATH=/models \
  -e API_KEY=your-api-key-here \
  localai/localai:latest-aio-cpu

Now pass the key in requests: Authorization: Bearer your-api-key-here

The Thing That Tripped Me Up {#gotcha}

My first LocalAI model was configured and the API showed it in the model list, but every chat completion returned a 500 error with no useful error message in the response body.

The issue was the chat template in my YAML config. LocalAI uses templates to format the conversation into a prompt that the model understands. Different models expect different formats: Llama 2 uses [INST] tags, Llama 3 uses <|start_header_id|>, Mistral uses [INST] with slight variations.

Using the wrong template produces garbage output or errors.

How I diagnosed it:

docker logs localai | grep -i error

Found: template parse error: unexpected token

The fix: Check the model card on Hugging Face for the exact prompt format, or use a pre-validated template from LocalAI's examples:

# Check LocalAI's template examples on GitHub
# github.com/mudler/LocalAI/tree/master/embedded/templates

For Llama 3 models specifically, the correct template uses <|start_header_id|> format (as shown in Part 2 above). For Mistral, it's different. Always match the template to the model family.

Troubleshooting {#troubleshooting}

Issue	Likely Cause	Fix
500 error on completions	Wrong chat template	Check model's prompt format on Hugging Face
Model not found in `/v1/models`	Config file not loaded	Check YAML syntax; restart container
Very slow responses	CPU inference	Expected; upgrade to GPU for faster inference
Out of memory	Model too large for RAM	Use a smaller or more quantized model
Docker container won't start	Port conflict	Check `lsof -i :8080`; change port if needed
Empty responses	Context too short	Increase `context_size` in model config
Gallery model download fails	Network issue	Retry; check server has internet access
Embeddings return wrong dimensions	Wrong embedding model	Use a model specifically configured for embeddings

Summary {#verdict}

✅ What you built:

LocalAI running as a Docker service with auto-restart
At least one GGUF language model configured and responding
OpenAI-compatible API at https://localai.yourdomain.com/v1
Embeddings endpoint for vector search
Optional: Whisper speech-to-text endpoint
Existing OpenAI SDK applications work with a single base_url change

LocalAI's main value proposition is seamless compatibility with the entire OpenAI API surface. For teams or projects already built on OpenAI, it's the lowest-friction path to local/private inference.

Frequently Asked Questions {#faq}

How much RAM do I need to run LocalAI on a VPS?
It depends on the model size. 3B parameter models need ~3–4 GB RAM; 7B models need ~5–6 GB; 13B+ models need 12+ GB. Check the requirements section for specific recommendations.

Can LocalAI run on a CPU-only server without a GPU?
Yes, but inference speed varies significantly. 3B models are responsive on CPU. 7B+ models are noticeably slower without GPU acceleration. For production AI workloads, consider a GPU instance.

Is my data private when using self-hosted AI models?
Yes — data is processed entirely on your server with no external API calls. Conversations, documents, and prompts never leave your infrastructure. This is a key advantage of self-hosting AI.

What is the TencentOS AI image and should I use it?
The TencentOS AI application image comes pre-installed with Python 3, Docker, PyTorch, TensorFlow, PaddlePaddle, and GPU drivers. It eliminates hours of manual CUDA and AI framework setup. Strongly recommended for GPU-accelerated AI workloads.

Can I use LocalAI as a drop-in replacement for the OpenAI API?
Many self-hosted AI tools provide OpenAI-compatible API endpoints. You can often switch your application by just changing the `base_url` to your server address.

👉 Get started with Tencent Cloud Lighthouse
👉 View current pricing and launch promotions
👉 Explore all active deals and offers