I had a project already built against the OpenAI API. The application code was clean, but I wanted to explore running it against local models — for cost reasons and because some of the data I was sending was sensitive.
LocalAI's value proposition is simple: it exposes the same API endpoints as OpenAI (not just /v1/chat/completions, but embeddings, speech-to-text, image generation) and routes them to local models. Change OPENAI_BASE_URL to point to your server, and your existing code works.
The catch is that chat templates are model-specific, and using the wrong one gives you garbled output or errors. I hit this on my first attempt. I'll explain how to get it right.
I use it for a project that was originally built against OpenAI's API. Switching to LocalAI meant changing one environment variable (the base URL), and the application ran identically with a local model.
I run LocalAI on Tencent Cloud Lighthouse. The 4 GB RAM / 2 vCPU plan runs 3B parameter models; 8 GB for 7B+ models. Lighthouse's TencentOS AI application image is particularly useful for LocalAI deployments — it comes pre-installed with Python 3, Docker, Git, and AI frameworks including PyTorch and TensorFlow, along with GPU driver support for GPU instances. This eliminates the multi-hour CUDA + driver setup process. LocalAI on a server means your applications point to a permanent URL, and all inference runs entirely on your Lighthouse instance with no data sent to third-party AI providers.
- Key Takeaways
| Feature | LocalAI | Ollama |
|---|---|---|
| OpenAI API compatibility | Full drop-in replacement | Chat + embeddings only |
| Image generation | Yes (Stable Diffusion) | No |
| Text-to-speech | Yes | No |
| Speech-to-text | Yes (Whisper) | No |
| Vision (image input) | Yes | Yes (limited) |
| Model format | GGUF, GPTQ, and others | Primarily GGUF |
| Web UI | No (API only) | No (needs Open WebUI) |
| Installation | Docker or binary | Simple install script |
| Setup complexity | Moderate | Easy |
Use LocalAI when: you need full OpenAI API compatibility (including non-chat endpoints), or need multi-modal capabilities.
Use Ollama when: you want the simplest setup for chat and embeddings.
| Requirement | Details |
|---|---|
| Server | Ubuntu 22.04, 4 GB+ RAM |
| Docker | Installed |
| Storage | 10–20 GB for models |
curl -fsSL https://get.docker.com | sh
sudo systemctl enable docker
sudo usermod -aG docker $USER
newgrp docker
mkdir -p /opt/localai/models
docker run -d \
--name localai \
--restart=always \
-p 127.0.0.1:8080:8080 \
-v /opt/localai/models:/models \
-e MODELS_PATH=/models \
localai/localai:latest-aio-cpu
The aio-cpu (all-in-one CPU) image includes all required backends for text generation, embeddings, image generation, and speech.
For GPU (NVIDIA):
docker run -d \
--name localai \
--restart=always \
--gpus all \
-p 127.0.0.1:8080:8080 \
-v /opt/localai/models:/models \
-e MODELS_PATH=/models \
localai/localai:latest-aio-gpu-nvidia-cuda-12
docker logs localai
curl http://localhost:8080/v1/models
# Returns: {"object":"list","data":[]} (empty until models are added)
LocalAI uses YAML configuration files to define models. Each model needs a config file and the model weights.
LocalAI has a model gallery with pre-configured models:
# List available gallery models
curl http://localhost:8080/models/available | python3 -m json.tool | head -50
# Install a model from the gallery
curl -X POST http://localhost:8080/models/apply \
-H "Content-Type: application/json" \
-d '{"id": "ggml-gpt4all-j"}'
Wait for the download (can take several minutes):
# Check download progress
curl http://localhost:8080/models/jobs
For more control, configure models manually.
Download a model (example — Llama 3.2 3B Q4):
cd /opt/localai/models
wget "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf" \
-O llama3.2-3b.gguf
Create a model config file:
nano /opt/localai/models/llama3.2-3b.yaml
name: llama3.2-3b
context_size: 4096
backend: llama-cpp
parameters:
model: llama3.2-3b.gguf
threads: 4
temperature: 0.7
top_k: 40
top_p: 0.9
template:
chat: |
{{.Input}}
chat_message: |
<|start_header_id|>{{.RoleName}}<|end_header_id|>
{{.Content}}<|eot_id|>
completion: |
{{.Input}}
Restart LocalAI to load the new model:
docker restart localai
curl http://localhost:8080/v1/models
# Should show llama3.2-3b in the list
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2-3b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
}'
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2-3b",
"prompt": "The capital of France is",
"max_tokens": 50
}'
LocalAI supports embeddings for vector search — you'll need an embedding-capable model:
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "text-embedding-ada-002",
"input": "The quick brown fox jumps over the lazy dog"
}'
For local embeddings, configure a GGUF embedding model (like nomic-embed-text) in a separate YAML config.
Download an embedding model:
cd /opt/localai/models
wget "https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF/resolve/main/nomic-embed-text-v1.5.Q8_0.gguf" \
-O nomic-embed-text.gguf
Create config file /opt/localai/models/nomic-embed-text.yaml:
name: text-embedding-ada-002 # Use OpenAI's name for drop-in compatibility
backend: llama-cpp
embeddings: true
parameters:
model: nomic-embed-text.gguf
embeddings: true
Using the OpenAI model name (text-embedding-ada-002) means any code that calls OpenAI's embeddings API will work unchanged.
cd /opt/localai/models
wget "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin" \
-O whisper-base-en.bin
Config /opt/localai/models/whisper.yaml:
name: whisper-1
backend: whisper
parameters:
model: whisper-base-en.bin
Test transcription:
curl http://localhost:8080/v1/audio/transcriptions \
-F file=@/path/to/audio.mp3 \
-F model=whisper-1
The key advantage of LocalAI: change one line in your application.
from openai import OpenAI
# Before: using OpenAI
# client = OpenAI(api_key="sk-your-openai-key")
# After: using LocalAI — only the base_url changes
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed" # LocalAI doesn't require authentication by default
)
# This code is identical whether using OpenAI or LocalAI:
response = client.chat.completions.create(
model="llama3.2-3b", # Or "gpt-3.5-turbo" if you name the LocalAI model that
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
For complete drop-in compatibility, name your LocalAI models to match OpenAI's names:
In your model YAML:
name: gpt-3.5-turbo # OpenAI-compatible name
Now any application calling gpt-3.5-turbo will use your local model without knowing it changed.
import OpenAI from "openai";
const openai = new OpenAI({
baseURL: "http://localhost:8080/v1",
apiKey: "not-needed",
});
const completion = await openai.chat.completions.create({
model: "llama3.2-3b",
messages: [{ role: "user", content: "Translate to Spanish: Hello world" }],
});
console.log(completion.choices[0].message.content);
sudo apt install -y nginx certbot python3-certbot-nginx
sudo nano /etc/nginx/sites-available/localai
server {
listen 80;
server_name localai.yourdomain.com;
location / {
proxy_pass http://localhost:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 300s;
client_max_body_size 100m;
}
}
sudo ln -s /etc/nginx/sites-available/localai /etc/nginx/sites-enabled/
sudo certbot --nginx -d localai.yourdomain.com
Edit LocalAI's Docker command to add an API key:
docker stop localai && docker rm localai
docker run -d \
--name localai \
--restart=always \
-p 127.0.0.1:8080:8080 \
-v /opt/localai/models:/models \
-e MODELS_PATH=/models \
-e API_KEY=your-api-key-here \
localai/localai:latest-aio-cpu
Now pass the key in requests: Authorization: Bearer your-api-key-here
My first LocalAI model was configured and the API showed it in the model list, but every chat completion returned a 500 error with no useful error message in the response body.
The issue was the chat template in my YAML config. LocalAI uses templates to format the conversation into a prompt that the model understands. Different models expect different formats: Llama 2 uses [INST] tags, Llama 3 uses <|start_header_id|>, Mistral uses [INST] with slight variations.
Using the wrong template produces garbage output or errors.
How I diagnosed it:
docker logs localai | grep -i error
Found: template parse error: unexpected token
The fix: Check the model card on Hugging Face for the exact prompt format, or use a pre-validated template from LocalAI's examples:
# Check LocalAI's template examples on GitHub
# github.com/mudler/LocalAI/tree/master/embedded/templates
For Llama 3 models specifically, the correct template uses <|start_header_id|> format (as shown in Part 2 above). For Mistral, it's different. Always match the template to the model family.
| Issue | Likely Cause | Fix |
|---|---|---|
| 500 error on completions | Wrong chat template | Check model's prompt format on Hugging Face |
Model not found in /v1/models |
Config file not loaded | Check YAML syntax; restart container |
| Very slow responses | CPU inference | Expected; upgrade to GPU for faster inference |
| Out of memory | Model too large for RAM | Use a smaller or more quantized model |
| Docker container won't start | Port conflict | Check lsof -i :8080; change port if needed |
| Empty responses | Context too short | Increase context_size in model config |
| Gallery model download fails | Network issue | Retry; check server has internet access |
| Embeddings return wrong dimensions | Wrong embedding model | Use a model specifically configured for embeddings |
✅ What you built:
https://localai.yourdomain.com/v1LocalAI's main value proposition is seamless compatibility with the entire OpenAI API surface. For teams or projects already built on OpenAI, it's the lowest-friction path to local/private inference.
How much RAM do I need to run LocalAI on a VPS?
It depends on the model size. 3B parameter models need ~3–4 GB RAM; 7B models need ~5–6 GB; 13B+ models need 12+ GB. Check the requirements section for specific recommendations.
Can LocalAI run on a CPU-only server without a GPU?
Yes, but inference speed varies significantly. 3B models are responsive on CPU. 7B+ models are noticeably slower without GPU acceleration. For production AI workloads, consider a GPU instance.
Is my data private when using self-hosted AI models?
Yes — data is processed entirely on your server with no external API calls. Conversations, documents, and prompts never leave your infrastructure. This is a key advantage of self-hosting AI.
What is the TencentOS AI image and should I use it?
The TencentOS AI application image comes pre-installed with Python 3, Docker, PyTorch, TensorFlow, PaddlePaddle, and GPU drivers. It eliminates hours of manual CUDA and AI framework setup. Strongly recommended for GPU-accelerated AI workloads.
base_url to your server address.👉 Get started with Tencent Cloud Lighthouse
👉 View current pricing and launch promotions
👉 Explore all active deals and offers