Technology Encyclopedia Home >Run LocalAI on a Cloud Server — A Drop-In OpenAI Replacement That Runs Entirely on Your Hardware

Run LocalAI on a Cloud Server — A Drop-In OpenAI Replacement That Runs Entirely on Your Hardware

I had a project already built against the OpenAI API. The application code was clean, but I wanted to explore running it against local models — for cost reasons and because some of the data I was sending was sensitive.

LocalAI's value proposition is simple: it exposes the same API endpoints as OpenAI (not just /v1/chat/completions, but embeddings, speech-to-text, image generation) and routes them to local models. Change OPENAI_BASE_URL to point to your server, and your existing code works.

The catch is that chat templates are model-specific, and using the wrong one gives you garbled output or errors. I hit this on my first attempt. I'll explain how to get it right.

I use it for a project that was originally built against OpenAI's API. Switching to LocalAI meant changing one environment variable (the base URL), and the application ran identically with a local model.

I run LocalAI on Tencent Cloud Lighthouse. The 4 GB RAM / 2 vCPU plan runs 3B parameter models; 8 GB for 7B+ models. Lighthouse's TencentOS AI application image is particularly useful for LocalAI deployments — it comes pre-installed with Python 3, Docker, Git, and AI frameworks including PyTorch and TensorFlow, along with GPU driver support for GPU instances. This eliminates the multi-hour CUDA + driver setup process. LocalAI on a server means your applications point to a permanent URL, and all inference runs entirely on your Lighthouse instance with no data sent to third-party AI providers.


Table of Contents

  1. LocalAI vs Ollama — Key Differences
  2. What You Need
  3. Part 1: Install LocalAI with Docker
  4. Part 2: Download and Configure Models
  5. Part 3: Test the API Endpoints
  6. Part 4: Embeddings and Additional Capabilities
  7. Part 5: Connect Your Existing Application
  8. Part 6: Set Up Nginx and Deploy as a Service
  9. The Thing That Tripped Me Up
  10. Troubleshooting
  11. Summary

  • Key Takeaways
  • Use the appropriate Lighthouse application image to skip manual installation steps where available
  • Lighthouse snapshots provide one-click full-server backup before major changes
  • OrcaTerm browser terminal lets you manage the server from any device
  • CBS cloud disk expansion handles growing storage needs without server migration
  • Console-level firewall + UFW = two independent protection layers

LocalAI vs Ollama — Key Differences {#comparison}

Feature LocalAI Ollama
OpenAI API compatibility Full drop-in replacement Chat + embeddings only
Image generation Yes (Stable Diffusion) No
Text-to-speech Yes No
Speech-to-text Yes (Whisper) No
Vision (image input) Yes Yes (limited)
Model format GGUF, GPTQ, and others Primarily GGUF
Web UI No (API only) No (needs Open WebUI)
Installation Docker or binary Simple install script
Setup complexity Moderate Easy

Use LocalAI when: you need full OpenAI API compatibility (including non-chat endpoints), or need multi-modal capabilities.
Use Ollama when: you want the simplest setup for chat and embeddings.


What You Need {#prerequisites}

Requirement Details
Server Ubuntu 22.04, 4 GB+ RAM
Docker Installed
Storage 10–20 GB for models

Part 1: Install LocalAI with Docker {#part-1}

1.1 — Install Docker

curl -fsSL https://get.docker.com | sh
sudo systemctl enable docker
sudo usermod -aG docker $USER
newgrp docker

1.2 — Create a Directory for Models

mkdir -p /opt/localai/models

1.3 — Run LocalAI (CPU mode)

docker run -d \
  --name localai \
  --restart=always \
  -p 127.0.0.1:8080:8080 \
  -v /opt/localai/models:/models \
  -e MODELS_PATH=/models \
  localai/localai:latest-aio-cpu

The aio-cpu (all-in-one CPU) image includes all required backends for text generation, embeddings, image generation, and speech.

For GPU (NVIDIA):

docker run -d \
  --name localai \
  --restart=always \
  --gpus all \
  -p 127.0.0.1:8080:8080 \
  -v /opt/localai/models:/models \
  -e MODELS_PATH=/models \
  localai/localai:latest-aio-gpu-nvidia-cuda-12

1.4 — Check LocalAI is Running

docker logs localai
curl http://localhost:8080/v1/models
# Returns: {"object":"list","data":[]}  (empty until models are added)

Part 2: Download and Configure Models {#part-2}

LocalAI uses YAML configuration files to define models. Each model needs a config file and the model weights.

LocalAI has a model gallery with pre-configured models:

# List available gallery models
curl http://localhost:8080/models/available | python3 -m json.tool | head -50

# Install a model from the gallery
curl -X POST http://localhost:8080/models/apply \
  -H "Content-Type: application/json" \
  -d '{"id": "ggml-gpt4all-j"}'

Wait for the download (can take several minutes):

# Check download progress
curl http://localhost:8080/models/jobs

2.2 — Manually Configure a GGUF Model

For more control, configure models manually.

Download a model (example — Llama 3.2 3B Q4):

cd /opt/localai/models
wget "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf" \
  -O llama3.2-3b.gguf

Create a model config file:

nano /opt/localai/models/llama3.2-3b.yaml
name: llama3.2-3b
context_size: 4096
backend: llama-cpp
parameters:
  model: llama3.2-3b.gguf
  threads: 4
  temperature: 0.7
  top_k: 40
  top_p: 0.9

template:
  chat: |
    {{.Input}}
  chat_message: |
    <|start_header_id|>{{.RoleName}}<|end_header_id|>

    {{.Content}}<|eot_id|>
  completion: |
    {{.Input}}

Restart LocalAI to load the new model:

docker restart localai

2.3 — Verify the Model is Available

curl http://localhost:8080/v1/models
# Should show llama3.2-3b in the list

Part 3: Test the API Endpoints {#part-3}

3.1 — Chat Completion

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2-3b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

3.2 — Text Completion

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2-3b",
    "prompt": "The capital of France is",
    "max_tokens": 50
  }'

3.3 — Embeddings

LocalAI supports embeddings for vector search — you'll need an embedding-capable model:

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-embedding-ada-002",
    "input": "The quick brown fox jumps over the lazy dog"
  }'

For local embeddings, configure a GGUF embedding model (like nomic-embed-text) in a separate YAML config.


Part 4: Embeddings and Additional Capabilities {#part-4}

Configure an Embedding Model

Download an embedding model:

cd /opt/localai/models
wget "https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF/resolve/main/nomic-embed-text-v1.5.Q8_0.gguf" \
  -O nomic-embed-text.gguf

Create config file /opt/localai/models/nomic-embed-text.yaml:

name: text-embedding-ada-002   # Use OpenAI's name for drop-in compatibility
backend: llama-cpp
embeddings: true
parameters:
  model: nomic-embed-text.gguf
  embeddings: true

Using the OpenAI model name (text-embedding-ada-002) means any code that calls OpenAI's embeddings API will work unchanged.

Configure Whisper for Speech-to-Text

cd /opt/localai/models
wget "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin" \
  -O whisper-base-en.bin

Config /opt/localai/models/whisper.yaml:

name: whisper-1
backend: whisper
parameters:
  model: whisper-base-en.bin

Test transcription:

curl http://localhost:8080/v1/audio/transcriptions \
  -F file=@/path/to/audio.mp3 \
  -F model=whisper-1

Part 5: Connect Your Existing Application {#part-5}

The key advantage of LocalAI: change one line in your application.

Python (OpenAI SDK)

from openai import OpenAI

# Before: using OpenAI
# client = OpenAI(api_key="sk-your-openai-key")

# After: using LocalAI — only the base_url changes
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"  # LocalAI doesn't require authentication by default
)

# This code is identical whether using OpenAI or LocalAI:
response = client.chat.completions.create(
    model="llama3.2-3b",   # Or "gpt-3.5-turbo" if you name the LocalAI model that
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Naming Models for Transparent Compatibility

For complete drop-in compatibility, name your LocalAI models to match OpenAI's names:

In your model YAML:

name: gpt-3.5-turbo    # OpenAI-compatible name

Now any application calling gpt-3.5-turbo will use your local model without knowing it changed.

Node.js

import OpenAI from "openai";

const openai = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "not-needed",
});

const completion = await openai.chat.completions.create({
  model: "llama3.2-3b",
  messages: [{ role: "user", content: "Translate to Spanish: Hello world" }],
});

console.log(completion.choices[0].message.content);

Part 6: Set Up Nginx and Deploy as a Service {#part-6}

Nginx Configuration

sudo apt install -y nginx certbot python3-certbot-nginx
sudo nano /etc/nginx/sites-available/localai
server {
    listen 80;
    server_name localai.yourdomain.com;

    location / {
        proxy_pass http://localhost:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_read_timeout 300s;
        client_max_body_size 100m;
    }
}
sudo ln -s /etc/nginx/sites-available/localai /etc/nginx/sites-enabled/
sudo certbot --nginx -d localai.yourdomain.com

Enable API Authentication

Edit LocalAI's Docker command to add an API key:

docker stop localai && docker rm localai

docker run -d \
  --name localai \
  --restart=always \
  -p 127.0.0.1:8080:8080 \
  -v /opt/localai/models:/models \
  -e MODELS_PATH=/models \
  -e API_KEY=your-api-key-here \
  localai/localai:latest-aio-cpu

Now pass the key in requests: Authorization: Bearer your-api-key-here


The Thing That Tripped Me Up {#gotcha}

My first LocalAI model was configured and the API showed it in the model list, but every chat completion returned a 500 error with no useful error message in the response body.

The issue was the chat template in my YAML config. LocalAI uses templates to format the conversation into a prompt that the model understands. Different models expect different formats: Llama 2 uses [INST] tags, Llama 3 uses <|start_header_id|>, Mistral uses [INST] with slight variations.

Using the wrong template produces garbage output or errors.

How I diagnosed it:

docker logs localai | grep -i error

Found: template parse error: unexpected token

The fix: Check the model card on Hugging Face for the exact prompt format, or use a pre-validated template from LocalAI's examples:

# Check LocalAI's template examples on GitHub
# github.com/mudler/LocalAI/tree/master/embedded/templates

For Llama 3 models specifically, the correct template uses <|start_header_id|> format (as shown in Part 2 above). For Mistral, it's different. Always match the template to the model family.


Troubleshooting {#troubleshooting}

Issue Likely Cause Fix
500 error on completions Wrong chat template Check model's prompt format on Hugging Face
Model not found in /v1/models Config file not loaded Check YAML syntax; restart container
Very slow responses CPU inference Expected; upgrade to GPU for faster inference
Out of memory Model too large for RAM Use a smaller or more quantized model
Docker container won't start Port conflict Check lsof -i :8080; change port if needed
Empty responses Context too short Increase context_size in model config
Gallery model download fails Network issue Retry; check server has internet access
Embeddings return wrong dimensions Wrong embedding model Use a model specifically configured for embeddings

Summary {#verdict}

What you built:

  • LocalAI running as a Docker service with auto-restart
  • At least one GGUF language model configured and responding
  • OpenAI-compatible API at https://localai.yourdomain.com/v1
  • Embeddings endpoint for vector search
  • Optional: Whisper speech-to-text endpoint
  • Existing OpenAI SDK applications work with a single base_url change

LocalAI's main value proposition is seamless compatibility with the entire OpenAI API surface. For teams or projects already built on OpenAI, it's the lowest-friction path to local/private inference.

Frequently Asked Questions {#faq}

How much RAM do I need to run LocalAI on a VPS?
It depends on the model size. 3B parameter models need ~3–4 GB RAM; 7B models need ~5–6 GB; 13B+ models need 12+ GB. Check the requirements section for specific recommendations.

Can LocalAI run on a CPU-only server without a GPU?
Yes, but inference speed varies significantly. 3B models are responsive on CPU. 7B+ models are noticeably slower without GPU acceleration. For production AI workloads, consider a GPU instance.

Is my data private when using self-hosted AI models?
Yes — data is processed entirely on your server with no external API calls. Conversations, documents, and prompts never leave your infrastructure. This is a key advantage of self-hosting AI.

What is the TencentOS AI image and should I use it?
The TencentOS AI application image comes pre-installed with Python 3, Docker, PyTorch, TensorFlow, PaddlePaddle, and GPU drivers. It eliminates hours of manual CUDA and AI framework setup. Strongly recommended for GPU-accelerated AI workloads.

Can I use LocalAI as a drop-in replacement for the OpenAI API?
Many self-hosted AI tools provide OpenAI-compatible API endpoints. You can often switch your application by just changing the base_url to your server address.

👉 Get started with Tencent Cloud Lighthouse
👉 View current pricing and launch promotions
👉 Explore all active deals and offers