How to Run Ollama + Open WebUI on a Cloud Server — Your Own Private AI Assistant

I'd been using AI chatbots for writing assistance and code review, but I was increasingly conscious of what I was sending to external APIs — sometimes client code, sometimes sensitive project details. Setting up a private alternative seemed worth trying.

Ollama makes it surprisingly practical. Download a model, run it, get a local API that responds like OpenAI's. Open WebUI wraps it in a proper chat interface with conversation history, multiple model support, and a system prompt editor.

Running it on a cloud server instead of my laptop means it's available from any device and I don't have to keep my computer on. For CPU-only inference, a 3B model responds in seconds. For better quality, a 7B model on 8 GB RAM is workable.

This guide deploys Ollama with Open WebUI on Ubuntu 22.04 using Docker Compose, with Nginx as the reverse proxy and HTTPS.

I run this on Tencent Cloud Lighthouse. For 7B parameter models (like Qwen 2.5-7B or Llama 3.1-8B), the 4 vCPU / 8 GB RAM plan is the recommended minimum. Smaller 3B models run on the 4 GB RAM plan. A practical advantage of Lighthouse for AI workloads: you can start with a smaller plan and upgrade the spec from the control panel as you figure out which models you actually use — no need to re-provision. The OrcaTerm browser terminal also lets you pull new models and check Ollama's status without a local SSH client, which is useful when managing the server from different machines.

What This Setup Gives You
Server Requirements for Different Models
Prerequisites
Part 1 — Server Setup
Part 2 — Install Ollama
Part 3 — Download Your First Model
Part 4 — Deploy Open WebUI
Part 5 — Configure Nginx
Part 6 — Enable HTTPS
Part 7 — First Login and Using the Chat Interface
Part 8 — Useful Models to Try
Part 9 — Use Ollama as an API
The Gotcha: RAM and Context Length
Model Management Commands

Key Takeaways

Use the appropriate Lighthouse application image to skip manual installation steps where available
Lighthouse snapshots provide one-click full-server backup before major changes
OrcaTerm browser terminal lets you manage the server from any device
CBS cloud disk expansion handles growing storage needs without server migration
Console-level firewall + UFW = two independent protection layers

What This Setup Gives You {#what}

Private AI chat — conversations never leave your server
No API costs — no per-token billing
Multiple models — switch between Llama, Qwen, Mistral, CodeLlama, and more
Document Q&A — upload PDFs and ask questions about them
API-compatible — the Ollama API is compatible with OpenAI's API format
Multi-user — Open WebUI supports multiple user accounts

Server Requirements for Different Models {#requirements}

Model Size	RAM Required	Recommended Plan	Quality
3B (e.g., Qwen2.5-3B)	4 GB	Basic	Good for simple tasks
7B (e.g., Llama3.1-8B)	8 GB	Standard	Good general use
13B	16 GB	Pro	Better reasoning
70B	64 GB	Large instance	Excellent, near GPT-4

CPU inference is slower than GPU but functional. A 7B model on a modern CPU generates about 5-10 tokens per second — slow but usable for non-interactive tasks.

Prerequisites {#prerequisites}

Requirement	Notes
Cloud server	Tencent Cloud Lighthouse Ubuntu 22.04
8 GB+ RAM	For 7B models
20 GB+ free disk	Models are 4-8 GB each
Docker + Compose	Installed

Part 1 — Server Setup {#part-1}

💡 If you selected the Lighthouse Docker CE application image, Docker is already installed and running. Skip the Docker install lines below and start from sudo apt install -y nginx.

ssh ubuntu@YOUR_SERVER_IP
sudo apt update && sudo apt upgrade -y

curl -fsSL https://get.docker.com | sudo sh
sudo usermod -aG docker $USER
newgrp docker

sudo apt install -y nginx
sudo ufw allow ssh
sudo ufw allow 'Nginx Full'
sudo ufw enable

Part 2 — Install Ollama {#part-2}

# Official Ollama installation script
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version

# Check service status
sudo systemctl status ollama

Ollama installs as a systemd service and starts automatically. It listens on localhost:11434 by default.

Configure Ollama to allow external connections

By default, Ollama only accepts connections from localhost. Open WebUI (running in Docker) needs to connect to it:

sudo systemctl edit ollama

Add in the editor:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

sudo systemctl daemon-reload
sudo systemctl restart ollama

Part 3 — Download Your First Model {#part-3}

# Download a 7B model (takes 5-15 minutes depending on connection)
# Good general-purpose English model:
ollama pull llama3.1:8b

# Good Chinese + English model:
ollama pull qwen2.5:7b

# Lightweight 3B model for limited RAM:
ollama pull qwen2.5:3b

# Code-focused model:
ollama pull codellama:7b

# Check downloaded models
ollama list

Test the model in the terminal:

ollama run qwen2.5:7b
>>> Hello! What can you help me with?
I'm a helpful AI assistant...

>>> /bye

Part 4 — Deploy Open WebUI {#part-4}

mkdir -p ~/apps/open-webui && cd ~/apps/open-webui

Create docker-compose.yml:

version: '3.8'

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    volumes:
      - open-webui_data:/app/backend/data
    environment:
      # Connect to Ollama running on the host
      OLLAMA_BASE_URL: http://host.docker.internal:11434
      # Security settings
      WEBUI_SECRET_KEY: generate_a_long_random_string
      WEBUI_AUTH: "true"
      # Disable anonymous access
      ENABLE_SIGNUP: "true"    # Allow registration on first visit; set to "false" after setup
      DEFAULT_MODELS: qwen2.5:7b
      DEFAULT_USER_ROLE: user
    extra_hosts:
      - "host.docker.internal:host-gateway"  # Allows container to reach host network

volumes:
  open-webui_data:

docker compose up -d
docker compose logs -f open-webui
# Wait for: Application startup complete

Part 5 — Configure Nginx {#part-5}

sudo nano /etc/nginx/sites-available/open-webui

server {
    listen 80;
    server_name ai.yourdomain.com;

    client_max_body_size 100m;   # For document uploads

    location / {
        proxy_pass http://127.0.0.1:3000;
        proxy_http_version 1.1;

        # WebSocket support (required for streaming responses)
        proxy_set_header Upgrade    $http_upgrade;
        proxy_set_header Connection 'upgrade';

        proxy_set_header Host              $host;
        proxy_set_header X-Real-IP         $remote_addr;
        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        proxy_cache_bypass $http_upgrade;

        # Long timeout for AI responses
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
    }
}

sudo ln -s /etc/nginx/sites-available/open-webui /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx

Part 6 — Enable HTTPS {#part-6}

sudo apt install -y certbot python3-certbot-nginx
sudo certbot --nginx -d ai.yourdomain.com

Visit https://ai.yourdomain.com.

First user = admin. Register your account — this first registered user gets admin access.

After login:

Select a model from the dropdown in the top bar
Type a message and press Enter
The AI responds with streaming output

Key features:

Model switching — change models mid-conversation
Document Q&A — upload PDFs, Word docs, text files and chat with them
System prompts — customize the AI's personality and behavior
Conversation history — all chats are saved and searchable
Multi-user — add users under Admin → Users

Part 8 — Useful Models to Try {#part-8}

# General conversation (English)
ollama pull llama3.1:8b

# General conversation (Chinese + English, very good)
ollama pull qwen2.5:7b

# Reasoning and math
ollama pull deepseek-r1:7b

# Code generation
ollama pull codellama:7b
ollama pull qwen2.5-coder:7b

# Very small models (4 GB RAM)
ollama pull qwen2.5:3b
ollama pull phi3:mini

# Embedding models (for RAG)
ollama pull nomic-embed-text

Part 9 — Use Ollama as an API {#part-9}

Ollama exposes an API compatible with OpenAI's format. Use it from any code:

# Direct Ollama API
curl http://localhost:11434/api/chat \
  -d '{"model": "qwen2.5:7b", "messages": [{"role": "user", "content": "Hello"}]}'

# OpenAI-compatible API (works with any OpenAI SDK)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen2.5:7b", "messages": [{"role": "user", "content": "Hello"}]}'

Python example using the OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but ignored by Ollama
)

response = client.chat.completions.create(
    model="qwen2.5:7b",
    messages=[{"role": "user", "content": "Explain Docker in simple terms"}]
)
print(response.choices[0].message.content)

The Gotcha: RAM and Context Length {#gotcha}

Ollama loads the model into RAM when the first request comes in and keeps it loaded for a period. If your prompt plus the conversation history exceeds the model's context window, old messages are silently dropped.

Practical issue: you notice the AI "forgets" earlier parts of long conversations. This isn't a bug — it's the context window limit.

Workarounds:

Use models with larger context windows (Qwen2.5 and Llama3.1 support 128K tokens)
Start new conversations for new topics
Keep conversations focused

Check available RAM before loading large models:

free -h
# Make sure available RAM > model size + 2 GB overhead

If Ollama crashes or returns errors, check memory:

sudo journalctl -u ollama -f

Model Management Commands {#commands}

# List downloaded models
ollama list

# Download a model
ollama pull MODEL_NAME

# Remove a model
ollama rm MODEL_NAME

# Show model details
ollama show MODEL_NAME

# Run model in terminal (interactive)
ollama run MODEL_NAME

# Check Ollama service
sudo systemctl status ollama
sudo journalctl -u ollama -f

# View model storage location
ls ~/.ollama/models/
du -sh ~/.ollama/models/  # Check total disk usage

Troubleshooting {#troubleshooting}

Issue	Likely Cause	Fix
Connection refused	Service not running or wrong port	Check `systemctl status SERVICE` and verify firewall rules
Permission denied	Wrong file ownership or permissions	Check file ownership with `ls -la` and use `chown`/`chmod` to fix
502 Bad Gateway	Backend service not running	Restart the backend service; check logs with `journalctl -u SERVICE`
SSL certificate error	Certificate expired or domain mismatch	Run `sudo certbot renew` and verify domain DNS points to server IP
Service not starting	Config error or missing dependency	Check logs with `journalctl -u SERVICE -n 50` for specific error
Out of disk space	Logs or data accumulation	Run `df -h` to identify usage; clean logs or attach CBS storage
High memory usage	Too many processes or memory leak	Check with `htop`; consider upgrading instance plan if consistently high
Firewall blocking traffic	Port not open in UFW or Lighthouse console	Open port in Lighthouse console firewall AND `sudo ufw allow PORT`

Frequently Asked Questions {#faq}

How much RAM do I need to run Ollama on a VPS?
It depends on the model size. 3B parameter models need ~3–4 GB RAM; 7B models need ~5–6 GB; 13B+ models need 12+ GB. Check the requirements section for specific recommendations.

Can Ollama run on a CPU-only server without a GPU?
Yes, but inference speed varies significantly. 3B models are responsive on CPU. 7B+ models are noticeably slower without GPU acceleration. For production AI workloads, consider a GPU instance.

Is my data private when using self-hosted AI models?
Yes — data is processed entirely on your server with no external API calls. Conversations, documents, and prompts never leave your infrastructure. This is a key advantage of self-hosting AI.

What is the TencentOS AI image and should I use it?
The TencentOS AI application image comes pre-installed with Python 3, Docker, PyTorch, TensorFlow, PaddlePaddle, and GPU drivers. It eliminates hours of manual CUDA and AI framework setup. Strongly recommended for GPU-accelerated AI workloads.

Can I use Ollama as a drop-in replacement for the OpenAI API?
Many self-hosted AI tools provide OpenAI-compatible API endpoints. You can often switch your application by just changing the `base_url` to your server address.

Run your own AI today:
👉 Tencent Cloud Lighthouse — High-RAM instances for AI workloads
👉 View current pricing and promotions
👉 Explore all active deals and offers