Technology Encyclopedia Home >How to Run Ollama + Open WebUI on a Cloud Server — Your Own Private AI Assistant

How to Run Ollama + Open WebUI on a Cloud Server — Your Own Private AI Assistant

I'd been using AI chatbots for writing assistance and code review, but I was increasingly conscious of what I was sending to external APIs — sometimes client code, sometimes sensitive project details. Setting up a private alternative seemed worth trying.

Ollama makes it surprisingly practical. Download a model, run it, get a local API that responds like OpenAI's. Open WebUI wraps it in a proper chat interface with conversation history, multiple model support, and a system prompt editor.

Running it on a cloud server instead of my laptop means it's available from any device and I don't have to keep my computer on. For CPU-only inference, a 3B model responds in seconds. For better quality, a 7B model on 8 GB RAM is workable.

This guide deploys Ollama with Open WebUI on Ubuntu 22.04 using Docker Compose, with Nginx as the reverse proxy and HTTPS.

I run this on Tencent Cloud Lighthouse. For 7B parameter models (like Qwen 2.5-7B or Llama 3.1-8B), the 4 vCPU / 8 GB RAM plan is the recommended minimum. Smaller 3B models run on the 4 GB RAM plan. A practical advantage of Lighthouse for AI workloads: you can start with a smaller plan and upgrade the spec from the control panel as you figure out which models you actually use — no need to re-provision. The OrcaTerm browser terminal also lets you pull new models and check Ollama's status without a local SSH client, which is useful when managing the server from different machines.


Table of Contents

  1. What This Setup Gives You
  2. Server Requirements for Different Models
  3. Prerequisites
  4. Part 1 — Server Setup
  5. Part 2 — Install Ollama
  6. Part 3 — Download Your First Model
  7. Part 4 — Deploy Open WebUI
  8. Part 5 — Configure Nginx
  9. Part 6 — Enable HTTPS
  10. Part 7 — First Login and Using the Chat Interface
  11. Part 8 — Useful Models to Try
  12. Part 9 — Use Ollama as an API
  13. The Gotcha: RAM and Context Length
  14. Model Management Commands

  • Key Takeaways
  • Use the appropriate Lighthouse application image to skip manual installation steps where available
  • Lighthouse snapshots provide one-click full-server backup before major changes
  • OrcaTerm browser terminal lets you manage the server from any device
  • CBS cloud disk expansion handles growing storage needs without server migration
  • Console-level firewall + UFW = two independent protection layers

What This Setup Gives You {#what}

  • Private AI chat — conversations never leave your server
  • No API costs — no per-token billing
  • Multiple models — switch between Llama, Qwen, Mistral, CodeLlama, and more
  • Document Q&A — upload PDFs and ask questions about them
  • API-compatible — the Ollama API is compatible with OpenAI's API format
  • Multi-user — Open WebUI supports multiple user accounts

Server Requirements for Different Models {#requirements}

Model Size RAM Required Recommended Plan Quality
3B (e.g., Qwen2.5-3B) 4 GB Basic Good for simple tasks
7B (e.g., Llama3.1-8B) 8 GB Standard Good general use
13B 16 GB Pro Better reasoning
70B 64 GB Large instance Excellent, near GPT-4

CPU inference is slower than GPU but functional. A 7B model on a modern CPU generates about 5-10 tokens per second — slow but usable for non-interactive tasks.


Prerequisites {#prerequisites}

Requirement Notes
Cloud server Tencent Cloud Lighthouse Ubuntu 22.04
8 GB+ RAM For 7B models
20 GB+ free disk Models are 4-8 GB each
Docker + Compose Installed

Part 1 — Server Setup {#part-1}

💡 If you selected the Lighthouse Docker CE application image, Docker is already installed and running. Skip the Docker install lines below and start from sudo apt install -y nginx.

ssh ubuntu@YOUR_SERVER_IP
sudo apt update && sudo apt upgrade -y

curl -fsSL https://get.docker.com | sudo sh
sudo usermod -aG docker $USER
newgrp docker

sudo apt install -y nginx
sudo ufw allow ssh
sudo ufw allow 'Nginx Full'
sudo ufw enable

Part 2 — Install Ollama {#part-2}

# Official Ollama installation script
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version

# Check service status
sudo systemctl status ollama

Ollama installs as a systemd service and starts automatically. It listens on localhost:11434 by default.

Configure Ollama to allow external connections

By default, Ollama only accepts connections from localhost. Open WebUI (running in Docker) needs to connect to it:

sudo systemctl edit ollama

Add in the editor:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl daemon-reload
sudo systemctl restart ollama

Part 3 — Download Your First Model {#part-3}

# Download a 7B model (takes 5-15 minutes depending on connection)
# Good general-purpose English model:
ollama pull llama3.1:8b

# Good Chinese + English model:
ollama pull qwen2.5:7b

# Lightweight 3B model for limited RAM:
ollama pull qwen2.5:3b

# Code-focused model:
ollama pull codellama:7b

# Check downloaded models
ollama list

Test the model in the terminal:

ollama run qwen2.5:7b
>>> Hello! What can you help me with?
I'm a helpful AI assistant...

>>> /bye

Part 4 — Deploy Open WebUI {#part-4}

mkdir -p ~/apps/open-webui && cd ~/apps/open-webui

Create docker-compose.yml:

version: '3.8'

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    volumes:
      - open-webui_data:/app/backend/data
    environment:
      # Connect to Ollama running on the host
      OLLAMA_BASE_URL: http://host.docker.internal:11434
      # Security settings
      WEBUI_SECRET_KEY: generate_a_long_random_string
      WEBUI_AUTH: "true"
      # Disable anonymous access
      ENABLE_SIGNUP: "true"    # Allow registration on first visit; set to "false" after setup
      DEFAULT_MODELS: qwen2.5:7b
      DEFAULT_USER_ROLE: user
    extra_hosts:
      - "host.docker.internal:host-gateway"  # Allows container to reach host network

volumes:
  open-webui_data:
docker compose up -d
docker compose logs -f open-webui
# Wait for: Application startup complete

Part 5 — Configure Nginx {#part-5}

sudo nano /etc/nginx/sites-available/open-webui
server {
    listen 80;
    server_name ai.yourdomain.com;

    client_max_body_size 100m;   # For document uploads

    location / {
        proxy_pass http://127.0.0.1:3000;
        proxy_http_version 1.1;

        # WebSocket support (required for streaming responses)
        proxy_set_header Upgrade    $http_upgrade;
        proxy_set_header Connection 'upgrade';

        proxy_set_header Host              $host;
        proxy_set_header X-Real-IP         $remote_addr;
        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        proxy_cache_bypass $http_upgrade;

        # Long timeout for AI responses
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
    }
}
sudo ln -s /etc/nginx/sites-available/open-webui /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx

Part 6 — Enable HTTPS {#part-6}

sudo apt install -y certbot python3-certbot-nginx
sudo certbot --nginx -d ai.yourdomain.com

Part 7 — First Login and Using the Chat Interface {#part-7}

Visit https://ai.yourdomain.com.

First user = admin. Register your account — this first registered user gets admin access.

After login:

  1. Select a model from the dropdown in the top bar
  2. Type a message and press Enter
  3. The AI responds with streaming output

Key features:

  • Model switching — change models mid-conversation
  • Document Q&A — upload PDFs, Word docs, text files and chat with them
  • System prompts — customize the AI's personality and behavior
  • Conversation history — all chats are saved and searchable
  • Multi-user — add users under Admin → Users

Part 8 — Useful Models to Try {#part-8}

# General conversation (English)
ollama pull llama3.1:8b

# General conversation (Chinese + English, very good)
ollama pull qwen2.5:7b

# Reasoning and math
ollama pull deepseek-r1:7b

# Code generation
ollama pull codellama:7b
ollama pull qwen2.5-coder:7b

# Very small models (4 GB RAM)
ollama pull qwen2.5:3b
ollama pull phi3:mini

# Embedding models (for RAG)
ollama pull nomic-embed-text

Part 9 — Use Ollama as an API {#part-9}

Ollama exposes an API compatible with OpenAI's format. Use it from any code:

# Direct Ollama API
curl http://localhost:11434/api/chat \
  -d '{"model": "qwen2.5:7b", "messages": [{"role": "user", "content": "Hello"}]}'

# OpenAI-compatible API (works with any OpenAI SDK)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen2.5:7b", "messages": [{"role": "user", "content": "Hello"}]}'

Python example using the OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but ignored by Ollama
)

response = client.chat.completions.create(
    model="qwen2.5:7b",
    messages=[{"role": "user", "content": "Explain Docker in simple terms"}]
)
print(response.choices[0].message.content)

The Gotcha: RAM and Context Length {#gotcha}

Ollama loads the model into RAM when the first request comes in and keeps it loaded for a period. If your prompt plus the conversation history exceeds the model's context window, old messages are silently dropped.

Practical issue: you notice the AI "forgets" earlier parts of long conversations. This isn't a bug — it's the context window limit.

Workarounds:

  • Use models with larger context windows (Qwen2.5 and Llama3.1 support 128K tokens)
  • Start new conversations for new topics
  • Keep conversations focused

Check available RAM before loading large models:

free -h
# Make sure available RAM > model size + 2 GB overhead

If Ollama crashes or returns errors, check memory:

sudo journalctl -u ollama -f

Model Management Commands {#commands}

# List downloaded models
ollama list

# Download a model
ollama pull MODEL_NAME

# Remove a model
ollama rm MODEL_NAME

# Show model details
ollama show MODEL_NAME

# Run model in terminal (interactive)
ollama run MODEL_NAME

# Check Ollama service
sudo systemctl status ollama
sudo journalctl -u ollama -f

# View model storage location
ls ~/.ollama/models/
du -sh ~/.ollama/models/  # Check total disk usage

Troubleshooting {#troubleshooting}

Issue Likely Cause Fix
Connection refused Service not running or wrong port Check systemctl status SERVICE and verify firewall rules
Permission denied Wrong file ownership or permissions Check file ownership with ls -la and use chown/chmod to fix
502 Bad Gateway Backend service not running Restart the backend service; check logs with journalctl -u SERVICE
SSL certificate error Certificate expired or domain mismatch Run sudo certbot renew and verify domain DNS points to server IP
Service not starting Config error or missing dependency Check logs with journalctl -u SERVICE -n 50 for specific error
Out of disk space Logs or data accumulation Run df -h to identify usage; clean logs or attach CBS storage
High memory usage Too many processes or memory leak Check with htop; consider upgrading instance plan if consistently high
Firewall blocking traffic Port not open in UFW or Lighthouse console Open port in Lighthouse console firewall AND sudo ufw allow PORT

Frequently Asked Questions {#faq}

How much RAM do I need to run Ollama on a VPS?
It depends on the model size. 3B parameter models need ~3–4 GB RAM; 7B models need ~5–6 GB; 13B+ models need 12+ GB. Check the requirements section for specific recommendations.

Can Ollama run on a CPU-only server without a GPU?
Yes, but inference speed varies significantly. 3B models are responsive on CPU. 7B+ models are noticeably slower without GPU acceleration. For production AI workloads, consider a GPU instance.

Is my data private when using self-hosted AI models?
Yes — data is processed entirely on your server with no external API calls. Conversations, documents, and prompts never leave your infrastructure. This is a key advantage of self-hosting AI.

What is the TencentOS AI image and should I use it?
The TencentOS AI application image comes pre-installed with Python 3, Docker, PyTorch, TensorFlow, PaddlePaddle, and GPU drivers. It eliminates hours of manual CUDA and AI framework setup. Strongly recommended for GPU-accelerated AI workloads.

Can I use Ollama as a drop-in replacement for the OpenAI API?
Many self-hosted AI tools provide OpenAI-compatible API endpoints. You can often switch your application by just changing the base_url to your server address.

Run your own AI today:
👉 Tencent Cloud Lighthouse — High-RAM instances for AI workloads
👉 View current pricing and promotions
👉 Explore all active deals and offers