Technology Encyclopedia Home >Run a Private AI Chatbot on a Cloud Server with Ollama — Your Own ChatGPT, No API Fees

Run a Private AI Chatbot on a Cloud Server with Ollama — Your Own ChatGPT, No API Fees

I got curious about running AI models locally after noticing how much I was spending on API calls for personal projects. The math was simple: at a certain usage level, a cloud server running Ollama costs less per month than the API fees, and you get unlimited calls.

More importantly, the data stays on your server. Conversations aren't logged by a third party, and you can deploy models that are fine-tuned for specific use cases without sending sensitive information anywhere.

Ollama makes running large language models on a server surprisingly straightforward. Combined with Open WebUI (which I covered in the Docker guide earlier), you get a ChatGPT-like interface connected to models you control.

I run this on Tencent Cloud Lighthouse. For CPU-only inference, a 4 GB RAM instance runs 3B models at reasonable speed; 8 GB for 7B+ models. For GPU-accelerated inference, Lighthouse offers a TencentOS AI application image that comes pre-installed with Python 3, Node.js, Docker, Git, and major AI frameworks (PyTorch, TensorFlow, PaddlePaddle) along with GPU drivers — eliminating the complex CUDA setup that typically takes hours. The primary advantage for private AI: your conversations and documents are processed entirely on your own server — nothing sent to any third-party API. The spec upgrade path lets you start with a CPU plan and move to a GPU plan as needed.


Table of Contents

  1. What Ollama Does
  2. Choosing the Right Model for Your Server
  3. Part 1: Install Ollama
  4. Part 2: Pull and Run Models
  5. Part 3: Deploy Open WebUI
  6. Part 4: Expose via Nginx with Authentication
  7. Part 5: Use the Ollama API
  8. Part 6: Model Management
  9. The Thing That Tripped Me Up
  10. Troubleshooting
  11. Summary

  • Key Takeaways
  • Use the appropriate Lighthouse application image to skip manual installation steps where available
  • Lighthouse snapshots provide one-click full-server backup before major changes
  • OrcaTerm browser terminal lets you manage the server from any device
  • CBS cloud disk expansion handles growing storage needs without server migration
  • Console-level firewall + UFW = two independent protection layers

What Ollama Does {#what-ollama-does}

Ollama is a tool for running open-source large language models locally. It:

  • Downloads and manages model files (GGUF format)
  • Provides an OpenAI-compatible REST API at localhost:11434
  • Handles model loading/unloading automatically
  • Supports CPU and GPU inference
  • Runs as a system service that starts on boot

Models you can run include Llama 3, Mistral, Phi-3, Gemma, Qwen, and many others — all the major open-source families.


Choosing the Right Model for Your Server {#choosing-models}

Match the model size to your available RAM:

Server RAM Recommended Models Notes
4 GB Llama 3.2:3b, Phi-3:mini, Gemma2:2b Fast responses, good for general chat
8 GB Llama 3.1:8b, Mistral:7b, Qwen2.5:7b Better quality, still responsive
16 GB Llama 3.1:13b, Mixtral:8x7b High quality, good for coding tasks
32 GB+ Llama 3.1:70b (quantized), CodeLlama:34b Near GPT-4 quality for many tasks

Rule of thumb: The model's parameter count in billions roughly equals the RAM needed in GB at 4-bit quantization. A 7B model needs ~4-5 GB RAM.


Part 1: Install Ollama {#part-1}

For GPU-accelerated inference: Select the TencentOS AI application image when creating your Lighthouse instance. It comes pre-installed with Python 3, Docker, PyTorch, TensorFlow, PaddlePaddle, and GPU drivers — you still need to install Ollama itself (one command below), but the CUDA/driver setup that usually takes hours is already done.

1.1 — Install

curl -fsSL https://ollama.com/install.sh | sh

This installs Ollama and sets it up as a systemd service.

1.2 — Verify the Service is Running

sudo systemctl status ollama

Check the API responds:

curl http://localhost:11434/api/tags
# Returns: {"models":[]}  (empty until you pull a model)

1.3 — Configure Ollama to Listen on All Interfaces (Optional)

By default, Ollama only listens on localhost. If you want to access it from outside the server (e.g., from your laptop), configure it to listen on all interfaces — but only do this behind Nginx with authentication:

sudo systemctl edit ollama

Add:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl daemon-reload
sudo systemctl restart ollama

Part 2: Pull and Run Models {#part-2}

2.1 — Pull a Model

# Download Llama 3.2 3B (good for 4 GB RAM servers)
ollama pull llama3.2:3b

# Or Phi-3 Mini (very lightweight, 2.3 GB download)
ollama pull phi3:mini

# Or Mistral 7B (better quality, needs 8 GB RAM)
ollama pull mistral:7b

Downloads go to ~/.ollama/models/ (or /usr/share/ollama/.ollama/models/ when running as a service).

2.2 — Chat in the Terminal

ollama run llama3.2:3b

You're now chatting with the model directly in your terminal. Type /bye to exit.

2.3 — Run a One-Shot Query

ollama run llama3.2:3b "Explain how HTTPS works in 3 sentences"

2.4 — List Downloaded Models

ollama list

2.5 — Delete a Model

ollama rm llama3.2:3b

Part 3: Deploy Open WebUI {#part-3}

Open WebUI gives you a ChatGPT-like browser interface connected to Ollama.

3.1 — Install Docker (if not already installed)

curl -fsSL https://get.docker.com | sh

3.2 — Run Open WebUI

docker run -d \
  --name open-webui \
  --restart=always \
  --network=host \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
  ghcr.io/open-webui/open-webui:main

--network=host lets the container reach Ollama at localhost:11434.

Open WebUI runs on port 8080. Access it via SSH tunnel first:

ssh -L 8080:localhost:8080 ubuntu@YOUR_SERVER_IP

Open http://localhost:8080 in your browser. Create an account (first user becomes admin). Select a model from the dropdown and start chatting.


Part 4: Expose via Nginx with Authentication {#part-4}

To access Open WebUI from anywhere (not just via SSH tunnel), set up Nginx with HTTPS.

4.1 — Install Nginx and Certbot

sudo apt install -y nginx certbot python3-certbot-nginx

4.2 — Create Nginx Config

sudo nano /etc/nginx/sites-available/ai
server {
    listen 80;
    server_name ai.yourdomain.com;

    location / {
        proxy_pass http://localhost:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        
        # Increase timeout for long AI responses
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
    }
}
sudo ln -s /etc/nginx/sites-available/ai /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx
sudo certbot --nginx -d ai.yourdomain.com

Open WebUI has its own user authentication built in — the first user to register becomes the admin. You can disable public registration in Open WebUI settings after creating your account.


Part 5: Use the Ollama API {#part-5}

Ollama provides an OpenAI-compatible API, so any OpenAI SDK works with minimal changes.

Direct API Call

curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "messages": [
      {"role": "user", "content": "Write a Python function to check if a number is prime"}
    ],
    "stream": false
  }'

Python Integration

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Ollama doesn't require a real key
)

response = client.chat.completions.create(
    model="llama3.2:3b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain Docker in simple terms"}
    ]
)

print(response.choices[0].message.content)

Node.js Integration

import OpenAI from "openai";

const ollama = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama",
});

const response = await ollama.chat.completions.create({
  model: "llama3.2:3b",
  messages: [{ role: "user", content: "What is a REST API?" }],
});

console.log(response.choices[0].message.content);

Part 6: Model Management {#part-6}

Check Model Storage Usage

du -sh ~/.ollama/models/
# or
du -sh /usr/share/ollama/.ollama/models/

Models are large (2–20 GB each). Plan your storage accordingly.

Create a Custom Modelfile

Customize a model's system prompt and parameters:

Create Modelfile:

FROM llama3.2:3b

SYSTEM """
You are a helpful assistant for a tech support team. 
Respond concisely and technically. 
When asked about code, always provide working examples.
"""

PARAMETER temperature 0.7
PARAMETER top_p 0.9

Build and use the custom model:

ollama create myassistant -f Modelfile
ollama run myassistant

Automatic Model Unloading

Ollama unloads models from RAM after 5 minutes of inactivity by default. To change:

sudo systemctl edit ollama

Add:

[Service]
Environment="OLLAMA_KEEP_ALIVE=30m"

The Thing That Tripped Me Up {#gotcha}

My first attempts with the Mistral 7B model on a 4 GB RAM server resulted in extremely slow responses — 30-40 seconds per token. The model was technically running, but practically unusable.

The cause: 7B models at full precision need about 14 GB RAM. Even quantized to 4-bit (which Ollama does by default), they need ~5 GB. My 4 GB server was swapping to disk.

What I learned about model selection:

# Check if Ollama is swapping
free -h
# If 'used' under Swap is non-zero during inference, the model is too large

# Check which quantization level you have
ollama show mistral:7b --modelinfo | grep quantization

The fix: Switch to a smaller model that fits comfortably in RAM:

# For 4 GB RAM: use 3B models
ollama pull llama3.2:3b    # 2.0 GB download, ~2.5 GB RAM usage
ollama pull phi3:mini       # 2.3 GB download, ~2.5 GB RAM usage

# For 8 GB RAM: 7B models work well
ollama pull llama3.1:8b    # 4.7 GB download, ~5 GB RAM usage

The 3B models are genuinely useful for most tasks — coding help, writing assistance, Q&A. For serious technical work or nuanced reasoning, upgrade to an 8 GB RAM instance and run a 7B model.


Troubleshooting {#troubleshooting}

Issue Likely Cause Fix
Slow responses Model too large for available RAM Use smaller model or upgrade RAM
connection refused on port 11434 Ollama not running sudo systemctl start ollama
Out of disk space Model files are large ollama rm modelname; check with ollama list
Open WebUI can't reach Ollama Network mismatch Use --network=host or set OLLAMA_BASE_URL correctly
Model download fails Network timeout Retry; models are large, downloads can take time
High memory usage after chat Model still loaded Ollama keeps model in RAM for 5 min; expected behavior
API returns 404 Model name wrong Check exact name with ollama list

Summary {#verdict}

What you built:

  • Ollama running as a system service with auto-start
  • At least one open-source LLM downloaded and running
  • Open WebUI providing a ChatGPT-like browser interface
  • HTTPS access via Nginx with Let's Encrypt
  • OpenAI-compatible API for integrating into your own apps
  • Custom model with system prompt via Modelfile

Total monthly cost: the price of your cloud server (starting from ~$5–10/month). No per-token fees, no usage limits, no data leaving your server.

Frequently Asked Questions {#faq}

How much RAM do I need to run Ollama AI on a VPS?
It depends on the model size. 3B parameter models need ~3–4 GB RAM; 7B models need ~5–6 GB; 13B+ models need 12+ GB. Check the requirements section for specific recommendations.

Can Ollama AI run on a CPU-only server without a GPU?
Yes, but inference speed varies significantly. 3B models are responsive on CPU. 7B+ models are noticeably slower without GPU acceleration. For production AI workloads, consider a GPU instance.

Is my data private when using self-hosted AI models?
Yes — data is processed entirely on your server with no external API calls. Conversations, documents, and prompts never leave your infrastructure. This is a key advantage of self-hosting AI.

What is the TencentOS AI image and should I use it?
The TencentOS AI application image comes pre-installed with Python 3, Docker, PyTorch, TensorFlow, PaddlePaddle, and GPU drivers. It eliminates hours of manual CUDA and AI framework setup. Strongly recommended for GPU-accelerated AI workloads.

Can I use Ollama AI as a drop-in replacement for the OpenAI API?
Many self-hosted AI tools provide OpenAI-compatible API endpoints. You can often switch your application by just changing the base_url to your server address.

👉 Get started with Tencent Cloud Lighthouse
👉 View current pricing and launch promotions
👉 Explore all active deals and offers