aiNovember 16, 202512 min read

Open WebUI: Running Your Own AI Interface Locally and in the Cloud

Setting up Open WebUI for local development and cloud deployment — comparing self-hosted AI interfaces with commercial alternatives.

open-webuiself-hostedai

Open WebUI: Running Your Own AI Interface Locally and in the Cloud

Every developer I know has at least three AI chat tabs open at any given time. ChatGPT for one thing, Claude for another, maybe a Gemini window for multimodal tasks. Each has its own conversation history, its own context, its own billing. Switching between them is friction you stop noticing until it is gone.

Open WebUI eliminates that friction. It is a self-hosted interface that connects to multiple AI backends — local models via Ollama, cloud APIs like OpenAI and Anthropic, or any OpenAI-compatible endpoint. One interface, one conversation history, one place to manage everything. And because you host it yourself, your data never leaves your infrastructure unless you explicitly send it to a cloud API.

What Open WebUI Is

Open WebUI (formerly Ollama WebUI) is an open-source, self-hosted web interface for interacting with large language models. It started as a frontend for Ollama — the tool that runs LLMs locally — but has evolved into a full-featured AI platform that supports:

Multiple model backends (Ollama, OpenAI, Anthropic, any OpenAI-compatible API)
Conversation history with search and organization
RAG (Retrieval-Augmented Generation) with document uploads
Custom model presets and system prompts
User management and team access controls
Function calling and tool use
Image generation integration
Voice input and output

It is not a toy project. The interface is polished, the feature set is comprehensive, and the community is active. At the time of writing, the GitHub repository has over 75,000 stars and a release cadence of roughly every two weeks.

The real value proposition is control. You decide where it runs, what models it connects to, who has access, and where the data goes. For developers working with sensitive code, proprietary business data, or regulated industries, this matters more than any feature comparison.

Local Setup with Docker

The fastest way to get Open WebUI running locally is Docker. One command, no dependencies to install, no configuration files to write:

docker run -d \
  -p 3000:8080 \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

That is it. Open http://localhost:3000 in your browser, create an admin account, and you have a working AI interface. The -v open-webui:/app/backend/data flag creates a Docker volume for persistent storage — your conversations, settings, and uploaded documents survive container restarts.

Connecting to Ollama

To use local models, you need Ollama running on your machine. Install it from ollama.com, then pull a model:

ollama pull llama3.1
ollama pull codellama
ollama pull mistral

If Open WebUI and Ollama are both running on the same machine, Open WebUI automatically detects Ollama at http://host.docker.internal:11434 (on macOS and Windows) or http://localhost:11434 (on Linux with --network host).

For Linux, the Docker command needs network host mode to reach Ollama:

docker run -d \
  -p 3000:8080 \
  --network host \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Once connected, every model you have pulled in Ollama appears in the model dropdown in Open WebUI. You can switch between them mid-conversation, compare outputs, and set per-model system prompts.

Connecting to Cloud APIs

Open WebUI also connects to cloud model providers. In the admin settings under "Connections," add your API keys:

OpenAI: Add your API key and all GPT models become available
Anthropic: Add your API key for Claude models (via OpenAI-compatible proxy or direct integration depending on the version)
Custom endpoints: Any service that exposes an OpenAI-compatible API — Azure OpenAI, Together AI, Groq, local vLLM instances

This is where Open WebUI becomes genuinely useful as a daily driver. You get one interface for local Llama models (free, private, good for experimentation) and cloud models (more capable, usage-based billing). The context switching cost drops to zero — you just change the model in the dropdown.

Docker Compose for a Full Stack

For a more robust local setup, use Docker Compose to run Open WebUI and Ollama together:

# docker-compose.yml
version: "3.8"

services:
  ollama:
    image: ollama/ollama
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    volumes:
      - open-webui-data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

volumes:
  ollama-data:
  open-webui-data:

docker compose up -d

The GPU reservation section is optional — remove it if you are running on CPU only. On Apple Silicon Macs, Ollama uses the Metal GPU framework automatically without Docker GPU passthrough (run Ollama natively on macOS rather than in Docker for best performance).

Cloud Deployment Options

Running Open WebUI locally is great for individual use. But when you want your team to access it, or when you want to use it from any device, you need a cloud deployment.

VPS Deployment (Hetzner, DigitalOcean, etc.)

The simplest cloud path is a VPS with Docker installed. A $20/month server from Hetzner or DigitalOcean is sufficient for the Open WebUI interface itself. If you want to run models on the server too, you will need a GPU instance ($50-150/month depending on the GPU).

# On your VPS
apt update && apt install docker.io docker-compose-plugin -y

# Create docker-compose.yml (same as above, minus GPU reservation)
docker compose up -d

# Set up a reverse proxy with SSL
apt install nginx certbot python3-certbot-nginx -y

Nginx reverse proxy configuration:

server {
    server_name ai.yourdomain.com;

    location / {
        proxy_pass http://localhost:3000;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Then certbot --nginx -d ai.yourdomain.com for free SSL from Let's Encrypt. You now have a private, SSL-secured AI interface accessible from anywhere.

Railway / Fly.io Deployment

For a managed deployment without server administration:

# Using Railway
railway login
railway init
railway up

Or with Fly.io:

fly launch --image ghcr.io/open-webui/open-webui:main
fly secrets set OLLAMA_BASE_URL=http://your-ollama-server:11434
fly deploy

These platforms handle SSL, scaling, and restarts automatically. The trade-off is less control over the infrastructure and higher cost at scale compared to a VPS.

Architecture Decision: Where to Run Models

The key architectural decision for cloud deployments is whether to run models on the same server as Open WebUI or connect to external API providers.

Same server: Lower latency, no API costs per token, full data privacy. But you need a GPU-equipped server ($100-300/month for a decent GPU instance), and the model selection is limited by your hardware.

External APIs only: No GPU needed, access to the best models from every provider, pay per token. The Open WebUI server is lightweight and cheap to host. But every conversation goes through a third-party API, and costs scale with usage.

Hybrid: Run a local model for routine tasks and privacy-sensitive work, connect to cloud APIs for tasks that require more capable models. This is what I use — Llama 3.1 for quick questions and code completion where privacy matters, Claude or GPT-4o for complex reasoning and generation tasks.

Comparing with Commercial Interfaces

ChatGPT Plus ($20/month)

ChatGPT gives you GPT-4o, DALL-E integration, browsing, code interpreter, and custom GPTs. The interface is polished and the mobile app is excellent.

Where Open WebUI wins: model flexibility (you are not locked to OpenAI), data privacy (conversations stay on your server), customization (custom system prompts, RAG pipelines, function calling). No usage caps on local models.

Where ChatGPT wins: native tool integrations, code interpreter sandbox, mobile experience, zero setup. The plugins and GPT Store ecosystem has no equivalent in Open WebUI.

Verdict: ChatGPT Plus is hard to beat for non-technical users who want a ready-to-use experience. Open WebUI is better for developers who want control and flexibility.

Claude Pro ($20/month)

Claude Pro gives you Claude Sonnet and Opus models with extended context windows, artifact creation, and projects with persistent context. Claude's instruction-following and long document handling are best-in-class.

Where Open WebUI wins: you can still use Claude models through the API while also having access to every other model. You own your conversation history. You can add RAG and custom tools.

Where Claude wins: the Projects feature with persistent context, the artifact system for code and documents, the native thinking/reasoning display. These are deeply integrated into Claude's interface and have no equivalent in Open WebUI.

Verdict: If Claude is your primary model, the native interface is hard to leave. Open WebUI is better as a unified interface when you use multiple providers.

For Teams

The most compelling case for Open WebUI is team use. Commercial AI interfaces offer team plans, but they typically charge per seat ($25-30/user/month), conversations are stored on the provider's infrastructure, and you cannot customize the experience.

Open WebUI with team access gives you: one hosting cost regardless of user count, conversations stored on your infrastructure, custom model presets per team or project, shared RAG document collections, and admin controls over which models and features are available.

For a team of 10 developers, commercial AI chat subscriptions cost $200-300/month. A self-hosted Open WebUI instance with cloud API access costs $20-40/month for hosting plus pay-as-you-go API usage. The economics shift further in your favor as the team grows.

RAG Pipeline Setup

Retrieval-Augmented Generation is one of Open WebUI's most powerful features. Upload documents, and the AI can reference them when answering questions. This turns Open WebUI from a general-purpose chat into a knowledge base that understands your specific documentation, codebase, or business data.

Document Upload

Open WebUI supports direct document upload through the interface. Drag a PDF, markdown file, or text file into the chat, and it is automatically chunked, embedded, and indexed for retrieval. The AI references the uploaded documents when answering questions.

For bulk document ingestion, use the admin panel's document management section. You can organize documents into collections and control which collections are available in which chats.

Embedding Configuration

Open WebUI uses embedding models to convert documents into vectors for semantic search. By default, it uses a local embedding model, but you can configure it to use OpenAI's embedding API for better quality:

In the admin settings under "Documents," configure:

Embedding model: text-embedding-3-small (OpenAI) or a local model via Ollama
Chunk size: 1000 tokens (default, adjust based on your documents)
Chunk overlap: 100 tokens (helps maintain context across chunks)
Top K: 5 (number of relevant chunks to retrieve per query)

Practical RAG Use Cases

Codebase documentation: Upload your project's README, architecture docs, and API documentation. Ask the AI questions about your own project and get answers grounded in your actual docs rather than the model's general training data.

Meeting notes and decisions: Upload meeting transcripts and decision logs. "What was decided about the database migration timeline in last week's meeting?" gets an accurate answer instead of a hallucinated one.

Research papers and technical specs: Upload PDFs of papers or specifications you are working with. The AI can summarize, compare, and answer questions about content that is not in its training data.

The quality of RAG answers depends heavily on document chunking and embedding quality. If answers seem to miss relevant information, try reducing the chunk size (so each chunk is more focused) or increasing Top K (so more chunks are retrieved).

Custom Model Configurations

Open WebUI lets you create model presets — saved configurations with custom system prompts, temperature settings, and model selections. This is useful for creating purpose-specific assistants without modifying the underlying models.

Creating a Code Review Preset

In the Open WebUI settings, create a new model preset:

Name: Code Reviewer
Base model: Claude Sonnet (or your preferred model)
System prompt: "You are a senior code reviewer. Analyze the provided code for bugs, security issues, performance problems, and style violations. Be specific about line numbers and provide corrected code snippets. Prioritize issues by severity."
Temperature: 0.3 (lower for more consistent, focused output)

Creating a Writing Assistant Preset

Name: Technical Writer
Base model: GPT-4o
System prompt: "You are a technical writing assistant. Help draft clear, concise technical documentation. Use active voice. Avoid jargon unless the audience is technical. Structure content with headers, lists, and code examples where appropriate."
Temperature: 0.7 (higher for more creative output)

These presets appear in the model dropdown alongside your regular models. You switch to the "Code Reviewer" preset when reviewing code and to the "Technical Writer" when drafting docs. The system prompt is applied automatically.

Privacy Advantages

The privacy argument for self-hosted AI is not theoretical. It has concrete implications for different use cases.

Sensitive Code

When you paste proprietary code into ChatGPT or Claude, you are sending it to a third-party server. The providers have data handling policies — OpenAI and Anthropic both state they do not train on API data — but the data still leaves your infrastructure. For companies with strict IP policies, regulated industries, or government contracts, this can be a non-starter.

With Open WebUI connected to Ollama, your code never leaves your machine. The model runs locally, inference happens locally, and conversation history is stored locally. For cloud deployments, the data stays on your server.

Client Data

If you are a consultant or agency working with client data, using commercial AI interfaces creates a data handling question you need to answer for every client. Self-hosted Open WebUI gives you a clear answer: the data stays in your infrastructure, under your control, subject to your security policies.

Compliance

HIPAA, SOC 2, GDPR — these frameworks care about where data is processed and stored. A self-hosted AI interface on your compliant infrastructure is inherently easier to include in your compliance scope than a third-party SaaS tool.

This does not mean local is always better. Cloud AI APIs have their own compliance certifications, and for many use cases, the compliance posture of OpenAI or Anthropic's enterprise offerings is stronger than what you can build yourself. The point is that self-hosting gives you the option when you need it.

Performance Considerations

Local Model Performance

Running models locally means your hardware determines the experience. Here are rough benchmarks for Llama 3.1 8B:

Apple M1 Pro (16GB RAM): ~15 tokens/second — usable for short interactions
Apple M2 Ultra (64GB RAM): ~40 tokens/second — comfortable for extended conversations
NVIDIA RTX 4090: ~80 tokens/second — near-instant responses
CPU only (no GPU): ~2-5 tokens/second — painfully slow, not recommended

For larger models like Llama 3.1 70B, you need at least 48GB of RAM (M2 Max or better on Mac, or a server-grade GPU). The quality improvement is significant, but the hardware requirement is steep.

Latency Comparison

For cloud API connections, Open WebUI adds minimal overhead — typically 10-30ms on top of the API's native latency. The bottleneck is always the model inference, not the interface.

For local models, latency is a function of model size and hardware. First-token latency (the time before the model starts generating) ranges from 100ms for small models on fast hardware to several seconds for large models on limited hardware. Streaming display hides most of this from the user.

When to Self-Host vs. Use Commercial

Self-host when:

You work with sensitive or proprietary data
You want to use local models for privacy or cost reasons
You need a team AI interface without per-seat licensing
You want to customize the RAG pipeline or add custom tools
You use multiple model providers and want a unified interface

Use commercial interfaces when:

You need the tightest possible integration with a specific model (Claude Projects, ChatGPT plugins)
You value mobile apps and cross-device sync
You do not want to manage infrastructure
Your use case does not involve sensitive data
You need the most polished, latest-feature experience

For most developers, the answer is both. I use Claude's native interface for deep work that benefits from Claude Projects, ChatGPT for tasks that need code interpreter, and Open WebUI for everything else — especially when I want to use local models, compare outputs across providers, or work with documents I do not want to send to a cloud API.

Open WebUI is not about replacing commercial interfaces. It is about having the option to control your AI infrastructure when that control matters. The setup takes 30 minutes. The privacy and flexibility benefits are permanent.

Danil Ulmashev

Full Stack Developer

Need a senior developer to build something like this for your business?

Book a discovery call Work