If you want to run AI models on your own server without paying per-token fees, the easiest way to do it is to install Ollama on Linux. Ollama wraps popular open-source models like Llama 3, Mistral, Phi, and DeepSeek behind a simple command-line tool and a local REST API. Your data never leaves the machine. There are no rate limits. And it works on any Linux server with at least 8 GB of RAM.

What Is Ollama and Why Use It?
Ollama is an open-source runtime for large language models. Think of it like Docker, but for AI models. You pull a model with one command, and Ollama handles quantization, memory management, and GPU offloading automatically.
The case for running AI locally is stronger than ever in 2026. OpenAI charges $15 per million input tokens. For developers building chatbots, processing documents, or experimenting with prompts, those costs add up fast. A local Llama 3 8B model running on Ollama costs nothing per token and runs completely offline.
Ollama supports over 50 models including Llama 3, Mistral, Gemma 2, Phi-4, DeepSeek Coder, and Qwen. You can browse them all on the Ollama model library.
System Requirements
Ollama runs on CPU alone, but a GPU makes responses much faster. Check your server meets these requirements before starting:
| Component | Minimum | Recommended |
|---|---|---|
| RAM | 8 GB (for 7B models) | 16 GB or more |
| Storage | 10 GB free | 50 GB for multiple models |
| CPU | x86-64 or ARM64 | Modern multi-core |
| GPU (optional) | NVIDIA 6 GB VRAM | RTX 3090 or A100 |
| OS | Ubuntu 20.04+, RHEL 8+ | Ubuntu 22.04 or AlmaLinux 9 |
No GPU? No problem. A 7B model runs fine on CPU with 16 GB of RAM. Responses are slower, but perfectly usable for development work.
How to Install Ollama on Linux

The official install script works on Ubuntu, Debian, RHEL, AlmaLinux, Rocky Linux, and most other systemd-based distributions. Run this as root or with sudo:
curl -fsSL https://ollama.com/install.sh | sh
The script downloads the binary, creates a system user, and registers a systemd service. Once it finishes, check the install:
ollama --version
systemctl status ollama
You should see the service running and output like ollama version 0.6.x. If the service is not active, enable it manually:
systemctl enable --now ollama
Manual Install Without the Script
If you prefer not to pipe a script directly to your shell, use the manual method. This works on all x86-64 Linux systems:
curl -LO https://ollama.com/download/ollama-linux-amd64.tgz
tar -C /usr/local -xzf ollama-linux-amd64.tgz
useradd -r -s /bin/false -U -m -d /usr/share/ollama ollama
cat > /etc/systemd/system/ollama.service << 'EOF'
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
[Install]
WantedBy=default.target
EOF
systemctl daemon-reload
systemctl enable --now ollama
Pull and Run Your First Model
Once Ollama is running, pull a model. Llama 3.2 3B is a good first choice. It is fast, only about 2 GB in size, and capable enough for most tasks:
ollama pull llama3.2
After the download, start a chat session:
ollama run llama3.2
Type a message and press Enter. To quit, type /bye. You can also send a one-off prompt without entering interactive mode:
ollama run llama3.2 "Explain what a Linux inode is in plain English"
Other Useful Models
ollama pull phi4-mini # Lightweight, good for low-RAM servers
ollama pull llama3.1:8b # Strong general-purpose model
ollama pull deepseek-coder-v2 # Best for code generation and debugging
ollama pull gemma2 # Google model, great for reasoning
ollama pull mistral # Fast, multilingual
To list downloaded models, run ollama list. To delete one and free disk space, run ollama rm model-name.
Using the Ollama REST API

Ollama runs a local REST API on port 11434. By default, it only listens on localhost. Query it with curl:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "What is the Linux kernel?",
"stream": false
}'
Ollama also supports an OpenAI-compatible endpoint. So you can use it as a drop-in replacement in tools already built for the OpenAI API:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "What does Ollama do?"}]
}'
This means tools like LangChain, LlamaIndex, and Continue work with your local Ollama instance after a small config change.
Expose Ollama to Your Local Network
By default, Ollama only accepts local connections. To reach it from another machine, set the OLLAMA_HOST variable via the systemd override:
systemctl edit ollama
Add these lines:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Then reload and restart:
systemctl daemon-reload && systemctl restart ollama
Only do this on a private network or behind a firewall. Ollama has no built-in authentication, so do not expose it to the public internet without a reverse proxy in front of it.
Add a Web Interface with Open WebUI
If you want a browser-based chat interface similar to ChatGPT, Open WebUI is the best option. It connects to your local Ollama instance and gives you a clean interface for switching models and managing conversations. Run it with Docker:
docker run -d \
--name open-webui \
--network=host \
-v open-webui:/app/backend/data \
-e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
--restart always \
ghcr.io/open-webui/open-webui:main
After the container starts, open http://your-server-ip:8080 in a browser. Create an admin account on the first visit, and you are ready to chat.
Enable NVIDIA GPU Acceleration
If your server has an NVIDIA GPU, Ollama picks it up automatically after you install the drivers and CUDA toolkit. First, confirm the GPU is visible:
nvidia-smi
If that command shows your GPU details, Ollama will use it on the next model run. Check the logs to confirm GPU offloading is active:
journalctl -u ollama -f
Look for n_gpu_layers = 33 in the output. That means all model layers are running on the GPU. If you see 0, inference is on CPU only. In that case, install the CUDA toolkit and restart the service.
Quick Reference: Useful Ollama Commands
ollama list # List downloaded models
ollama show llama3.2 # Show model info and parameters
ollama ps # Show models loaded in memory right now
ollama rm llama3.2 # Delete a model from disk
ollama pull llama3.2 # Download or update a model
Conclusion
Once you install Ollama on Linux, you have a private, cost-free AI server running on your own hardware. The whole setup takes under 10 minutes. Start with Phi-4 Mini or Llama 3.2 3B, then move to larger models as you need them. The full model list is on the Ollama GitHub page. If you manage the same Linux server, also check our guide on patching the Copy Fail Linux kernel vulnerability before opening new ports for Ollama.