How to Install Ollama on Linux and Run AI Models Locally

If you want to run AI models on your own server without paying per-token fees, the easiest way to do it is to install Ollama on Linux. Ollama wraps popular open-source models like Llama 3, Mistral, Phi, and DeepSeek behind a simple command-line tool and a local REST API. Your data never leaves the machine. There are no rate limits. And it works on any Linux server with at least 8 GB of RAM.

Install Ollama on Linux to run local AI models like Llama 3 and Mistral
Ollama reached 95,000 GitHub stars in early 2026 and is now the most widely used runtime for running LLMs locally on Linux.

What Is Ollama and Why Use It?

Ollama is an open-source runtime for large language models. Think of it like Docker, but for AI models. You pull a model with one command, and Ollama handles quantization, memory management, and GPU offloading automatically.

The case for running AI locally is stronger than ever in 2026. OpenAI charges $15 per million input tokens. For developers building chatbots, processing documents, or experimenting with prompts, those costs add up fast. A local Llama 3 8B model running on Ollama costs nothing per token and runs completely offline.

Ollama supports over 50 models including Llama 3, Mistral, Gemma 2, Phi-4, DeepSeek Coder, and Qwen. You can browse them all on the Ollama model library.

System Requirements

Ollama runs on CPU alone, but a GPU makes responses much faster. Check your server meets these requirements before starting:

Component Minimum Recommended
RAM 8 GB (for 7B models) 16 GB or more
Storage 10 GB free 50 GB for multiple models
CPU x86-64 or ARM64 Modern multi-core
GPU (optional) NVIDIA 6 GB VRAM RTX 3090 or A100
OS Ubuntu 20.04+, RHEL 8+ Ubuntu 22.04 or AlmaLinux 9

No GPU? No problem. A 7B model runs fine on CPU with 16 GB of RAM. Responses are slower, but perfectly usable for development work.

How to Install Ollama on Linux

Linux terminal running Ollama local LLM commands after install
Ollama installs as a systemd service and starts automatically on boot.

The official install script works on Ubuntu, Debian, RHEL, AlmaLinux, Rocky Linux, and most other systemd-based distributions. Run this as root or with sudo:

curl -fsSL https://ollama.com/install.sh | sh

The script downloads the binary, creates a system user, and registers a systemd service. Once it finishes, check the install:

ollama --version
systemctl status ollama

You should see the service running and output like ollama version 0.6.x. If the service is not active, enable it manually:

systemctl enable --now ollama

Manual Install Without the Script

If you prefer not to pipe a script directly to your shell, use the manual method. This works on all x86-64 Linux systems:

curl -LO https://ollama.com/download/ollama-linux-amd64.tgz
tar -C /usr/local -xzf ollama-linux-amd64.tgz

useradd -r -s /bin/false -U -m -d /usr/share/ollama ollama

cat > /etc/systemd/system/ollama.service << 'EOF'
[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3

[Install]
WantedBy=default.target
EOF

systemctl daemon-reload
systemctl enable --now ollama

Pull and Run Your First Model

Once Ollama is running, pull a model. Llama 3.2 3B is a good first choice. It is fast, only about 2 GB in size, and capable enough for most tasks:

ollama pull llama3.2

After the download, start a chat session:

ollama run llama3.2

Type a message and press Enter. To quit, type /bye. You can also send a one-off prompt without entering interactive mode:

ollama run llama3.2 "Explain what a Linux inode is in plain English"

Other Useful Models

ollama pull phi4-mini          # Lightweight, good for low-RAM servers
ollama pull llama3.1:8b        # Strong general-purpose model
ollama pull deepseek-coder-v2  # Best for code generation and debugging
ollama pull gemma2             # Google model, great for reasoning
ollama pull mistral            # Fast, multilingual

To list downloaded models, run ollama list. To delete one and free disk space, run ollama rm model-name.

Using the Ollama REST API

Linux server running Ollama with GPU acceleration for local AI models
Ollama exposes an OpenAI-compatible REST API on port 11434, making it easy to plug into existing tools.

Ollama runs a local REST API on port 11434. By default, it only listens on localhost. Query it with curl:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "What is the Linux kernel?",
  "stream": false
}'

Ollama also supports an OpenAI-compatible endpoint. So you can use it as a drop-in replacement in tools already built for the OpenAI API:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "What does Ollama do?"}]
  }'

This means tools like LangChain, LlamaIndex, and Continue work with your local Ollama instance after a small config change.

Expose Ollama to Your Local Network

By default, Ollama only accepts local connections. To reach it from another machine, set the OLLAMA_HOST variable via the systemd override:

systemctl edit ollama

Add these lines:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

Then reload and restart:

systemctl daemon-reload && systemctl restart ollama

Only do this on a private network or behind a firewall. Ollama has no built-in authentication, so do not expose it to the public internet without a reverse proxy in front of it.

Add a Web Interface with Open WebUI

If you want a browser-based chat interface similar to ChatGPT, Open WebUI is the best option. It connects to your local Ollama instance and gives you a clean interface for switching models and managing conversations. Run it with Docker:

docker run -d \
  --name open-webui \
  --network=host \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
  --restart always \
  ghcr.io/open-webui/open-webui:main

After the container starts, open http://your-server-ip:8080 in a browser. Create an admin account on the first visit, and you are ready to chat.

Enable NVIDIA GPU Acceleration

If your server has an NVIDIA GPU, Ollama picks it up automatically after you install the drivers and CUDA toolkit. First, confirm the GPU is visible:

nvidia-smi

If that command shows your GPU details, Ollama will use it on the next model run. Check the logs to confirm GPU offloading is active:

journalctl -u ollama -f

Look for n_gpu_layers = 33 in the output. That means all model layers are running on the GPU. If you see 0, inference is on CPU only. In that case, install the CUDA toolkit and restart the service.

Quick Reference: Useful Ollama Commands

ollama list               # List downloaded models
ollama show llama3.2      # Show model info and parameters
ollama ps                 # Show models loaded in memory right now
ollama rm llama3.2        # Delete a model from disk
ollama pull llama3.2      # Download or update a model

Conclusion

Once you install Ollama on Linux, you have a private, cost-free AI server running on your own hardware. The whole setup takes under 10 minutes. Start with Phi-4 Mini or Llama 3.2 3B, then move to larger models as you need them. The full model list is on the Ollama GitHub page. If you manage the same Linux server, also check our guide on patching the Copy Fail Linux kernel vulnerability before opening new ports for Ollama.

}