How To Build a Private and Offline RAG System With Ollama and FAISS

Retrieval-Augmented Generation (RAG) has emerged as one of the most practical and powerful architectures for enabling large language models (LLMs) to use custom domain knowledge. Traditional LLMs are trained on general-purpose corpora and cannot access your private data unless you provide it at inference time. RAG solves this problem by allowing the model to “look up” relevant documents from an external knowledge base and use them as context for generating accurate responses.

But while many cloud-based tools support RAG, there are many scenarios where offline, fully private RAG is required—such as when working with sensitive business files, proprietary research, or air-gapped environments. This is where Ollama and FAISS pair beautifully: Ollama runs powerful open-source LLMs fully locally, and FAISS provides efficient vector search on your machine.

This article covers everything you need to build an offline RAG pipeline using Ollama for generation, FAISS for vector search, and Python as the glue. By the end, you’ll have a working system that can ingest documents, embed them, store vectors locally, and answer questions using retrieval-augmented prompts—all without touching the cloud.

Introduction to Offline RAG

Traditional RAG involves three major components:

Embedding model – Converts text into high-dimensional numerical vectors.
Vector store – Holds the embeddings and provides fast similarity search.
LLM generator – Accepts retrieved context and generates the final answer.

In an offline environment, you cannot rely on cloud-hosted APIs for any of these steps. Fortunately:

Ollama can run both embedding models and LLMs on your machine.
FAISS (Facebook AI Similarity Search) can store embeddings and perform high-performance nearest-neighbor searches.
Python ties all components together in a clean and extendable workflow.

We will walk through the entire architecture—from installation to querying.

Why Use Ollama for Offline LLM and Embedding?

Ollama is a local model runner that makes it extremely easy to run modern LLMs and embedding models. Some reasons it works perfectly for offline RAG:

Runs Llama, Mistral, Phi, Gemma, and many more models locally.
Provides both text generation and embedding endpoints.
Zero cloud dependencies.
Easy installation and single-line model pulls.
Script-friendly HTTP API for Python integrations.

You will use Ollama in two ways:

To generate embeddings using a model such as nomic-embed-text or mxbai-embed-large.
To generate RAG-augmented answers using a model such as llama3 or mistral.

Setting Up the Environment

To get started, install both Ollama and the required Python dependencies.

Install Ollama

On macOS or Linux:

Then pull your required models:

You can replace these with any compatible models you prefer.

Install Python Packages

Create a new project environment and install dependencies:

That’s all you need to run the system.

Data Preparation: Loading and Chunking Documents

RAG systems typically work better when documents are chunked into small, semantically meaningful pieces (e.g., 200–500 tokens). This improves the chances that the retriever will find the most relevant part.

Here is a simple chunking function:

Load your file and chunk it:

Generating Embeddings Locally With Ollama

Ollama exposes an easy HTTP endpoint for generating embeddings.

Embedding Function

Test embedding:

Building a FAISS Vector Store

Once you have embeddings, you need to store them for fast retrieval.

Create the Index

Add Chunk Embeddings to the Index

Save the Index for Later Use

Your offline knowledge base is now built.

Searching FAISS for the Most Relevant Chunks

When a user asks a question, you embed the query using Ollama and then search for the nearest chunks.

Search Function

Creating a RAG Prompt With Retrieved Context

After retrieval, construct a context-augmented prompt:

Running the Final Query Through an Ollama LLM

Create a function to send the RAG prompt to Ollama’s text generation API:

Putting It All Together: Full RAG Query Pipeline

Example usage:

You now have a fully offline private RAG system.

Enhancing the Offline RAG System

Once the fundamentals work, you can improve the system in several ways:

Use More Advanced Embedding Models

Better embeddings boost retrieval accuracy dramatically. Options include:

mxbai-embed-large
nomic-embed-text
bge-large

Just pull and switch models in the code.

Improve Chunking With Semantic Breakpoints

Instead of splitting words blindly, use:

Sentence Transformers paragraph segmentation
PDF section extraction
Markdown heading-aware chunking

Better chunks mean cleaner retrieval.

Add Metadata and Filtering

Enhance your vector store by storing:

document titles
timestamps
categories
tags

Then filter search results by metadata.

Use FAISS Index Variants for Larger Databases

For huge knowledge bases:

IndexIVFFlat
IndexHNSWFlat

These offer sub-linear search time.

Add Caching for Frequent Queries

Since your system runs offline, caching results for repeated questions improves user experience.

Full Minimal Example Script

Here is a simplified end-to-end script:

import requests, faiss, numpy as np, json

def embed_text(text):
r = requests.post(“http://localhost:11434/api/embeddings”,
json={“model”: “nomic-embed-text”, “prompt”: text})
return r.json()[“embedding”]

def ask_ollama(prompt):
r = requests.post(“http://localhost:11434/api/generate”,
json={“model”: “llama3”, “prompt”: prompt, “stream”: False})
return r.json()[“response”]

def chunk_text(t, size=300, overlap=50):
w = t.split()
chunks, i = [], 0
while i < len(w):
chunks.append(” “.join(w[i:i+size]))
i += size – overlap
return chunks

# Load document
text = open(“docs.txt”).read()
chunks = chunk_text(text)

# Build vector store
vectors = [embed_text(c) for c in chunks]
dim = len(vectors[0])
index = faiss.IndexFlatL2(dim)
index.add(np.array(vectors).astype(“float32”))

def search(query, k=4):
qv = np.array([embed_text(query)], dtype=“float32”)
_, idx = index.search(qv, k)
return [chunks[i] for i in idx[0]]

def rag_query(q):
retrieved = search(q)
prompt = f”CONTEXT:\n\n{”.join(retrieved)}\n\nQUESTION: {q}”
return ask_ollama(prompt)

print(rag_query(“Explain our company mission.”))

This script forms the foundation of a reproducible and extendable offline RAG workflow.

Troubleshooting Tips

Ollama refuses connections

Ensure the service is running:

Embeddings produce inconsistent shapes

Confirm the model dimension:

FAISS index returns irrelevant chunks

Check:

chunk size (too large or small)
embedding quality
number of retrieved results (k)

Conclusion

Building an offline RAG system with Ollama and FAISS offers a powerful alternative to cloud-based solutions—providing full privacy, zero data leakage, and complete control over your local environment. With Ollama handling both embedding and LLM generation, and FAISS delivering fast similarity search, you have a fully self-contained architecture capable of answering complex questions from your custom dataset.

You started by understanding the role of chunking, embeddings, vector storage, and prompt construction. From there, you implemented each part step by step: preparing documents, generating embeddings locally, building a FAISS index, retrieving relevant chunks, composing context-rich prompts, and generating answers from a local LLM. The resulting pipeline is efficient, maintainable, and fully extensible, allowing you to scale up with more documents, better models, or more advanced FAISS indexes.

This offline RAG architecture is suitable for secure enterprises, personal research archives, proprietary codebases, or any environment where cloud dependency is not acceptable. With the flexibility of Python and the increasing capabilities of local LLMs, you now have all the tools necessary to create an advanced retrieval-based assistant that runs entirely on your machine.

If you continue expanding this system—adding GUIs, PDF parsing, metadata filtering, or LLM-powered summarization—you can transform it into a complete private knowledge system. The best part is that your data never leaves your device, ensuring maximum confidentiality while still giving you the full power of modern AI.