The rise of large language models (LLMs) has opened doors to new possibilities in AI-driven applications. Retrieval-Augmented Generation (RAG) has become a popular technique for grounding LLM outputs in reliable, domain-specific data. Traditionally, developers rely on cloud-hosted LLMs and vector databases to implement RAG. However, recent advancements make it possible to run LLMs locally with Ollama while leveraging Azure Cosmos DB as the knowledge store. This approach provides performance benefits, lower costs, and greater control over sensitive data.

This article will walk you through how to run local LLMs using Ollama, integrate them with Azure Cosmos DB, and implement a RAG pipeline with coding examples. By the end, you will have a clear understanding of how to deploy a scalable and efficient solution.

What is Retrieval-Augmented Generation (RAG)?

RAG is a method for improving the reliability of LLMs by grounding their responses in external, factual data. Instead of relying solely on a model’s pre-trained knowledge, RAG fetches relevant documents from a data store (e.g., Cosmos DB) and passes them as context to the LLM. This approach offers:

  • Accuracy: Reduces hallucinations by grounding answers in retrieved documents.
  • Flexibility: Allows domain-specific customization without retraining the model.
  • Efficiency: Supports dynamic updates to the knowledge base without fine-tuning.

Why Use Ollama + Cosmos DB?

  • Ollama: An open-source runtime for deploying and running LLMs locally. It supports popular models such as LLaMA 2, Mistral, and other open models optimized for local inference. Ollama allows you to:
    • Run models on local GPUs.
    • Control costs by avoiding API fees.
    • Keep sensitive data private.
  • Azure Cosmos DB: A globally distributed, multi-model NoSQL database. With features like vector search support, Cosmos DB can:
    • Store embeddings for fast similarity search.
    • Scale to massive workloads.
    • Integrate seamlessly with other Azure services.

By combining these two, you get a local-first AI application that uses Cosmos DB as a knowledge store for retrieval.

Architecture Overview

  1. Data Ingestion: Documents (e.g., PDFs, web pages, manuals) are processed, chunked, and embedded into vectors.
  2. Storage: Embeddings are stored in Cosmos DB, along with metadata.
  3. Query Handling: A user query is embedded and used to search Cosmos DB.
  4. Context Injection: Retrieved documents are passed to Ollama.
  5. Response Generation: The LLM generates a grounded response using the provided context.

Setting Up the Environment

Before coding, install the following:

  • Ollama: https://ollama.ai
  • Python 3.10+
  • Azure Cosmos DB account with vector search enabled.
  • Required Python libraries:
pip install openai azure-cosmos sentence-transformers

Running Local LLMs with Ollama

First, install a model locally with Ollama. For example, to run LLaMA 2:

ollama pull llama2

Then, you can start interacting with the model:

ollama run llama2 “Explain retrieval-augmented generation in simple terms.”

In Python, you can connect to Ollama using the OpenAI-compatible API:

import requests
response = requests.post(“http://localhost:11434/api/generate”, json={
“model”: “llama2”,
“prompt”: “Explain RAG in 2 sentences.”
})
print(response.json()[“response”])

Preparing Data for RAG

  1. Load documents and split them into chunks.
  2. Generate embeddings using a sentence transformer model.
  3. Store embeddings in Cosmos DB.

Example:

from sentence_transformers import SentenceTransformer
from azure.cosmos import CosmosClient
# Initialize embedding model
embedder = SentenceTransformer(“all-MiniLM-L6-v2”)
# Example documents
documents = [
{“id”: “1”, “content”: “Azure Cosmos DB is a fully managed NoSQL database.”},
{“id”: “2”, “content”: “Ollama runs large language models locally.”}
]
# Generate embeddings
for doc in documents:
doc[“embedding”] = embedder.encode(doc[“content”]).tolist()
# Connect to Cosmos DB
client = CosmosClient.from_connection_string(“<COSMOS_CONNECTION_STRING>”)
db = client.get_database_client(“ragdb”)
container = db.get_container_client(“docs”)
# Insert documents
for doc in documents:
container.upsert_item(doc)

Querying with Vector Search

To perform similarity search, embed the query and run a vector similarity query in Cosmos DB:

query = “What is Cosmos DB?”
query_embedding = embedder.encode(query).tolist()
# Cosmos DB vector search
vector_query = {
“vector”: query_embedding,
“topK”: 2,
“fields”: “embedding”
}
results = container.query_items(
query=“SELECT TOP 2 c.id, c.content FROM c ORDER BY VectorDistance(c.embedding, @vector)”,
parameters=[{“name”: “@vector”, “value”: query_embedding}],
enable_cross_partition_query=True
)
retrieved_docs = [item[“content”] for item in results]
print(“Retrieved Docs:”, retrieved_docs)

Passing Context to Ollama

Now that we have retrieved relevant documents, we can inject them into the prompt:

context = \n.join(retrieved_docs)
prompt = f”Answer the following question using the context provided.\n\nContext: {context}\n\nQuestion: {query}\nAnswer:”
response = requests.post(“http://localhost:11434/api/generate”, json={
“model”: “llama2”,
“prompt”: prompt
})
print(“LLM Answer:”, response.json()[“response”])

Building a Full RAG Pipeline

Putting it all together:

def rag_query(user_query):
# Embed query
query_embedding = embedder.encode(user_query).tolist()
# Retrieve from Cosmos DB
results = container.query_items(
query=“SELECT TOP 3 c.content FROM c ORDER BY VectorDistance(c.embedding, @vector)”,
parameters=[{“name”: “@vector”, “value”: query_embedding}],
enable_cross_partition_query=True
)
retrieved_docs = [item[“content”] for item in results]
# Build prompt
context = \n.join(retrieved_docs)
prompt = f”Use the following context to answer the question.\n\nContext: {context}\n\nQuestion: {user_query}\nAnswer:”
# Query Ollama
response = requests.post(“http://localhost:11434/api/generate”, json={
“model”: “llama2”,
“prompt”: prompt
})
return response.json()[“response”]
# Example usage
print(rag_query(“Explain how Ollama and Cosmos DB can be used together.”))

Scaling Considerations

When deploying RAG with Ollama and Cosmos DB in production, consider:

  • Indexing: Use vector indexes in Cosmos DB for faster similarity search.
  • Sharding: Partition large datasets efficiently.
  • Caching: Store frequently accessed embeddings and responses.
  • Security: Use Azure Managed Identities and Private Endpoints.
  • Monitoring: Leverage Azure Monitor for performance insights.

Advantages of This Approach

  • Local-first AI: Keeps sensitive queries private.
  • Cost-efficiency: Avoids cloud API charges for inference.
  • Flexibility: Swap models locally as needed.
  • Scalability: Cosmos DB can handle large datasets and workloads.

Challenges & Limitations

  • Hardware Requirements: Running LLMs locally requires significant GPU resources.
  • Latency: Local inference may be slower than optimized cloud-hosted models.
  • Maintenance: Keeping models updated and fine-tuned requires effort.

Conclusion

The integration of Ollama and Azure Cosmos DB for Retrieval-Augmented Generation (RAG) offers a practical balance between privacy, scalability, and cost efficiency. Running models locally with Ollama allows sensitive data to remain secure while avoiding recurring API costs, and Cosmos DB provides a reliable, cloud-native vector store to power efficient similarity search.

This hybrid setup gives developers the flexibility to build AI systems that are both grounded in relevant data and adaptable across industries. From customer support to internal knowledge assistants, the same pipeline of embedding, retrieval, and context injection can be reused and tailored to specific domains. While hardware requirements and model management pose challenges, the overall benefits of local-first AI combined with cloud-scale storage far outweigh these limitations.

In short, Ollama plus Cosmos DB provides the best of both worlds: the control and privacy of local inference with the scalability and performance of cloud infrastructure. For organizations looking to create secure, adaptable, and future-ready AI applications, this architecture is a strong and forward-looking choice.