Building RAG Apps With Apache Cassandra, Python, and Ollama

Retrieval-Augmented Generation (RAG) is a powerful technique that enhances generative AI models by integrating an external knowledge base. In this article, we will explore how to build a RAG application using Apache Cassandra, Python, and Ollama, a lightweight framework for deploying large language models (LLMs) efficiently.

What is Retrieval-Augmented Generation (RAG)?

RAG combines two essential AI components:

Retrieval: Extracting relevant information from a knowledge base.
Generation: Using a generative model to answer queries based on the retrieved information.

This hybrid approach improves response accuracy by grounding AI-generated answers in external factual data.

Why Use Apache Cassandra for RAG?

Apache Cassandra is a highly scalable NoSQL database ideal for handling large datasets. It supports:

High write and read throughput.
Distributed architecture for fault tolerance.
Fast queries using CQL (Cassandra Query Language).

Cassandra’s ability to store and quickly retrieve vector embeddings makes it an excellent choice for RAG applications.

Setting Up Apache Cassandra

1. Installing Apache Cassandra

You can install Cassandra using Docker:

docker run --name cassandra -d -p 9042:9042 cassandra

Or install it manually on Linux/macOS:

sudo apt update && sudo apt install cassandra

2. Configuring a Keyspace and Table

Once Cassandra is running, create a keyspace and table:

CREATE KEYSPACE rag_app WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

USE rag_app;

CREATE TABLE documents (
    id UUID PRIMARY KEY,
    text TEXT,
    embedding VECTOR<FLOAT, 1536>
);

The embedding column stores vector representations of documents for fast similarity searches.

Generating Embeddings With Ollama

1. Installing Ollama

First, install Ollama:

pip install ollama

2. Creating Embeddings for Text

We use an LLM model (e.g., Mistral or Llama) to generate embeddings:

import ollama

def generate_embedding(text):
    model = "mistral"
    response = ollama.embed(model, text)
    return response["embedding"]

text = "Apache Cassandra is a highly scalable database."
embedding = generate_embedding(text)
print(embedding)

Storing and Retrieving Data From Cassandra

1. Connecting Python to Cassandra

Install the required Python driver:

pip install cassandra-driver

Then, establish a connection:

from cassandra.cluster import Cluster
import uuid

def connect_cassandra():
    cluster = Cluster(["127.0.0.1"])
    session = cluster.connect("rag_app")
    return session

2. Inserting Documents With Embeddings

def insert_document(session, text, embedding):
    doc_id = uuid.uuid4()
    query = "INSERT INTO documents (id, text, embedding) VALUES (%s, %s, %s)"
    session.execute(query, (doc_id, text, embedding))
    print("Document inserted successfully.")

3. Retrieving Similar Documents

We use cosine similarity to retrieve relevant documents:

import numpy as np

def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

def retrieve_documents(session, query_embedding):
    rows = session.execute("SELECT id, text, embedding FROM documents")
    results = []
    for row in rows:
        stored_embedding = np.array(row.embedding)
        similarity = cosine_similarity(query_embedding, stored_embedding)
        results.append((row.text, similarity))
    results.sort(key=lambda x: x[1], reverse=True)
    return results[:5]

Implementing the RAG Pipeline

Finally, we integrate the retrieval step with a generative model to answer user queries:

def generate_answer(query):
    session = connect_cassandra()
    query_embedding = generate_embedding(query)
    retrieved_docs = retrieve_documents(session, query_embedding)
    
    context = "\n".join([doc[0] for doc in retrieved_docs])
    prompt = f"Context: {context}\nQuestion: {query}\nAnswer:"
    
    response = ollama.generate("mistral", prompt)
    return response["text"]

query = "What is Apache Cassandra?"
answer = generate_answer(query)
print(answer)

Conclusion

By leveraging Apache Cassandra, Python, and Ollama, we can build efficient and scalable RAG applications that:

Store and retrieve large-scale text embeddings.
Use fast similarity search for relevant document retrieval.
Enhance AI responses with factual grounding.

This approach provides a robust foundation for AI-powered knowledge systems, chatbots, and search applications. With the ability to efficiently store and retrieve vectorized knowledge, RAG models help bridge the gap between traditional database search and generative AI models.

Future improvements could include fine-tuning the embedding models, optimizing indexing strategies, and integrating more advanced retrieval mechanisms like hierarchical clustering or approximate nearest neighbor search. Additionally, ensuring data freshness and incorporating real-time updates can make RAG-based applications even more powerful.

By experimenting with different vectorization techniques and database optimizations, developers can tailor RAG applications to various use cases, from enterprise knowledge management to AI-driven customer support. As the AI ecosystem evolves, tools like Apache Cassandra and Ollama will continue to play a crucial role in enabling scalable, intelligent applications.