In today’s fast-paced digital landscape, organizations are constantly seeking ways to leverage real-time data for making timely decisions. Real-time data ingestion enables systems to capture, process, and store data as it’s generated, empowering businesses with up-to-the-minute insights. Snowflake, as a cloud-based data warehousing solution, excels in managing massive volumes of data, but how do we seamlessly integrate real-time data streams into Snowflake from a source like CockroachDB? The answer lies in utilizing Kafka Connect alongside streaming platforms like Redpanda and Kafka.

In this article, we will explore the process of ingesting real-time data from CockroachDB into Snowflake using Kafka Connect and Redpanda. We will provide detailed coding examples to help you understand how to set up each component and ensure smooth data flow.

Introduction to Real-Time Data Ingestion

Real-time data ingestion involves the continuous import of data from source systems into a target destination in near real-time. This process is crucial for applications that need to react to events as they occur, such as fraud detection systems, IoT applications, and e-commerce platforms. In this scenario, CockroachDB will act as the data source, Redpanda as the message broker (Kafka alternative), and Snowflake as the data warehouse.

We will use Kafka Connect to link CockroachDB to Redpanda and Redpanda to Snowflake. Kafka Connect simplifies integration by handling the transfer of data between systems, providing connectors to a wide variety of data sources and sinks.

Architecture Overview

Before diving into the code, let’s first discuss the architecture that powers this real-time data ingestion pipeline.

  • CockroachDB: A distributed SQL database that serves as the source of our real-time data.
  • Redpanda: A Kafka-compatible, high-performance streaming platform.
  • Kafka Connect: Acts as the bridge between data sources and targets, providing connectors to capture and deliver data.
  • Snowflake: A cloud-native data warehouse that will store and manage the ingested data.

Data flows from CockroachDB into Redpanda, where it is streamed in real-time. Kafka Connect uses Redpanda as the message broker to move data between CockroachDB and Snowflake.

Setting Up CockroachDB

CockroachDB is a distributed SQL database that combines the best of relational databases with the advantages of NoSQL. To begin, we’ll set up CockroachDB locally (or on a cloud provider). Assuming you have CockroachDB installed, here’s how you can create a sample database and table for this use case:

sql
CREATE DATABASE realtime_data;
USE realtime_data;
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name STRING,
email STRING,
created_at TIMESTAMPTZ DEFAULT current_timestamp()
);INSERT INTO users (name, email) VALUES (‘Alice’, ‘alice@example.com’), (‘Bob’, ‘bob@example.com’);

This table will serve as the source of real-time data, and any new entries or updates to this table will be streamed to Redpanda.

Setting Up Redpanda

Redpanda is a Kafka-compatible streaming platform designed for high throughput and low latency. It can be a drop-in replacement for Kafka in most use cases. Follow these steps to set up Redpanda on your machine:

  1. Install Redpanda using the following command (for Linux):
    bash
    curl -1sLf https://packages.vectorized.io/smash/setup.rpm.sh | sudo -E bash
    sudo yum install -y redpanda
  2. Start Redpanda:
    bash
    sudo systemctl start redpanda
  3. Create a Topic for streaming the data:
    bash
    rpk topic create users_topic

    This topic will hold the user data from CockroachDB as it gets ingested.

Kafka Connect Setup

Kafka Connect is the critical component that connects CockroachDB with Redpanda and Snowflake. We will need the Kafka Connect JDBC Source connector for CockroachDB and the Snowflake Sink connector to load the data into Snowflake.

Install Kafka Connect

To set up Kafka Connect, download and extract Kafka from the official website. Kafka Connect comes bundled with Kafka.

bash
wget https://downloads.apache.org/kafka/2.8.0/kafka_2.13-2.8.0.tgz
tar -xvf kafka_2.13-2.8.0.tgz
cd kafka_2.13-2.8.0

Install JDBC Source Connector

Next, install the JDBC Source connector for CockroachDB.

bash
confluent-hub install confluentinc/kafka-connect-jdbc:latest

Configure Kafka Connect for CockroachDB

Create a configuration file cockroachdb-source.properties for the JDBC Source connector:

properties
name=cockroachdb-source
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
tasks.max=1
connection.url=jdbc:postgresql://localhost:26257/realtime_data?user=root
table.whitelist=users
mode=incrementing
incrementing.column.name=id
topic.prefix=cockroach-

This configuration connects to the users table in CockroachDB and sends data to the topic cockroach-users in Redpanda.

Install Snowflake Sink Connector

Similarly, install the Snowflake Sink connector:

bash
confluent-hub install snowflakeinc/snowflake-kafka-connector:latest

Configure Kafka Connect for Snowflake

Now, create a configuration file snowflake-sink.properties to load data into Snowflake:

properties
name=snowflake-sink
connector.class=com.snowflake.kafka.connector.SnowflakeSinkConnector
tasks.max=1
topics=cockroach-users
snowflake.url.name=https://<account>.snowflakecomputing.com
snowflake.user=<username>
snowflake.private.key=<private_key>
snowflake.database=<database_name>
snowflake.schema=<schema_name>
snowflake.table=<target_table>
buffer.count.records=1000
buffer.size.bytes=5000000
buffer.flush.time=60

Replace <account>, <username>, <private_key>, <database_name>, and <schema_name> with your Snowflake details.

Ingesting Data into Snowflake

With Kafka Connect configured, we can now run the connectors and start ingesting data. First, start Kafka Connect:

bash
bin/connect-distributed.sh config/connect-distributed.properties

Once Kafka Connect is running, you can load the cockroachdb-source and snowflake-sink connectors:

bash
curl -X POST -H "Content-Type: application/json" --data @cockroachdb-source.properties http://localhost:8083/connectors
curl -X POST -H "Content-Type: application/json" --data @snowflake-sink.properties http://localhost:8083/connectors

Now, whenever new data is inserted into the users table in CockroachDB, it will flow through Redpanda into Snowflake in near real-time.

Testing the Pipeline

To test the pipeline, insert new records into the users table in CockroachDB:

sql
INSERT INTO users (name, email) VALUES ('Charlie', 'charlie@example.com');

Check the topic in Redpanda:

bash
rpk topic consume cockroach-users

You should see the new record streamed to the topic. Finally, verify that the data has arrived in Snowflake by querying the target table:

sql
SELECT * FROM <schema_name>.<target_table>;

The newly inserted record should now appear in Snowflake.

Conclusion

Real-time data ingestion is a vital process for modern data architectures, enabling businesses to stay competitive by acting on the freshest data. By integrating CockroachDB with Snowflake through Redpanda and Kafka Connect, we have built a resilient, high-performance pipeline that allows for seamless data flow from source to target in real time.

This setup, using open-source technologies like CockroachDB and Redpanda along with cloud-native Snowflake, proves to be both scalable and efficient for ingesting real-time data. The use of Kafka Connect simplifies the process, reducing the complexity of writing custom ingestion pipelines while providing a flexible architecture that can be extended to other data sources and sinks.

Whether you are dealing with large-scale transactional data or real-time event streams, this approach offers a robust solution for modern data-driven applications.