In today’s fast-paced digital landscape, organizations are constantly seeking ways to leverage real-time data for making timely decisions. Real-time data ingestion enables systems to capture, process, and store data as it’s generated, empowering businesses with up-to-the-minute insights. Snowflake, as a cloud-based data warehousing solution, excels in managing massive volumes of data, but how do we seamlessly integrate real-time data streams into Snowflake from a source like CockroachDB? The answer lies in utilizing Kafka Connect alongside streaming platforms like Redpanda and Kafka.
In this article, we will explore the process of ingesting real-time data from CockroachDB into Snowflake using Kafka Connect and Redpanda. We will provide detailed coding examples to help you understand how to set up each component and ensure smooth data flow.
Introduction to Real-Time Data Ingestion
Real-time data ingestion involves the continuous import of data from source systems into a target destination in near real-time. This process is crucial for applications that need to react to events as they occur, such as fraud detection systems, IoT applications, and e-commerce platforms. In this scenario, CockroachDB will act as the data source, Redpanda as the message broker (Kafka alternative), and Snowflake as the data warehouse.
We will use Kafka Connect to link CockroachDB to Redpanda and Redpanda to Snowflake. Kafka Connect simplifies integration by handling the transfer of data between systems, providing connectors to a wide variety of data sources and sinks.
Architecture Overview
Before diving into the code, let’s first discuss the architecture that powers this real-time data ingestion pipeline.
- CockroachDB: A distributed SQL database that serves as the source of our real-time data.
- Redpanda: A Kafka-compatible, high-performance streaming platform.
- Kafka Connect: Acts as the bridge between data sources and targets, providing connectors to capture and deliver data.
- Snowflake: A cloud-native data warehouse that will store and manage the ingested data.
Data flows from CockroachDB into Redpanda, where it is streamed in real-time. Kafka Connect uses Redpanda as the message broker to move data between CockroachDB and Snowflake.
Setting Up CockroachDB
CockroachDB is a distributed SQL database that combines the best of relational databases with the advantages of NoSQL. To begin, we’ll set up CockroachDB locally (or on a cloud provider). Assuming you have CockroachDB installed, here’s how you can create a sample database and table for this use case:
CREATE DATABASE realtime_data;
USE realtime_data;
CREATE TABLE users (id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name STRING,
email STRING,
created_at TIMESTAMPTZ DEFAULT current_timestamp()
);
INSERT INTO users (name, email) VALUES (‘Alice’, ‘alice@example.com’), (‘Bob’, ‘bob@example.com’);This table will serve as the source of real-time data, and any new entries or updates to this table will be streamed to Redpanda.
Setting Up Redpanda
Redpanda is a Kafka-compatible streaming platform designed for high throughput and low latency. It can be a drop-in replacement for Kafka in most use cases. Follow these steps to set up Redpanda on your machine:
- Install Redpanda using the following command (for Linux):
bash
curl -1sLf https://packages.vectorized.io/smash/setup.rpm.sh | sudo -E bash
sudo yum install -y redpanda
- Start Redpanda:
bash
sudo systemctl start redpanda
- Create a Topic for streaming the data:
bash
rpk topic create users_topic
This topic will hold the user data from CockroachDB as it gets ingested.
Kafka Connect Setup
Kafka Connect is the critical component that connects CockroachDB with Redpanda and Snowflake. We will need the Kafka Connect JDBC Source connector for CockroachDB and the Snowflake Sink connector to load the data into Snowflake.
Install Kafka Connect
To set up Kafka Connect, download and extract Kafka from the official website. Kafka Connect comes bundled with Kafka.
wget https://downloads.apache.org/kafka/2.8.0/kafka_2.13-2.8.0.tgz
tar -xvf kafka_2.13-2.8.0.tgz
cd kafka_2.13-2.8.0
Install JDBC Source Connector
Next, install the JDBC Source connector for CockroachDB.
confluent-hub install confluentinc/kafka-connect-jdbc:latest
Configure Kafka Connect for CockroachDB
Create a configuration file cockroachdb-source.properties
for the JDBC Source connector:
name=cockroachdb-source
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
tasks.max=1
connection.url=jdbc:postgresql://localhost:26257/realtime_data?user=root
table.whitelist=users
mode=incrementing
incrementing.column.name=id
topic.prefix=cockroach-
This configuration connects to the users
table in CockroachDB and sends data to the topic cockroach-users
in Redpanda.
Install Snowflake Sink Connector
Similarly, install the Snowflake Sink connector:
confluent-hub install snowflakeinc/snowflake-kafka-connector:latest
Configure Kafka Connect for Snowflake
Now, create a configuration file snowflake-sink.properties
to load data into Snowflake:
name=snowflake-sink
connector.class=com.snowflake.kafka.connector.SnowflakeSinkConnector
tasks.max=1
topics=cockroach-users
snowflake.url.name=https://<account>.snowflakecomputing.com
snowflake.user=<username>
snowflake.private.key=<private_key>
snowflake.database=<database_name>
snowflake.schema=<schema_name>
snowflake.table=<target_table>
buffer.count.records=1000
buffer.size.bytes=5000000
buffer.flush.time=60
Replace <account>
, <username>
, <private_key>
, <database_name>
, and <schema_name>
with your Snowflake details.
Ingesting Data into Snowflake
With Kafka Connect configured, we can now run the connectors and start ingesting data. First, start Kafka Connect:
bin/connect-distributed.sh config/connect-distributed.properties
Once Kafka Connect is running, you can load the cockroachdb-source
and snowflake-sink
connectors:
curl -X POST -H "Content-Type: application/json" --data @cockroachdb-source.properties http://localhost:8083/connectors
curl -X POST -H "Content-Type: application/json" --data @snowflake-sink.properties http://localhost:8083/connectors
Now, whenever new data is inserted into the users
table in CockroachDB, it will flow through Redpanda into Snowflake in near real-time.
Testing the Pipeline
To test the pipeline, insert new records into the users
table in CockroachDB:
INSERT INTO users (name, email) VALUES ('Charlie', 'charlie@example.com');
Check the topic in Redpanda:
rpk topic consume cockroach-users
You should see the new record streamed to the topic. Finally, verify that the data has arrived in Snowflake by querying the target table:
SELECT * FROM <schema_name>.<target_table>;
The newly inserted record should now appear in Snowflake.
Conclusion
Real-time data ingestion is a vital process for modern data architectures, enabling businesses to stay competitive by acting on the freshest data. By integrating CockroachDB with Snowflake through Redpanda and Kafka Connect, we have built a resilient, high-performance pipeline that allows for seamless data flow from source to target in real time.
This setup, using open-source technologies like CockroachDB and Redpanda along with cloud-native Snowflake, proves to be both scalable and efficient for ingesting real-time data. The use of Kafka Connect simplifies the process, reducing the complexity of writing custom ingestion pipelines while providing a flexible architecture that can be extended to other data sources and sinks.
Whether you are dealing with large-scale transactional data or real-time event streams, this approach offers a robust solution for modern data-driven applications.