Change Data Capture for Apache Phoenix Stream: A Deep Dive

Change Data Capture (CDC) is an essential technique enabling real-time data processing by capturing insert, update, and delete operations in a persistent, time-ordered stream. Apache Phoenix—a relational SQL layer on top of HBase—has recently introduced CDC as a streaming feature, offering near-real-time row-level change visibility. This article explores the architecture, implementation, coding patterns, and practical usage of Phoenix Stream CDC, complete with code samples and a full conclusion.

What Is Change Data Capture (CDC)? Context and Fundamentals

CDC refers to a set of design patterns focused on identifying, capturing, and delivering data changes—deltas—for downstream systems and analytics. It’s widely used in data warehousing, ETL, real-time processing, auditing, and synchronization workflows.

Key methodologies include:

Timestamps: Rows include LAST_MODIFIED columns; changed rows are those beyond a checkpoint timestamp.
Versioning: Rows tagged with increasing version numbers; latest version implies changed.
Triggers: Database triggers log changes into a queue or log table.
Transaction log reading: Capturing changes by scanning the DBMS’s commit logs.

Phoenix’s CDC mechanism relies on a combination of uncovered, time-based indexing and Max Lookback to capture and serve changes efficiently.

Phoenix CDC Architecture: Indexes, Streams, and Metadata

Phoenix’s CDC design introduces new constructs:

Uncovered Index + Max Lookback: Controls history retention and provides an index to scan changes by mutation time.
CDC Streams and Partitions: When CDC is enabled on a table, Phoenix creates a stream storing changes in time order, within a TTL (default: 24 hours).
Stream Partitions: The stream is partitioned by HBase data table regions.
- Open partitions correspond to active regions.
- Closed partitions are archived when HBase regions are split or merged.
SYSTEM Tables for Metadata:
- SYSTEM.CDC_STREAM_STATUS ensures one active stream per table and records stream names.
- SYSTEM.CDC_STREAM holds partition metadata: partition IDs (encoded region names), parent-child relationships, time windows, and key boundaries.

When regions split, a coprocessor updates partition metadata to maintain lineage for ordered consumption.

Enabling CDC: SQL and CLI

Before using Phoenix CDC, you must enable it on the target table.

Syntax:

CREATE CDC <cdc_object_name> ON <table_name>

This operation:

Creates an uncovered global index on (PARTITION_ID(), PHOENIX_ROW_TIMESTAMP())
Registers metadata entries in the SYSTEM.CDC_* tables.

Example:

CREATE CDC STREAM_SALES ON SALES_TABLE;

This command:

Establishes the CDC stream STREAM_SALES
Generates index and metadata capturing changes for the SALES_TABLE.

Understanding Key Functions: PARTITION_ID() and PHOENIX_ROW_TIMESTAMP()

Two critical server-side functions fuel Phoenix CDC:

PARTITION_ID(): Returns a 32-byte-encoded string representing the HBase region name where a modification occurs. Used as index partition key.
PHOENIX_ROW_TIMESTAMP(): Retrieves mutation timestamp from an empty cell, used to control ordering and resuming of scans.

Partitions ensure that records are laid out sequentially per region. When region splits happen, new daughter partitions receive mutations, while parent partitions are archived (marked closed) but still consumable until TTL expiry.

Consuming CDC Streams: Coding Patterns

To read CDC streams, applications typically:

Track per-partition offset using PHOENIX_ROW_TIMESTAMP().
Query open partitions first and then closed ones.
Use time range filtering to resume scans.
Handle optional pre-image and post-image support.

A simplified Java usage example:

// Pseudocode illustrating the pattern

String cdcQuery = "SELECT * FROM STREAM_SALES " +

"WHERE PARTITION_ID() = ? AND PHOENIX_ROW_TIMESTAMP() > ? " +

"ORDER BY PHOENIX_ROW_TIMESTAMP()";

PreparedStatement stmt = connection.prepareStatement(cdcQuery);
stmt.setString(1, partitionId);
stmt.setLong(2, lastTimestamp);ResultSet rs = stmt.executeQuery();
while (rs.next()) {
byte[] preImage = rs.getBytes(“PRE_IMAGE”); // optional
byte[] postImage = rs.getBytes(“POST_IMAGE”); // optional
// Process mutation
lastTimestamp = rs.getLong(“PHOENIX_ROW_TIMESTAMP()”);
}

Key steps:

Set initial timestamp to 0 or persisted offset.
Consume changes with order guarantee via timestamp.
Persist last timestamp per partition to support fault-tolerant resumptions.

This approach seamlessly handles region splits because partitions themselves reflect region lineage.

Advanced Features: Pre-Image / Post-Image, Fault-Tolerance, and TTL

Phoenix CDC offers:

Optional Pre-Image & Post-Image: You can capture how row looked before and after the mutation—critical for auditing or complex synchronization.
Fault-Tolerance & Resumability: Consumers resume scanning per partition using PHOENIX_ROW_TIMESTAMP() offsets.
Time-to-Live (TTL): Stream data persists only for a limited duration (typically 24 hours) and auto-expires—cleaning up stale partitions and avoiding storage bloat.

Example End-to-End Flow

Let’s illustrate a full scenario:

Enable CDC on table:

CREATE CDC STREAM_ORDERS ON ORDERS;
Phoenix creates an uncovered index on (PARTITION_ID(), PHOENIX_ROW_TIMESTAMP()) and populates metadata tables.
HBase regions receive mutations (INSERT/UPDATE/DELETE), and the index logs changes grouped by partitions.
Consumer application:
- Reads partition metadata from SYSTEM.CDC_STREAM.
- For each partition, queries new changes where timestamp > last-known.
- Handles region splits gracefully: partitions split and new partitions appear with new IDs but lineage is maintained via metadata.
Consumer optionally stores pre/post images for each change.
Records older than TTL are purged automatically.

Putting It All Together: Pseudocode with Metadata

// Pseudocode

List<PartitionMeta> partitions = query("SELECT * FROM SYSTEM.CDC_STREAM WHERE TABLE_NAME='ORDERS'");

for (PartitionMeta pm : partitions) {
String pid = pm.getPartitionId();
long offset = resumeOffsets.getOrDefault(pid, 0L);String cdcSql = “SELECT *, PRE_IMAGE, POST_IMAGE FROM ORDERS$CDC_STREAM “ +
“WHERE PARTITION_ID() = ‘” + pid + “‘ “ +
“AND PHOENIX_ROW_TIMESTAMP() > “ + offset +
” ORDER BY PHOENIX_ROW_TIMESTAMP()”;ResultSet rs = connection.createStatement().executeQuery(cdcSql);while (rs.next()) {
// parse pre/post images
// transform or process change
offset = rs.getLong(“PHOENIX_ROW_TIMESTAMP()”);
}
resumeOffsets.put(pid, offset);
}// persist resumeOffsets externally for fault-tolerance

Benefits & Use Cases

Benefits:

SQL-Based Streaming: Side-by-side with regular queries.
Time-Ordered Consistency: Guaranteed order via timestamps.
Efficient Partitioned Consumption: Scalable streaming per region.
Low Operational Overhead: TTL-based retention and auto-cleanup.
Auditing Capability: Optional row “before”/“after” images.

Use Cases:

Real-time data syncing to search indexes (e.g., Elasticsearch).
Audit trails and change logs.
Event-driven microservices based on DB changes.
Replication to external systems or data warehouses.
Streaming analytics or dashboards.

Considerations & Caveats

TTL window constraints: By default, only 24 hours of history is retained—clients must read frequently to avoid missing data.
Region split mechanics: Consumers must track new partitions and offsets to avoid duplication or omission, though metadata addresses lineage.
Schema changes: If a CDC-enabled table is transformed or renamed, external schema registry integration may break—care must be taken to export schema changes.
Resource footprint: Indexing and CDC metadata consume space and I/O—plan capacity accordingly.
Single active stream per table: Attempting to create another active CDC stream on the same table fails.

Summary Table

Feature	Description
CDC Enablement	`CREATE CDC ... ON <table>`
Index Mechanism	Uncovered index on `(PARTITION_ID(), ROW_TIMESTAMP)`
Metadata Tables	`SYSTEM.CDC_STREAM_STATUS`, `SYSTEM.CDC_STREAM`
Partitions	Open (active) and closed (archived) partitions
Split Handling	Master coprocessor maintains lineage on splits
Consumption Strategy	SQL query + timestamp offset per partition
History Retention (TTL)	Default 24 hours (configurable)
Pre/Post Image Support	Optional detailed change tracking
Fault Tolerance	Resume via stored offsets within partitions
Schema Change Caveats	Must re-export schema to registry if changed

Conclusion

Apache Phoenix’s introduction of streaming Change Data Capture fundamentally enhances how real-time data changes are captured, stored, and consumed within HBase-backed environments. By combining:

Uncovered time-based indexing for ordered change retrieval
Partition-aware streams aligned with HBase region management
Robust metadata tracking via dedicated system tables
Optional pre/post image support for comprehensive auditing
Built-in TTL retention for automation and storage hygiene

Phoenix CDC delivers a powerful, SQL-accessible tool for downstream applications, microservices, analytics platforms, and integration systems requiring near-real-time data flows. Its resilience to region splits, fault recovery, and generous configuration options make it an excellent fit for cloud-native architectures and event-driven systems.

That said, success with Phoenix CDC hinges on disciplined offset management, attention to TTL expiration, and awareness of potential schema or region management pitfalls. Yet with these obligations met, users can unlock streaming capabilities without introducing external CDC frameworks—purely within the Phoenix/HBase ecosystem.

In sum, Apache Phoenix Stream CDC offers a seamless, scalable, and integrated mechanism for capturing and streaming database changes, providing SQL-accessible, time-ordered, partition-safe updates that are ready for a wide array of real-time use cases.