How To Use Aho-Corasick Algorithm And Deterministic Tokenization In Spring Boot To Intercept Logs In Real Time And Remove Sensitive Values

Modern applications generate massive volumes of logs that are invaluable for debugging, monitoring, auditing, and security analysis. However, logs often contain sensitive information such as email addresses, phone numbers, API keys, authentication tokens, credit card numbers, or personally identifiable information (PII). Persisting such data in plain text logs introduces serious compliance, privacy, and security risks.

In Spring Boot–based systems, logs are typically emitted at very high throughput and across many threads. This makes it impractical to sanitize logs using naive string replacement or ad hoc regular expressions. What we need instead is a deterministic, high-performance, real-time log interception mechanism that can reliably detect sensitive values and replace them with consistent, irreversible tokens.

This article presents a comprehensive approach to solving this problem by combining the Aho-Corasick string-matching algorithm with deterministic tokenization, implemented directly in a Spring Boot logging pipeline. The result is a fast, scalable, and secure log-sanitization solution suitable for production systems.

The Problem of Sensitive Data in Logs

Logging frameworks are intentionally designed to be simple and expressive. Developers frequently log request payloads, headers, database parameters, and error objects. Over time, this convenience leads to accidental exposure of:

Passwords and secrets
Session identifiers and JWTs
Credit card and banking data
Email addresses and phone numbers
National identifiers

Once logged, this data may be shipped to centralized log stores, third-party monitoring platforms, or long-term archives. Even if access controls are strong, the mere existence of sensitive values in logs is often a compliance violation.

A robust solution must meet several requirements:

Operate in real time, before logs are written
Support high throughput with minimal latency
Detect many sensitive patterns efficiently
Replace values deterministically for correlation
Avoid false negatives and excessive false positives

This is where Aho-Corasick and deterministic tokenization excel.

Overview of the Aho-Corasick Algorithm

The Aho-Corasick algorithm is a multi-pattern string matching algorithm that allows simultaneous searching for many keywords in a single pass over the input text. Unlike naive approaches that scan once per pattern, Aho-Corasick builds a finite automaton (a trie with failure links) that matches all patterns in linear time.

Key characteristics of Aho-Corasick:

Time complexity: O(n + m + z), where n is input length, m is total pattern length, and z is number of matches
Supports hundreds or thousands of patterns efficiently
Deterministic and predictable performance
Ideal for streaming or real-time text processing

For log sanitization, this makes it possible to scan each log message once while detecting all sensitive keywords or markers.

Deterministic Tokenization Explained

Once a sensitive value is detected, it must be removed or replaced. Simply masking it with “****” is often insufficient because it breaks debugging and traceability. Deterministic tokenization offers a better alternative.

Deterministic tokenization replaces a sensitive value with a token that:

Is consistent for the same input value
Cannot be reversed without a secret key
Preserves uniqueness and correlation

For example:

user@example.com → TOK_8f3a12c9
user@example.com → TOK_8f3a12c9

This allows developers to correlate log entries without revealing the original sensitive data.

A common approach uses a keyed cryptographic hash (such as HMAC-SHA256) and truncates the result to a manageable token length.

Architecture for Real-Time Log Interception in Spring Boot

In Spring Boot applications, logs typically flow through a logging framework such as Logback or Log4j2. To intercept logs in real time, we insert a custom component into the logging pipeline.

The high-level architecture looks like this:

Application emits a log message
Logging framework invokes a custom converter or appender
Log message is scanned using Aho-Corasick
Sensitive values are deterministically tokenized
Sanitized log message is written to output

This ensures that sensitive data never reaches disk or external systems.

Defining Sensitive Patterns

Before building the matcher, we must define what constitutes sensitive data. This typically includes both static markers and dynamic values.

Examples include:

Field names such as password=, token=, authorization:
JSON keys like "email", "ssn", "creditCard"
Header names such as Authorization, X-API-KEY

These markers act as anchors. Once found, we extract the associated value using deterministic parsing logic.

Implementing Aho-Corasick in Java

Below is a simplified example of building an Aho-Corasick trie in Java. In production systems, you may use a well-tested implementation, but understanding the mechanics is important.

public class AhoCorasickMatcher {

    private final Node root = new Node();

    public void addPattern(String pattern) {
        Node current = root;
        for (char c : pattern.toCharArray()) {
            current = current.children.computeIfAbsent(c, k -> new Node());
        }
        current.isTerminal = true;
        current.pattern = pattern;
    }

    public void buildFailureLinks() {
        Queue<Node> queue = new LinkedList<>();
        root.failure = root;

        queue.add(root);
        while (!queue.isEmpty()) {
            Node current = queue.poll();
            for (Map.Entry<Character, Node> entry : current.children.entrySet()) {
                char c = entry.getKey();
                Node child = entry.getValue();

                Node failure = current.failure;
                while (failure != root && !failure.children.containsKey(c)) {
                    failure = failure.failure;
                }

                if (failure.children.containsKey(c) && failure.children.get(c) != child) {
                    child.failure = failure.children.get(c);
                } else {
                    child.failure = root;
                }

                queue.add(child);
            }
        }
    }

    public List<Match> match(String text) {
        List<Match> matches = new ArrayList<>();
        Node current = root;

        for (int i = 0; i < text.length(); i++) {
            char c = text.charAt(i);
            while (current != root && !current.children.containsKey(c)) {
                current = current.failure;
            }

            current = current.children.getOrDefault(c, root);

            if (current.isTerminal) {
                matches.add(new Match(i, current.pattern));
            }
        }

        return matches;
    }
}

This matcher can detect multiple sensitive markers in a single pass over the log message.

Deterministic Tokenization Implementation

Next, we implement a tokenizer that produces consistent, irreversible tokens using a secret key.

public class DeterministicTokenizer {

    private final Mac mac;

    public DeterministicTokenizer(String secretKey) {
        try {
            mac = Mac.getInstance("HmacSHA256");
            mac.init(new SecretKeySpec(secretKey.getBytes(), "HmacSHA256"));
        } catch (Exception e) {
            throw new IllegalStateException(e);
        }
    }

    public String tokenize(String value) {
        byte[] hash = mac.doFinal(value.getBytes());
        return "TOK_" + HexFormat.of().formatHex(hash).substring(0, 12);
    }
}

The same input value will always produce the same token, enabling reliable correlation across log entries.

Intercepting Logs with Logback

In Spring Boot, Logback is commonly used by default. We can create a custom converter that sanitizes messages before they are written.

public class SanitizingMessageConverter extends MessageConverter {

    private final AhoCorasickMatcher matcher;
    private final DeterministicTokenizer tokenizer;

    public SanitizingMessageConverter(AhoCorasickMatcher matcher,
                                      DeterministicTokenizer tokenizer) {
        this.matcher = matcher;
        this.tokenizer = tokenizer;
    }

    @Override
    public String convert(ILoggingEvent event) {
        String message = event.getFormattedMessage();
        return sanitize(message);
    }

    private String sanitize(String message) {
        List<Match> matches = matcher.match(message);
        String sanitized = message;

        for (Match match : matches) {
            String value = extractValue(sanitized, match);
            String token = tokenizer.tokenize(value);
            sanitized = sanitized.replace(value, token);
        }

        return sanitized;
    }
}

This converter ensures that every log message is sanitized synchronously and deterministically.

Performance and Thread Safety Considerations

High-throughput logging demands extreme efficiency. The Aho-Corasick matcher should be built once at application startup and reused across threads. The matcher itself is read-only after construction, making it inherently thread-safe.

For tokenization, care must be taken because cryptographic primitives such as Mac are not thread-safe. Solutions include:

Using a ThreadLocal<Mac>
Synchronizing access to the tokenizer
Creating per-thread tokenizer instances

Memory allocations should be minimized, and string replacement logic should avoid repeated copying where possible.

Handling Structured Logs and JSON Payloads

Many Spring Boot applications emit structured JSON logs. In such cases, sanitization can operate either on the raw string or after parsing the JSON.

A common hybrid approach is:

Use Aho-Corasick to detect sensitive field names
Parse only the affected portions
Tokenize values deterministically

This avoids the overhead of fully parsing every log entry while preserving accuracy.

Testing and Validation

Thorough testing is essential. Test cases should cover:

High-volume concurrent logging
Repeated values producing identical tokens
Mixed sensitive and non-sensitive fields
Partial matches and overlapping patterns

Performance testing under load ensures that the logging pipeline does not become a bottleneck.

Conclusion

Sanitizing logs in real time is no longer optional in modern, security-conscious systems. As applications grow in complexity and regulatory requirements tighten, logging strategies must evolve beyond simple masking and manual discipline.

By combining the Aho-Corasick algorithm with deterministic tokenization, Spring Boot applications can achieve a powerful balance of performance, security, and observability. Aho-Corasick provides linear-time, multi-pattern detection that scales effortlessly with log volume, while deterministic tokenization ensures that sensitive values are replaced consistently without sacrificing traceability.

Embedding this logic directly into the logging pipeline guarantees that sensitive data is never written, stored, or transmitted in plain form. The approach is flexible enough to support unstructured text, structured JSON logs, and evolving sets of sensitive patterns. With careful attention to thread safety, performance optimization, and testing, this solution can operate transparently at scale.

Ultimately, this architecture transforms logging from a potential liability into a secure, compliant, and trustworthy diagnostic tool. By treating log sanitization as a first-class concern and leveraging proven algorithms, development teams can confidently meet both operational and regulatory demands without compromising on insight or performance.