Modern applications generate massive volumes of logs that are invaluable for debugging, monitoring, auditing, and security analysis. However, logs often contain sensitive information such as email addresses, phone numbers, API keys, authentication tokens, credit card numbers, or personally identifiable information (PII). Persisting such data in plain text logs introduces serious compliance, privacy, and security risks.
In Spring Boot–based systems, logs are typically emitted at very high throughput and across many threads. This makes it impractical to sanitize logs using naive string replacement or ad hoc regular expressions. What we need instead is a deterministic, high-performance, real-time log interception mechanism that can reliably detect sensitive values and replace them with consistent, irreversible tokens.
This article presents a comprehensive approach to solving this problem by combining the Aho-Corasick string-matching algorithm with deterministic tokenization, implemented directly in a Spring Boot logging pipeline. The result is a fast, scalable, and secure log-sanitization solution suitable for production systems.
The Problem of Sensitive Data in Logs
Logging frameworks are intentionally designed to be simple and expressive. Developers frequently log request payloads, headers, database parameters, and error objects. Over time, this convenience leads to accidental exposure of:
- Passwords and secrets
- Session identifiers and JWTs
- Credit card and banking data
- Email addresses and phone numbers
- National identifiers
Once logged, this data may be shipped to centralized log stores, third-party monitoring platforms, or long-term archives. Even if access controls are strong, the mere existence of sensitive values in logs is often a compliance violation.
A robust solution must meet several requirements:
- Operate in real time, before logs are written
- Support high throughput with minimal latency
- Detect many sensitive patterns efficiently
- Replace values deterministically for correlation
- Avoid false negatives and excessive false positives
This is where Aho-Corasick and deterministic tokenization excel.
Overview of the Aho-Corasick Algorithm
The Aho-Corasick algorithm is a multi-pattern string matching algorithm that allows simultaneous searching for many keywords in a single pass over the input text. Unlike naive approaches that scan once per pattern, Aho-Corasick builds a finite automaton (a trie with failure links) that matches all patterns in linear time.
Key characteristics of Aho-Corasick:
- Time complexity: O(n + m + z), where n is input length, m is total pattern length, and z is number of matches
- Supports hundreds or thousands of patterns efficiently
- Deterministic and predictable performance
- Ideal for streaming or real-time text processing
For log sanitization, this makes it possible to scan each log message once while detecting all sensitive keywords or markers.
Deterministic Tokenization Explained
Once a sensitive value is detected, it must be removed or replaced. Simply masking it with “****” is often insufficient because it breaks debugging and traceability. Deterministic tokenization offers a better alternative.
Deterministic tokenization replaces a sensitive value with a token that:
- Is consistent for the same input value
- Cannot be reversed without a secret key
- Preserves uniqueness and correlation
For example:
user@example.com → TOK_8f3a12c9
user@example.com → TOK_8f3a12c9
This allows developers to correlate log entries without revealing the original sensitive data.
A common approach uses a keyed cryptographic hash (such as HMAC-SHA256) and truncates the result to a manageable token length.
Architecture for Real-Time Log Interception in Spring Boot
In Spring Boot applications, logs typically flow through a logging framework such as Logback or Log4j2. To intercept logs in real time, we insert a custom component into the logging pipeline.
The high-level architecture looks like this:
- Application emits a log message
- Logging framework invokes a custom converter or appender
- Log message is scanned using Aho-Corasick
- Sensitive values are deterministically tokenized
- Sanitized log message is written to output
This ensures that sensitive data never reaches disk or external systems.
Defining Sensitive Patterns
Before building the matcher, we must define what constitutes sensitive data. This typically includes both static markers and dynamic values.
Examples include:
- Field names such as
password=,token=,authorization: - JSON keys like
"email","ssn","creditCard" - Header names such as
Authorization,X-API-KEY
These markers act as anchors. Once found, we extract the associated value using deterministic parsing logic.
Implementing Aho-Corasick in Java
Below is a simplified example of building an Aho-Corasick trie in Java. In production systems, you may use a well-tested implementation, but understanding the mechanics is important.
public class AhoCorasickMatcher {
private final Node root = new Node();
public void addPattern(String pattern) {
Node current = root;
for (char c : pattern.toCharArray()) {
current = current.children.computeIfAbsent(c, k -> new Node());
}
current.isTerminal = true;
current.pattern = pattern;
}
public void buildFailureLinks() {
Queue<Node> queue = new LinkedList<>();
root.failure = root;
queue.add(root);
while (!queue.isEmpty()) {
Node current = queue.poll();
for (Map.Entry<Character, Node> entry : current.children.entrySet()) {
char c = entry.getKey();
Node child = entry.getValue();
Node failure = current.failure;
while (failure != root && !failure.children.containsKey(c)) {
failure = failure.failure;
}
if (failure.children.containsKey(c) && failure.children.get(c) != child) {
child.failure = failure.children.get(c);
} else {
child.failure = root;
}
queue.add(child);
}
}
}
public List<Match> match(String text) {
List<Match> matches = new ArrayList<>();
Node current = root;
for (int i = 0; i < text.length(); i++) {
char c = text.charAt(i);
while (current != root && !current.children.containsKey(c)) {
current = current.failure;
}
current = current.children.getOrDefault(c, root);
if (current.isTerminal) {
matches.add(new Match(i, current.pattern));
}
}
return matches;
}
}
This matcher can detect multiple sensitive markers in a single pass over the log message.
Deterministic Tokenization Implementation
Next, we implement a tokenizer that produces consistent, irreversible tokens using a secret key.
public class DeterministicTokenizer {
private final Mac mac;
public DeterministicTokenizer(String secretKey) {
try {
mac = Mac.getInstance("HmacSHA256");
mac.init(new SecretKeySpec(secretKey.getBytes(), "HmacSHA256"));
} catch (Exception e) {
throw new IllegalStateException(e);
}
}
public String tokenize(String value) {
byte[] hash = mac.doFinal(value.getBytes());
return "TOK_" + HexFormat.of().formatHex(hash).substring(0, 12);
}
}
The same input value will always produce the same token, enabling reliable correlation across log entries.
Intercepting Logs with Logback
In Spring Boot, Logback is commonly used by default. We can create a custom converter that sanitizes messages before they are written.
public class SanitizingMessageConverter extends MessageConverter {
private final AhoCorasickMatcher matcher;
private final DeterministicTokenizer tokenizer;
public SanitizingMessageConverter(AhoCorasickMatcher matcher,
DeterministicTokenizer tokenizer) {
this.matcher = matcher;
this.tokenizer = tokenizer;
}
@Override
public String convert(ILoggingEvent event) {
String message = event.getFormattedMessage();
return sanitize(message);
}
private String sanitize(String message) {
List<Match> matches = matcher.match(message);
String sanitized = message;
for (Match match : matches) {
String value = extractValue(sanitized, match);
String token = tokenizer.tokenize(value);
sanitized = sanitized.replace(value, token);
}
return sanitized;
}
}
This converter ensures that every log message is sanitized synchronously and deterministically.
Performance and Thread Safety Considerations
High-throughput logging demands extreme efficiency. The Aho-Corasick matcher should be built once at application startup and reused across threads. The matcher itself is read-only after construction, making it inherently thread-safe.
For tokenization, care must be taken because cryptographic primitives such as Mac are not thread-safe. Solutions include:
- Using a
ThreadLocal<Mac> - Synchronizing access to the tokenizer
- Creating per-thread tokenizer instances
Memory allocations should be minimized, and string replacement logic should avoid repeated copying where possible.
Handling Structured Logs and JSON Payloads
Many Spring Boot applications emit structured JSON logs. In such cases, sanitization can operate either on the raw string or after parsing the JSON.
A common hybrid approach is:
- Use Aho-Corasick to detect sensitive field names
- Parse only the affected portions
- Tokenize values deterministically
This avoids the overhead of fully parsing every log entry while preserving accuracy.
Testing and Validation
Thorough testing is essential. Test cases should cover:
- High-volume concurrent logging
- Repeated values producing identical tokens
- Mixed sensitive and non-sensitive fields
- Partial matches and overlapping patterns
Performance testing under load ensures that the logging pipeline does not become a bottleneck.
Conclusion
Sanitizing logs in real time is no longer optional in modern, security-conscious systems. As applications grow in complexity and regulatory requirements tighten, logging strategies must evolve beyond simple masking and manual discipline.
By combining the Aho-Corasick algorithm with deterministic tokenization, Spring Boot applications can achieve a powerful balance of performance, security, and observability. Aho-Corasick provides linear-time, multi-pattern detection that scales effortlessly with log volume, while deterministic tokenization ensures that sensitive values are replaced consistently without sacrificing traceability.
Embedding this logic directly into the logging pipeline guarantees that sensitive data is never written, stored, or transmitted in plain form. The approach is flexible enough to support unstructured text, structured JSON logs, and evolving sets of sensitive patterns. With careful attention to thread safety, performance optimization, and testing, this solution can operate transparently at scale.
Ultimately, this architecture transforms logging from a potential liability into a secure, compliant, and trustworthy diagnostic tool. By treating log sanitization as a first-class concern and leveraging proven algorithms, development teams can confidently meet both operational and regulatory demands without compromising on insight or performance.