Processing a million+ recording files (audio, video, binary blobs, or JSON/XML records extracted from recordings) is not just “more work” — it is a completely different class of engineering problem. When you try to treat the dataset like a single in-memory collection, you quickly hit physical memory limits, GC pressure, and unpredictable latency. Systems that load all file contents or build a huge in-memory array of metadata or messages will run out of heap, stall, or crash.
Streaming is the pattern that lets you process arbitrarily large datasets using constant (or bounded) memory. Streamed processing reads, transforms, and writes data piecewise, often combined with batching, backpressure and idempotent checkpoints. Below we compare an in-memory DataWeave-style approach (which will OOM at this scale) to streaming solutions and provide code examples that show how to scale safely.
Why an in-memory DataWeave approach fails for 1M+ files
DataWeave (MuleSoft) is a powerful transformation language. Many examples and tutorials show transforming lists using map/filter/reduce. Those idioms assume the list being transformed fits into memory. A common pattern that becomes problematic at scale:
%dw 2.0
output application/json
var files = payload.files // suppose payload.files contains file metadata for 1,200,000 files
---
files map ((f) ->
{
name: f.name,
// reading content into memory for each file
content: read(url: f.path, format: "application/octet-stream")
}
)
That map builds a large array in memory. Even if each payload element is small, 1,200,000 elements quickly exceed available heap. Worse, read(...) that materializes file contents will pull binary into memory for each element. DataWeave will attempt to create full in-memory representations for the output array, triggering OutOfMemoryError when the JVM heap cannot hold all objects.
Key reasons this approach fails:
- Building a single large list of transformed objects consumes Java heap for list structure and object graph.
- Temporary copies during mapping and serialization increase memory pressure.
- The garbage collector must trace millions of objects, increasing pause time and overall latency.
- Many DataWeave operators are eager (create full outputs) unless explicitly designed to use streaming features.
The remedy is to stop building the entire dataset in memory: adopt streaming patterns.
Streaming principles that prevent OutOfMemory
Streaming processing relies on four core techniques:
- Incremental reading — read one file (or chunk) at a time, process it, then discard it.
- Backpressure — only read as fast as downstream can accept (so producers don’t overwhelm consumers).
- Batching and windowing — group items into bounded batches for efficiency while keeping bounds small enough to fit memory.
- Use of streaming parsers — for XML/JSON use SAX/StAX or Jackson Streaming API (token stream) instead of loading entire DOM/Tree.
By combining these you can maintain constant memory usage regardless of the number of files.
Architecture patterns for processing millions of recordings
Several deployment patterns are common:
- Producer/Consumer pipeline: A small fixed thread pool produces file references; consumers process contents; internal queue enforces backpressure and bounded memory.
- Reactive streams: Use an implementation (Project Reactor, RxJava, Node.js streams) that natively supports backpressure and nonblocking IO.
- Streaming connectors: Platform file connectors that offer
InputStreamaccess rather than returning whole file content. - MapReduce-like stage processing: Stage 1: scan filenames and partition; Stage 2: process partitions sequentially or in bounded parallelism; Stage 3: write results.
Next, concrete code examples illustrate how to implement these patterns.
Why the DataWeave in-memory pattern OOMs (illustrative)
A more DataWeave-like pseudocode that aggregates metadata into a single JSON array will allocate an array of size N:
%dw 2.0
output application/json
var allMeta = payload.files map ((file) -> {
name: file.name,
duration: (read(url: file.metadataPath, format: "application/json")).duration,
// other fields
})
---
{ recordings: allMeta } // returns huge JSON array — will OOM at 1M+
Even with streaming file reads, the assembly into allMeta is eager. The safe approach is to avoid returning a single massive array to the runtime — instead stream items out (for example writing to an output stream, a message queue, or a paged file), or use a streaming connector that writes each transformed record to a sink immediately.
Java example: stream files with bounded memory using NIO + Jackson streaming
This Java example demonstrates a robust pattern: iterate over files using DirectoryStream, process each file sequentially (or with bounded thread pool), and use Jackson streaming to parse JSON content without loading it into memory.
import java.nio.file.*;
import java.io.*;
import java.util.concurrent.*;
import com.fasterxml.jackson.core.*;
public class StreamFileProcessor {
private static final int THREADS = 8;
private static final int BATCH_SIZE = 100; // optional grouping
public static void main(String[] args) throws Exception {
Path dir = Paths.get("/path/to/recordings");
ExecutorService exec = Executors.newFixedThreadPool(THREADS);
try (DirectoryStream<Path> ds = Files.newDirectoryStream(dir)) {
for (Path p : ds) {
exec.submit(() -> processFile(p));
}
}
exec.shutdown();
exec.awaitTermination(1, TimeUnit.DAYS);
}
private static void processFile(Path p) {
try (InputStream in = Files.newInputStream(p);
JsonParser parser = new JsonFactory().createParser(in)) {
// Use streaming tokens instead of building an object tree
while (parser.nextToken() != null) {
// inspect fields as tokens arrive and write results immediately
}
// write output to sink (file, DB, queue) and then return — no large in-memory state retained
} catch (IOException e) {
// handle errors, retries
}
}
}
This approach keeps per-file memory usage bounded (the parser buffers only a small token buffer). You may combine with a small bounded queue for backpressure if you stream files from a faster source.
Node.js example: pipeline and backpressure with streams
Node.js streams take care of backpressure for you. Example processing of a list of file paths one at a time and piping transformed output to a writable sink:
const fs = require('fs');
const { pipeline, Transform } = require('stream');
const pLimit = require('p-limit'); // for bounded concurrency if needed
const processStream = new Transform({
objectMode: true,
transform(filePath, encoding, cb) {
const read = fs.createReadStream(filePath);
// do streaming transformation per file
read.on('data', chunk => {
// process chunk, but avoid buffering entire file
this.push(transformChunk(chunk));
});
read.on('end', () => cb());
read.on('error', cb);
}
});
// Suppose `filePaths` is an array created by a directory scan (names only).
const filePaths = getFilePaths(); // ensure this list is just names, not content
const source = require('stream').Readable.from(filePaths, { objectMode: true });
// pipe with backpressure handled by Node streams
pipeline(source, processStream, fs.createWriteStream('output.ndjson'), (err) => {
if (err) console.error('Pipeline failed', err);
else console.log('Pipeline finished');
});
This uses streaming reads and the pipeline will ensure memory remains bounded — Node automatically applies backpressure when the writable cannot accept more data.
Python example: generator + incremental processing
Python generators make streaming simple and readable. This pattern reads file metadata, yields file paths, and processes them sequentially:
import os
import json
def file_paths(directory):
with os.scandir(directory) as it:
for entry in it:
if entry.is_file():
yield entry.path
def process_file(path):
with open(path, 'rb') as f:
# read in small chunks or use streaming parser for JSON
for chunk in iter(lambda: f.read(8192), b''):
process_chunk(chunk)
# write result to DB or append to output file
def main(directory):
for fp in file_paths(directory):
process_file(fp)
if __name__ == "__main__":
main("/path/to/recordings")
This code never holds more than a small chunk buffer per file in memory.
Practical strategies to combine with streaming
- Bounded concurrency: use a fixed-size thread pool or concurrency limiter (e.g., Reactor
flatMapwith concurrency,p-limitin Node, orExecutorServicein Java). This limits the number of files processed simultaneously and prevents memory spikes. - Use streaming connectors: Mule’s File or SFTP connectors can provide
InputStreamto DataWeave. When used correctly (and when DataWeave is configured for streaming or when you avoid creating a single array), you can process each file and write out results incrementally. - Externalize large state: write intermediate results to disk, a database, or a queue instead of collecting them in memory. For example append NDJSON lines to a file or publish messages to Kafka/SQS.
- Use streaming parsers: Jackson Streaming for JSON, SAX/StAX for XML — avoid DOM/Tree APIs at this scale.
- Avoid eager collect/aggregate: any
collect/toList/arrayoperations will convert streaming sequences into memory structures — avoid them. - Tune GC and heap if necessary: streaming reduces memory usage but you still need reasonable heap for buffers and JVM overhead. Set heap based on typical active concurrency rather than total file count.
When DataWeave can be safe — and how to design it
DataWeave itself can be used in streaming-safe ways if you avoid assembling huge outputs. Design patterns:
- Transform each file independently and write the output record to a sink (file, DB, queue) rather than returning an array of all transformed records.
- Use connectors that return streams and map transformations that operate on streaming payload segments. Many Mule connectors and the Mule runtime allow streaming flows where DataWeave operates on a streaming payload rather than a materialized list.
- Use pagination in the control flow: process N files at a time, commit results, then continue with the next page.
The essential rule is: never ask DataWeave (or any transformer) to hold the entire million-file result in memory.
Monitoring, error handling and operational considerations
- Backpressure metrics: measure queue lengths, active worker counts, and downstream consumption rates. If queue depth grows, throttle producers.
- Retries and idempotence: make processing idempotent (so reprocessing after failure is safe).
- Checkpointing / resume: persist processed file ids so you can resume after a crash without reprocessing everything.
- Logging: avoid logging entire file contents; log identifiers, sizes, and timings.
- Resource limits: cap file descriptor counts and thread pool sizes. Use streaming reads to avoid exhausting descriptors.
- Testing: simulate large scale with generated files or a test harness that emits file events at scale.
Conclusion
Processing over a million recording files requires a streaming-first design: incremental reads, streaming parsers, bounded concurrency, and immediate writes to durable sinks. An in-memory approach—like mapping a million file records into a single DataWeave array—will exhaust JVM heap, amplify GC pressure, cause long pauses, and ultimately produce OutOfMemory errors. The correct remedy is to treat the dataset as an unbounded stream, not a single finite collection.
Practical code patterns in Java, Node.js, and Python show how to implement streaming file processors that keep memory usage bounded: iterate filenames rather than file contents, use InputStream/streaming parsers to consume content token by token, apply bounded thread pools or reactive concurrency limits to enforce backpressure, and write outputs immediately to an external sink rather than creating big in-memory aggregates.
When using DataWeave in a Mule environment, avoid building giant arrays or materializing all transformed objects. Instead, transform each file individually and persist results incrementally or use Mule connectors and streaming flows that let DataWeave operate on streaming payloads. When combined with proper monitoring, idempotent processing, and checkpointing, streaming systems will reliably process millions of recordings without OOMs — and with far more predictable latency and throughput — whereas purely in-memory approaches will fail at scale.