Automating ETL (Extract, Transform, Load) processes for settlement files—such as bank statements, payment reconciliations, and transaction logs—is crucial for timely and error-free financial operations. Traditional manual data processing is time-consuming, error-prone, and lacks scalability. AWS offers a powerful combination of services like Amazon S3, AWS Glue, AWS Lambda, and AWS Step Functions to build scalable, serverless data pipelines that automate these ETL workflows end-to-end.

This article delves into how you can use these AWS services to automate the processing of settlement files efficiently, including detailed implementation strategies and code examples.

Understanding the Business Use Case

Settlement files are typically:

  • Dropped in a secure location (like S3) daily

  • Structured in CSV, XML, or JSON

  • Need to be parsed, validated, cleaned, enriched, and loaded into data warehouses like Redshift or S3-based data lakes

The goal is to automate this pipeline so that:

  • New files trigger processing automatically

  • The data is cleaned and transformed according to business rules

  • Errors are logged and optionally retried

  • Transformed data is available for reporting and reconciliation

Architecture Overview

The AWS ETL pipeline for settlement files consists of the following components:

  1. Amazon S3 – Stores raw, intermediate, and transformed files.

  2. AWS Lambda – Triggers processing logic when a file lands.

  3. AWS Glue – Performs data transformation and cataloging.

  4. AWS Step Functions – Orchestrates the entire ETL workflow with retries and error handling.

plaintext
[Settlement File Upload to S3]

[S3 Event Notification]

[Lambda Trigger]

[Step Functions Start]

[Glue Job for Transformation] → [Validation Step] → [Store Results]

Setting Up S3 Buckets for Raw and Processed Files

Create two S3 buckets:

  • settlement-files-raw

  • settlement-files-processed

These buckets are used to separate unprocessed files from cleaned and transformed data.

Sample AWS CLI Commands:

bash
aws s3 mb s3://settlement-files-raw
aws s3 mb s3://settlement-files-processed

Folder Structure:

plaintext
settlement-files-raw/
└── daily/
└── bank_settlement_20250701.csv
settlement-files-processed/
└── year=2025/month=07/day=01/
└── transformed.parquet

Creating an S3 Event Notification

Configure S3 to trigger a Lambda function when a new file is uploaded to the settlement-files-raw/daily/ prefix.

json
{
"LambdaFunctionConfigurations": [
{
"Id": "StartETLOnFileUpload",
"LambdaFunctionArn": "arn:aws:lambda:region:account-id:function:StartSettlementETL",
"Events": ["s3:ObjectCreated:*"],
"Filter": {
"Key": {
"FilterRules": [
{ "Name": "prefix", "Value": "daily/" },
{ "Name": "suffix", "Value": ".csv" }
]
}
}
}
]
}

Lambda Function to Trigger Step Functions

This Lambda function initiates the ETL workflow via Step Functions.

python
import boto3
import urllib.parse
sfn = boto3.client(‘stepfunctions’)def lambda_handler(event, context):
bucket = event[‘Records’][0][‘s3’][‘bucket’][‘name’]
key = urllib.parse.unquote(event[‘Records’][0][‘s3’][‘object’][‘key’])response = sfn.start_execution(
stateMachineArn=‘arn:aws:states:region:account-id:stateMachine:SettlementETL’,
input=json.dumps({
‘bucket’: bucket,
‘key’: key
})
)
return response

Define a Step Function Workflow

Use AWS Step Functions to orchestrate the ETL:

json
{
"Comment": "ETL for Settlement Files",
"StartAt": "TransformDataWithGlue",
"States": {
"TransformDataWithGlue": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "transform-settlement-data",
"Arguments": {
"--bucket": "$.bucket",
"--key": "$.key"
}
},
"Next": "ValidateOutput"
},
"ValidateOutput": {
"Type": "Task",
"Resource": "arn:aws:lambda:region:account-id:function:ValidateTransformedData",
"Next": "Success"
},
"Success": {
"Type": "Succeed"
}
}
}

Glue Job for Data Transformation

Write a Glue job in Python to clean, transform, and convert the CSV into Parquet.

python
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from datetime import datetime
args = getResolvedOptions(sys.argv, [‘JOB_NAME’, ‘bucket’, ‘key’])sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args[‘JOB_NAME’], args)# Load raw data
input_path = f”s3://{args[‘bucket’]}/{args[‘key’]}”
df = spark.read.option(“header”, True).csv(input_path)# Clean & transform
df_clean = df.dropna(subset=[“TransactionID”]).withColumnRenamed(“Amount”, “TransactionAmount”)# Write to S3 in Parquet
now = datetime.utcnow()
output_path = f”s3://settlement-files-processed/year={now.year}/month={now.month:02}/day={now.day:02}/”
df_clean.write.mode(“overwrite”).parquet(output_path)job.commit()

Lambda for Validation (Optional)

Post-processing validation can be done using another Lambda:

python
import boto3
import pyarrow.parquet as pq
import io
def lambda_handler(event, context):
s3 = boto3.client(‘s3’)
bucket = “settlement-files-processed”
prefix = “year=2025/month=07/day=01/”result = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
if ‘Contents’ not in result:
raise Exception(“No output found!”)print(“Validation Passed.”)
return {“status”: “success”}

Automating Error Notifications and Retries

To make the pipeline production-ready:

  • Configure Step Functions with Catch blocks to retry or alert via SNS

  • Use CloudWatch Alarms to monitor job success rates

  • Set Glue job timeouts to prevent long-running jobs

Cataloging with AWS Glue Data Catalog

After writing Parquet files, update the Glue Data Catalog for query support:

python
glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={"paths": [output_path]},
format="parquet"
).toDF().write.saveAsTable("settlement_db.cleaned_transactions")

This enables seamless querying using Amazon Athena or Redshift Spectrum.

Advantages of This Architecture

  • Serverless: No servers to manage; fully managed by AWS.

  • Scalable: Automatically handles spikes in file uploads.

  • Auditable: Every step is logged and tracked in CloudWatch.

  • Cost-efficient: Pay-per-use pricing for Lambda, Glue, and S3.

  • Modular: Easily extendable to support XML, JSON, or different banks.

Use Cases Beyond Financial Settlements

While this architecture is tailored for settlement files, it can be adapted to:

  • Insurance claim processing

  • Healthcare data ingestion (HL7/FHIR)

  • Logistics tracking

  • Retail sales reconciliation

  • Subscription billing records

Security Considerations

  • Enable S3 encryption (SSE-S3 or SSE-KMS)

  • Use IAM roles with least privilege for Lambda, Glue, and Step Functions

  • Implement VPC endpoints and S3 bucket policies to prevent public access

  • Log access and changes using AWS CloudTrail

Conclusion

Automating ETL for settlement files using AWS services like S3, Glue, Lambda, and Step Functions transforms a historically manual, error-prone process into a modern, scalable, and secure pipeline. This architecture enables organizations to ingest, clean, validate, and analyze large volumes of financial data in real-time or near-real-time without investing in heavy infrastructure.

By leveraging S3 for durable and cost-effective storage, Lambda for lightweight compute tasks, Glue for transformation and schema management, and Step Functions for orchestration, teams can design a robust pipeline that meets compliance requirements, speeds up reporting, and eliminates operational bottlenecks.

In today’s data-driven economy, timely access to clean and accurate settlement data is a competitive advantage. Whether you’re a fintech startup, a bank, or a billing platform, implementing this kind of serverless ETL pipeline ensures you are not just automating a process—but enabling faster decisions, deeper insights, and better financial accountability.