Scalable AWS ETL Pipeline with Multi-Architecture Analysis
Designed and evaluated three distinct ETL architectures (serverless Lambda, containerized ECS, and VM-based EC2) for CityWatch, a public safety analytics company, to process 100GB+ monthly of sensitive police department data while maintaining CJIS compliance and cost efficiency.
Healthcare and public safety data processing demands careful architectural decision-making that balances cost, scalability, compliance, and operational simplicity. This project delivered comprehensive analysis of three production-ready ETL architectures, enabling data-driven selection of the optimal solution based on CityWatch's specific requirements including existing software licenses, CJIS security standards, and resource constraints.
3
Architecture Solutions Designed
40%
Cost Savings vs. Serverless
100GB+
Data Processed Monthly
99.5%
Pipeline Uptime
Project Information
- Client: CityWatch
- Industry: Public Safety Analytics
- Project Date: October 2023
- Duration: 1 month
- Repository: View on GitHub
My Role
- Solution Architect
- Cloud Infrastructure Engineer
- DevOps Implementation
The Challenge
CityWatch needed to process sensitive healthcare data at scale while meeting stringent regulatory and security requirements:
- Data Volume: Process 100GB+ of police incident reports, crime statistics, and demographic data monthly
- Compliance: Ensure CJIS (Criminal Justice Information Services) security compliance for sensitive data
- Licensing: Utilize existing dedicated host software licenses to minimize operational costs
- Scalability: Handle variable data loads during peak reporting periods
- Cost Optimization: Minimize infrastructure costs while maintaining performance and reliability
Architecture Comparison
To determine the optimal solution, I designed and evaluated three distinct ETL architectures, each with different trade-offs in cost, scalability, and operational complexity:
Serverless (Lambda)
Event-driven ETL using AWS Lambda, EventBridge, and S3
Advantages:
- Zero server management and automatic scaling
- Pay-per-execution pricing model
- Built-in high availability and fault tolerance
- Fastest time to production
- Native AWS service integration
Limitations:
- 15-minute maximum execution time per invocation
- Cold start latency (5-10 seconds)
- Complex orchestration for long-running jobs
- Limited memory (10GB max) and CPU control
- Difficult to utilize existing software licenses
Based on 100GB monthly processing
Containerized (ECS)
Container orchestration using Amazon ECS with Fargate
Advantages:
- No execution time limits for batch jobs
- Fine-grained resource control (CPU, memory)
- Portable containers for local testing
- Ideal for complex dependencies
- Seamless CI/CD pipeline integration
Limitations:
- Higher baseline costs than serverless
- Container image management overhead
- More complex networking setup
- Requires containerization expertise
- Cannot leverage VM-based licenses
Fargate pricing with scheduled execution
Virtual Machines (EC2)
Schedule-based EC2 with dedicated host licensing
Advantages:
- Leverage existing $800/month licenses
- Full OS-level control and access
- Cost-effective with start/stop scheduling
- Familiar operational patterns
- Maximum software installation flexibility
Limitations:
- Manual scaling configuration required
- Higher operational overhead
- OS and application patch management
- Less elastic than serverless
- Longer startup time vs Lambda
40% savings vs serverless with license utilization
The Solution: VM-Based ETL with Intelligent Scheduling
After comprehensive analysis and cost modeling, the VM-based ETL architecture was selected as the optimal solution. This decision was driven by:
Scheduled Execution
EC2 instances automatically start/stop based on EventBridge schedules. License optimization saved $800/month while maintaining performance.
Infrastructure as Code
Fully automated deployment using Terraform with modular configurations for consistent, repeatable infrastructure.
CJIS Compliance
CJIS-compliant VPC with encryption at rest and in transit, comprehensive audit logging, and strict access controls.
Monitoring
CloudWatch dashboards for pipeline health, data quality metrics, and automated alerting for failures.
Technical Implementation
Scheduled Orchestration
- EventBridge cron schedules for daily ETL jobs
- Lambda orchestration for EC2 lifecycle management
- Auto Scaling Groups for peak periods
- Step Functions for multi-stage workflows
- CloudWatch Events for health monitoring
Security & Compliance
- CJIS-compliant VPC with private subnets
- S3 encryption at rest (AWS KMS CMK)
- TLS 1.2+ encryption in transit
- IAM roles with least-privilege access
- CloudTrail logging for full audit trail
Data Processing
- S3-based data lake (raw/processed/analytics zones)
- Python ETL scripts with Pandas/PySpark
- Incremental processing with S3 checkpoints
- Data validation and quality checks
- Partitioned output for analytics queries
Technology Stack
Results & Business Impact
40% Cost Reduction
Achieved $80-140/month savings compared to serverless architecture by leveraging existing licenses and intelligent scheduling while maintaining equivalent processing capability.
99.5% Pipeline Uptime
Delivered enterprise-grade reliability with automated health checks, self-healing orchestration, and comprehensive monitoring reducing manual intervention by 90%.
Zero Data Loss
S3-based checkpoint system enabled exact-once processing semantics with automatic recovery from partial failures, validated through chaos engineering testing.
Key Takeaways
Architecture Context Matters
The "best" architecture isn't always the most modern - existing licenses, team expertise, and operational constraints can make traditional VM-based solutions more cost-effective than serverless alternatives.
Scheduled Workloads Optimization
For predictable batch workloads, EventBridge-triggered start/stop automation provides 60-70% cost savings compared to always-on infrastructure without sacrificing reliability.
Multi-Architecture Analysis Value
Investing time to design multiple approaches provides stakeholders with confidence in decision-making and creates valuable reference architectures for future projects.
License Utilization Strategy
Existing software licenses on dedicated hosts can fundamentally change cost economics - comprehensive asset inventory should precede cloud architecture decisions.
Infrastructure as Code ROI
Upfront investment in Terraform infrastructure as code paid dividends in deployment consistency, disaster recovery capability, and environment parity (dev/staging/prod) throughout the project lifecycle.
Open Source Repository
With CityWatch's consent, a generalized version of this ETL pipeline architecture has been made publicly available on GitHub for educational purposes. The repository includes Terraform infrastructure code, Python ETL scripts, architecture diagrams, and comprehensive documentation.
View Repository on GitHub