Team members: Carlos Byrne, Liam Biggar, Nicolas Tolksdorf, Shay Doherty, Girish Joshi and Ethan Labouchardiere.
This is a three-week group project completed as part of the Northcoders Data Engineering Bootcamp. The project implements a data pipeline to extract, transform, and load (ETL) data from an operational database (totesys
) into an AWS-based data lake and data warehouse. The goal is to create a robust, automated, and scalable data platform that supports analytical reporting and business intelligence.
This project implements a data pipeline to extract, transform, and load (ETL) data from an operational database (totesys
) into an AWS-based data lake and data warehouse. The goal is to create a robust, automated, and scalable data platform that supports analytical reporting and business intelligence.
The architecture consists of the following key components:
- Source Database (
totesys
): A simulated operational database containing transactional data. - Data Lake (S3): Two S3 buckets:
ingestion-bucket
: Stores raw data extracted from thetotesys
database.processed-bucket
: Stores transformed data in Parquet format, ready for loading into the data warehouse.
- Data Warehouse (AWS): A relational data warehouse hosted in AWS, designed with three star schemas (Sales, Purchases, Payments). Our project focussed on implementing the Sales star schema as a Minimum Viable Product (MVP).
- AWS Lambda Functions: Python-based Lambda functions for:
- Data extraction from the
totesys
database. - Data transformation and remodeling.
- Data loading into the data warehouse.
- Data extraction from the
- AWS EventBridge: Used for scheduling and orchestrating the data pipeline.
- AWS CloudWatch: Used for logging, monitoring, and alerting.
- GitHub Actions: For continuous integration and continuous deployment (CI/CD).
- Terraform: For infrastructure as code (IaC).
- AWS account with appropriate permissions.
- Python 3.12
- Terraform
- GitHub account
-
Clone the Repository:
git clone https://github.com/nicolas-tolksdorf/tote-bag-data-transformation.git cd tote-bag-data-transformation
-
Create a Virtual Environment (Recommended):
python3 -m venv venv source venv/bin/activate
-
Install Dependencies:
pip install -r requirements.txt
-
Configure AWS Credentials:
- AWS region should be configured to eu-west-2
- Set up AWS credentials in your environment or configure them using AWS CLI.
- Store AWS credentials as GitHub Secrets (
AWS_REGION
,AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
) for CI/CD. - Add database credentials for totesys database to aws secrets manager
- Replace arn in
aws_iam_policy_document.read_secretsmanager.resources
in iam.tf - Replace arn in
connect_to_database()
in utils/lambda_utils.py
- Replace arn in
- Add database credentials for project warehouse to aws secrets manager
- Replace arn in
connect_to_warehouse()
in utils/lambda_utils.py
- Replace arn in
-
Terraform Initialization:
- Log into AWS console and create
data_squid_tf_bucket
in s3 for terraform to store the state meta data
cd terraform terraform init
- Log into AWS console and create
-
Deploy Infrastructure:
cd terraform ../bin/install-dep.sh ../bin/install-utils.sh terraform apply -auto-approve
-
Configure GitHub Actions:
- Set up GitHub Secrets in your repository settings.
- Push code to the
main
branch to trigger the CI/CD pipeline.
./
├── bin/
| ├── install-dep.sh # creates archive of dependencies for lambdas
| ├── intall-utils.sh # creates archive of util functions for lambdas
├── Makefile # Automation file for build, test, and deployment tasks.
├── mvp.png # Minimum Viable Product diagram.
├── README.md # Project documentation with instructions and information.
├── requirements-lambda.txt # Python dependencies specifically for Lambda functions.
├── requirements.txt # General Python dependencies for the project.
├── src/ # Source code directory.
│ ├── extraction_lambda/
│ │ └── main.py # Python code for the extraction Lambda function.
│ ├── load_lambda/
│ │ └── main.py # Python code for the loading Lambda function.
│ └── transform_lambda/
│ └── main.py # Python code for the transformation Lambda function.
├── terraform/ # Infrastructure as Code (IaC) directory containing Terraform configurations for:
│ ├── cloudwatch.tf # AWS CloudWatch resources (logging, monitoring).
│ ├── events.tf # AWS EventBridge resources (scheduling).
│ ├── iam.tf # AWS IAM resources (permissions).
│ ├── lambda_layers.tf # AWS Lambda layers (dependencies).
│ ├── lambda.tf # AWS Lambda functions.
│ ├── provider.tf # the AWS provider.
│ ├── s3.tf # AWS S3 buckets (data lake).
│ ├── sns.tf # AWS SNS resources (notifications).
│ └── vars.tf # Terraform variable definitions.
├── tests/ # Test directory for unit and integration tests.
│ ├── extraction_tests/
│ │ └── test_extraction.py # Python unit tests for the extraction Lambda.
│ ├── load_tests/
│ │ └── test_load_utils.py # Python unit tests for the loading Lambda utilities.
│ └── transform_tests/
│ └── test_transform_utils.py # Python unit tests for the transformation Lambda utilities.
└── utils/ # Utility functions directory
└── lambda_utils.py # Python utility functions used across Lambda functions.
The following command will create a python virtual environment and runs various tests mentioned below.
make all
Performs the following Code Checks:
- Linting
- Unit tests
- Security scans
- Code formatting
The CI/CD pipeline is implemented using GitHub Actions. It automates code checks and infrastructure builds using terraform
The data ingestion process extracts data from the totesys database and uploads it to the ingestion-bucket in S3. It supports both initial and continuous data extraction.
The data transformation process remodels the data into the data warehouse schema and stores it in Parquet format in the processed-bucket in S3.
The data loading process loads the transformed data from S3 into the data warehouse.
Data visualization is performed using tools like AWS QuickSight or similar BI tools.
AWS CloudWatch is used for monitoring the data pipeline and logging events.
AWS IAM roles and policies are used to control access to AWS resources. GitHub Secrets are used to store sensitive information. Run bandit and pip-audit for security vulnerability checks.
Additional data quality checks. Enhance monitoring and alerting. Add more data sources. Improve data visualization capabilities. Implement data lineage tracking.