This project implements a modern Extract-Load pipeline for British Airways, designed to process and analyze customer review data from Airline Quality. It leverages Apache Airflow, Snowflake, AWS S3, and Docker to load data into Snowflake before transformation using dbt.
.
├── airflow/ # Airflow configuration and DAGs
│ ├── dags/ # Airflow DAG definitions
│ ├── tasks/ # Custom task implementations
│ ├── plugins/ # Custom Airflow plugins
│ └── logs/ # Airflow execution logs
├── data/ # Data files
│ └── raw_data.csv # Source data file
├── docker/ # Docker configuration
│ ├── docker-compose.yaml
│ └── Dockerfile
├── .env # Environment variables
├── requirements.txt # Python dependencies
└── Makefile # Project automation commands
- Data Processing: Python 3.12 with Pandas
- Workflow Orchestration: Apache Airflow
- Data Warehouse: Snowflake
- Data Lake: AWS S3 for staging
- Containerization: Docker
The project processes customer review data scraped from AirlineQuality.com, which contains detailed information about customer flight experiences.
-
Data Crawling
- Crawl customer reviews from AirlineQuality.com
- Store raw data as
raw_data.csv
-
Data Cleaning & Transformation
- Process and clean the raw data (see British Airways Data Cleaning Repository)
- Standardize formats and handle missing values
- Generate cleaned dataset
-
Staging in S3
- Upload cleaned data to AWS S3 bucket (
upload_cleaned_data_to_s3
) - Store in staging area for Snowflake ingestion
- Maintain data versioning and audit trail
- Upload cleaned data to AWS S3 bucket (
-
Snowflake Loading
- Use Snowflake COPY operator to load data from S3
- Transform and load into target tables
- Implement incremental loading strategy
- Data validation checks
- Error handling and logging
- Pipeline monitoring and alerting
- Snowflake data quality monitoring
Located in airflow/dags/
:
- DAG definitions for data processing workflows
- Task scheduling and dependency management
- Error handling and retry logic
- Snowflake data loading and transformation tasks
Located in airflow/tasks/
:
- Data processing and transformation logic
- S3 upload operations
- Snowflake data loading and unloading operations
- Custom operators for specific business requirements
- Utility functions for data handling
Located in airflow/plugins/
:
- Custom hooks and operators
- Extended Airflow functionality
- Integration with Snowflake and S3 services
pandas==1.5.3
apache-airflow-providers-snowflake
snowflake-connector-python
boto3==1.35.0
apache-airflow-providers-amazon