Skip to content

Extract-Load pipeline for British Airways using Airflow, Snowflake, and AWS S3. Automates data ingestion from AirlineQuality.com, stages in S3, and loads into Snowflake for downstream dbt transformations

Notifications You must be signed in to change notification settings

vietlam2002/british_airways_extract_load

Repository files navigation

British Airways Extract-Load

This project implements a modern Extract-Load pipeline for British Airways, designed to process and analyze customer review data from Airline Quality. It leverages Apache Airflow, Snowflake, AWS S3, and Docker to load data into Snowflake before transformation using dbt.


🗂 Project Structure

.
├── airflow/             # Airflow configuration and DAGs
│   ├── dags/            # Airflow DAG definitions
│   ├── tasks/           # Custom task implementations
│   ├── plugins/         # Custom Airflow plugins
│   └── logs/            # Airflow execution logs
├── data/                # Data files
│   └── raw_data.csv     # Source data file
├── docker/              # Docker configuration
│   ├── docker-compose.yaml
│   └── Dockerfile
├── .env                 # Environment variables
├── requirements.txt     # Python dependencies
└── Makefile             # Project automation commands

⚙️ Technology Stack

ba_architecture

  • Data Processing: Python 3.12 with Pandas
  • Workflow Orchestration: Apache Airflow
  • Data Warehouse: Snowflake
  • Data Lake: AWS S3 for staging
  • Containerization: Docker

🧱 Data Architecture

1. Data Source

The project processes customer review data scraped from AirlineQuality.com, which contains detailed information about customer flight experiences.

2. Data Processing Pipeline

  1. Data Crawling

    • Crawl customer reviews from AirlineQuality.com
    • Store raw data as raw_data.csv
  2. Data Cleaning & Transformation

  3. Staging in S3

    • Upload cleaned data to AWS S3 bucket (upload_cleaned_data_to_s3)
    • Store in staging area for Snowflake ingestion
    • Maintain data versioning and audit trail
  4. Snowflake Loading

    • Use Snowflake COPY operator to load data from S3
    • Transform and load into target tables
    • Implement incremental loading strategy

3. Data Quality Framework

  • Data validation checks
  • Error handling and logging
  • Pipeline monitoring and alerting
  • Snowflake data quality monitoring

🧩 Project Components

📊 Airflow DAGs

Located in airflow/dags/:

  • DAG definitions for data processing workflows
  • Task scheduling and dependency management
  • Error handling and retry logic
  • Snowflake data loading and transformation tasks

🛠 Custom Tasks

Located in airflow/tasks/:

  • Data processing and transformation logic
  • S3 upload operations
  • Snowflake data loading and unloading operations
  • Custom operators for specific business requirements
  • Utility functions for data handling

🔌 Airflow Plugins

Located in airflow/plugins/:

  • Custom hooks and operators
  • Extended Airflow functionality
  • Integration with Snowflake and S3 services

📦 Key Dependencies

  • pandas==1.5.3
  • apache-airflow-providers-snowflake
  • snowflake-connector-python
  • boto3==1.35.0
  • apache-airflow-providers-amazon

About

Extract-Load pipeline for British Airways using Airflow, Snowflake, and AWS S3. Automates data ingestion from AirlineQuality.com, stages in S3, and loads into Snowflake for downstream dbt transformations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages