Access our Dashboard: British Airways Dashboard
-
british_airways_data_cleaning — Owner: DucLe-2005
Cleans raw scraped data and standardizes formats using modular Python functions. -
british_airways_extract_load — Owner: vietlam2002
Scrapes customer reviews, stages in S3, then loads them to Snowflake via Airflow-compatible ETL scripts. -
british_airways_transformation — Owner: MarkPhamm
Handles dbt-based data transformation on Snowflake with CI/CD workflows via GitHub Actions. -
british_airways_dashboard_website — Owner: nguyentienTCU
A dashboard website for visualizing insights from processed airline reviews. -
bristish_airways_analysis — Owner: trungdam512
Focuses on EDA, ML models, and Sentiment Analysis.
- Data Engineering: Leonard Dau, Thieu Nguyen, Viet Lam Nguyen
- Software Engineer: Tien Nguyen, Anh Duc Le,
- Data Scientist: Robin Tran, Trung Dam
- Scrum Master: Hien Dinh
This end-to-end analytics project implements a modern data pipeline for British Airways, covering extraction, transformation, loading, and visualization of customer review data from Air Inequality. The architecture leverages industry-standard tools and cloud services to create a robust, scalable analytics solution.
Self-selection bias: While analyzing reviews of British Airways, it's crucial to acknowledge the presence of self-selection sampling bias. Similar to social media platforms like Yelp, individuals who voluntarily submit reviews may have had extreme experiences, affiliations with the airline, or simply different motivations compared to those who do not provide feedback. Due to self-sampling bias, the KPI and review will be worse than the general population. However, it's important to clarify that our aim is not to generalize findings about the entire population. Instead, we focus on identifying specific areas for improvement that British Airways can address.
The extraction layer for the British Airways data pipeline handles the acquisition of customer review data from AirlineQuality.com. This module is responsible for scraping raw review data, storing it appropriately, and preparing it for subsequent processing stages.
- Repository: british_airways_extract_load
- Python 3.12 with Pandas: Core data processing and manipulation
- Apache Airflow: Workflow orchestration and scheduling
- AWS S3: Data lake storage for both raw and processed data
- Docker: Containerization for consistent development and deployment
- Snowflake: Target data warehouse for storing processed data
The primary data source is AirlineQuality.com, which provides:
- Customer ratings
- Detailed review text
- Flight information
- Customer metadata
- Service quality assessments
The extraction process begins with the scrape_british_data
task in the Airflow DAG, which executes a Python script to:
- Connect to AirlineQuality.com
- Navigate through review pages
- Extract structured data from HTML content
- Compile results into a raw dataset
# From main_dag.py - Extract task definition
scrape_british_data = BashOperator(
task_id="scrape_british_data",
bash_command="chmod -R 777 /opt/***/data && python /opt/airflow/tasks/scraper_extract/scraper.py"
)
- Raw data is initially stored as
raw_data.csv
in the project's data directory - After successful extraction, a notification task confirms completion:
note = BashOperator(
task_id="note",
bash_command="echo 'Succesfull extract data to raw_data.csv'"
)
Following extraction, the data undergoes initial cleaning:
clean_data = BashOperator(
task_id='clean_data',
bash_command="python /opt/airflow/tasks/transform/transform.py"
)
The cleaning process:
- Standardizes formats
- Handles missing values
- Removes duplicates
- Ensures data type consistency
- Prepares data for staging
The cleaned data is uploaded to AWS S3 for staging:
upload_cleaned_data_to_s3 = BashOperator(
task_id='upload_cleaned_data_to_s3',
bash_command="chmod -R 777 /opt/airflow/data && python /opt/airflow/tasks/upload_to_s3.py"
)
- IAM roles and permissions are configured to restrict access
- Encrypted transmission to ensure data security
- Versioning enabled to maintain audit trail
The entire extraction process is orchestrated via Apache Airflow:
with DAG(
dag_id="british_pipeline",
schedule_interval=schedule_interval,
default_args=default_args,
start_date=start_date,
catchup=True,
max_active_runs=1
) as dag:
# Task definitions here
- Schedule: Daily execution (
timedelta(days=1)
) - Retry Logic: Configured to retry failed tasks after 5 minutes
- Dependencies: Tasks are chained to ensure proper execution order
The extraction portion of the pipeline follows this sequence:
scrape_british_data >> note >> clean_data >> note_clean_data >> upload_cleaned_data_to_s3
After staging in S3, data is loaded to Snowflake:
snowflake_copy_operator = BashOperator(
task_id='snowflake_copy_from_s3',
bash_command="pip install snowflake-connector-python python-dotenv && python /opt/airflow/tasks/snowflake_load.py"
)
- Uses Snowflake's COPY command for efficient data loading
- Configures column mapping and data type conversion
- Implements error handling and validation
This layer transforms raw British Airways customer review data into a standardized, analysis-ready format.
- Repository: british_airways_data_cleaning
- Technology Stack: Python 3.12.5 with Pandas, NumPy, Matplotlib, and Seaborn
- Converts column names to snake_case
- Standardizes special characters and naming conventions
- Renames specific columns for clarity (e.g., "country" → "nationality")
- Standardizes dates to ISO 8601 format ("YYYY-MM-DD")
- Handles both submission dates and flight dates
- Example: "19th March 2025" → "2025-03-19"
- Extracts verification status from review text
- Creates boolean "verify" column
- Removes verification prefix from review content
- Cleans nationality field (e.g., "United Kingdom (UK)" → "United Kingdom")
- Parses route information into structured components:
- Origin city and airport
- Destination city and airport
- Transit points (if applicable)
- Handles both direct and connecting flight formats
- Normalizes aircraft naming conventions
- Standardizes Boeing and Airbus nomenclature
- Example: "Aircraft: B777-300" → "Boeing 777-300"
- Converts all rating fields to numeric format
- Uses Int64 data type to properly handle missing values
- Standardizes rating scales across all categories
- Null Value Handling: Preserves legitimate nulls while ensuring consistent types
- Type Consistency: Enforces appropriate data types for each column
- Format Standardization: Ensures consistent formats for dates, aircraft names, and locations
- Edge Case Management: Handles various non-standard inputs and formats
The cleaned dataset includes standardized fields for:
- Customer information (name, nationality)
- Review metadata (submission date, verification status)
- Flight details (aircraft, route, date flown, traveler type)
- Ratings across multiple service categories
- Derived fields (origin/destination cities and airports)
- Input: Receives raw data from extraction layer
- Processing: Triggered by Airflow DAG task
clean_data
- Output: Produces cleaned data for S3 upload and subsequent Snowflake loading
This layer implements a dimensional modeling approach to transform cleaned British Airways customer review data into analytics-ready structures.
- Repository: british_airways_transformation
- Technology Stack:
- dbt (data build tool) for transformations
- Snowflake as the data warehouse
- Apache Airflow with Astronomer for orchestration
- GitHub Actions for CI/CD automation
The project implements a dimensional star schema:
fct_review
: Core fact table with one row per customer review per flight- Contains quantitative metrics (ratings)
- Boolean indicators (verified, recommended)
- Foreign keys to dimension tables
dim_customer
: Customer identity and demographic informationdim_aircraft
: Aircraft details including manufacturer and modeldim_location
: Airport and city information for origin, destination, and transit pointsdim_date
: Calendar and fiscal date tracking for both submission and flight dates
- Source data ingestion from cleaned dataset in Snowflake
- Staging models to prepare data for dimensional modeling
- Core dimension table creation and enrichment
- Fact table construction with foreign key relationships
- Final views and aggregations for business users
- Models organized in layers (staging → dimensions → facts → reporting)
- Incremental loading strategy for efficiency
- Documentation and tests integrated into models
- Version control with Git
- Schema Tests: Column constraints, uniqueness, relationships
- Custom Tests: Business logic validation
- Freshness Checks: Data recency monitoring
- Completeness Validation: Coverage of expected data points
-
Triggers:
- Code pushes to main branch
- Pull requests
- Scheduled runs at 00:00 UTC every Monday
- Manual execution option
-
Workflow Steps:
- Environment setup with Python 3.12
- Dependencies installation
- dbt model execution against Snowflake
- Status notifications via email
-
Pipeline Status: Tracked via GitHub Actions badges
- Upstream: Receives data from data cleaning layer via Snowflake
- Downstream: Produces analytics-ready dimensional tables for BI tools
- Orchestration: Executed as part of the overall data pipeline via Airflow
This layer delivers interactive visualizations and analytical insights derived from the transformed British Airways customer review data.
- Repository: british_airways_dashboard_website
- Live Dashboard: British Airways Analytics Dashboard
- Technology Stack:
- Next.js for frontend framework
- TailwindCSS for styling
- Chart.js for data visualization
- LangChain for RAG implementation
- ChromaDB for vector database storage
- Interactive KPI Cards: Real-time metrics and performance indicators
- Multi-dimensional Filtering: Analysis by route, aircraft type, and customer segment
- Responsive Design: Mobile and desktop optimized interface
- Data Explorer: Custom query builder for ad-hoc analysis
- Time Series Analysis: Time series charts for tracking rating trends
- Aircraft Analysis: Aircraft model performance analysis
- Route Analysis: Route map with performance overlays
- RAG Chatbot: LangChain-powered question answering system
- Query Interface: Natural language processing for data exploration
- Context-Aware Responses: Leveraging ChromaDB for semantic retrieval
- Sentiment Analysis: Visual representation of review sentiment
- Feature Importance: Key drivers of customer satisfaction
- Executive Summary: High-level overview of performance metrics
- Service Quality Tracker: Detailed breakdown of rating categories
- Customer Segment Analysis: Demographic and preference-based insights
- Aircraft Performance Comparison: Rating variations across fleet types
- Route Analysis: Performance metrics by origin-destination pairs
- Temporal Patterns: Seasonal and trend-based visualizations
- Data Source: Connects to transformed data in Snowflake
- API Layer: RESTful endpoints for dynamic data retrieval
- Economy class customers prioritize staff services
- Negative experiences concentrated at London airports (Heathrow and Gatwick)
- 95% of complaints relate to insufficient ground staff support and staff attitude
- Enhance staff training programs
- Increase staff presence at London airports
- Implement prompt feedback mechanisms
- Non-economy customers prioritize food quality and seat comfort
- Business travelers expect premium service commensurate with pricing
- Complaints focus on cramped Business Class seating and poor food quality
- Redesign Business Class seating arrangement
- Elevate in-flight dining quality and presentation