✈️ Flight Fare Prediction

A comprehensive machine learning analysis predicting flight fares using the Bangladesh Flight Price Dataset (57,000+ records)

View Full Analysis

Dataset Source

📋 Table of Contents

Project Overview
Key Features
Dataset
Technical Approach
Results
Installation & Usage
Technologies Used
Contributors

🎯 Project Overview

This project explores the key drivers of flight fare variability within the Bangladesh aviation ecosystem through comprehensive data analysis and machine learning. We built predictive models to forecast ticket prices based on 17 distinct features including airline operations, route characteristics, booking timing, and seasonal patterns.

Business Problem

Understanding what influences airline ticket pricing is crucial for both travelers seeking the best deals and airlines optimizing their revenue strategies. This analysis identifies which factors, from cabin class to departure timing, have the most significant impact on fare costs.

Objectives

Predict total flight fare (BDT) with high accuracy
Identify the most influential pricing factors
Compare linear vs. non-linear modeling approaches
Provide actionable insights for travelers and industry stakeholders

✨ Key Features

Data Processing & Engineering

✅ Cleaned and validated 57,000+ flight records
✅ Engineered 15+ predictive features including:
- Temporal patterns (month, day, hour, weekend indicators)
- Route frequency metrics
- Numeric stopover conversions
- Calendar-based seasonality signals

Exploratory Data Analysis

📊 Comprehensive visualization suite examining:
- Price distributions and outlier patterns
- Seasonal fare trends (Regular, Eid, Hajj, Winter Holidays)
- Class-based pricing tiers (Economy → Business → First)
- Route-specific characteristics
- Correlation analysis across numerical features

Machine Learning Pipeline

🔧 Production-ready sklearn pipeline with:
- Automated preprocessing (imputation, scaling, encoding)
- Multiple model architectures tested
- Robust train-test validation
- Hyperparameter optimization

📊 Dataset

Source: Flight Price Dataset of Bangladesh
Size: 57,000 simulated flight records
Features: 17 original columns including:

Feature	Description
Airline	Carrier operating the flight
Aircraft Type	Model of aircraft
Source/Destination	Airport codes for departure/arrival
Class	Economy, Business, or First Class
Duration	Flight time in hours
Stopovers	Direct, 1 Stop, or 2 Stops
Days Before Departure	Booking advance window (1-90 days)
Seasonality	Regular, Eid, Hajj, or Winter Holidays
Total Fare (BDT)	Target variable

Note: Base Fare and Tax & Surcharge were excluded from modeling to prevent data leakage

🔬 Technical Approach

1. Data Preprocessing

# Key preprocessing steps
- Removed redundant columns (Source Name, Destination Name)
- Converted datetime strings → numeric features (month, hour, day of week)
- Mapped categorical stopovers → numeric values (Direct=0, 1 Stop=1, 2 Stops=2)
- Created route frequency feature (flight popularity metric)
- One-hot encoded categorical variables
- Standardized numeric features

2. Feature Engineering Highlights

Temporal Features: Extracted dep_month, dep_hour, dep_dayofweek, dep_is_weekend
Route Analysis: Computed route_frequency to capture demand patterns
Stopover Encoding: Converted text labels to numeric progression
Leakage Prevention: Dropped Base Fare and Tax columns (sum to target)

3. Modeling Strategy

We evaluated 6 model configurations across two target representations:

Linear Baselines

Model	Target	R²	RMSE (BDT)	Notes
Linear Regression	Raw	0.570	53,538	Baseline OLS
Ridge (α = 1)	Raw	0.570	53,538	No improvement — features well-conditioned
Linear Regression	log1p	0.651	48,265	~10% RMSE reduction
Ridge (α = 1)	log1p	0.651	48,269	Matches log OLS performance

Insight: Log-transform helped linear models; L2 regularization showed no benefit.

Tree Ensembles

Model	Target	R²	RMSE (BDT)	Configuration
RandomForest	Raw	0.663	47,400	200 trees, min_samples_leaf = 2
RandomForest	log1p	0.638	49,093	Log target underperforms
HistGradientBoosting	Raw	0.677	46,437	50 iterations, lr = 0.1, l2 = 0.5 ✅
HistGradientBoosting	log1p	0.651	48,226	Log target slightly inferior

🏆 Results

Best Model: HistGradientBoosting (Raw Target)

✅ R² Score: 0.677 (explains 67.7% of fare variance)
✅ RMSE: 46,437 BDT (~$386 USD)
✅ 13% improvement over baseline Linear Regression

Key Findings

Class is the dominant predictor
- First Class costs ~300% more than Economy
- Business costs ~100% more than Economy
- Clear pricing tiers with high separation
Seasonality drives major price swings
- Hajj period: +42% vs. Regular season
- Eid period: +35% vs. Regular season
- Winter Holidays: +18% vs. Regular season
Duration shows weak correlation with price (r ≈ 0.33)
- Suggests other factors (demand, competition) dominate pricing
Airline choice has minimal impact
- Only ~12% fare difference between carriers
- Route and class matter far more
Booking timing shows no clear pattern
- Days before departure: r ≈ -0.07 (nearly zero correlation)
- Challenges the “book early = cheaper” conventional wisdom

Model Comparison Visualization

Model Performance (RMSE in BDT):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Baseline OLS       ████████████████████ 53,538
Log OLS            ████████████████     48,265
RandomForest       ███████████████      47,400
HistGradientBoosting ██████████████     46,437 ⭐ BEST

🚀 Installation & Usage

Prerequisites

Python 3.8+
pip install -r requirements.txt

Dependencies

pandas>=1.3.0
numpy>=1.21.0
scikit-learn>=1.0.0
matplotlib>=3.4.0
seaborn>=0.11.0

Quick Start

# Clone the repository
git clone https://github.com/ROYBRUNO81/flightprice.github.io.git
cd flight-fare-prediction

# Install dependencies
pip install -r requirements.txt

# Launch Jupyter Notebook
jupyter notebook Flight_Fare_Prediction_Final.ipynb

Running the Analysis

Load Data: Execute cells in Section 2.1
Preprocess: Run through Section 2.3
EDA: Explore visualizations in Section 3
Feature Engineering: Execute Section 4
Model Training: Run all model cells in final section
Evaluate: Review performance metrics and comparisons

🛠️ Technologies Used

Category	Tools
Languages	Python 3.8+
Data Processing	Pandas, NumPy
Visualization	Matplotlib, Seaborn
Machine Learning	scikit-learn (LinearRegression, Ridge, RandomForest, HistGradientBoosting)
Development	Jupyter Notebook, Google Colab
Version Control	Git, GitHub

👥 Contributors

Ange Christa Dushime

Christian Ishimwe

Bruno Ndiba Mbwaye Roy

CIS 5450 Final Project — University of Pennsylvania

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

**⭐ Star this repo if you found it helpful!**