top of page
Gradient Background

Analysis and Prediction of Airline Passenger Satisfaction During COVID-19

In 2020, airlines suddenly found themselves at the mercy of passengers. A massive number of flights were influenced because people did not want to travel during the spread of COVID-19 worldwide. Further, more unoccupied seats were required to comply with government social distance mandates. Airlines could no longer fill flights and had to drastically reduce ticket prices to lure customers to travel. Such conditions made passengers’ decreasing satisfaction levels an even more important concern. In this capstone project, an Airline Passenger Satisfaction dataset is used to answer the key question raised by this phenomenon: How can I help airline companies cope with decreasing passenger satisfaction levels? 

Project Information:
Duration: 3 months
129880 Observations, 23 independent variables
Data source: https://www.kaggle.com/datasets/mysarahmadbhat/airline-passenger-satisfaction?select=data_dictionary.csv
Initial Dataset Management :

Use MySQL to split data into flat tables and reversely combine them with an additional new column: total score.

Extract new tables into Tableau and compare the total score in different groups.

Logistic Regression and Classification Tree:

Variable depart delay and arrive delay have quite similar values ( with a 0.96 correlation coefficient), only keep one column to avoid multicollinearity.

The Variance Inflation Factor (VIF) is also used to detect multicollinearity, proving there is multicollinearity exists between arr_delay and dep_delay with a VIF above 12.

Split data into 80% train and 20% test data, doing variables selections through AIC, BIC, and LASSO based on mean residual variance. 

The best model has a mean residual variance = 0.668.

Goals

Discover relationships among types of passengers, grading metrics, and overall satisfaction level. Help flight companies address important factors that influence passengers’ feedback, and predict overall satisfaction through models.

Additional Analysis on Passenger Types:

Most participations are returning passengers, they rated about 3 scores higher than first-time passengers regardless of  the flight class.

Avg. score of these three variables in different customer types and classes almost all drop below average when they are “Neutral or Dissatisfied ” ; Similarly, when customers are “Satisfied”, most group scores are above average.

Results and Conclusion:
  • Online boarding, check-in service, and in-flight Wi-Fi are the most significant factors that have positive correlation relationships with the final satisfaction level.

  • Return customers are more likely to rate a higher score than first-time customers.

  • Female passengers tend to give a lower score regardless of whether they are “Satisfied” or which class they are in. 

  • Classification tree has a better performance on out-of-sample AUC than that of the logistic regression model. But considering the asymmetric cost of the classification tree is almost 4 times larger than the asymmetric cost of the regression model, it is recommended to predict the result using the regression model in the future.

Supplementary Works:

​

Kexin Wang

©2022 by Kate's Portfolio. Proudly created with Wix.com

bottom of page