The data for each group are from a large dataset that includes various information about the domestic flights run by a particular airline.
- If the data file name starts with a “DL”, the flights were run by the Delta. If the file name starts with an “AA”, the flights were run by the American Airline.
- If the data file name contains “JFK”, the origin airport of all the flights is the John F. Kennedy International Airport. If the file name contains “LGA”, the origin airport is the LaGuardia Airport.
The objective of the project is to build a regression model for predicting the arrival delay time. Each group has two data sets: a training data for building the regression model and a validation data for evaluating the regression model. Each data includes the following variables:
- dep_time: real departure time
- sched_dep_time: scheduled departure time
- dep_delay: delay at the departure
- arr_time: real arrival time
- sched_arr_time: scheduled arrival time
- arr_delay: delay at the arrival (the response variable)
- flight: flight number
- tailnum: tail number
- dest: destination airport