Predicting Delayed Flights. The file contains information
on all commercial flights departing the Washington, DC area and arriving at New
York during January 2004. For each flight there is information on the departure and
arrival airports, the distance of the route, the scheduled time and date of the flight, and
so on. The variable that we are trying to predict is whether or not a flight is delayed.
A delay is defined as an arrival that is at least 15 minutes later than scheduled.
Data preprocessing. Bin the scheduled departure time (CRS.DEP TIME) into 8
hins. This will avoid treating the departure time as a continuous predictor, because
it is reasonable that delays are related to rush-hour times. (Note that these data are
not stored in JMP with a time format, so you’ll need to explore the best way to bin
this data – two options are (1) via the formula editor and (2) using the Make Binning
Formula column utility.) Partition the data into training and validation sets.
a. Fit a classification tree to the flight delay variable using all the relevant predictors
(use the binned version of the departure time) and the validation column. Do not
include DEP TIME (actoal departure time) in the model because it is unknown at
the time of prediction (uuless we are doing our predicting of delays after the plane
takes off, wbich is uulikely).
i. How many splits are in the final model?
ii. How many variables are involved in the splits?
iii. Which variables contribute the most to the model?
iv. Which variables were not involved in any of the splits?
v. Express the resulting tree as a set of rules.
vi. If you needed to fly between DCA and EWR on a Monday at 7 AM, would
you be able to use this tree to predict whether the flight will be delayed? What
other information would you need? Is this information available in practice?
What information is redundant?
b. Fit another tree, this time using the original scheduled departure time rather than
the binned version. Save the formula for this model to the data table (we’ll retoru
to this in a futore exercise).
i. Compare this tree to the original, in terms of the number of splits and the
number of variables involved. What are the key differences?