Predicting Delayed Flights

Predicting Delayed Flights. The file contains information

on all commercial flights departing the Washington, DC area and arriving at New

York during January 2004. For each flight there is information on the departure and

arrival airports, the distance of the route, the scheduled time and date of the flight, and

so on. The variable that we are trying to predict is whether or not a flight is delayed.

A delay is defined as an arrival that is at least 15 minutes later than scheduled.

Data preprocessing. Bin the scheduled departure time (CRS.DEP TIME) into 8

hins. This will avoid treating the departure time as a continuous predictor, because

it is reasonable that delays are related to rush-hour times. (Note that these data are

not stored in JMP with a time format, so you’ll need to explore the best way to bin

this data – two options are (1) via the formula editor and (2) using the Make Binning

Formula column utility.) Partition the data into training and validation sets.

a. Fit a classification tree to the flight delay variable using all the relevant predictors

(use the binned version of the departure time) and the validation column. Do not

include DEP TIME (actoal departure time) in the model because it is unknown at

the time of prediction (uuless we are doing our predicting of delays after the plane

takes off, wbich is uulikely).

i. How many splits are in the final model?

ii. How many variables are involved in the splits?

iii. Which variables contribute the most to the model?

iv. Which variables were not involved in any of the splits?

v. Express the resulting tree as a set of rules.

vi. If you needed to fly between DCA and EWR on a Monday at 7 AM, would

you be able to use this tree to predict whether the flight will be delayed? What

other information would you need? Is this information available in practice?

What information is redundant?

b. Fit another tree, this time using the original scheduled departure time rather than

the binned version. Save the formula for this model to the data table (we’ll retoru

to this in a futore exercise).

i. Compare this tree to the original, in terms of the number of splits and the

number of variables involved. What are the key differences?

• On-time delivery guarantee
• PhD-level professional writers
• Free Plagiarism Report

• 100% money-back guarantee
• Absolute Privacy & Confidentiality
• High Quality custom-written papers

Related Model Questions