Detecting Spam Email (from the UCI Machine Learning Repository). A team at
Hewlett-Packard collected data on a large number of email messages from their postmaster
and personal email for the purpose of finding a classifier that can separate email
messages that are spam vs. nonspam (aka “ham”). The spam concept is diverse: it
includes advertisements for products or websites, “make money fast” schemes, chain
letters. pornography. and so on. The definition used here is “unsolicited commercial
email.”
The file contains information on 4601 email messages, among
which 1813 are tagged “spam.” The predictors include 57 attributes, most of them
are the average number of times a certain word (e.g., mail, George) or symbol (e.g.,
#, !) appears in the email. A few predictors are related to the number and length of
capita1ized words.
a. To reduce the number of predictors to a manageable size, examine how eacb
predictor differs between the spam and nonspam emails by comparing the spamclass
average and nonspam-class average. (Hint: Use the Graph Builder with
the Column Switcher.) Which are the 11 predictors that appear to vary the most
between spam and nonspam emails? From these 11, which words or signs occur
more often in spam?
b. Partition the data into training and validation sets, then perform a discriminant
analysis on the training data using ouly the II predictors.
c. If we are interested maiuly in detecting spam messages, is this model useful?
Use the confusion matrix, ROC curve and lift curve for the validation set for the
evaluation.
d. In the sample, almost 40% of the email messages were tagged as spam. However,
suppose that the actual proportion of spam messages in these email accounts
is 10%. How does this information change the distance scores, the predicted
probabilities, and the rnisclassifications?
e. A spam filter that is based on your model is used, so that ouly messages that are
classified as nonspam are delivered while messages that are classified as spam
are quarantined. Consequently, rnisclassifying a nonspam email (as spam) has
much heftier results. Suppose that the cost of quarantining a nonspam email is 20
times that of not detecting a spam message. Assume that the proportion of spam is
reflected correctly by the sample proportion. To explore costs of rnisclassification,
save the discriminant formula to the data table, then create a formula in the data
table to calculate the costs of rnisclassification. Summarize these costs-how costly
is it to quarantine nonspam email vs. not detecting a spam message?