#Sales Offer!| Get upto 25% Off:

Detecting Spam Email (from the UCI Machine Learning Repository). A team at

Hewlett-Packard collected data on a large number of email messages from their postmaster

and personal email for the purpose of finding a classifier that can separate email

messages that are spam vs. nonspam (aka “ham”). The spam concept is diverse: it

includes advertisements for products or websites, “make money fast” schemes, chain

letters. pornography. and so on. The definition used here is “unsolicited commercial

email.”

The file contains information on 4601 email messages, among

which 1813 are tagged “spam.” The predictors include 57 attributes, most of them

are the average number of times a certain word (e.g., mail, George) or symbol (e.g.,

#, !) appears in the email. A few predictors are related to the number and length of

capita1ized words.

a. To reduce the number of predictors to a manageable size, examine how eacb

predictor differs between the spam and nonspam emails by comparing the spamclass

average and nonspam-class average. (Hint: Use the Graph Builder with

the Column Switcher.) Which are the 11 predictors that appear to vary the most

between spam and nonspam emails? From these 11, which words or signs occur

more often in spam?

b. Partition the data into training and validation sets, then perform a discriminant

analysis on the training data using ouly the II predictors.

c. If we are interested maiuly in detecting spam messages, is this model useful?

Use the confusion matrix, ROC curve and lift curve for the validation set for the

evaluation.

d. In the sample, almost 40% of the email messages were tagged as spam. However,

suppose that the actual proportion of spam messages in these email accounts

is 10%. How does this information change the distance scores, the predicted

probabilities, and the rnisclassifications?

e. A spam filter that is based on your model is used, so that ouly messages that are

classified as nonspam are delivered while messages that are classified as spam

are quarantined. Consequently, rnisclassifying a nonspam email (as spam) has

much heftier results. Suppose that the cost of quarantining a nonspam email is 20

times that of not detecting a spam message. Assume that the proportion of spam is

reflected correctly by the sample proportion. To explore costs of rnisclassification,

save the discriminant formula to the data table, then create a formula in the data

table to calculate the costs of rnisclassification. Summarize these costs-how costly

is it to quarantine nonspam email vs. not detecting a spam message?

Found something interesting ?

• On-time delivery guarantee
• PhD-level professional writers
• Free Plagiarism Report

• 100% money-back guarantee
• Absolute Privacy & Confidentiality
• High Quality custom-written papers

Grab your Discount!

25% Coupon Code: SAVE25
get 25% !!