August 21, 2019

Discuss the possible reasons for obtaining these analysis results and how to improve them

We provide one IPython notebook SIT742Task2.ipynb at https://github.com/tulip-lab/ sit742/tree/master/Assessment/2019, together with a csv file bank.csv at the data subfolder. You are required to analyse this dataset using IPython notebook with Spark packages including spark.sql and pyspark.ml that you have learnt from SIT742.

2.2. TASK DESCRIPTION

Attribute	Meaning
age	age of the customer
job	type of job
marital	marital status
education	education level
default	has credit in default?
balance	the balance of the customer
housing	has housing loan?
loan	has personal loan?
contact	contact communication type
day	last contact day of the week
month	last contact month of year
duration	last contact duration, in seconds
campaign	number of contacts performed
pdays	number of days that passed by after a previous campaign
previous	number of contacts performed before this campaign
poutcome	outcome of the previous marketing campaign
deposit	has the client subscribed a term deposit?

2.2.1 IPython Notebook

To systematically investigate this dataset, your IPython notebook should follow the basic 6 procedures as:

(1) Import the csv file, “bank.csv”, as a Spark dataframe and name it as df, then check and explore its individual attribute.

(2) Select important attributes from df as a new dataframe df2 for further investigate.

You are required to select 13 important attributes from df: `age', `job', `marital',

`education', `default', `balance', `housing', `loan', `campaign', `pdays', `previous', `poutcome' and 'deposit'.

(3) Remove all invalid rows in the dataframe df2 using spark.sql. Supposing that a row is invalid if at least one of its attributes contains `unknown'. For the attribute `poutcome', the valid values are `failure' and `success'.

(4) Convert all categorical attributes to numerical attributes in df2 using One hot encoding, then apply Min-Max normalisation on each attribute.

(5) Perform unsupervised learning on df2 including k-means and PCA. For k-means, you can use the whole df2 as both training and testing data, and evaluate the clustering result using Accuracy. For PCA, you can generate a scatter plot using the first two components to investigate the data distribution.

(6) Perform supervised learning on df2 including Logistic Regression, Decision Tree and Naive Bayes. For the three classification methods, you can use 70% of df2 as the training data and the remaining 30% as the testing data, and evaluate their prediction performance using Accuracy.

2.2.2 Case Study Report

Based on your IPython notebook results, you are required to write a case study report which should include the following information:

(1) The data attribute distribution

(2) The methods/algorithms you used for data wrangling and processing

(3) The performance of both unsupervised and supervised learning on the data

(4) The important features which affect the objective (‘yes’ in ‘deposit’) [Hint: you can refer the coefficients generated from the Logistic Regression]

(5) Discuss the possible reasons for obtaining these analysis results and how to improve them

(6) Describe the group activities, such as the task distribution for group members and what you have learnt during this project.

More information about report writing can be found at: https://www.deakin.edu.au/ students/studying/study-support/academic-skills/report-writing.

Requirement:

Your IPython notebook solution source file for the data exploration of the bank marketing data. You can fill your group information at the relevant place in the first markdown cell. Please follow the PEP 8 guidelines (Section 3.1) for source code style.

Report.pdf

A report describing and discussing your analysis results.

Management

Found something interesting ?

We don't just promise. Here is what we guarantee!

• On-time delivery guarantee
• PhD-level professional writers
• Free Plagiarism Report

• 100% money-back guarantee
• Absolute Privacy & Confidentiality
• High Quality custom-written papers

Discuss the possible reasons for obtaining these analysis results and how to improve them

Found something interesting ?

We don't just promise. Here is what we guarantee!

Related Model Questions

Critically analyse the current and future capacity requirements of a web platform, clearly describing your example

“Election is not a privilege but a responsibility”Discuss

How to design jobs to enhance satisfaction and motivation

ESSAYBUREAU.COM

Sitemap

Grab your Discount!