College of Computing and Informatics
Project Dataset:
1- https://www.kaggle.com/austinreese/craigslist-carstrucks-data
2- https://www.kaggle.com/currie32/crimes-in-chicago
3- https://www.kaggle.com/hm-land-registry/uk-housing-prices-paid
You can choose any one of the previous datasets. And apply all the following tasks on the dataset you choose.
Project Required Steps:
Task 1: (2 Marks)
Topic 1: Sentiment analysis is used in identifying the public opinion through text analytics. Big data tools can aid in the storage and processing of data for sentiment analysis. Through such analysis, companies can better plan their processes and sales accordingly.
Topic 2: Machine Learning algorithms are very important in the field of data science. With the increasing number of data, it is very important and advantageous to apply those algorithms on Big Data.
Write a small Literature Review and discussion about topic 1 or topic 2 discussing how this topic can be implemented and used in Big Data applications, in no more than one paper. You must use at least six references and cite them in the Literature Review. The reference must be added to the template (Try using any referencing software).
Task 2: (1 Marks)
Load the data set into Hadoop File System. Discuss and explain the type and structure of the data. Show the steps that you followed during the importing process.
Task 3: (2 Marks)
Apply Map Reduce algorithm to produce useful statistical results. Discuss in detail the statistical results, and its meaning based on the dataset you have chosen.
Task 4: (1 Marks)
Import the data in MongoDB. Show the steps you followed to import the dataset to any of these NoSQL systems.
Task 5: (2 Marks)
Execute at least three queries on the data MongoDB. Describe your queries and the results. Discuss the meaning of the results based on the data set.
Task 6: (1 Marks)
Using Hive or Pig, execute at least three queries on the data set. Describe your queries and the results. Discuss the meaning of the results based on the data.
Task 7: (1 Marks)
Using Spark, run two SparkSQL statements on the dataset, and visualize the results in any of the charts (Hints: you can use Zeppelin directly).
Task 8 (Optional): (1 Marks as Bonus)
Using Mlib in Spark, build a suitable machine learning model and execute it on the data. Discuss your results.
Note:
· You can use Horton HDP sandbox with only one node. For the part on Spark you can use the same sandbox, or you can use Databricks cluster.
· All the tasks must be described in detail with the code written for each part.
· You can add screenshots of your steps to the project template.