· 1 Aim
· 2 Method
· 3 Due Date and Submission
· 5 Report Format
· 6 Marks
· 7 Declaration
· 8 Project Description
1 Aim
provide us with a chance to analyse the Social Web using knowledge obtained from this unit with assistance from a computer based statistical package. For this project, we will focus on identifying a chosen company’s Twitter image.
2 Method
To complete this project:
1. Read through this specification.
2. Choose a company that is active on Twitter, check that it is not already on the list ofGroup Project Twitter Handles. Then submit the Twitter handle of the company using the same link. Note that a given company cannot be allocated to more than one group. If duplicate company names are found on the list, the group with the later time stamp will be asked to find a new company.
3. Complete the data analysis required by the specification.
4. Write up your analysis using your favourite word processing/typesetting program, making sure that all of the working is shown and presented well. Include all the R code along with its output in your assignment.
6. Include the student declaration text on the front page of your report. Please make sure that the names and student numbers of each group member are clearly displayed on the front page. If a group member did not contribute to any part of the project, do not put their name to the cover (no contribution means 0 mark).
7. Submit the report as a PDF by the due date using theSubmit Group Project. More detailed screenshots of your code should be in the Appendix part of the assignment, include comments in the code to explain what you tried to do.
4 Due date and Submission
The project report Part A is due in by 11:59 p.m. on the Friday of week 10. The report must be submitted as a PDF file using the assignment submission facilities in the Project section of 300958 in vUWS. Only one student from each group needs to submit the assignment.
5 Report Format
Once the required analysis is performed by the group, the members of the group are to write up the analysis as a report. Remember that the assessor will only see the groups’ report and will be marking the group’s analysis based on your report. Therefore, the report should contain a clear and concise description of the procedures carried out, comments on the code, explanations of what you tried to do, the analysis of results and any conclusions reached from the analysis.
The required analysis in this specification covers the material presented in lectures and labs. Students should use the computer software R to carry out the required analysis and then present the results from the analysis in the report.
6 Marks
This project is worth 30% (Part A 19% + Part B 11%) of your final grade, and so the project will be marked out of 30. The project consists of four investigations (3 sections in Part A, 1 section at Part B) and will be marked using the following criteria:
Marks (Part A) | Criteria Satisfied |
7 marks | First section completed correctly. |
5 marks | Second section completed correctly. |
6 marks | Third section completed correctly. |
There is also one mark allocated to presentation (based on the report formatting, style, grammar, clarity and mathematical notation). If the report looks like something that would be submitted to an employer, then the 1 mark will be awarded.
If a report is submitted late, the maximum mark it can achieve will be reduced by 10% per day.
7 Declaration
The following declaration must be included in a clearly visible and readable place on the first page of the report.
By including this statement, we the authors of this work, verify that:
· We hold a copy of this assignment that we can produce if the original is lost or damaged.
· We hereby certify that no part of this assignment/product has been copied from any other student’s work or from any other source except where due acknowledgement is made in the assignment.
· No part of this assignment/product has been written/produced for us by another person except where such collaboration has been authorised by the subject lecturer/tutor concerned.
· We are aware that this work may be reproduced and submitted to plagiarism detection software programs for the purpose of detecting possible plagiarism (which may retain a copy on its database for future plagiarism checking).
· We hereby certify that we have read and understand what the School of Computing and Mathematics defines as minor and substantial breaches of misconduct as outlined in the learning guide for this unit.
Note: An examiner or lecturer/tutor has the right not to mark this project report if the above declaration has not been added to the cover of the report.
8 Project Description PART A (due Week 10, Friday 11:59 pm)
A company is investigating its public image and has approached your team to identify what the public associates with the company name. The company wants the three pieces of analysis to be performed in your first report.
8.1Analysing the source of the tweets
In this section, we want to find out which sources the people use while tweeting about the company
1. Use the search_tweets function from the rtweet library to search for 750 tweets about the company you selected. Save these tweets as “tweets.about”.
2. Examine the source column to see the source of tweets. Find out how many different levels of source exists in your tweets.
3. Obtain a vector of frequencies of each different source.
4. Create a data frame to save this information where first column represents source names, second column represents source counts.
5. List the top 10 most frequent tweet source name and draw the bar plot of the frequency of these top ten tweets source. Make sure each bar has names of the source.
6. Comment on the bar plot.
7. Company owner claims that Twitter users are equally likely to use ‘Twitter for iPhone’, ‘Twitter for Android’ and ‘Twitter Web Client’ when they post a tweet about the company. Use your tweet sample to test at a 5% level of significance whether this claim is true (Hint: First find frequencies of these sources in your data frame and save these counts in a vector, then apply the appropriate statistical test).
7. Comment on your findings.
8.2 Word-cloud of the company tweets and public tweets
In this section we want to visualize the similarity between the company tweets and public tweets as well as the language used in the tweets
8. Download the last 750 tweets from the chosen Twitter handle’s timeline, and save astweets.company
.
9. After doing pre-processing,
a. Construct a document term matrix of TFIDF weights of thetweets.company
.
b. Construct a document term matrix of TFIDF weights of thetweets.about
.
10. Construct word clouds of the words intweets.about
andtweets.company
. Comment on both word-clouds.
11. Combine (merge) thetweets.about
withtweets.company
and construct the document-term-matrix of the merged tweets using TFIDF weighting.
8.3 Connection between public and the company
In this section, we want to categorize (cluster) all the tweets and want to determine which topics are dominated by public tweets.
12. Compute the most appropriate number of clusters using the elbow method for the merged tweets you calculated in question 11 .
13. Cluster the merged tweets using the most appropriate clustering method.
14. Visualize your clustering in 2-dimensional vector space. Show each cluster in different colour and the tweets intweets.about
andtweets.company
with different symbols in your visualization.
15. Comment on your visualization.
16. Compute the proportion oftweets.company
in each cluster. Print these proportions for all clusters.
17. Which cluster is dominated by tweets.about? Print top 20 words in the dominated cluster and comment on the theme of this cluster.
The company wants the above three parts of analysis to be written up as a professional report in the first deliverables. Each part should have its own section of the report and all questions should have thoughtful answers. Include all the code along with its output in your assignment.