Twitter is making it possible for developers and researchers to study the public conversation
around COVID-19 in real time. This dataset includes a CSV file which contains tweets
extracted from the Twitter website in March 2020. The dataset is large and thus you are
initially required to manipulate it using shell scripting. Once you have reduced the data to a
reasonable size, you are then asked to use R programming language to further analyse and
visualise the results.
Task A: Investigating Twitter Data using shell commands
Download the file covid-data.zip from the link provided above. Use the Unix shell to
manipulate the file and answer the following questions.
1. Decompress the file. How big is it?
2. What delimiter is used to separate the columns in the file? Write the code to show
how many columns are there?
3. How many tweets are there in total in the file?
4. Assuming that the data is sorted, what is the date range of the tweets? (date of first and last
tweet)
5. When was the first mention of the term “COVID-19” in your dataset (notice that we look for
COVID-19 with capital letters here)? What is the user_id, text and post date of this tweet?
6. How many times did the hashtag #coronavirus or #COVID-19 appear in the file in the
given form?( If any of these words appear more than once in a line, you need to count
all its occurrences to answer this question properly)
7. As per the dataset, how many unique users (user_ids) have tweeted? List the top 10
most frequent Twitter users (user_ids) whose tweets are in English (lang = ‘en’)?
8. How many times does the word “Advertisers” appear in the source column? What is
the full name of the source which contains the text “Advertisers”? Print the text and
post date of the first and last English tweet posted from this source.
9. Filter all the tweets with lang = ‘en’ which contain the term ‘Corona’ or ‘Covid’.
Export the tweets to a new file named “covid19Final.csv”. Ensure that you restrict the
tweets only to verified users who have retweeted at least 20 times. Ensure that the file
“covid19Final.csv” contains the column names as well.
#Sales Offer!| Get upto 25% Off: