Develop an Apache Spark application per provided specifications and Crunchbase Open Data Map organizations dataset download, using PySpark in Google Colab.
Details
Use the Week 11 Class Exercise downloads a reference:
- Create a new notebook in Google Colab
- Download Crunchbase ODM Orgs CSV download file and upload it to the “Files” section in your Colab notebook (may take a few minutes to upload)
- Read the Crunchbase Orgs dataset into Spark DataFrame
Implement PySpark code using DataFrames, RDDs or Spark UDF functions:
- Find all entities with the name that starts with a letter “F” (e.g. Facebook, etc.):
- print the count and show() the resulting Spark DataFrame
- Find all entities located in New York City:
- print the count and show() the resulting Spark DataFrame
- Add a “Blog” column to the DataFrame with the row entries set to 1 if the “domain” field contains “blogspot.com”, and 0 otherwise.
- show() only the records with the “Blog” field marked as 1
- Find all entities with names that are palindromes (name reads the same way forward and reverse, e.g. madam):
- print the count and show() the resulting Spark DataFrame