1. Consider a text corpus with 106 documents, a lexicon of size 105, and 100 distinct words per document, which is represented as a bag of words with frequencies. (a) What is the amount of space required to store the entire data matrix without any optimization? (b) Suggest a sparse data format to store the matrix and compute the space required.
2. In Exercise 1, let us represent the documents in 0-1 format depending on whether or not a word is present in the document. Compute the expected dot product between a pair of documents in each of which 100 words are included completely at random. What is the expected dot product between a pair with 50,000 words each? What does this tell you about the effect of document length on the computation of the dot product?