Consider the term “elections” which is present in only 50 documents in a corpus of 1000 documents. Furthermore, assume that the corpus contains 100 documents belonging to the Politics category, and 900 documents belonging to the Not-Politics category. The term “election” is contained in 25 documents belonging to the Politics category. (a) Compute the unnormalized Gini index and the normalized Gini index Gn(·) of the term “elections.” (b) Compute the entropy of the class distribution with respect to the entire data set. (c) Compute the conditional entropy of the class distribution with respect to the term “elections.” (d) Compute the mutual information of the term “elections” according to Eq. 5.6. How are your answers to (b), (c), and (d) related? (e) Compute the information gain of the term….
Daily Archives: January 24, 2021
1. Predict the probabilities of categories Cat and Car of Test2 on the toy corpus example in Sect. 22.214.171.124. You can use the multinomial na¨ıve Bayes model with the same level of smoothing as used in the example in the book. Return normalized probabilities that sum to 1 over the two categories.
2. Na¨ıve Bayes is a generative model in which each class corresponds to one mixture component. Design a fully supervised generalization of the nai¨ıve Bayes model in which each of the k classes contains exactly b > 1 mixture components for a total of b · k mixture components. How would you perform parameter estimation in this model?
1. Na¨ıve Bayes is a generative model in which each class corresponds to one mixture component. Design a semi-supervised generalization of the nai¨ıve Bayes model in which each of the k classes contains exactly b > 1 mixture components for a total of b · k mixture components. How would you perform parameter estimation in this model?
2. The adaptive nearest-neighbor method discussed in the chapter uses a single distortion metric over the entire data space in order to compute the nearest neighbor of a point. Propose a training algorithm to make this metric locally adaptive, so that an optimized distortion metric is used for each test instance based on the local class patterns in the data. What are the possible advantages and disadvantages of using such an….
Imagine a document data set in which the class label is generated by the following hidden function (which is unknown to the analyst and therefore has to be learned by a supervised learner): If a term has an odd number of consonants, then the term is of type 1. Otherwise the term is of type 2. The class label of a document is of type 1, if the majority of the tokens in it are of type 1. Otherwise, the class label is of type 2. For a document collection of this type, would you prefer to use (1) a Bernoulli na¨ıve Bayes classifier, (2) a multinomial na¨ıve Bayes classifier, (3) a nearest-neighbor classifier, or (4) a univariate decision tree? What is the impact of the lexicon size….
1. Discuss the advantages of rule-based learners over decision trees, when the amount of data is limited.
2. Discuss how one might integrate domain knowledge with rule-based learners.
3. The bias variable is often addressed in least-squares classification and regression by adding an additional column of 1s to the data. Discuss the differences with the use of an explicit bias term when regularized forms of the model are used.
4. Write the optimization formulation for least-squares regression of the form y = W · X + b with a bias term b. Do not use regularization. Show that the optimal value of the bias term b always evaluates to 0 when the data matrix D and response variable vector y are both mean-centered.
1. Show that the effect of the bias term can be accounted for by adding a constant amount to each entry of the n × n kernel similarity matrix when using kernels with linear models.
2. Formulate a variation of regularized least-squares classification in which L1-loss is used instead of L2-loss. How would you expect each of these methods to behave in the presence of outliers? Which of these methods is more similar to SVMs with hinge loss? Discuss the challenges of using gradient-descent with this problem as compared to the regularized least-squares formulation.
1. Derive stochastic gradient-descent steps for the variation of L1-loss classification introduced in Exercise 6. You can use a constant step size.
2. Derive stochastic gradient-descent steps for SVMs with quadratic loss instead of hinge loss. You can use a constant step size.
3. Provide an algorithm to perform classification with explicit kernel feature transformation and the Nystr¨om approximation. How would you use ensembles to make the algorithm efficient and accurate?
4. Consider an SVM with properly optimized parameters. Provide an intuitive argument as to why the out-of-sample error rate of the SVM will be usually less than the fraction of support vectors in the training data.
1. Suppose that you perform least-squares regression without regularization with the loss function n i=1(yi − W · Xi)2, but you add spherical Gaussian noise with variance λ to each feature. Show that the expected loss with the perturbed features provides a loss function that is identical to that of L2-regularization. Use this result to provide an intuitive explanation of the connection between regularization and noise addition.
2. Show how to use the representer theorem to derive the closed-form solution of kernel least-squares regression.
3. Discuss the effect on the bias and variance by making the following changes to a classification algorithm: (a) Increasing the regularization parameter in a support vector machine, (b) Increasing the Laplacian smoothing parameter in a n¨ıve Bayes classifier, (c) Increasing the depth of….
1. Implement a subsampling ensemble in combination with a 1-nearest neighbor classifier. 5. Suppose you used a 1-nearest neighbor classifier in combination with an ensemble method, and you are guaranteed that each application gets you the correct answer for a test instance with 60% probability. What is the probability that you get the correct answer using a majority vote of three tries, each of which are guaranteed to be independent and identically distributed?
2. Suppose that a data set is sampled without replacement to create a bagged sample of the same size as the original data set. Show that he probability that a point will not be included in the re-sampled data set is given by 1/e, where e is the base of the natural logarithm.
1. Show how to use a factorization machine to perform undirected link prediction in a social network with content.
2. Show how to convert a link prediction problem with structure and content into a link prediction problem on a derived graph.
3. Suppose that you have a user-item ratings matrix with numerical/missing values. Furthermore, users have rated each other’s trustworthiness with binary/missing values. (a) Show how you can use shared matrix factorization for estimating the rating of a user on an item that they have not already rated. (b) Show how you can use factorization machines to achieve similar goals as (a).