1. Show that the effect of the bias term can be accounted for by adding a constant amount to each entry of the n × n kernel similarity matrix when using kernels with linear models.
2. Formulate a variation of regularized least-squares classification in which L1-loss is used instead of L2-loss. How would you expect each of these methods to behave in the presence of outliers? Which of these methods is more similar to SVMs with hinge loss? Discuss the challenges of using gradient-descent with this problem as compared to the regularized least-squares formulation.