When Gradient Descent Is a Kernel Method

In this content, the author discusses solving a regression problem using a linear combination of random functions. They propose initializing the coefficients to 1/sqrt(N) and using gradient descent to minimize a loss function. The author empirically tests this approach and finds that the resulting functions are not bad and seem to concentrate around piecewise linear interpolations of the data. They also explore the relationship between gradient descent and Bayesian inference, suggesting that running gradient descent to convergence samples from the posterior for the data points. The author then delves into the mathematical analysis of gradient descent, including the relationship between its behavior and the statistical properties of the random functions and initialization. They discuss the concept of a tangent kernel and how gradient descent can be viewed as a process occurring in the function space. The author also introduces the idea of regularization and its connection to reproducing kernel Hilbert spaces (RKHS). They explain how gradient descent modifies the function by a linear combination of kernel functions, which leads to piecewise affine interpolations of the data. The author then explores the relationship between the covariance kernel of the Wiener process and the dynamics of gradient descent. Finally, they discuss the concept of RKHS and its connection to the Gaussian process, as well as the regularization of optimization problems

https://cgad.ski/blog/when-gradient-descent-is-a-kernel-method.html