Kernel Least-Squares
MotivationsConsider a linear auto-regressive model for time-series, where
This writes
We can fit this model based on historical data via least-squares:
The associated prediction rule is We can introduce a non-linear version, where
This writes
Everything the same as before, with It appears that the size of the least-squares problem grows quickly with the degree of the feature vectors. How do we do it in a computationally efficient manner? The kernel trickWe exploit a simple fact: in the least-squares problem
the optimal
for some vector
where Hence the least-squares problem depends only on
The prediction rule depends on the scalar products between new point
Once Nonlinear caseIn the nonlinear case, we simply replace the feature vectors This leads to the modified kernel matrix
The kernel function associated with mapping
It provides information about the metric in the feature space, eg:
The computational effort involved in
depends only on our ability to quickly evaluate such scalar products. We can't choose Examples of kernelsA variety of kernels are available. Some are adapted to the structure of data, for example text or images. Here are a few popular choices. Polynomial kernelsRegression with quadratic functions involves feature vectors
In fact, given two vectors
More generally when
The computational effort grows linearly in This represents a dramatic reduction in speed over the ‘‘brute force’’ approach:
Gaussian kernelsGaussian kernel function:
where Other kernelsThere is a large variety (a zoo?) of other kernels, some adapted to structure of data (text, images, etc). Kernels in practice
|