Goal
Accelerate the computation of kernel matrix by directly approximating the kernel function.
Map x into a low-dimension random feature space, $z: \mathcal { R } ^ { d } \rightarrow \mathcal { R } ^ { D }$
\[k ( \mathbf { x } , \mathbf { y } ) = \langle \phi ( \mathbf { x } ) , \phi ( \mathbf { y } ) \rangle \approx \mathbf { z } ( \mathbf { x } ) ^ { \prime } \mathbf { z } ( \mathbf { y } )\]Basic Thought
Rewrite k(x, y) as an expectation over a certain probability space, and hence we can use sample mean to approximate it.
Method: Random Fourier Features
Some kernels depend only on $\Delta = abs(x-y)$, for example, Gaussian kernel,
\[e ^ { - \frac { \| \Delta \| _ { 2 } ^ { 2 } } { 2 } }\]There is a theorem by Bochner: A continuous kernel $k ( \mathrm { x } , \mathrm { y } ) = k ( \mathrm { x } - \mathrm { y } )$ on $\mathcal { R } ^ { d }$ is positive definite if and only if $k ( \delta )$ is the Fourier transform of a non-negative measure.
And recall the Fourier Transform, $X ( j \omega ) = \int _ { - \infty } ^ { + \infty } x ( t ) e ^ { - j \omega t } d t$, we let
\[p ( \omega ) = \frac { 1 } { 2 \pi } \int e ^ { - j \omega ^ { \prime } \delta } k ( \delta ) d \delta\]And use the inverse formula:
\[k ( \mathbf { x } - \mathbf { y } ) = \int _ { \mathcal { R } ^ { d } } p ( \omega ) e ^ { j \omega ^ { \prime } ( \mathbf { x } - \mathbf { y } ) } d \omega = E _ { \omega } \left[ \zeta _ { \omega } ( \mathbf { x } ) \zeta _ { \omega } ( \mathbf { y } ) ^ { * } \right]\]By sampling different $\omega$ from $p(\omega)$, z(x) is a vector of D $\zeta _ { \omega } ( \mathbf { x } )$, and
\[k ( \mathbf { x } , \mathbf { y } ) = k ( \mathbf { x } - \mathbf { y } ) \approx \mathbf { z } ( \mathbf { x } ) ^ { \prime } \mathbf { z } ( \mathbf { y } )\]Plus, note that both $k()$ and $p(\omega)$ are real-valued, so we could ignore the imaginary part,
\[k ( x - y ) = \int _ { R ^ { d } } p ( \omega ) \cos \left( \omega ^ { T } ( x - y ) \right) d \omega\]After adding a random variable uniformly drawn from $[0, 2\pi]$, (to avoid additionally calculating $sin(\omega^{ T } x)$, since $cos(\omega^{T} (x-y)) = cos(\omega^{T} x) cos(\omega^{T} y) + sin(\omega^{T} x) sin(\omega^{T} y)$)
\[= \int _ { 0 } ^ { 2 \pi } \frac { 1 } { 2 \pi } d b \int _ { R ^ { d } } p ( \omega ) 2 \cos \left( \omega ^ { T } x + b \right) \cos \left( \omega ^ { T } y + b \right) d \omega = E \left[ z _ { \omega } ( x ) z _ { \omega } ( y ) \right]\]So the final map is $z _ { \omega } ( x ) = \sqrt { 2 } \cos \left( \omega ^ { T } x + b \right)$
How close is the approximation to the original kernel function?
By Hoeffding inequality, (since the rv, i.e. the integral, is bounded)
\[\operatorname { Pr } \left[ \left| z ( x ) ^ { T } z ( y ) - k ( x , y ) \right| \geq \epsilon \right] \leq 2 e ^ { - D \left( \frac { \epsilon } { 2 } \right) ^ { 2 } }\]Method: Random binning Features
First try to approximate a special “hat” kernel
\[k _ { h a t } ( x , y ; \delta ) = \max ( 0, 1 - \frac { | x - y | } { \delta } )\]Partition the real number line with a grid of pitch δ, and shift this grid randomly by an amount u drawn uniformly at random from [0,δ]. This grid partitions the real number line into intervals [u + nδ,u + (n + 1)δ] for all integers n.
Extend it to more general kernel function
And we can rewrite a kernel function depending only on the L1 distance between pairs of points,
\[\int _ { 0 } ^ { \infty } k _ { h a t } ( x , y ; \delta ) p ( \delta ) d \delta\]the distribution p could be set as, $p ( \delta ) = \delta \ddot { k } ( \delta )$, which is guarenteed by the lemma 1 in this paper.
Random maps for separable multivariate shift-invariant kernels of the form can also utilize this method.
\[k(x − y) = \prod _ { m = 1 } ^ { d } k _ { m } ( | x ^ { m } - y ^ { m } | )\]