Polynomial Features Extraction

2021-07-15

Polynomial regression adds a new feature to the normal linear regression. Something very simple and requires only one change from the classic linear regression. Polynomial regression is useful if the explanatory variable and the dependent variables are related in not linear way. For example: We need food to get energy but if we eat a lot we get tired so, a parabolic pattern is formed; When we have food in a healthy amount, we get the maximum energy,but if the opposite happens we get lazy and might go to bed faster. By the way, the last sentence is absolutely, scientifically correct and is perfect representation of polynomial representation. Polynomial is still linear in that the term that we are predicting, θ\theta, is linear. So polynomial regression is linear regression with polynomial features of the data. (By the way, I get the ideas for the blogs from here.)

Little Maths

Yup, I am blatantly ripping off the math from Wikipedia, some books for this explanation(Pardon me!),and a little of my own contributions(however, I keep it to the minimum as I am most probably wrong). All of the steps will remain the same when compared to linear regression. The only thing that changes is the features that will be used to predict the θ\theta.

Let us assume that the XX matrix, in the simplest case, is of size n×1n \times 1. Here, nn is the number of entries or number of cases and 11 denotes that we are predicting for only a single feature, a more complex example will be shown later. So the matrix XX can be represented as: X=(x1x12x1mx2x22x2mx3x32x3mxnxn2xnm) X = \begin{pmatrix} x_1 & x^2_1 & \cdots & x^m_1 \\ x_2 & x^2_2 & \cdots & x^m_2 \\ x_3 & x^2_3 & \cdots & x^m_3 \\ \vdots & \vdots & \ddots & \vdots \\ x_n & x^2_n & \cdots & x^m_n \\ \end{pmatrix} , y=Xθy = X \theta We will be leaving out the ε\varepsilon part to make the discussion more simpler. So the final equation can be given as,

or,(y1y2y3yn)=(x1x12x1mx2x22x2mx3x32x3mxnxn2xnm)(θ0θ1θ2θm) or, \begin{pmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_n \end{pmatrix} = \begin{pmatrix} x_1 & x^2_1 & \cdots & x^m_1 \\ x_2 & x^2_2 & \cdots & x^m_2 \\ x_3 & x^2_3 & \cdots & x^m_3 \\ \vdots & \vdots & \ddots & \vdots \\ x_n & x^2_n & \cdots & x^m_n \\ \end{pmatrix} \begin{pmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_m \\ \end{pmatrix}

The value mm denotes that the explanatory variable XX and the dependent yy are related at mthm^{th} polynomial degree(I cannot seem to word it better). For example: If XX and yy were related linearly dependent the polynomial degree(mm) would be 1. The example about the food would be related at polynomial degree 2.

Now when we have a explanatory variable XX having nn entries and pp features, the matrix can given as: X=(x11x12x1px21x22x2pxn1xn2xnp) X = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \\ \end{pmatrix}

So, if this new XX was to be shown in the form as before, it would be something like this:

(y1y2y3yn)=(x11x12x1px112x122x1p2x1pmx21x22x2px212x222x2p2x2pmxn1xn2xnpxn12xn22xnp2xnpm)(θ1θ2θp×m) \begin{pmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_n \end{pmatrix} = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1p} & x^2_{11} & x^2_{12} & \cdots & x^2_{1p} & \cdots & x^m_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} & x^2_{21} & x^2_{22} & \cdots & x^2_{2p} & \cdots & x^m_{2p} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} & x^2_{n1} & x^2_{n2} & \cdots & x^2_{np} & \cdots & x^m_{np} \\ \end{pmatrix} \begin{pmatrix} \theta_1 \\ \theta_2 \\ \vdots \\ \theta_{p\times m} \end{pmatrix}

That looks like a awfully long matrix, doesn't it? Here we first raise the power of each component of the matrix XX from 1 to mm and concatenate each of those to the original matrix. The shape of the matrix we are predicting changes as well: the same of the matrix θ\theta becomes (mp)×1(m*p)\times 1. This is still a linear regression, as the component we want to predict(θ\theta) is linear.

Now, lets view another way of making that same XX matrix even longer. This one requires us to write it in a more algorithmic style.

To make this simpler, lets assume that the features pp is even,so that we can divide the matrix XX in two equal halves as such,

X1=(x11x21x1p2x21x22x2p2xn1xn2xnp2) and X2=(x1(p2+1)x1(p2+2)x1px2(p2+1)x2(p2+2)x2pxn(p2+1)xn(p2+2)xnp) X_1 = \begin{pmatrix} x_{11} & x_{21} & \cdots & x_{1\frac{p}{2}} \\ x_{21} & x_{22} & \cdots & x_{2\frac{p}{2}} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{n\frac{p}{2}} \end{pmatrix} \ and \ X_2 = \begin{pmatrix} x_{1(\frac{p}{2}+1)} & x_{1(\frac{p}{2}+2)} & \cdots & x_{1p} \\ x_{2(\frac{p}{2}+1)} & x_{2(\frac{p}{2}+2)} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n(\frac{p}{2}+1)} & x_{n(\frac{p}{2}+2)} & \cdots & x_{np} \end{pmatrix} Now we perform the binomial expansion and concatenate each term of the binomial expansion without the coefficient to the original XX.

So for a polynomial degree mm, the additional matrix to be concatenated can be given as: Final X=(XX1mX1m1X21X2m)Final \ X = ( X \cdots X_1^m \cdots X_1^{m-1}X_2^1 \cdots X_2^m ) Here, X1m1X21X_1^{m-1}X_2^1 means element wise exponent of X1X_1 to m1m-1 and X2X_2 to 11, and the element wise multiplication of the result of exponential. The result is, then, concatenated to the original XX. This is same as how polynomial feature extraction works in sk-learn.

Coding

Programming this aforementioned process is pretty simple. We have to expand it to its binomial expansion without the coefficient, and the rest is the same as the ordinary linear regression. The feature extraction can be given as such in Rust. Also this is just the follow up to my previous blog. The implementation can be written inside the impl block of the either Poly(If you want to treat is as a different type of machine learning model) or Linear struct (If you treat it as a feature extraction only.)

fn make_poly(data: &Array2<T>, poly: i32) -> Array2<f64> {
    let split = (data.ncols() / 2) as usize;
    let _temp = data.clone(); // This is ugly !
    let data_1 = data.slice(s![.., 0..split]);
    let data_2 = data.slice(s![.., split..split * 2]);
    for i in 1..poly + 1 {
        for j in 0..i + 1 {
            _temp = concatenate![
                Axis(1),
                *data,
                (data_1.mapv(|a| a.powi(i - j)) * data_2.mapv(|a| a.powi(j)))
            ];
        }
    }
    temp
/*
    if split % 2 != 0 {
       concatenate![Axis(1), *data, data.slice(s![.., (split*2)+1..(split*2)+1])];
     }
*/
}

This looks a lot inefficient so I am looking into better way of implementing this. We first slice the data to two different halves and expand the two halves in binomial form as shown before. We could also check if the split is divisible by 2, if it is we don't do anything and if it isn't we could do all sorts of things like concatenating the number of terms remaining at last, squaring them and concatenating etc.... This is basically what polynomial regression is. Next on: Logistic Regression :)