Approach#

Linear regression is a classical method in mathematics for constructing functions which are somehow close to a given set of points. In Data Science linear regression does a pretty good job, thought it’s relatively simple.

Linear Regression with Linear Functions#

The Principle#

We first consider the simplest regression model. The parameter dependent hypothesis is of the form

\[\begin{equation*} f_{\mathrm{approx}}(x)=a_0+a_1\,x^{(1)}+a_2\,x^{(2)}+\cdots+a_m\,x^{(m)}=a_0+\sum_{k=1}^m a_k\,x^{(k)}, \end{equation*}\]

where \(x=\bigl(x^{(1)},\ldots,x^{(m)}\bigr)\) is a feature vector and \(a_0,a_1,\ldots,a_m\) are the parameters of the hypothesis \(f_{\mathrm{approx}}\).

Given \(n\) training samples \((x_1,y_1),\ldots,(x_n,y_n)\) we want to choose \(a_0,a_1,\ldots,a_m\) such that

\[\begin{equation*} f_{\mathrm{approx}}(x_l)\approx y_l\qquad\text{for $l=1,\ldots,n$}. \end{equation*}\]

To achive this we solve the minimization problem

\[\begin{equation*} \boxed{ \frac{1}{n}\sum_{l=1}^n\bigl(f_{\mathrm{approx}}(x_l)-y_l\bigr)^2\to\min_{a_0,\ldots,a_m} }. \end{equation*}\]

The function \(g\) defined by

\[\begin{equation*} g(u,v)=\frac{1}{n}\sum_{l=1}^n\bigl(u_l-v_l\bigr)^2 \end{equation*}\]

is a so called loss function expressing the distance between two vectors \(u=(u_1,\ldots,u_n)\) and \(v=(v_1,\ldots,v_n)\). There are several other loss functions in machine learning like mean absolute error and Huber loss, see Quality Measures. For the moment we content ourselves with the simplest one, the mean squared error.

Note that we cannot expect \(f_{\mathrm{approx}}(x_l)=y_l\), because our model, that is, our assumption of a linear function, is likely to be too simplistic. In addition, observations \(y_l\) in regression problems often are corrupted by noise.

Solving the Minimization Problem Analytically#

The minimization problem

\[\begin{equation*} h(a_0,\ldots,a_m):=\frac{1}{n}\sum_{l=1}^n\bigl(f_{\mathrm{approx}}(x_l)-y_l\bigr)^2 =\frac{1}{n}\sum_{l=1}^n\left(a_0+\sum_{k=1}^m a_k\,x_l^{(k)}-y_l\right)^2 \to\min_{a_1,\ldots,a_m} \end{equation*}\]

is known to have only global minima (no other stationary points), which can be found via differential calculus. We simply have to find \(a_0,\ldots,a_m\) where the gradient of \(h\) is the zero vector. To keep formulas as simple as possible we introduce an additional artificial zeroth feature \(x_l^{(0)}:=1\) for \(l=1,\ldots,n\). This allows to write

\[\begin{equation*} h(a_0,\ldots,a_m)=\frac{1}{n}\sum_{l=1}^n\left(\sum_{k=0}^m a_k\,x_l^{(k)}-y_l\right)^2. \end{equation*}\]

Taking the derivative with respect to \(a_i\) for \(i=0,1,\ldots,m\) we obtain

\[\begin{align*} \frac{\partial}{\partial a_i}h(a_1,\ldots,a_m) &=\frac{1}{n}\sum_{l=1}^n 2\,\left(\sum_{k=0}^m a_k\,x_l^{(k)}-y_l\right)\,x_l^{(i)}\\ &=\frac{2}{n}\sum_{l=1}^n\sum_{k=0}^m a_k\,x_l^{(k)}\,x_l^{(i)}-\frac{2}{n}\sum_{l=1}^n y_l\,x_l^{(i)}\\ &=\frac{2}{n}\sum_{k=0}^m a_k\,\sum_{l=1}^n x_l^{(k)}\,x_l^{(i)}-\frac{2}{n}\sum_{l=1}^n y_l\,x_l^{(i)}. \end{align*}\]

Thus, the gradient is zero if and only if

\[\begin{equation*} \sum_{k=0}^m a_k\,\sum_{l=1}^n x_l^{(k)}\,x_l^{(i)}=\sum_{l=1}^n y_l\,x_l^{(i)}\qquad\text{for $i=0,1,\ldots,m$}. \end{equation*}\]

This is a system of linear equations for the unknowns \(a_0,a_1,\ldots,a_m\). Sometimes these equations are called normal equations of the minimization problem. Taking the feature vectors \(x_1,\ldots,x_n\) as rows of a matrix \(B\in\mathbb{R}^{m\times n}\) (including the artificial zeroth feature \(1\)), that is,

\[\begin{equation*} B:=\begin{bmatrix}1&x_1^{(1)}&\cdots&x_1^{(m)}\\\vdots&\vdots&&\vdots\\1&x_n^{(1)}&\cdots&x_n^{(m)}\end{bmatrix} \end{equation*}\]

we see that the gradient of \(h\) is zero if and only if

\[\begin{equation*} \boxed{ B^{\mathrm{T}}\,B\,a=B^{\mathrm{T}}\,y. } \end{equation*}\]

Here \(a\) and \(y\) are the vectors of the unknowns and of the labels, respectively. Note that the system matrix \(B^{\mathrm{T}}\,B\) is a square matrix of size \(m+1\) with \(m\) being the number of features in our data.

Example: Used Cars Prices#

We look at a data set containing information about used cars offered at the online platform CarDekho. The data set contains age, selling price, kilimeters driven and several other parameters of about 4000 cars.

The dataset has been obtained via Kaggle and is licenced under Database Contents License (DbCL) v1.0, including Open Database License (ODbL) v1.0.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

cars = pd.read_csv('cars.csv')

cars
name year selling_price km_driven fuel seller_type transmission owner
0 Maruti 800 AC 2007 60000 70000 Petrol Individual Manual First Owner
1 Maruti Wagon R LXI Minor 2007 135000 50000 Petrol Individual Manual First Owner
2 Hyundai Verna 1.6 SX 2012 600000 100000 Diesel Individual Manual First Owner
3 Datsun RediGO T Option 2017 250000 46000 Petrol Individual Manual First Owner
4 Honda Amaze VX i-DTEC 2014 450000 141000 Diesel Individual Manual Second Owner
... ... ... ... ... ... ... ... ...
4335 Hyundai i20 Magna 1.4 CRDi (Diesel) 2014 409999 80000 Diesel Individual Manual Second Owner
4336 Hyundai i20 Magna 1.4 CRDi 2014 409999 80000 Diesel Individual Manual Second Owner
4337 Maruti 800 AC BSIII 2009 110000 83000 Petrol Individual Manual Second Owner
4338 Hyundai Creta 1.6 CRDi SX Option 2016 865000 90000 Diesel Individual Manual First Owner
4339 Renault KWID RXT 2016 225000 40000 Petrol Individual Manual First Owner

4340 rows × 8 columns

We concentrate on one car model. Let’s take the most frequent one.

grouped_by_name = cars.groupby('name') 
name = grouped_by_name.size().idxmax()
cars_subset = grouped_by_name.get_group(name)

cars_subset
name year selling_price km_driven fuel seller_type transmission owner
117 Maruti Swift Dzire VDI 2014 434000 79350 Diesel Individual Manual Second Owner
327 Maruti Swift Dzire VDI 2018 800000 25000 Diesel Individual Manual First Owner
338 Maruti Swift Dzire VDI 2015 490000 60000 Diesel Individual Manual Second Owner
365 Maruti Swift Dzire VDI 2014 400000 90000 Diesel Individual Manual First Owner
409 Maruti Swift Dzire VDI 2012 215000 80000 Diesel Individual Manual Third Owner
... ... ... ... ... ... ... ... ...
4189 Maruti Swift Dzire VDI 2013 425000 61083 Diesel Dealer Manual First Owner
4245 Maruti Swift Dzire VDI 2014 480000 101000 Diesel Dealer Manual First Owner
4265 Maruti Swift Dzire VDI 2014 480000 101000 Diesel Dealer Manual First Owner
4289 Maruti Swift Dzire VDI 2019 680000 40000 Diesel Individual Manual First Owner
4314 Maruti Swift Dzire VDI 2015 470000 170000 Diesel Individual Manual First Owner

69 rows × 8 columns

Before we start any computation we should have a look at the data.

fig, ax = plt.subplots()

ax.scatter(cars_subset['km_driven'].to_numpy(), cars_subset['selling_price'].to_numpy(), s=10, c='b')
ax.set_xlabel('km')
ax.set_ylabel('price')

plt.show()
../../../_images/approach_5_0.png

This does not look like a linear dependence between price and kilometers driven. There seems to be no systematic dependence between both quantities at all. But intuition suggests that there should be one. Thus, we should take into account other features. Isn’t it possible that the dependence between price and kilometers driven depends on the age of the car?

Let’s visualize different ages.

fig, ax = plt.subplots()

ax.scatter(cars_subset['km_driven'].to_numpy(), cars_subset['selling_price'].to_numpy(),
           s=10, c=2020 - cars_subset['year'].to_numpy(), cmap='jet')
ax.set_xlabel('km')
ax.set_ylabel('price')

plt.show()
../../../_images/approach_7_0.png

The blue region shows some linear structure and the reddish region, too. Thus, age should be taken into account for predicting prices. Now let’s do two regressions: one for young cars and one for old ones.

from matplotlib import colors

def simple_linear_regression(X, y):
    B = np.ones((X.size, 2))
    B[:, 1] = X
    return np.linalg.solve(np.matmul(B.T, B), np.matmul(B.T, y))

year = 2014   # where to split the data set into young and old

mask_young = cars_subset['year'] > year
X_young = cars_subset.loc[mask_young, 'km_driven'].to_numpy()
y_young = cars_subset.loc[mask_young, 'selling_price'].to_numpy()

mask_old = cars_subset['year'] <= year
X_old = cars_subset.loc[mask_old, 'km_driven'].to_numpy()
y_old = cars_subset.loc[mask_old, 'selling_price'].to_numpy()

a_young = simple_linear_regression(X_young, y_young)
a_old = simple_linear_regression(X_old, y_old)

print(a_young)
print(a_old)

fig, ax = plt.subplots()

ax.scatter(cars_subset['km_driven'].to_numpy(), cars_subset['selling_price'].to_numpy(),
           s=10, c=(mask_young.to_numpy()), cmap=colors.ListedColormap(['r', 'b']))

xmin = 0
xmax = 260000
ax.plot([xmin, xmax], [a_young[0] + a_young[1] * xmin, a_young[0] + a_young[1] * xmax], '-b', label='young')
ax.plot([xmin, xmax], [a_old[0] + a_old[1] * xmin, a_old[0] + a_old[1] * xmax], '-r', label='old')

ax.set_xlabel('km')
ax.set_ylabel('price')
ax.legend()
plt.show()
[ 6.74201542e+05 -1.68902485e+00]
[3.64966780e+05 1.81825528e-01]
../../../_images/approach_9_1.png

Given the kilometers driven, the simple formula

\[\begin{equation*} \boxed{ y = 674202 - 1.69\,x } \end{equation*}\]

yields a prediction for the price we could get for a young car, where \(x\) are the kilometers and y is the price. For old cars we have the model

\[\begin{equation*} \boxed{ y = 364967 + 0.18\,x } \end{equation*}\]

Note that this holds only for the selected car model. In addition, we only considered very simple hypotheses with only two parameters.

Instead of doing two one-dimensional regressions for young and old cars, we could do a two-dimensional regression including both features, km driven and age.

At first we look at the data.

import plotly.graph_objects as go

xcoords = cars_subset['km_driven'].to_numpy()
ycoords = 2020 - cars_subset['year'].to_numpy()
zcoords = cars_subset['selling_price'].to_numpy()

fig = go.Figure()
fig.layout.width = 800
fig.layout.height = 600

fig.add_trace(go.Scatter3d(
    x=xcoords, y=ycoords, z=zcoords,
    mode='markers',
    marker={'color': '#0000ff', 'size': 2}
))
fig.update_scenes(
    xaxis_title_text='km',
    yaxis_title_text='age',
    zaxis_title_text='price',
)

fig.show()

Then we do the regression.

X = np.stack((cars_subset['km_driven'].to_numpy(),
              2020 - cars_subset['year'].to_numpy()), axis=1)
y = cars_subset['selling_price'].to_numpy()

B = np.ones((X.shape[0], 3))
B[:, 1:] = X
a = np.linalg.solve(np.matmul(B.T, B), np.matmul(B.T, y))

print(a)

fig = go.Figure()
fig.layout.width = 800
fig.layout.height = 600

fig.add_trace(go.Scatter3d(
    x=X[:, 0], y=X[:, 1], z=y,
    mode='markers',
    marker={'color': '#0000ff', 'size': 2}
))

xmin = 0
xmax = 250000
ymin = 0
ymax = 10
xcoords, ycoords = np.meshgrid([xmin, xmax], [ymin, ymax])
zcoords = a[0] + a[1] * xcoords + a[2] * ycoords
fig.add_trace(
    go.Surface(x=xcoords, y=ycoords, z=zcoords, showscale=False,
               colorscale=[[0, 'rgba(255, 0, 0, 0.8)'], [1, 'rgba(255, 0, 0, 0.8)']])
)

fig.update_scenes(
    xaxis_title_text='km',
    yaxis_title_text='age',
    zaxis_title_text='price',
)

fig.show()
[ 7.69448384e+05 -2.07046783e-01 -5.47861786e+04]

Now our model for the price \(y\) depending on kilometers \(x_1\) and age \(x_2\) is

\[\begin{equation*} \boxed{ y = 769448 - 0.21\,x_1 - 54786\,x_2. } \end{equation*}\]

Limitations#

Linear regression with linear functions yields inaccurate results if the data set does not show a linear structure. Also outliers may distort results. Thus, linear regression has to be used with care and preceeding visual analysis is mandatory.

Have a look at Ascombe’s quartet for very different data sets all yielding the same result if linear regression with linear functions is used.

Linear Regression with Seaborn#

Seaborn offers pairplot for getting a first overview of mutual dependence of all variables. See Plotting pairwise data relation ships in the Seaborn documentation for more details.

import seaborn as sns

sns.pairplot(cars_subset)
plt.show()
../../../_images/approach_15_0.png

Seaborn also provides linear regression directly. See Estimating regression fits for more details.

sns.regplot(x='year', y='selling_price', data=cars_subset)
<Axes: xlabel='year', ylabel='selling_price'>
../../../_images/approach_17_1.png

Linear Regression with Scikit-Learn#

Linear regression is implemented by almost all machine learning libraries. Scikit-Learn is a good choice due to its simple API. It offers a module linear_model.

We reimplement linear regression with two features for the above example of used car prices.

The workflow is as follows:

  • create a LinearRegression object,

  • fit the object to the data (that is, do the regression),

  • extract the coefficients from the object or use the object for prediction.

# X, y from above

import sklearn.linear_model as lm

regression = lm.LinearRegression()

regression.fit(X, y)

print(regression.intercept_)    # a[0]
print(regression.coef_)         # a[1], a[2]
769448.3841336307
[-2.07046783e-01 -5.47861786e+04]

From Scikit-Learn we obtain identical coefficients as above. To evaluate the regression function call predict.

age = 3.5
km = 40000

price = regression.predict([[km, age]])

print(price)
[569414.88787459]

For comparison with more complex models below we calculate (root) mean squared error on the training set.

import sklearn.metrics as metrics

metrics.mean_squared_error(y, regression.predict(X), squared=False)
89976.37031916517

Linear Regression with Nonlinear Functions#

Idea#

We consider more general parameter dependent hypotheses of the form

\[\begin{equation*} f_{\mathrm{approx}}(x)=a_1\,\varphi_1\bigl(x^{(1)},\ldots,x^{(m)}\bigr)+\cdots+a_\mu\,\varphi_\mu\bigl(x^{(1)},\ldots,x^{(m)}\bigr)=\sum_{\kappa=1}^\mu a_\kappa\,\varphi_\kappa\bigl(x^{(1)},\ldots,x^{(m)}\bigr), \end{equation*}\]

where \(x=\bigl(x^{(1)},\ldots,x^{(m)}\bigr)\) is a feature vector, \(a_1,\ldots,a_\mu\) are the parameters of the hypothesis \(f_{\mathrm{approx}}\), and \(\varphi_1,\ldots,\varphi_\mu\) are prescribed real-valued functions on \(\mathbb{R}^m\).

Note that \(f_{\mathrm{approx}}\) depends linearly on the parameters \(a_1,\ldots,a_\mu\). Thus linear regression. In contrast, in nonlinear regression we would consider hypotheses containing the searched for parameters in a nonlinear fashion. This would be the most general case, which is rarely needed in practise.

Example: Fitting a Parabola#

Consider data with only one feature, that is, \(m=1\). We would like to fit a parabola to such a data set.

rng = np.random.default_rng(0)

def f(x):
    return (x - 1) ** 2 + 1

fig, ax = plt.subplots()

xgrid = np.linspace(-1, 3, 100)
ax.plot(xgrid, f(xgrid), '-r')

x = rng.uniform(-0.5, 2.7, 20)
yy = f(x) + rng.normal(0, 0.3, 20)
ax.plot(x, yy, 'ob', markersize=5)

ax.set_xticklabels([])
ax.set_yticklabels([])

plt.show()
../../../_images/approach_25_0.png

In mathematical notation our hypotesis shall have the form

\[\begin{equation*} f_{\mathrm{approx}}(x)=a_1\,x^2+a_2\,x+a_3, \end{equation*}\]

where we replaced \(x^{(1)}\) by \(x\) since we only have one feature. Thus, fitting a parabola is a special case of our general linear regression approach with

\[\begin{equation*} \mu=3,\qquad\varphi_1(x)=x^2,\qquad\varphi_2(x)=x,\qquad\varphi_3(x)=1. \end{equation*}\]

Example: Fitting Piecewise Linear Functions#

Again consider data with only one feature. In addition, assume that the feature takes values in \([0,1]\). Then it might be a good idea to divide \([0,1]\) into \(\mu-1\) equaly spaced subintervals for some fixed \(\mu\). On each subinterval we could do a linear regression, but require that resulting lines are connected at the interval boundaries.

rng = np.random.default_rng(0)

def hat(x):
    return np.maximum(0, 1 - np.abs(x))

def f(x):
    return 0.5 * hat(5*x) + 2 * hat(5*x-1) + hat(5*x-2) + 1.5 * hat(5*x-3) + 3 * hat(5*x-4)

fig, ax = plt.subplots()

xgrid = np.linspace(0, 1, 6)
ax.plot(xgrid, f(xgrid), '-r')

x = rng.uniform(0, 1, 20)
yy = f(x) + rng.normal(0, 0.3, 20)
ax.plot(x, yy, 'ob', markersize=5)

for x in range(0, 6):
    ax.plot([x/5, x/5], [0, 3.3], 'k:', lw=1)

ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_title('$\mu=6$')

plt.show()
../../../_images/approach_27_0.png

In our linear regression framework we could choose \(\varphi_1,\ldots,\varphi_\mu\) to be hat functions of width \(\frac{2}{\mu-1}\) centered at the interval boundaries \(\frac{\kappa-1}{\mu-1}\) for \(\kappa=1\ldots,\mu\):

\[\begin{equation*} \varphi_\kappa(x)=\begin{cases} (\mu-1)\,x-\kappa+2,&\text{if }x\in\bigl[\frac{\kappa-2}{\mu-1},\frac{\kappa-1}{\mu-1}\bigr],\\ -(\mu-1)\,x+\kappa,&\text{if }x\in\bigl[\frac{\kappa-1}{\mu-1},\frac{\kappa}{\mu-1}\bigr],\\ 0,&\text{else}. \end{cases} \end{equation*}\]
def hat(x):
    return np.maximum(0, 1 - np.abs(x))

fig, ax = plt.subplots()

xgrid = np.array([-1, 0, 1])

for x in range(0, 6):
    ax.plot(xgrid + x, hat(xgrid))

ax.set_xlim(0, 5)
ax.set_ylim(0, 1.5)
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_title('$\mu=6$')

plt.show()
../../../_images/approach_29_0.png

The parameters \(a_1,\ldots,a_\mu\) describe the high of the hats or the function values of \(f_{\mathrm{approx}}\) at the interval boundaries.

Example: RBF Regression#

Using several copies of a hat function and placing them at fixed grid points is a special case of RBF regression. Here RBF is the abbreviation of radial basis function. An RBF is a function that depends only on the length \(|x|\) of a feature vector \(x\) and not on the individual features. Thus, it is symmetric for one-dimensional feature vectors, cone shaped for two-dimensional feature vectors, and so on. RBFs typically have their maximum at 0 (the center) and then decay when distance from the center increases.

Next to hat functions

\[\begin{equation*} \varphi(x)=\begin{cases}1-\frac{1}{c}\,|x|,&\text{if }|x|\leq c,\\0,&\text{else},\end{cases} \end{equation*}\]

where \(c>0\) controls the width of the hat (\(c=\frac{1}{\mu-1}\) in the previous example), Gaussian RBFs

\[\begin{equation*} \varphi(x)=\mathrm{e}^{-c\,|x|^2} \end{equation*}\]

are widely used. Again, \(c>0\) controls the width of the bell shaped curve.

def hat(x):
    return np.maximum(0, 1 - np.abs(x))

def bell(x):
    return np.exp(-x**2)

fig, ax = plt.subplots()

xgrid = np.linspace(-2.5, 2.5, 101)
ax.plot(xgrid, hat(xgrid), '-r')
ax.plot(xgrid, bell(xgrid), '-b')

ax.set_ylim(0, 1.3)
ax.set_xticks([0])
ax.set_xticklabels([0])
ax.set_yticks([0, 1])
ax.set_yticklabels([0, 1])
ax.grid()
ax.set_title('1d RBFs')

plt.show()
../../../_images/approach_31_0.png
def hat(x, y):
    return np.maximum(0, 1 - np.sqrt(x**2 + y**2))

fig = go.Figure()
fig.layout.width = 800
fig.layout.height = 600

xcoords, ycoords = np.meshgrid(np.linspace(-2.5, 2.5, 51), np.linspace(-2.5, 2.5, 51))
zcoords = hat(xcoords, ycoords)
fig.add_trace(
    go.Surface(x=xcoords, y=ycoords, z=zcoords, showscale=False,
               colorscale=[[0, 'rgba(255, 0, 0, 1)'], [1, 'rgba(255, 0, 0, 1)']])
)

fig.show()
def bell(x, y):
    return np.exp(-x**2-y**2)

fig = go.Figure()
fig.layout.width = 800
fig.layout.height = 600

xcoords, ycoords = np.meshgrid(np.linspace(-2.5, 2.5, 51), np.linspace(-2.5, 2.5, 51))
zcoords = bell(xcoords, ycoords)+1
fig.add_trace(
    go.Surface(x=xcoords, y=ycoords, z=zcoords, showscale=False,
               colorscale=[[0, 'rgba(0, 0, 255, 1)'], [1, 'rgba(0, 0, 255, 1)']])
)

fig.show()

Given fixed grid points \(u_1,\ldots,u_\mu\) in the feature space we choose

\[\begin{equation*} \varphi_1(x)=\varphi(x-u_1),\quad\ldots,\quad\varphi_\mu(x)=\varphi(x-u_\mu). \end{equation*}\]

Relation to Linear Regression with Linear Functions#

Linear regression with nonlinear functions can be reduced to linear regression with linear functions studied above. We simply have to transform all feature vectors by applying the functions \(\varphi_1,\ldots,\varphi_\mu\).

Given feature vectors \(x_1,\ldots,x_n\) with \(m\) features each we define new feature vectors \(\tilde{x}_1,\ldots,\tilde{x}_n\) with \(\mu\) features by

\[\begin{equation*} \tilde{x}_1^{(\kappa)}:=\varphi_\kappa(x_1),\quad\ldots,\quad\tilde{x}_n^{(\kappa)}:=\varphi_\kappa(x_n)\qquad\text{for }\kappa=1,\ldots,\mu. \end{equation*}\]

Now our hypothesis reads

\[\begin{equation*} f_{\mathrm{approx}}(\tilde{x})=\sum_{\kappa=1}^\mu a_\kappa\,\tilde{x}^{(\kappa)} \end{equation*}\]

with \(\tilde{x}\) from \(\mathbb{R}^\mu\).

Implementation with Scikit-Learn#

Scikit-Learn provides a module for preprocessing data. This module contains a PolynomialFeatures class for creating transformer objects. The transformer object then provides a method fit_transform doing the actual transformation.

Care has to be taken at the following point: Scikit-Learn’s polynomial feature transform adds a feature which is always one. This feature corresponds to \(x^0\). But the linear regression model adds this feature again. Thus, we end up with too many parameters. From the mathematical point of view this isn’t a problem. But there might be confusion when evaluating the coefficients for some reason. Either set include_bias to False when creating the transformer object or set fit_intercept to False when creating the LinearRegression object.

Again we reuse the used car price example with two-dimensional feature space. Now we want to fit a second-order function to the data.

# X, y from above

from sklearn.preprocessing import PolynomialFeatures

transformer = PolynomialFeatures(degree=2)
transformed_X = transformer.fit_transform(X)

print('shape of original data:   ', X.shape)
print('shape of transformed data:', transformed_X.shape)
shape of original data:    (69, 2)
shape of transformed data: (69, 6)

Starting with two features \(x^{(1)}\) and \(x^{(2)}\), we now have six features:

\[\begin{equation*} \tilde{x}^{(1)}=1,\quad \tilde{x}^{(2)}=x^{(1)},\quad \tilde{x}^{(3)}=x^{(2)},\quad \tilde{x}^{(4)}=\bigl(x^{(1)}\bigr)^2,\quad \tilde{x}^{(5)}=x^{(1)}\,x^{(2)},\quad \tilde{x}^{(6)}=\bigl(x^{(2)}\bigr)^2. \end{equation*}\]

From here on regression works as usual.

regression = lm.LinearRegression(fit_intercept=False)
regression.fit(transformed_X, y)

# predict price
age = 3.5
km = 40000
price = regression.predict(transformer.fit_transform([[km, age]]))
print(price)

fig = go.Figure()
fig.layout.width = 800
fig.layout.height = 600

fig.add_trace(go.Scatter3d(
    x=X[:, 0], y=X[:, 1], z=y,
    mode='markers',
    marker={'color': '#0000ff', 'size': 2}
))

xmin = 0
xmax = 250000
ymin = 0
ymax = 10
grid_size = 20

xcoords, ycoords = np.meshgrid(np.linspace(xmin, xmax, grid_size), np.linspace(ymin, ymax, grid_size))
xyfeatures = np.stack((xcoords.reshape(-1), ycoords.reshape(-1)), axis=1)
zcoords = regression.predict(transformer.fit_transform(xyfeatures)).reshape(grid_size, grid_size)

fig.add_trace(
    go.Surface(x=xcoords, y=ycoords, z=zcoords, showscale=False,
               colorscale=[[0, 'rgba(255, 0, 0, 0.8)'], [1, 'rgba(255, 0, 0, 0.8)']])
)

fig.update_scenes(
    xaxis_title_text='km',
    yaxis_title_text='age',
    zaxis_title_text='price',
)

fig.show()

metrics.mean_squared_error(y, regression.predict(transformed_X), squared=False)
[569498.41773257]
88734.00776114204