Approach
Contents
Approach#
Linear regression is a classical method in mathematics for constructing functions which are somehow close to a given set of points. In Data Science linear regression does a pretty good job, thought it’s relatively simple.
Linear Regression with Linear Functions#
The Principle#
We first consider the simplest regression model. The parameter dependent hypothesis is of the form
where \(x=\bigl(x^{(1)},\ldots,x^{(m)}\bigr)\) is a feature vector and \(a_0,a_1,\ldots,a_m\) are the parameters of the hypothesis \(f_{\mathrm{approx}}\).
Given \(n\) training samples \((x_1,y_1),\ldots,(x_n,y_n)\) we want to choose \(a_0,a_1,\ldots,a_m\) such that
To achive this we solve the minimization problem
The function \(g\) defined by
is a so called loss function expressing the distance between two vectors \(u=(u_1,\ldots,u_n)\) and \(v=(v_1,\ldots,v_n)\). There are several other loss functions in machine learning like mean absolute error and Huber loss, see Quality Measures. For the moment we content ourselves with the simplest one, the mean squared error.
Note that we cannot expect \(f_{\mathrm{approx}}(x_l)=y_l\), because our model, that is, our assumption of a linear function, is likely to be too simplistic. In addition, observations \(y_l\) in regression problems often are corrupted by noise.
Solving the Minimization Problem Analytically#
The minimization problem
is known to have only global minima (no other stationary points), which can be found via differential calculus. We simply have to find \(a_0,\ldots,a_m\) where the gradient of \(h\) is the zero vector. To keep formulas as simple as possible we introduce an additional artificial zeroth feature \(x_l^{(0)}:=1\) for \(l=1,\ldots,n\). This allows to write
Taking the derivative with respect to \(a_i\) for \(i=0,1,\ldots,m\) we obtain
Thus, the gradient is zero if and only if
This is a system of linear equations for the unknowns \(a_0,a_1,\ldots,a_m\). Sometimes these equations are called normal equations of the minimization problem. Taking the feature vectors \(x_1,\ldots,x_n\) as rows of a matrix \(B\in\mathbb{R}^{m\times n}\) (including the artificial zeroth feature \(1\)), that is,
we see that the gradient of \(h\) is zero if and only if
Here \(a\) and \(y\) are the vectors of the unknowns and of the labels, respectively. Note that the system matrix \(B^{\mathrm{T}}\,B\) is a square matrix of size \(m+1\) with \(m\) being the number of features in our data.
Example: Used Cars Prices#
We look at a data set containing information about used cars offered at the online platform CarDekho. The data set contains age, selling price, kilimeters driven and several other parameters of about 4000 cars.
The dataset has been obtained via Kaggle and is licenced under Database Contents License (DbCL) v1.0, including Open Database License (ODbL) v1.0.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
cars = pd.read_csv('cars.csv')
cars
name | year | selling_price | km_driven | fuel | seller_type | transmission | owner | |
---|---|---|---|---|---|---|---|---|
0 | Maruti 800 AC | 2007 | 60000 | 70000 | Petrol | Individual | Manual | First Owner |
1 | Maruti Wagon R LXI Minor | 2007 | 135000 | 50000 | Petrol | Individual | Manual | First Owner |
2 | Hyundai Verna 1.6 SX | 2012 | 600000 | 100000 | Diesel | Individual | Manual | First Owner |
3 | Datsun RediGO T Option | 2017 | 250000 | 46000 | Petrol | Individual | Manual | First Owner |
4 | Honda Amaze VX i-DTEC | 2014 | 450000 | 141000 | Diesel | Individual | Manual | Second Owner |
... | ... | ... | ... | ... | ... | ... | ... | ... |
4335 | Hyundai i20 Magna 1.4 CRDi (Diesel) | 2014 | 409999 | 80000 | Diesel | Individual | Manual | Second Owner |
4336 | Hyundai i20 Magna 1.4 CRDi | 2014 | 409999 | 80000 | Diesel | Individual | Manual | Second Owner |
4337 | Maruti 800 AC BSIII | 2009 | 110000 | 83000 | Petrol | Individual | Manual | Second Owner |
4338 | Hyundai Creta 1.6 CRDi SX Option | 2016 | 865000 | 90000 | Diesel | Individual | Manual | First Owner |
4339 | Renault KWID RXT | 2016 | 225000 | 40000 | Petrol | Individual | Manual | First Owner |
4340 rows × 8 columns
We concentrate on one car model. Let’s take the most frequent one.
grouped_by_name = cars.groupby('name')
name = grouped_by_name.size().idxmax()
cars_subset = grouped_by_name.get_group(name)
cars_subset
name | year | selling_price | km_driven | fuel | seller_type | transmission | owner | |
---|---|---|---|---|---|---|---|---|
117 | Maruti Swift Dzire VDI | 2014 | 434000 | 79350 | Diesel | Individual | Manual | Second Owner |
327 | Maruti Swift Dzire VDI | 2018 | 800000 | 25000 | Diesel | Individual | Manual | First Owner |
338 | Maruti Swift Dzire VDI | 2015 | 490000 | 60000 | Diesel | Individual | Manual | Second Owner |
365 | Maruti Swift Dzire VDI | 2014 | 400000 | 90000 | Diesel | Individual | Manual | First Owner |
409 | Maruti Swift Dzire VDI | 2012 | 215000 | 80000 | Diesel | Individual | Manual | Third Owner |
... | ... | ... | ... | ... | ... | ... | ... | ... |
4189 | Maruti Swift Dzire VDI | 2013 | 425000 | 61083 | Diesel | Dealer | Manual | First Owner |
4245 | Maruti Swift Dzire VDI | 2014 | 480000 | 101000 | Diesel | Dealer | Manual | First Owner |
4265 | Maruti Swift Dzire VDI | 2014 | 480000 | 101000 | Diesel | Dealer | Manual | First Owner |
4289 | Maruti Swift Dzire VDI | 2019 | 680000 | 40000 | Diesel | Individual | Manual | First Owner |
4314 | Maruti Swift Dzire VDI | 2015 | 470000 | 170000 | Diesel | Individual | Manual | First Owner |
69 rows × 8 columns
Before we start any computation we should have a look at the data.
fig, ax = plt.subplots()
ax.scatter(cars_subset['km_driven'].to_numpy(), cars_subset['selling_price'].to_numpy(), s=10, c='b')
ax.set_xlabel('km')
ax.set_ylabel('price')
plt.show()
This does not look like a linear dependence between price and kilometers driven. There seems to be no systematic dependence between both quantities at all. But intuition suggests that there should be one. Thus, we should take into account other features. Isn’t it possible that the dependence between price and kilometers driven depends on the age of the car?
Let’s visualize different ages.
fig, ax = plt.subplots()
ax.scatter(cars_subset['km_driven'].to_numpy(), cars_subset['selling_price'].to_numpy(),
s=10, c=2020 - cars_subset['year'].to_numpy(), cmap='jet')
ax.set_xlabel('km')
ax.set_ylabel('price')
plt.show()
The blue region shows some linear structure and the reddish region, too. Thus, age should be taken into account for predicting prices. Now let’s do two regressions: one for young cars and one for old ones.
from matplotlib import colors
def simple_linear_regression(X, y):
B = np.ones((X.size, 2))
B[:, 1] = X
return np.linalg.solve(np.matmul(B.T, B), np.matmul(B.T, y))
year = 2014 # where to split the data set into young and old
mask_young = cars_subset['year'] > year
X_young = cars_subset.loc[mask_young, 'km_driven'].to_numpy()
y_young = cars_subset.loc[mask_young, 'selling_price'].to_numpy()
mask_old = cars_subset['year'] <= year
X_old = cars_subset.loc[mask_old, 'km_driven'].to_numpy()
y_old = cars_subset.loc[mask_old, 'selling_price'].to_numpy()
a_young = simple_linear_regression(X_young, y_young)
a_old = simple_linear_regression(X_old, y_old)
print(a_young)
print(a_old)
fig, ax = plt.subplots()
ax.scatter(cars_subset['km_driven'].to_numpy(), cars_subset['selling_price'].to_numpy(),
s=10, c=(mask_young.to_numpy()), cmap=colors.ListedColormap(['r', 'b']))
xmin = 0
xmax = 260000
ax.plot([xmin, xmax], [a_young[0] + a_young[1] * xmin, a_young[0] + a_young[1] * xmax], '-b', label='young')
ax.plot([xmin, xmax], [a_old[0] + a_old[1] * xmin, a_old[0] + a_old[1] * xmax], '-r', label='old')
ax.set_xlabel('km')
ax.set_ylabel('price')
ax.legend()
plt.show()
[ 6.74201542e+05 -1.68902485e+00]
[3.64966780e+05 1.81825528e-01]
Given the kilometers driven, the simple formula
yields a prediction for the price we could get for a young car, where \(x\) are the kilometers and y is the price. For old cars we have the model
Note that this holds only for the selected car model. In addition, we only considered very simple hypotheses with only two parameters.
Instead of doing two one-dimensional regressions for young and old cars, we could do a two-dimensional regression including both features, km driven and age.
At first we look at the data.
import plotly.graph_objects as go
xcoords = cars_subset['km_driven'].to_numpy()
ycoords = 2020 - cars_subset['year'].to_numpy()
zcoords = cars_subset['selling_price'].to_numpy()
fig = go.Figure()
fig.layout.width = 800
fig.layout.height = 600
fig.add_trace(go.Scatter3d(
x=xcoords, y=ycoords, z=zcoords,
mode='markers',
marker={'color': '#0000ff', 'size': 2}
))
fig.update_scenes(
xaxis_title_text='km',
yaxis_title_text='age',
zaxis_title_text='price',
)
fig.show()
Then we do the regression.
X = np.stack((cars_subset['km_driven'].to_numpy(),
2020 - cars_subset['year'].to_numpy()), axis=1)
y = cars_subset['selling_price'].to_numpy()
B = np.ones((X.shape[0], 3))
B[:, 1:] = X
a = np.linalg.solve(np.matmul(B.T, B), np.matmul(B.T, y))
print(a)
fig = go.Figure()
fig.layout.width = 800
fig.layout.height = 600
fig.add_trace(go.Scatter3d(
x=X[:, 0], y=X[:, 1], z=y,
mode='markers',
marker={'color': '#0000ff', 'size': 2}
))
xmin = 0
xmax = 250000
ymin = 0
ymax = 10
xcoords, ycoords = np.meshgrid([xmin, xmax], [ymin, ymax])
zcoords = a[0] + a[1] * xcoords + a[2] * ycoords
fig.add_trace(
go.Surface(x=xcoords, y=ycoords, z=zcoords, showscale=False,
colorscale=[[0, 'rgba(255, 0, 0, 0.8)'], [1, 'rgba(255, 0, 0, 0.8)']])
)
fig.update_scenes(
xaxis_title_text='km',
yaxis_title_text='age',
zaxis_title_text='price',
)
fig.show()
[ 7.69448384e+05 -2.07046783e-01 -5.47861786e+04]
Now our model for the price \(y\) depending on kilometers \(x_1\) and age \(x_2\) is
Limitations#
Linear regression with linear functions yields inaccurate results if the data set does not show a linear structure. Also outliers may distort results. Thus, linear regression has to be used with care and preceeding visual analysis is mandatory.
Have a look at Ascombe’s quartet for very different data sets all yielding the same result if linear regression with linear functions is used.
Linear Regression with Seaborn#
Seaborn offers pairplot
for getting a first overview of mutual dependence of all variables. See Plotting pairwise data relation ships in the Seaborn documentation for more details.
import seaborn as sns
sns.pairplot(cars_subset)
plt.show()
Seaborn also provides linear regression directly. See Estimating regression fits for more details.
sns.regplot(x='year', y='selling_price', data=cars_subset)
<Axes: xlabel='year', ylabel='selling_price'>
Linear Regression with Scikit-Learn#
Linear regression is implemented by almost all machine learning libraries. Scikit-Learn is a good choice due to its simple API. It offers a module linear_model
.
We reimplement linear regression with two features for the above example of used car prices.
The workflow is as follows:
create a
LinearRegression
object,fit the object to the data (that is, do the regression),
extract the coefficients from the object or use the object for prediction.
# X, y from above
import sklearn.linear_model as lm
regression = lm.LinearRegression()
regression.fit(X, y)
print(regression.intercept_) # a[0]
print(regression.coef_) # a[1], a[2]
769448.3841336307
[-2.07046783e-01 -5.47861786e+04]
From Scikit-Learn we obtain identical coefficients as above. To evaluate the regression function call predict
.
age = 3.5
km = 40000
price = regression.predict([[km, age]])
print(price)
[569414.88787459]
For comparison with more complex models below we calculate (root) mean squared error on the training set.
import sklearn.metrics as metrics
metrics.mean_squared_error(y, regression.predict(X), squared=False)
89976.37031916517
Linear Regression with Nonlinear Functions#
Idea#
We consider more general parameter dependent hypotheses of the form
where \(x=\bigl(x^{(1)},\ldots,x^{(m)}\bigr)\) is a feature vector, \(a_1,\ldots,a_\mu\) are the parameters of the hypothesis \(f_{\mathrm{approx}}\), and \(\varphi_1,\ldots,\varphi_\mu\) are prescribed real-valued functions on \(\mathbb{R}^m\).
Note that \(f_{\mathrm{approx}}\) depends linearly on the parameters \(a_1,\ldots,a_\mu\). Thus linear regression. In contrast, in nonlinear regression we would consider hypotheses containing the searched for parameters in a nonlinear fashion. This would be the most general case, which is rarely needed in practise.
Example: Fitting a Parabola#
Consider data with only one feature, that is, \(m=1\). We would like to fit a parabola to such a data set.
rng = np.random.default_rng(0)
def f(x):
return (x - 1) ** 2 + 1
fig, ax = plt.subplots()
xgrid = np.linspace(-1, 3, 100)
ax.plot(xgrid, f(xgrid), '-r')
x = rng.uniform(-0.5, 2.7, 20)
yy = f(x) + rng.normal(0, 0.3, 20)
ax.plot(x, yy, 'ob', markersize=5)
ax.set_xticklabels([])
ax.set_yticklabels([])
plt.show()
In mathematical notation our hypotesis shall have the form
where we replaced \(x^{(1)}\) by \(x\) since we only have one feature. Thus, fitting a parabola is a special case of our general linear regression approach with
Example: Fitting Piecewise Linear Functions#
Again consider data with only one feature. In addition, assume that the feature takes values in \([0,1]\). Then it might be a good idea to divide \([0,1]\) into \(\mu-1\) equaly spaced subintervals for some fixed \(\mu\). On each subinterval we could do a linear regression, but require that resulting lines are connected at the interval boundaries.
rng = np.random.default_rng(0)
def hat(x):
return np.maximum(0, 1 - np.abs(x))
def f(x):
return 0.5 * hat(5*x) + 2 * hat(5*x-1) + hat(5*x-2) + 1.5 * hat(5*x-3) + 3 * hat(5*x-4)
fig, ax = plt.subplots()
xgrid = np.linspace(0, 1, 6)
ax.plot(xgrid, f(xgrid), '-r')
x = rng.uniform(0, 1, 20)
yy = f(x) + rng.normal(0, 0.3, 20)
ax.plot(x, yy, 'ob', markersize=5)
for x in range(0, 6):
ax.plot([x/5, x/5], [0, 3.3], 'k:', lw=1)
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_title('$\mu=6$')
plt.show()
In our linear regression framework we could choose \(\varphi_1,\ldots,\varphi_\mu\) to be hat functions of width \(\frac{2}{\mu-1}\) centered at the interval boundaries \(\frac{\kappa-1}{\mu-1}\) for \(\kappa=1\ldots,\mu\):
def hat(x):
return np.maximum(0, 1 - np.abs(x))
fig, ax = plt.subplots()
xgrid = np.array([-1, 0, 1])
for x in range(0, 6):
ax.plot(xgrid + x, hat(xgrid))
ax.set_xlim(0, 5)
ax.set_ylim(0, 1.5)
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_title('$\mu=6$')
plt.show()
The parameters \(a_1,\ldots,a_\mu\) describe the high of the hats or the function values of \(f_{\mathrm{approx}}\) at the interval boundaries.
Example: RBF Regression#
Using several copies of a hat function and placing them at fixed grid points is a special case of RBF regression. Here RBF is the abbreviation of radial basis function. An RBF is a function that depends only on the length \(|x|\) of a feature vector \(x\) and not on the individual features. Thus, it is symmetric for one-dimensional feature vectors, cone shaped for two-dimensional feature vectors, and so on. RBFs typically have their maximum at 0 (the center) and then decay when distance from the center increases.
Next to hat functions
where \(c>0\) controls the width of the hat (\(c=\frac{1}{\mu-1}\) in the previous example), Gaussian RBFs
are widely used. Again, \(c>0\) controls the width of the bell shaped curve.
def hat(x):
return np.maximum(0, 1 - np.abs(x))
def bell(x):
return np.exp(-x**2)
fig, ax = plt.subplots()
xgrid = np.linspace(-2.5, 2.5, 101)
ax.plot(xgrid, hat(xgrid), '-r')
ax.plot(xgrid, bell(xgrid), '-b')
ax.set_ylim(0, 1.3)
ax.set_xticks([0])
ax.set_xticklabels([0])
ax.set_yticks([0, 1])
ax.set_yticklabels([0, 1])
ax.grid()
ax.set_title('1d RBFs')
plt.show()
def hat(x, y):
return np.maximum(0, 1 - np.sqrt(x**2 + y**2))
fig = go.Figure()
fig.layout.width = 800
fig.layout.height = 600
xcoords, ycoords = np.meshgrid(np.linspace(-2.5, 2.5, 51), np.linspace(-2.5, 2.5, 51))
zcoords = hat(xcoords, ycoords)
fig.add_trace(
go.Surface(x=xcoords, y=ycoords, z=zcoords, showscale=False,
colorscale=[[0, 'rgba(255, 0, 0, 1)'], [1, 'rgba(255, 0, 0, 1)']])
)
fig.show()
def bell(x, y):
return np.exp(-x**2-y**2)
fig = go.Figure()
fig.layout.width = 800
fig.layout.height = 600
xcoords, ycoords = np.meshgrid(np.linspace(-2.5, 2.5, 51), np.linspace(-2.5, 2.5, 51))
zcoords = bell(xcoords, ycoords)+1
fig.add_trace(
go.Surface(x=xcoords, y=ycoords, z=zcoords, showscale=False,
colorscale=[[0, 'rgba(0, 0, 255, 1)'], [1, 'rgba(0, 0, 255, 1)']])
)
fig.show()
Given fixed grid points \(u_1,\ldots,u_\mu\) in the feature space we choose
Relation to Linear Regression with Linear Functions#
Linear regression with nonlinear functions can be reduced to linear regression with linear functions studied above. We simply have to transform all feature vectors by applying the functions \(\varphi_1,\ldots,\varphi_\mu\).
Given feature vectors \(x_1,\ldots,x_n\) with \(m\) features each we define new feature vectors \(\tilde{x}_1,\ldots,\tilde{x}_n\) with \(\mu\) features by
Now our hypothesis reads
with \(\tilde{x}\) from \(\mathbb{R}^\mu\).
Implementation with Scikit-Learn#
Scikit-Learn provides a module for preprocessing data. This module contains a PolynomialFeatures
class for creating transformer objects. The transformer object then provides a method fit_transform
doing the actual transformation.
Care has to be taken at the following point: Scikit-Learn’s polynomial feature transform adds a feature which is always one. This feature corresponds to \(x^0\). But the linear regression model adds this feature again. Thus, we end up with too many parameters. From the mathematical point of view this isn’t a problem. But there might be confusion when evaluating the coefficients for some reason. Either set include_bias
to False
when creating the transformer object or set fit_intercept
to False
when creating the LinearRegression
object.
Again we reuse the used car price example with two-dimensional feature space. Now we want to fit a second-order function to the data.
# X, y from above
from sklearn.preprocessing import PolynomialFeatures
transformer = PolynomialFeatures(degree=2)
transformed_X = transformer.fit_transform(X)
print('shape of original data: ', X.shape)
print('shape of transformed data:', transformed_X.shape)
shape of original data: (69, 2)
shape of transformed data: (69, 6)
Starting with two features \(x^{(1)}\) and \(x^{(2)}\), we now have six features:
From here on regression works as usual.
regression = lm.LinearRegression(fit_intercept=False)
regression.fit(transformed_X, y)
# predict price
age = 3.5
km = 40000
price = regression.predict(transformer.fit_transform([[km, age]]))
print(price)
fig = go.Figure()
fig.layout.width = 800
fig.layout.height = 600
fig.add_trace(go.Scatter3d(
x=X[:, 0], y=X[:, 1], z=y,
mode='markers',
marker={'color': '#0000ff', 'size': 2}
))
xmin = 0
xmax = 250000
ymin = 0
ymax = 10
grid_size = 20
xcoords, ycoords = np.meshgrid(np.linspace(xmin, xmax, grid_size), np.linspace(ymin, ymax, grid_size))
xyfeatures = np.stack((xcoords.reshape(-1), ycoords.reshape(-1)), axis=1)
zcoords = regression.predict(transformer.fit_transform(xyfeatures)).reshape(grid_size, grid_size)
fig.add_trace(
go.Surface(x=xcoords, y=ycoords, z=zcoords, showscale=False,
colorscale=[[0, 'rgba(255, 0, 0, 0.8)'], [1, 'rgba(255, 0, 0, 0.8)']])
)
fig.update_scenes(
xaxis_title_text='km',
yaxis_title_text='age',
zaxis_title_text='price',
)
fig.show()
metrics.mean_squared_error(y, regression.predict(transformed_X), squared=False)
[569498.41773257]
88734.00776114204