Worked Example: House Prices II#

We try to improve prediction of house prices based on Erdogan Seref’s (unreachable in 2023) German housing dataset from www.immobilienscout24.de published at www.kaggle.com (unreachable in 2023) under a Attribution-NonCommercial-ShareAlike 4.0 International License.

We load preprocessed data and adjust data types.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn.linear_model as linear_model
import sklearn.metrics as metrics
import sklearn.model_selection as model_selection
import sklearn.preprocessing as preprocessing
import sklearn.pipeline as pipeline

data_path = 'german_housing_preprocessed.csv'
regions_path = 'regions.csv'
data = pd.read_csv(data_path, index_col=0)

data['Type'] = data['Type'].astype('category')
data['Condition'] = data['Condition'].astype('category')
data['Garage_type'] = data['Garage_type'].astype('category')

data['Condition'] = data['Condition'].cat.reorder_categories([
    'first occupation',
    'first occupation after refurbishment',
    'maintained',
    'renovated',
    'modernized',
    'refurbished',
    'fixer-upper',
    'dilapidated'
])
data.head()
Price Type Living_space Lot Rooms Bathrooms Floors Year_built Year_renovated Condition State City Place Garages Garage_type
0 13.118355 Multiple dwelling 4.663439 5.433722 5.5 1.0 2.0 2.772589 2.772589 modernized Baden-Württemberg Bodenseekreis Bermatingen 2.0 Outside parking lot
2 13.526494 Farmhouse 5.093075 4.406719 5.0 2.0 4.0 2.079442 2.079442 dilapidated Baden-Württemberg Esslingen (Kreis) Ostfildern 1.0 Garage
3 12.464583 Farmhouse 4.941642 6.701960 4.0 2.0 2.0 4.795791 3.044522 fixer-upper Baden-Württemberg Waldshut (Kreis) Bonndorf im Schwarzwald 1.0 Garage
8 12.804909 Duplex 5.424950 6.880384 10.0 4.0 2.0 5.356586 1.791759 modernized Baden-Württemberg Enzkreis Neuenbürg 8.0 Outside parking lot
10 14.375126 Mid-terrace house 5.347108 7.286192 6.0 2.0 3.0 4.406719 1.945910 modernized Baden-Württemberg Stuttgart Schönberg 2.0 Garage

More Data#

Results obtained from linear regression showed that input variables do not suffice to explain the targets. Thus, we should add more input variables. When preprocessing the data we dropped several columns. Keeping them could increase prediction quality slightly, but there were several good reasons to drop those columns. The main reason were lots of missing values in those columns.

A far better idea is to collect additional data. What features of a house influence the selling price? Of course its location! Up to now we did not use location information at all, but we have location information available. There are columns State, City, Place. But city names do not help. We need something like proximity to big cities or nice landscape. Adding a layer of abstraction we might ask for the demand for houses and the whealth of potential buyers. So we should head out for statistical information about local real estate markets and about economic power of different regions in Germany.

Everything we need is publicly available at www.regionalstatistik.de provided by Statistische Ämter des Bundes und der Länder under the license Datenlizenz Deutschland – Namensnennung – Version 2.0. Clicking here and there we find two interesting tables:

From those tables we may compile a table with 4 columns for region id, region name, income, ground prices.

The difficult part is matching region names in German housing data set with region names in the region table. Here is some code doing the job:

# remove rows with missing location information
data = data.dropna(subset=('State', 'City'))

# reindex to remove gaps in the index
data.index = pd.RangeIndex(0, len(data))
regions = pd.read_csv(regions_path)

regions.head(5)
id region income prices
0 0 Deutschland 22623 137.67
1 1 Schleswig-Holstein 22864 85.30
2 1001 Flensburg, Kreisfreie Stadt 19296 85.30
3 1002 Kiel, Landeshauptstadt, Kreisfreie Stadt 19263 85.30
4 1003 Lübeck, Hansestadt, Kreisfreie Stadt 20363 110.84
data['city_short'] = data['City'].str.replace(' (Kreis)', '', regex=False)
data['region_idx'] = 0

for (idx, city_short) in enumerate(data['city_short']):
    find_results = regions['region'].str.find(str(city_short))
    if not (find_results > -1).any():
        #print(city_short)
        if data.loc[idx, 'State'] == 'Hamburg':
            find_results = regions['region'].str.find('Hamburg')
            data.loc[idx, 'region_idx'] = regions.index[find_results > -1][-1]
        elif data.loc[idx, 'State'] == 'Bremen':
            find_results = regions['region'].str.find('Bremen')
            data.loc[idx, 'region_idx'] = regions.index[find_results > -1][-1]
        elif data.loc[idx, 'State'] == 'Berlin':
            district = city_short.split('(')[-1][0:-1]
            if district == 'Weißensee':
                district = 'Pankow'
            if district == 'Prenzlauer Berg':
                district = 'Pankow'
            if district == 'Hohenschönhausen':
                district = 'Lichtenberg'
            if district == 'Wedding':
                district = 'Berlin-Mitte'
            find_results = regions['region'].str.find(district)
            #print('***', city_short, ':', district, '-->', regions['Region'][find_results > -1])
            data.loc[idx, 'region_idx'] = regions.index[find_results > -1][-1]
        elif city_short == 'Neuss (Rhein-Kreis)':
            find_results = regions['region'].str.find('Neuss')
            data.loc[idx, 'region_idx'] = regions.index[find_results > -1][-1]
        elif city_short == 'Sankt Wendel':
            find_results = regions['region'].str.find('Wendel')
            data.loc[idx, 'region_idx'] = regions.index[find_results > -1][-1]
        elif city_short == 'Stadtverband Saarbrücken':
            find_results = regions['region'].str.find('Saarbrücken, Regionalverband')
            data.loc[idx, 'region_idx'] = regions.index[find_results > -1][-1]
        elif city_short.split()[1] in ('in', 'im', 'an', 'am'):
            if city_short.split()[0] == 'Neustadt':
                find_results = regions['region'].str.find(city_short.split()[-1])
            else:
                find_results = regions['region'].str.find(city_short.split()[0])
            #print('***', city_short, '-->', regions.loc[find_results > -1, 'Region'])
            data.loc[idx, 'region_idx'] = regions.index[find_results > -1][-1]
        else:
            print('NOT FOUND:', city_short)
            
    else:
        # take last match (smallest region)
        data.loc[idx, 'region_idx'] = regions.index[find_results > -1][-1]
data['Region_id'] = regions.loc[data['region_idx'], 'id'].values
data['Income'] = regions.loc[data['region_idx'], 'income'].values
data['Land_prices'] = regions.loc[data['region_idx'], 'prices'].values
data = data.drop(columns=['city_short', 'region_idx'])
data.head(5)
Price Type Living_space Lot Rooms Bathrooms Floors Year_built Year_renovated Condition State City Place Garages Garage_type Region_id Income Land_prices
0 13.118355 Multiple dwelling 4.663439 5.433722 5.5 1.0 2.0 2.772589 2.772589 modernized Baden-Württemberg Bodenseekreis Bermatingen 2.0 Outside parking lot 8435 26548 172.95
1 13.526494 Farmhouse 5.093075 4.406719 5.0 2.0 4.0 2.079442 2.079442 dilapidated Baden-Württemberg Esslingen (Kreis) Ostfildern 1.0 Garage 8116 25449 387.06
2 12.464583 Farmhouse 4.941642 6.701960 4.0 2.0 2.0 4.795791 3.044522 fixer-upper Baden-Württemberg Waldshut (Kreis) Bonndorf im Schwarzwald 1.0 Garage 8337 25304 111.64
3 12.804909 Duplex 5.424950 6.880384 10.0 4.0 2.0 5.356586 1.791759 modernized Baden-Württemberg Enzkreis Neuenbürg 8.0 Outside parking lot 8236 25496 192.18
4 14.375126 Mid-terrace house 5.347108 7.286192 6.0 2.0 3.0 4.406719 1.945910 modernized Baden-Württemberg Stuttgart Schönberg 2.0 Garage 8111 25559 1500.34

Preprocessing New Columns#

Now that we have two now columns (Income and Land_prices) we should look at their values.

Income Column#

sns.histplot(data['Income'])
plt.show()
../../../_images/example2_13_0.png

Log-scaling is not mandadory here, but it brings values to a similar range like the other columns. Without log-scaling we would have a column with very large range, which may result in problems when using regularization (see Regularization). Alternatively we could standardize all columns before doing linear regression.

sns.histplot(np.log(data['Income'].to_numpy()))
plt.show()
../../../_images/example2_15_0.png
data['Income'] = np.log(data['Income'].to_numpy())

Land_prices Column#

sns.histplot(data['Land_prices'])
plt.show()
../../../_images/example2_18_0.png
sns.histplot(np.log(data['Land_prices'].to_numpy()))
plt.show()
../../../_images/example2_19_0.png
data['Land_prices'] = np.log(data['Land_prices'].to_numpy())

Save Data#

data.to_csv(data_path.replace('preprocessed', 'extended'))

Linear Regression#

Now we do linear regression as before, but with two additional columns.

data['Condition_codes'] = data['Condition'].cat.codes
data = pd.get_dummies(data, columns=['Type', 'Garage_type'], drop_first=True)
y = data['Price'].to_numpy()
X = data.drop(columns=['Price', 'Condition', 'State', 'City', 'Place', 'Region_id']).to_numpy()

print(X.shape, y.shape)
(4772, 23) (4772,)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)
print(y_train.size, y_test.size)
3817 955
steps = [('poly', preprocessing.PolynomialFeatures()),
         ('ridge', linear_model.Ridge())]

pipe = pipeline.Pipeline(steps)

param_grid = {'poly__degree': [1, 2, 3],
              'ridge__alpha': [0] + [2 ** k for k in range(5, 15)]}

gs = model_selection.GridSearchCV(pipe, param_grid=param_grid,
                                  scoring='neg_mean_squared_error', n_jobs=-1, cv=5)

gs.fit(X_train, y_train)
best_params = gs.best_params_

Evaluation#

Now the interesting part. Do we see an increase in prediction quality?

print(best_params)

pipe.set_params(**best_params)
pipe.fit(X_train, y_train)

y_test_pred = pipe.predict(X_test)
{'poly__degree': 2, 'ridge__alpha': 512}
rmse = metrics.mean_squared_error(y_test, y_test_pred, squared=False)
sigma = np.std(y_test)
print('RMSE:', rmse)
print('standard deviation:', sigma)
print('ratio:', rmse / sigma)
RMSE: 0.41784051720522897
standard deviation: 0.7969255717596182
ratio: 0.5243156099039885
fig, ax = plt.subplots()
ax.plot(y_test, y_test_pred, 'or', markersize=3)
ax.plot([9, 17], [9, 17], '-b')
ax.set_xlabel('true targets')
ax.set_ylabel('predictions')
ax.set_aspect('equal')
plt.show()
../../../_images/example2_31_0.png

Looks much better!

Feature Importance#

With a trained model we may look at feature importances to see which features have high influence on the selling price.

import sklearn.inspection as inspection
result = inspection.permutation_importance(pipe, X, y, n_jobs=-1)
cols = data.drop(columns=['Price', 'Condition', 'State', 'City', 'Place', 'Region_id']).columns
imp = pd.Series(result.importances_mean, index=cols)
imp = imp.sort_values(ascending=False)
imp
Land_prices                            0.402403
Living_space                           0.313045
Year_built                             0.169028
Lot                                    0.064657
Income                                 0.029009
Rooms                                  0.019429
Bathrooms                              0.019242
Type_Duplex                            0.017057
Type_Villa                             0.015990
Year_renovated                         0.011070
Condition_codes                        0.008705
Type_Mid-terrace house                 0.007865
Garage_type_Outside parking lot        0.007603
Floors                                 0.005987
Garages                                0.004356
Garage_type_Garage                     0.003244
Type_Corner house                      0.002825
Type_Multiple dwelling                 0.002599
Type_Single dwelling                   0.002530
Type_Farmhouse                         0.000998
Type_Special property                  0.000996
Garage_type_Underground parking lot    0.000359
Type_Residential property              0.000302
dtype: float64