Worked Example: House Prices I#

To test techniques for supervised learning discussed so far we train a model for predicting house prices in Germany. Inputs are properties of a house and of the plot of land it has been built on. Output is the selling price.

Training data exists in form of advertisements on specialized websites for finding a buyer for a house. In principle we could scrape data from such a website, but usually its not allowed by the website operator and we would have to write lots of code. Erdogan Seref (unreachable in 2023) already did this job at www.immobilienscout24.de and published the data set at www.kaggle.com (unreachable in 2023) under a Attribution-NonCommercial-ShareAlike 4.0 International License.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn.linear_model as linear_model
import sklearn.metrics as metrics
import sklearn.model_selection as model_selection
import sklearn.preprocessing as preprocessing
import sklearn.pipeline as pipeline

data_path = 'german_housing.csv'

The Data Set#

At first we load the data set and try to get an overview of features and data quality.

data = pd.read_csv(data_path)

If a data frame has many columns Pandas by default does not show all columns. But we want to see all. Thus, we have to adjust the number of columns shown by setting corresponding option to None (that is, unlimited).

pd.set_option('display.max_columns', None)
data.head(10)
Unnamed: 0 Price Type Living_space Lot Usable_area Free_of_Relation Rooms Bedrooms Bathrooms Floors Year_built Furnishing_quality Year_renovated Condition Heating Energy_source Energy_certificate Energy_certificate_type Energy_consumption Energy_efficiency_class State City Place Garages Garagetype
0 0 498000.0 Multiple dwelling 106.00 229.0 NaN 01.10.2020 5.5 3.0 1.0 2.0 2005.0 normal NaN modernized central heating Gas available demand certificate NaN D Baden-Württemberg Bodenseekreis Bermatingen 2.0 Parking lot
1 1 495000.0 Mid-terrace house 140.93 517.0 20.00 01.01.2021 6.0 3.0 2.0 NaN 1994.0 basic NaN modernized stove heating NaN not required by law NaN NaN NaN Baden-Württemberg Konstanz (Kreis) Engen 7.0 Parking lot
2 2 749000.0 Farmhouse 162.89 82.0 37.62 01.07.2020 5.0 3.0 2.0 4.0 2013.0 NaN NaN dilapidated stove heating Fernwärme, Bioenergie available demand certificate NaN B Baden-Württemberg Esslingen (Kreis) Ostfildern 1.0 Garage
3 3 259000.0 Farmhouse 140.00 814.0 NaN nach Vereinbarung 4.0 NaN 2.0 2.0 1900.0 basic 2000.0 fixer-upper central heating Strom available demand certificate NaN G Baden-Württemberg Waldshut (Kreis) Bonndorf im Schwarzwald 1.0 Garage
4 4 469000.0 Multiple dwelling 115.00 244.0 NaN sofort 4.5 2.0 1.0 NaN 1968.0 refined 2019.0 refurbished central heating Öl available demand certificate NaN F Baden-Württemberg Esslingen (Kreis) Leinfelden-Echterdingen 1.0 Garage
5 5 1400000.0 Mid-terrace house 310.00 860.0 100.00 sofort 8.0 NaN NaN 3.0 1969.0 basic NaN maintained NaN Öl available consumption certificate NaN NaN Baden-Württemberg Stuttgart Süd 2.0 Garage
6 6 3500000.0 Duplex 502.00 5300.0 163.16 nach Absprache 13.0 NaN 4.0 NaN 2004.0 basic NaN dilapidated stove heating Erdwärme, Holzpellets available consumption certificate 35.9 A Baden-Württemberg Göppingen (Kreis) Wangen 7.0 Parking lot
7 7 630000.0 Duplex 263.00 406.0 118.00 01.04.2020 10.0 NaN NaN 3.0 1989.0 basic NaN modernized stove heating Gas available demand certificate NaN E Baden-Württemberg Freiburg im Breisgau Munzingen 2.0 Garage
8 8 364000.0 Duplex 227.00 973.0 83.00 nach Absprache 10.0 4.0 4.0 2.0 1809.0 normal 2015.0 modernized central heating Strom available consumption certificate 183.1 F Baden-Württemberg Enzkreis Neuenbürg 8.0 Parking lot
9 9 1900000.0 Duplex 787.00 933.0 NaN NaN 30.0 NaN NaN 3.0 1920.0 basic NaN modernized stove heating Gas, Fernwärme-Dampf available demand certificate NaN D Baden-Württemberg Mannheim Rheinau 12.0 Parking lot
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10552 entries, 0 to 10551
Data columns (total 26 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unnamed: 0               10552 non-null  int64  
 1   Price                    10552 non-null  float64
 2   Type                     10150 non-null  object 
 3   Living_space             10552 non-null  float64
 4   Lot                      10552 non-null  float64
 5   Usable_area              5568 non-null   float64
 6   Free_of_Relation         6983 non-null   object 
 7   Rooms                    10552 non-null  float64
 8   Bedrooms                 6878 non-null   float64
 9   Bathrooms                8751 non-null   float64
 10  Floors                   7888 non-null   float64
 11  Year_built               9858 non-null   float64
 12  Furnishing_quality       7826 non-null   object 
 13  Year_renovated           5349 non-null   float64
 14  Condition                10229 non-null  object 
 15  Heating                  9968 non-null   object 
 16  Energy_source            9325 non-null   object 
 17  Energy_certificate       9797 non-null   object 
 18  Energy_certificate_type  7026 non-null   object 
 19  Energy_consumption       2433 non-null   float64
 20  Energy_efficiency_class  5733 non-null   object 
 21  State                    10551 non-null  object 
 22  City                     10551 non-null  object 
 23  Place                    10262 non-null  object 
 24  Garages                  8592 non-null   float64
 25  Garagetype               8592 non-null   object 
dtypes: float64(12), int64(1), object(13)
memory usage: 2.1+ MB

We should drop irrelevant columns and adjust data types.

  • Unnamed: 0: Seems to be an integer index. We don’t need it, so drop it.

  • Price: This is our target variable.

  • Type: An important column, because house prices are likely to depend on the type of house. We should convert this to categorical type.

  • Living_space and Lot: Important features, keep them.

  • Usable_area: Likely to have influence on the selling price, but available only for half the samples. If we want to use this for regression, we would have to drop half the training samples. Alternatively we could impute values, but it’s very hard to guess usable area from other features. We should drop the column.

  • Free_of_Relation: Not related to the selling price. Drop it.

  • Rooms, Bedrooms, Bathrooms: Should have influence on prices, but not available for all samples. For the moment we keep all three columns. Later we should have a look on correlations between the three columns and possibly only keep the first one, which is available for all samples.

  • Floors: Important feature, keep it.

  • Year_built: Important feature, keep it.

  • Furnishing_quality: Important, convert to categorical and keep.

  • Year_renovated: Important, but half the data is missing. There is good chance that missing values indicate that there the house has not been renovated until today. Thus, a reasonable fill value is the year of construction.

  • Condition: Important, convert to categorical and keep.

  • Heating and Energy_source: Could be important, convert to categorical and keep.

  • Energy_certificate, Energy_certificate_type, Energy_consumption: The first contains more or less only the value 'available' (since energy certificates are required by law). The second is irrelevant and the third is missing for most samples. Drop them all.

  • Energy_efficiency_class: Likely to have influence on the selling price, although classification procedure is very unreliable in practice. Keep and convert to categorical.

  • State, City, Place: Geolocation surely influences selling prices. But it’s hard to use location data for regression. For the moment we keep these columns.

  • Garages: Could be important, keep.

  • Garagetype: If we keep Garages then we also have to keep this column. Convert to categorical and rename to Garage_type to fit naming convention used for the other columns.

data = data.drop(columns=['Unnamed: 0', 'Usable_area', 'Free_of_Relation',
                          'Energy_certificate', 'Energy_certificate_type', 'Energy_consumption'])

data['Type'] = data['Type'].astype('category')
data['Furnishing_quality'] = data['Furnishing_quality'].astype('category')
data['Condition'] = data['Condition'].astype('category')
data['Heating'] = data['Heating'].astype('category')
data['Energy_source'] = data['Energy_source'].astype('category')
data['Energy_efficiency_class'] = data['Energy_efficiency_class'].astype('category')
data['Garagetype'] = data['Garagetype'].astype('category')

data = data.rename(columns={'Garagetype': 'Garage_type'})

nan_mask = data['Year_renovated'].isna()
data.loc[nan_mask, 'Year_renovated'] = data.loc[nan_mask, 'Year_built']

Categorical columns Furnishing_quality, Condition and Energy_efficiency_class should have a natural ordering, which should be represented by the data type.

print(data['Furnishing_quality'].cat.categories)
print(data['Condition'].cat.categories)
print(data['Energy_efficiency_class'].cat.categories)
Index(['basic', 'luxus', 'normal', 'refined'], dtype='object')
Index(['as new', 'by arrangement', 'dilapidated', 'first occupation',
       'first occupation after refurbishment', 'fixer-upper', 'maintained',
       'modernized', 'refurbished', 'renovated'],
      dtype='object')
Index([' A ', ' A+ ', ' B ', ' C ', ' D ', ' E ', ' F ', ' G ', ' H '], dtype='object')

We should rename same categories and sort them as good as possible.

data['Furnishing_quality'] = data['Furnishing_quality'] \
    .cat.rename_categories({'luxus': 'luxury'}) \
    .cat.reorder_categories(['basic', 'normal', 'refined', 'luxury'])

data['Condition'] = data['Condition'].cat.reorder_categories([
    'first occupation',
     'first occupation after refurbishment',
     'as new',
     'maintained',
     'renovated',
     'modernized',
     'refurbished',
     'by arrangement',
     'fixer-upper',
     'dilapidated'
])

data['Energy_efficiency_class'] = data['Energy_efficiency_class'] \
    .cat.rename_categories({
        ' A ': 'A',
        ' A+ ': 'A+',
        ' B ': 'B',
        ' C ': 'C',
        ' D ': 'D',
        ' E ': 'E',
        ' F ': 'F',
        ' G ': 'G',
        ' H ': 'H'
    }) \
    .cat.reorder_categories(['A+', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'])

Now let’s see how many complete samples we have.

len(data.dropna())
1591

That’s very few. So we should drop some columns with many missing values.

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10552 entries, 0 to 10551
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   Price                    10552 non-null  float64 
 1   Type                     10150 non-null  category
 2   Living_space             10552 non-null  float64 
 3   Lot                      10552 non-null  float64 
 4   Rooms                    10552 non-null  float64 
 5   Bedrooms                 6878 non-null   float64 
 6   Bathrooms                8751 non-null   float64 
 7   Floors                   7888 non-null   float64 
 8   Year_built               9858 non-null   float64 
 9   Furnishing_quality       7826 non-null   category
 10  Year_renovated           10211 non-null  float64 
 11  Condition                10229 non-null  category
 12  Heating                  9968 non-null   category
 13  Energy_source            9325 non-null   category
 14  Energy_efficiency_class  5733 non-null   category
 15  State                    10551 non-null  object  
 16  City                     10551 non-null  object  
 17  Place                    10262 non-null  object  
 18  Garages                  8592 non-null   float64 
 19  Garage_type              8592 non-null   category
dtypes: category(7), float64(10), object(3)
memory usage: 1.1+ MB

Energy_efficiency_class is relatively unreliable and not too important for selling prices.

len(data.drop(columns=['Energy_efficiency_class']).dropna())
2615

Better, but not good. The Bedrooms column has many missing values, too, and it’s likely to be correlated to Rooms. So let’s look at correlations between Rooms, Bedrooms, Bathrooms, Floors.

sns.pairplot(data[['Rooms', 'Bedrooms', 'Bathrooms', 'Floors']], plot_kws={"s": 5})
plt.show()
../../../_images/example1_20_0.png

Floors is not correlated to the other columns, so keep it. Bedrooms show correlation to Rooms and Bathrooms, so drop Bedrooms. Bathroom shows some correlation to Rooms. Wether to drop Bathrooms should be decided by the increase in sample counts.

len(data.drop(columns=['Energy_efficiency_class', 'Bedrooms']).dropna())
3174
len(data.drop(columns=['Energy_efficiency_class', 'Bedrooms', 'Bathrooms']).dropna())
3479

We should keep Bathrooms, because dropping it only yields 300 more samples while neglecting possibly important information. Note that the number of bath rooms can be regarded as a measure for overall furnishing quality. Thus, there should be some correlation to Furnishing_quality.

data.groupby('Furnishing_quality')['Bathrooms'].mean()
Furnishing_quality
basic      2.207496
normal     2.363720
refined    1.826739
luxury     2.778761
Name: Bathrooms, dtype: float64

In addition, judging about furnishing quality of a house is highly subjective. Thus, we should drop the column to get more samples without missing data.

len(data.drop(columns=['Energy_efficiency_class', 'Bedrooms', 'Furnishing_quality']).dropna())
4526

The Energy_source is another candidate for dropping, because it has more than 1000 missing values and its influence on selling prices should be rather low.

for cat in data['Energy_source'].cat.categories:
    print(cat)
 Bioenergie 
 Erdgas leicht 
 Erdgas leicht, Erdgas schwer 
 Erdgas schwer 
 Erdgas schwer, Bioenergie 
 Erdgas schwer, Holz 
 Erdwärme 
 Erdwärme, Fernwärme 
 Erdwärme, Gas 
 Erdwärme, Holzpellets 
 Erdwärme, Solar 
 Erdwärme, Solar, Holzpellets, Holz 
 Erdwärme, Solar, Umweltwärme 
 Erdwärme, Strom 
 Erdwärme, Umweltwärme 
 Fernwärme 
 Fernwärme, Bioenergie 
 Fernwärme, Flüssiggas 
 Fernwärme, Nahwärme, KWK fossil 
 Fernwärme-Dampf 
 Flüssiggas 
 Flüssiggas, Holz 
 Gas 
 Gas, Bioenergie 
 Gas, Fernwärme 
 Gas, Fernwärme-Dampf 
 Gas, Holz 
 Gas, Holz-Hackschnitzel 
 Gas, KWK fossil 
 Gas, Kohle, Holz 
 Gas, Strom 
 Gas, Strom, Holz 
 Gas, Strom, Kohle, Holz 
 Gas, Wasserenergie 
 Gas, Öl 
 Gas, Öl, Holz 
 Gas, Öl, Kohle 
 Gas, Öl, Kohle, Holz 
 Gas, Öl, Strom 
 Holz 
 Holz, Bioenergie 
 Holz-Hackschnitzel 
 Holzpellets 
 Holzpellets, Gas 
 Holzpellets, Gas, Öl 
 Holzpellets, Holz 
 Holzpellets, Holz-Hackschnitzel 
 Holzpellets, Kohle, Holz 
 Holzpellets, Strom 
 Holzpellets, Öl 
 KWK erneuerbar 
 KWK fossil 
 KWK regenerativ 
 Kohle 
 Kohle, Holz 
 Kohle/Koks 
 Nahwärme 
 Solar 
 Solar, Bioenergie 
 Solar, Erdgas schwer 
 Solar, Gas 
 Solar, Gas, Holz 
 Solar, Gas, Strom 
 Solar, Gas, Strom, Holz 
 Solar, Gas, Wasserenergie 
 Solar, Gas, Öl 
 Solar, Gas, Öl, Holz 
 Solar, Holz 
 Solar, Holz-Hackschnitzel 
 Solar, Holzpellets 
 Solar, Holzpellets, Holz 
 Solar, Holzpellets, Strom 
 Solar, Holzpellets, Öl 
 Solar, Strom 
 Solar, Strom, Bioenergie 
 Solar, Umweltwärme 
 Solar, Öl 
 Solar, Öl, Bioenergie 
 Solar, Öl, Holz 
 Solar, Öl, Holz-Hackschnitzel 
 Solar, Öl, Strom 
 Solar, Öl, Strom, KWK fossil 
 Strom 
 Strom, Bioenergie 
 Strom, Flüssiggas 
 Strom, Holz 
 Strom, Holz-Hackschnitzel 
 Strom, Kohle 
 Strom, Kohle, Holz 
 Strom, Umweltwärme 
 Umweltwärme 
 Wasserenergie 
 Windenergie 
 Wärmelieferung 
 Öl 
 Öl, Bioenergie 
 Öl, Fernwärme 
 Öl, Holz 
 Öl, Kohle 
 Öl, Kohle, Holz 
 Öl, Strom 
 Öl, Strom, Holz 
 Öl, Strom, Kohle, Holz 
 Öl, Umweltwärme 

Values are very diverse and hard to preprocess for regression. We would have to convert the column to several boolean columns. In addition, some grouping would be necessary (Holz is a subcategory of Bioenergie and so on).

len(data.drop(columns=['Energy_efficiency_class', 'Bedrooms', 'Furnishing_quality', 'Energy_source']).dropna())
4854

Now we have almost 5000 complete samples. Should be a good compromise between completeness and level of detail.

data = data.drop(columns=['Energy_efficiency_class', 'Bedrooms', 'Furnishing_quality', 'Energy_source'])
data = data.dropna()

Outliers and Further Preprocessing#

Now that we have a cleaned data set we should remove outliers. The simplest method of detecting outliers is to look at the ranges of all feature. With describe we get a first overview for numerical features.

data.describe()
Price Living_space Lot Rooms Bathrooms Floors Year_built Year_renovated Garages
count 4.854000e+03 4854.000000 4854.000000 4854.000000 4854.000000 4854.000000 4854.000000 4854.000000 4854.000000
mean 5.739566e+05 209.305740 1240.636904 7.051504 2.316028 2.256696 1964.252369 1995.626700 2.518541
std 5.880211e+05 118.252688 3806.518099 3.834865 1.595327 0.776769 49.065052 35.389067 2.719901
min 0.000000e+00 0.000000 0.000000 1.000000 0.000000 0.000000 1430.000000 1430.000000 1.000000
25% 2.800000e+05 135.000000 401.000000 5.000000 1.000000 2.000000 1950.000000 1991.000000 1.000000
50% 4.400000e+05 180.000000 675.000000 6.000000 2.000000 2.000000 1974.000000 2008.000000 2.000000
75% 6.850000e+05 248.000000 1042.000000 8.000000 3.000000 3.000000 1997.750000 2016.000000 3.000000
max 1.300000e+07 1742.240000 143432.000000 84.000000 26.000000 8.000000 2021.000000 2206.000000 65.000000

Price Column#

sns.histplot(data['Price'])
plt.show()
../../../_images/example1_37_0.png

There are only very few high prices and price distribution concentrates on low prices. If the target variable has wide range, but most samples concentrate on a small portion of the range, then ‘learning’ the target is much more difficult than for more uniformly distributed data.

A common trick is to use nonlinear scaling. Especially for market prices it is known from experience that they follow a log-normal distribution, that is, after applying the logarithm we see a normal distribution. Before applying the logarithm we should drop samples with zeros in the Price column to avoid undefined results. A price of zero indicates that the seller did not provide a price in the advertisement. Thus, dropping such sample even is a good idea if we wouldn’t want to apply the logarithm.

data = data.loc[data['Price'] > 0, :]
sns.histplot(np.log(data['Price'].to_numpy()))
plt.show()
../../../_images/example1_39_0.png

There seem to be same very small values.

data.loc[data['Price'] <= np.exp(7), :]
Price Type Living_space Lot Rooms Bathrooms Floors Year_built Year_renovated Condition Heating State City Place Garages Garage_type
6987 1.0 Duplex 459.0 2742.0 23.0 8.0 3.0 1957.0 1957.0 refurbished stove heating Nordrhein-Westfalen Märkischer Kreis Altena 1.0 Garage

Those samples should be dropped because house prices below \(\mathrm{e}^7\approx 1000\) EUR are very uncommon.

data['Price'] = np.log(data['Price'].to_numpy())
data = data.loc[data['Price'] > 7, :]

Living_space Column#

sns.histplot(data['Living_space'])
plt.show()
../../../_images/example1_45_0.png

Same here as for Price.

data = data.loc[data['Living_space'] > 0, :]
sns.histplot(np.log(data['Living_space'].to_numpy()))
plt.show()
../../../_images/example1_47_0.png
data['Living_space'] = np.log(data['Living_space'].to_numpy())

Lot Column#

sns.histplot(data['Lot'])
plt.show()
../../../_images/example1_50_0.png

Same here as for Price again.

data = data.loc[data['Lot'] > 0, :]
sns.histplot(np.log(data['Lot'].to_numpy()))
plt.show()
../../../_images/example1_52_0.png
data.loc[data['Lot'] <= np.exp(2), :]
Price Type Living_space Lot Rooms Bathrooms Floors Year_built Year_renovated Condition Heating State City Place Garages Garage_type
722 11.492723 Mid-terrace house 4.491441 0.25 4.0 1.0 3.0 1900.0 1900.0 modernized wood-pellet heating Baden-Württemberg Main-Tauber-Kreis Ahorn 1.0 Garage
1876 12.860362 Residential property 5.342334 1.00 5.0 1.0 1.0 2019.0 2019.0 first occupation stove heating Bayern Regensburg (Kreis) Hemau 1.0 Outside parking lot
6435 11.707834 Duplex 5.056246 1.96 5.0 2.0 3.0 1905.0 1981.0 refurbished stove heating Nordrhein-Westfalen Höxter (Kreis) Höxter 3.0 Outside parking lot
7887 12.594731 Duplex 5.857933 1.00 8.0 5.0 3.0 1965.0 2019.0 modernized stove heating Rheinland-Pfalz Neuwied (Kreis) Dierdorf 1.0 Outside parking lot

Lot size below \(\mathrm{e}^2<8\) m² is very unlikely.

data['Lot'] = np.log(data['Lot'].to_numpy())
data = data.loc[data['Lot'] > 2, :]

Rooms Column#

sns.histplot(data['Rooms'])
plt.show()
../../../_images/example1_57_0.png
data.loc[data['Rooms'] >= 30, :]
Price Type Living_space Lot Rooms Bathrooms Floors Year_built Year_renovated Condition Heating State City Place Garages Garage_type
1863 14.430696 Residential property 6.371612 7.578145 41.0 21.0 3.0 1907.0 2019.0 maintained stove heating Bayern Eichstätt (Kreis) Dollnstein 16.0 Outside parking lot
4322 14.346139 Single dwelling 6.745236 7.783224 30.0 6.0 3.0 2009.0 2009.0 dilapidated stove heating Hessen Limburg-Weilburg (Kreis) Hadamar 4.0 Outside parking lot
4346 15.150512 Duplex 6.690842 6.311735 36.0 13.0 5.0 1961.0 2016.0 modernized stove heating Hessen Frankfurt am Main Rödelheim 3.0 Garage
4557 12.896717 Duplex 6.653534 7.970395 36.0 15.0 3.0 1970.0 2000.0 maintained stove heating Mecklenburg-Vorpommern Demmin (Kreis) Rosenow 1.0 Outside parking lot
4614 13.906265 Duplex 7.306531 8.234034 84.0 24.0 5.0 1991.0 2018.0 modernized stove heating Mecklenburg-Vorpommern Demmin (Kreis) Kruckow 16.0 Outside parking lot
4615 13.906265 Duplex 7.306531 8.234034 84.0 24.0 5.0 1989.0 2018.0 modernized stove heating Mecklenburg-Vorpommern Demmin (Kreis) Kruckow 16.0 Outside parking lot
6895 13.639966 Duplex 6.516193 7.090077 32.0 8.0 3.0 2003.0 2003.0 dilapidated electric heating Nordrhein-Westfalen Gelsenkirchen Hassel 8.0 Parking lot
7397 13.963931 Duplex 7.057898 6.952729 45.0 20.0 4.0 1950.0 1950.0 maintained heat pump Nordrhein-Westfalen Gelsenkirchen Horst 12.0 Outside parking lot
7904 13.805460 Duplex 6.682109 8.505121 35.0 13.0 3.0 2004.0 2019.0 first occupation after refurbishment stove heating Rheinland-Pfalz Trier-Saarburg (Kreis) Saarburg 15.0 Outside parking lot
8705 14.506155 Bungalow 7.003065 8.575462 40.0 0.0 3.0 2012.0 2012.0 fixer-upper heat pump Rheinland-Pfalz Vulkaneifel (Kreis) Gerolstein 40.0 Parking lot

There are only very few sample with high number of rooms. There is no chance to get good predictions from those few samples.

data = data.loc[data['Rooms'] < 30, :]

Bathrooms Column#

sns.histplot(data['Bathrooms'])
plt.show()
../../../_images/example1_62_0.png
data.loc[data['Bathrooms'] >= 15, :]
Price Type Living_space Lot Rooms Bathrooms Floors Year_built Year_renovated Condition Heating State City Place Garages Garage_type
3270 15.264780 Duplex 7.462927 7.693605 26.0 26.0 4.0 1994.0 1994.0 modernized heat pump Brandenburg Teltow-Fläming (Kreis) Blankenfelde-Mahlow 26.0 Duplex lot
6006 13.457406 Bungalow 6.514713 8.750366 25.0 25.0 3.0 1874.0 2016.0 renovated cogeneration units Niedersachsen Lüchow-Dannenberg (Kreis) Küsten 30.0 Outside parking lot
7224 14.038654 Duplex 6.476972 6.063785 18.0 18.0 4.0 1962.0 1962.0 modernized stove heating Nordrhein-Westfalen Herford (Kreis) Herford 5.0 Garage
10088 12.037654 Multiple dwelling 4.499810 5.777652 4.0 22.0 2.0 1961.0 2000.0 modernized heat pump Schleswig-Holstein Schleswig-Flensburg (Kreis) Tarp 1.0 Outside parking lot
data = data.loc[data['Bathrooms'] < 15, :]

Floors Column#

sns.histplot(data['Floors'])
plt.show()
../../../_images/example1_66_0.png

Nothing to do here.

Year_built Column#

sns.histplot(data['Year_built'])
plt.show()
../../../_images/example1_68_0.png
data.loc[data['Year_built'] <= 1500, :]
Price Type Living_space Lot Rooms Bathrooms Floors Year_built Year_renovated Condition Heating State City Place Garages Garage_type
1311 14.580978 Corner house 6.182085 11.522876 13.0 7.0 3.0 1430.0 1430.0 fixer-upper stove heating Bayern Berchtesgadener Land (Kreis) Marktschellenberg 2.0 Carport
1458 14.077875 Duplex 6.522093 6.361302 19.0 9.0 3.0 1500.0 2014.0 modernized underfloor heating Bayern Ansbach Stadt 9.0 Carport
8174 12.203570 Bungalow 5.703782 5.634790 10.0 2.0 4.0 1492.0 1492.0 refurbished stove heating Rheinland-Pfalz Bad Kreuznach (Kreis) Bad Kreuznach 2.0 Garage
data = data.loc[data['Year_built'] > 1500, :]

Values above 2020 obviously are wrong (data set is from 2020).

data.loc[data['Year_built'] > 2020, :]
Price Type Living_space Lot Rooms Bathrooms Floors Year_built Year_renovated Condition Heating State City Place Garages Garage_type
2437 13.623139 Single dwelling 4.897840 5.609472 4.5 1.0 3.0 2021.0 2021.0 dilapidated stove heating Bayern Rosenheim (Kreis) Kolbermoor 2.0 Parking lot
2515 13.704579 Single dwelling 4.875197 5.796058 4.0 1.0 3.0 2021.0 2021.0 dilapidated stove heating Bayern Ebersberg (Kreis) Zorneding 2.0 Parking lot
2516 13.704579 Single dwelling 4.875197 5.963579 4.0 1.0 3.0 2021.0 2021.0 dilapidated stove heating Bayern Ebersberg (Kreis) Zorneding 2.0 Parking lot
2519 13.928839 Mid-terrace house 4.962845 6.061457 4.5 2.0 3.0 2021.0 2021.0 dilapidated stove heating Bayern Rosenheim (Kreis) Kolbermoor 2.0 Parking lot
2520 13.981025 Mid-terrace house 4.962845 6.165418 4.5 2.0 3.0 2021.0 2021.0 dilapidated stove heating Bayern Rosenheim (Kreis) Kolbermoor 2.0 Parking lot
2521 13.652992 Single dwelling 4.897840 5.700444 4.5 1.0 3.0 2021.0 2021.0 dilapidated stove heating Bayern Rosenheim (Kreis) Kolbermoor 2.0 Parking lot
data = data.loc[data['Year_built'] <= 2020, :]

To get a better distribution of the samples over the range, we again apply a logarithmic transform.

data.loc[:, 'Year_built'] = np.log(2021 - data['Year_built'].to_numpy())
sns.histplot(data['Year_built'])
plt.show()
../../../_images/example1_75_0.png

Year_renovated Column#

sns.histplot(data['Year_renovated'])
plt.show()
../../../_images/example1_77_0.png

There seem to be renovations before 1900, which seems somewhat strange. But remember that we filled missing values with values from Year_built. Values above 2020 obviously are wrong.

data.loc[data['Year_renovated'] > 2020, :]
Price Type Living_space Lot Rooms Bathrooms Floors Year_built Year_renovated Condition Heating State City Place Garages Garage_type
7803 12.992255 Duplex 6.969791 6.818924 26.0 13.0 2.0 4.025352 2026.0 by arrangement heat pump Rheinland-Pfalz Trier-Saarburg (Kreis) Reinsfeld 15.0 Parking lot
9324 12.971540 Mid-terrace house 5.075174 6.486161 6.0 2.0 3.0 4.727388 2206.0 modernized heat pump Sachsen Zwickau Nordvorstadt 2.0 Outside parking lot
data = data.loc[data['Year_renovated'] <= 2020, :]
data.loc[:, 'Year_renovated'] = np.log(2021 - data['Year_renovated'].to_numpy())
sns.histplot(data['Year_renovated'])
plt.show()
../../../_images/example1_81_0.png

Garages Column#

sns.histplot(data['Garages'])
plt.show()
../../../_images/example1_83_0.png
data.loc[data['Garages'] >= 20, :]
Price Type Living_space Lot Rooms Bathrooms Floors Year_built Year_renovated Condition Heating State City Place Garages Garage_type
461 14.343193 Bungalow 5.347108 8.932609 8.0 2.0 3.0 3.332205 3.332205 modernized stove heating Baden-Württemberg Ludwigsburg (Kreis) Vaihingen an der Enz 30.0 Outside parking lot
2579 14.220976 Duplex 6.278521 8.027150 15.0 4.0 2.0 3.761200 2.397895 modernized underfloor heating Bayern Rottal-Inn (Kreis) Eggenfelden 22.0 Parking lot
3761 15.009130 Bungalow 6.476972 8.116716 10.0 2.0 3.0 5.293305 1.609438 renovated heat pump Hessen Frankfurt am Main Niederursel 20.0 Outside parking lot
3932 13.384728 Duplex 5.375278 8.809714 15.0 7.0 4.0 4.510860 4.510860 maintained stove heating Hessen Darmstadt-Dieburg (Kreis) Otzberg 60.0 Outside parking lot
4175 14.066269 Duplex 5.777652 6.884487 11.0 5.0 2.0 3.496508 0.000000 modernized stove heating Hessen Limburg-Weilburg (Kreis) Bad Camberg 21.0 Outside parking lot
4243 11.407565 Duplex 3.058707 7.342132 1.0 1.0 4.0 3.912023 3.912023 modernized heat pump Hessen Main-Taunus-Kreis Hattersheim am Main 30.0 Parking lot
7166 13.102161 Bungalow 7.003065 9.239899 12.0 3.0 1.0 3.970292 2.639057 modernized gas heating Nordrhein-Westfalen Minden-Lübbecke (Kreis) Espelkamp 65.0 Parking lot
7870 12.959844 Corner house 5.828946 10.596635 14.0 2.0 3.0 4.795791 1.609438 dilapidated stove heating Rheinland-Pfalz Altenkirchen (Westerwald) (Kreis) Birnbach 50.0 Outside parking lot
9535 13.910821 Duplex 6.415097 8.097122 22.0 8.0 3.0 3.091042 3.091042 modernized stove heating Sachsen-Anhalt Salzlandkreis Bernburg (Saale) 58.0 Outside parking lot
data = data.loc[data['Garages'] < 20, :]

Type Column#

data['Type'].value_counts()
Mid-terrace house       2198
Duplex                   868
Single dwelling          565
Farmhouse                265
Villa                    213
Multiple dwelling        209
Special property         168
Residential property     124
Bungalow                 113
Corner house              78
Castle                     2
Name: Type, dtype: int64
data = data.loc[data['Type'] != 'Castle', :]
data['Type'] = data['Type'].cat.remove_categories('Castle')

Condition Column#

data['Condition'].value_counts()
modernized                              2146
dilapidated                              663
refurbished                              565
renovated                                509
maintained                               361
fixer-upper                              268
first occupation after refurbishment     210
first occupation                          50
by arrangement                            26
as new                                     3
Name: Condition, dtype: int64

We should remove 'as new' and 'by arrangement' because only few samples use these categories and both are somewhat dubious.

data = data.loc[~data['Condition'].isin(['as new', 'by arrangement']), :]
data['Condition'] = data['Condition'].cat.remove_categories(['as new', 'by arrangement'])

Heating Column#

data['Heating'].value_counts()
stove heating           2900
heat pump                563
oil heating              420
central heating          248
underfloor heating       162
night storage heater     138
district heating         117
wood-pellet heating       62
floor heating             55
electric heating          52
gas heating               32
cogeneration units        13
solar heating             10
Name: Heating, dtype: int64

Something is wrong here! More than every second house sold in 2020 has stove heating? And what about 'floor heating'? Is it gas powered or oil powered or what else? What’s the difference between 'floor heating' and 'underfloor heating'. It’s better to drop this column.

data = data.drop(columns=['Heating'])

Garage_type Column#

data['Garage_type'].value_counts()
Garage                     2647
Outside parking lot         897
Parking lot                 739
Carport                     409
Underground parking lot      53
Duplex lot                   26
Car park lot                  1
Name: Garage_type, dtype: int64

There are many similar categories. We should join some.

data.loc[data['Garage_type'] == 'Car park lot', 'Garage_type'] = 'Outside parking lot'
data.loc[data['Garage_type'] == 'Duplex lot', 'Garage_type'] = 'Outside parking lot'
data.loc[data['Garage_type'] == 'Parking lot', 'Garage_type'] = 'Outside parking lot'
data['Garage_type'] = data['Garage_type'].cat.remove_categories(['Car park lot', 'Duplex lot', 'Parking lot'])

data['Garage_type'].value_counts()
Garage                     2647
Outside parking lot        1663
Carport                     409
Underground parking lot      53
Name: Garage_type, dtype: int64

Save Cleaned Data#

We save cleaned data for future use.

data.to_csv(data_path.replace('.csv', '_preprocessed.csv'))

Linear Regression#

Now data is almost ready for training a model. It remains to convert categorical data to numerical data. Condition is ordered and numeric representation is accessible with Series.cat.codes. Columns Type and Garage_type should be one-hot encoded.

data['Condition_codes'] = data['Condition'].cat.codes
data = pd.get_dummies(data, columns=['Type', 'Garage_type'], drop_first=True)
data.head()
Price Living_space Lot Rooms Bathrooms Floors Year_built Year_renovated Condition State City Place Garages Condition_codes Type_Corner house Type_Duplex Type_Farmhouse Type_Mid-terrace house Type_Multiple dwelling Type_Residential property Type_Single dwelling Type_Special property Type_Villa Garage_type_Garage Garage_type_Outside parking lot Garage_type_Underground parking lot
0 13.118355 4.663439 5.433722 5.5 1.0 2.0 2.772589 2.772589 modernized Baden-Württemberg Bodenseekreis Bermatingen 2.0 4 0 0 0 0 1 0 0 0 0 0 1 0
2 13.526494 5.093075 4.406719 5.0 2.0 4.0 2.079442 2.079442 dilapidated Baden-Württemberg Esslingen (Kreis) Ostfildern 1.0 7 0 0 1 0 0 0 0 0 0 1 0 0
3 12.464583 4.941642 6.701960 4.0 2.0 2.0 4.795791 3.044522 fixer-upper Baden-Württemberg Waldshut (Kreis) Bonndorf im Schwarzwald 1.0 6 0 0 1 0 0 0 0 0 0 1 0 0
8 12.804909 5.424950 6.880384 10.0 4.0 2.0 5.356586 1.791759 modernized Baden-Württemberg Enzkreis Neuenbürg 8.0 4 0 1 0 0 0 0 0 0 0 0 1 0
10 14.375126 5.347108 7.286192 6.0 2.0 3.0 4.406719 1.945910 modernized Baden-Württemberg Stuttgart Schönberg 2.0 4 0 0 0 1 0 0 0 0 0 1 0 0

We drop columns not used for regression and convert the data frame to NumPy arrays suitable for Scikit-Learn.

y = data['Price'].to_numpy()
X = data.drop(columns=['Price', 'Condition', 'State', 'City', 'Place']).to_numpy()

print(X.shape, y.shape)
(4772, 21) (4772,)

We have relatively few data. Thus, test set should be small to have more training samples.

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)
print(y_train.size, y_test.size)
3817 955

We use polynomial regression with regularization.

steps = [('poly', preprocessing.PolynomialFeatures()),
         ('ridge', linear_model.Ridge())]

pipe = pipeline.Pipeline(steps)

param_grid = {'poly__degree': [1, 2, 3],
              'ridge__alpha': [0] + [2 ** k for k in range(5, 15)]}

gs = model_selection.GridSearchCV(pipe, param_grid=param_grid,
                                  scoring='neg_mean_squared_error', n_jobs=-1, cv=5)

gs.fit(X_train, y_train)
best_params = gs.best_params_

Evaluating the Model#

Now we use the test set to evaluate prediction quality of the model.

print(best_params)

pipe.set_params(**best_params)
pipe.fit(X_train, y_train)

y_test_pred = pipe.predict(X_test)
{'poly__degree': 2, 'ridge__alpha': 64}

Root mean squared error between predicted and exact targets on its own does not tell much about fitting quality. We have to compare the value to standard deviation of the targets. Standard deviation is the root mean squared error of the exact targets and their mean. In other words, standard deviation tells us the prediction error if we would use constant predictions for all inputs. Obviously the constant should be the mean of the training (!) targets, but the mean of the training targets should be very close the mean of the test targets if test sample have been selected randomly.

rmse = metrics.mean_squared_error(y_test, y_test_pred, squared=False)
sigma = np.std(y_test)
print('RMSE:', rmse)
print('standard deviation:', sigma)
print('ratio:', rmse / sigma)
RMSE: 0.549529269010328
standard deviation: 0.7553775997302192
ratio: 0.7274894955934497

We see that the model’s prediction is better than constant prediction, but not so much.

We should have a closer look at the predictions. Since there is no natural ordering in the set of samples plotting y_test and y_test_pred with plot does not help much.

fig, ax = plt.subplots()
ax.plot(y_test, '-b', label='true targets')
ax.plot(y_test_pred, '-r', label='predictions')
ax.legend()
plt.show()
../../../_images/example1_117_0.png

A better idea is to plot y_test versus y_test_pred. If true and predicted labels are close, then points should concentrate along the diagonal. Else they are far away from the diagonal.

fig, ax = plt.subplots()
ax.plot(y_test, y_test_pred, 'or', markersize=3)
ax.plot([9, 17], [9, 17], '-b')
ax.set_xlabel('true targets')
ax.set_ylabel('predictions')
ax.set_aspect('equal')
plt.show()
../../../_images/example1_119_0.png

The red cloud shows some rotation compared to the blue line. Small target values get too high predictions and high target values get too low predictions. In other words, predictions tend to be too close to the target’s mean. Such behavior is typically observed if there are many similar samples with different targets in the training data. Then there is no clear functional dependence of the targets on the inputs and models tend to predict the mean targets.

To further investigate this issue we should look at the predictions on the training set. If we are right, then predictions on the training set should show similar behavior (predictions close to mean).

y_train_pred = pipe.predict(X_train)

fig, ax = plt.subplots()
ax.plot(y_train, y_train_pred, 'or', markersize=3)
ax.plot([9, 17], [9, 17], '-b')
ax.set_xlabel('true targets')
ax.set_ylabel('predictions')
ax.set_aspect('equal')
plt.show()
../../../_images/example1_121_0.png

Again we see slight rotation. To summarize: our input data has too few details to explain the targets. There are simlar inputs with different targets leading to underestimation of high values and over estimation of low values. The only way out is gathering more data, either by dropping less columns or by getting relevant data from additional sources. We will come back to this issue soon.

Predictions#

When using our model for predicting house prices we have to keep in mind that we transformed some of the input data. All those transforms have to be applied to new inputs, too.

living_space = 80
lot = 3600
rooms = 5
bathrooms = 0
floors = 2
year_built = 1948
year_renovated = 1948
garages = 2
condition_codes = 7    # 0 = 'first occupation', 7 = 'dilapidated'
type_corner_house = 0
type_duplex = 0
type_farmhouse = 1
type_midterrace_house = 0
type_multiple_dwelling = 0
type_residential_property = 0
type_single_dwelling = 0
type_special_property = 0
type_villa = 0
garage_type_garage = 0
garage_type_outside_parking_lot = 1
garage_type_underground_parking_lot = 0

X = np.asarray([np.log(living_space), np.log(lot), rooms, bathrooms, floors,
                np.log(2021 - year_built), np.log(2021 - year_renovated),
                garages, condition_codes, type_corner_house, type_duplex, type_farmhouse,
                type_midterrace_house, type_multiple_dwelling, type_residential_property,
                type_single_dwelling, type_special_property, type_villa,
                garage_type_garage, garage_type_outside_parking_lot,
                garage_type_underground_parking_lot]).reshape(1, -1)

y = np.exp(pipe.predict(X))

print('predicted price: {:.0f} EUR'.format(y[0]))
predicted price: 115455 EUR