Worked Example: House Prices I#

To test techniques for supervised learning discussed so far we train a model for predicting house prices in Germany. Inputs are properties of a house and of the plot of land it has been built on. Output is the selling price.

Training data exists in form of advertisements on specialized websites for finding a buyer for a house. In principle we could scrape data from such a website, but usually its not allowed by the website operator and we would have to write lots of code. Erdogan Seref (unreachable in 2023) already did this job at www.immobilienscout24.de and published the data set at www.kaggle.com (unreachable in 2023) under a Attribution-NonCommercial-ShareAlike 4.0 International License.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn.linear_model as linear_model
import sklearn.metrics as metrics
import sklearn.model_selection as model_selection
import sklearn.preprocessing as preprocessing
import sklearn.pipeline as pipeline

data_path = 'german_housing.csv'

The Data Set#

At first we load the data set and try to get an overview of features and data quality.

data = pd.read_csv(data_path)

If a data frame has many columns Pandas by default does not show all columns. But we want to see all. Thus, we have to adjust the number of columns shown by setting corresponding option to None (that is, unlimited).

pd.set_option('display.max_columns', None)
data.head(10)

	Unnamed: 0	Price	Type	Living_space	Lot	Usable_area	Free_of_Relation	Rooms	Bedrooms	Bathrooms	Floors	Year_built	Furnishing_quality	Year_renovated	Condition	Heating	Energy_source	Energy_certificate	Energy_certificate_type	Energy_consumption	Energy_efficiency_class	State	City	Place	Garages	Garagetype
0	0	498000.0	Multiple dwelling	106.00	229.0	NaN	01.10.2020	5.5	3.0	1.0	2.0	2005.0	normal	NaN	modernized	central heating	Gas	available	demand certificate	NaN	D	Baden-Württemberg	Bodenseekreis	Bermatingen	2.0	Parking lot
1	1	495000.0	Mid-terrace house	140.93	517.0	20.00	01.01.2021	6.0	3.0	2.0	NaN	1994.0	basic	NaN	modernized	stove heating	NaN	not required by law	NaN	NaN	NaN	Baden-Württemberg	Konstanz (Kreis)	Engen	7.0	Parking lot
2	2	749000.0	Farmhouse	162.89	82.0	37.62	01.07.2020	5.0	3.0	2.0	4.0	2013.0	NaN	NaN	dilapidated	stove heating	Fernwärme, Bioenergie	available	demand certificate	NaN	B	Baden-Württemberg	Esslingen (Kreis)	Ostfildern	1.0	Garage
3	3	259000.0	Farmhouse	140.00	814.0	NaN	nach Vereinbarung	4.0	NaN	2.0	2.0	1900.0	basic	2000.0	fixer-upper	central heating	Strom	available	demand certificate	NaN	G	Baden-Württemberg	Waldshut (Kreis)	Bonndorf im Schwarzwald	1.0	Garage
4	4	469000.0	Multiple dwelling	115.00	244.0	NaN	sofort	4.5	2.0	1.0	NaN	1968.0	refined	2019.0	refurbished	central heating	Öl	available	demand certificate	NaN	F	Baden-Württemberg	Esslingen (Kreis)	Leinfelden-Echterdingen	1.0	Garage
5	5	1400000.0	Mid-terrace house	310.00	860.0	100.00	sofort	8.0	NaN	NaN	3.0	1969.0	basic	NaN	maintained	NaN	Öl	available	consumption certificate	NaN	NaN	Baden-Württemberg	Stuttgart	Süd	2.0	Garage
6	6	3500000.0	Duplex	502.00	5300.0	163.16	nach Absprache	13.0	NaN	4.0	NaN	2004.0	basic	NaN	dilapidated	stove heating	Erdwärme, Holzpellets	available	consumption certificate	35.9	A	Baden-Württemberg	Göppingen (Kreis)	Wangen	7.0	Parking lot
7	7	630000.0	Duplex	263.00	406.0	118.00	01.04.2020	10.0	NaN	NaN	3.0	1989.0	basic	NaN	modernized	stove heating	Gas	available	demand certificate	NaN	E	Baden-Württemberg	Freiburg im Breisgau	Munzingen	2.0	Garage
8	8	364000.0	Duplex	227.00	973.0	83.00	nach Absprache	10.0	4.0	4.0	2.0	1809.0	normal	2015.0	modernized	central heating	Strom	available	consumption certificate	183.1	F	Baden-Württemberg	Enzkreis	Neuenbürg	8.0	Parking lot
9	9	1900000.0	Duplex	787.00	933.0	NaN	NaN	30.0	NaN	NaN	3.0	1920.0	basic	NaN	modernized	stove heating	Gas, Fernwärme-Dampf	available	demand certificate	NaN	D	Baden-Württemberg	Mannheim	Rheinau	12.0	Parking lot

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10552 entries, 0 to 10551
Data columns (total 26 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 Unnamed: 0               10552 non-null  int64  
 Price                    10552 non-null  float64
 Type                     10150 non-null  object 
 Living_space             10552 non-null  float64
 Lot                      10552 non-null  float64
 Usable_area              5568 non-null   float64
 Free_of_Relation         6983 non-null   object 
 Rooms                    10552 non-null  float64
 Bedrooms                 6878 non-null   float64
 Bathrooms                8751 non-null   float64
Floors                   7888 non-null   float64
Year_built               9858 non-null   float64
Furnishing_quality       7826 non-null   object 
Year_renovated           5349 non-null   float64
Condition                10229 non-null  object 
Heating                  9968 non-null   object 
Energy_source            9325 non-null   object 
Energy_certificate       9797 non-null   object 
Energy_certificate_type  7026 non-null   object 
Energy_consumption       2433 non-null   float64
Energy_efficiency_class  5733 non-null   object 
State                    10551 non-null  object 
City                     10551 non-null  object 
Place                    10262 non-null  object 
Garages                  8592 non-null   float64
Garagetype               8592 non-null   object 
dtypes: float64(12), int64(1), object(13)
memory usage: 2.1+ MB

We should drop irrelevant columns and adjust data types.

Unnamed: 0: Seems to be an integer index. We don’t need it, so drop it.
Price: This is our target variable.
Type: An important column, because house prices are likely to depend on the type of house. We should convert this to categorical type.
Living_space and Lot: Important features, keep them.
Usable_area: Likely to have influence on the selling price, but available only for half the samples. If we want to use this for regression, we would have to drop half the training samples. Alternatively we could impute values, but it’s very hard to guess usable area from other features. We should drop the column.
Free_of_Relation: Not related to the selling price. Drop it.
Rooms, Bedrooms, Bathrooms: Should have influence on prices, but not available for all samples. For the moment we keep all three columns. Later we should have a look on correlations between the three columns and possibly only keep the first one, which is available for all samples.
Floors: Important feature, keep it.
Year_built: Important feature, keep it.
Furnishing_quality: Important, convert to categorical and keep.
Year_renovated: Important, but half the data is missing. There is good chance that missing values indicate that there the house has not been renovated until today. Thus, a reasonable fill value is the year of construction.
Condition: Important, convert to categorical and keep.
Heating and Energy_source: Could be important, convert to categorical and keep.
Energy_certificate, Energy_certificate_type, Energy_consumption: The first contains more or less only the value 'available' (since energy certificates are required by law). The second is irrelevant and the third is missing for most samples. Drop them all.
Energy_efficiency_class: Likely to have influence on the selling price, although classification procedure is very unreliable in practice. Keep and convert to categorical.
State, City, Place: Geolocation surely influences selling prices. But it’s hard to use location data for regression. For the moment we keep these columns.
Garages: Could be important, keep.
Garagetype: If we keep Garages then we also have to keep this column. Convert to categorical and rename to Garage_type to fit naming convention used for the other columns.

data = data.drop(columns=['Unnamed: 0', 'Usable_area', 'Free_of_Relation',
                          'Energy_certificate', 'Energy_certificate_type', 'Energy_consumption'])

data['Type'] = data['Type'].astype('category')
data['Furnishing_quality'] = data['Furnishing_quality'].astype('category')
data['Condition'] = data['Condition'].astype('category')
data['Heating'] = data['Heating'].astype('category')
data['Energy_source'] = data['Energy_source'].astype('category')
data['Energy_efficiency_class'] = data['Energy_efficiency_class'].astype('category')
data['Garagetype'] = data['Garagetype'].astype('category')

data = data.rename(columns={'Garagetype': 'Garage_type'})

nan_mask = data['Year_renovated'].isna()
data.loc[nan_mask, 'Year_renovated'] = data.loc[nan_mask, 'Year_built']

Categorical columns Furnishing_quality, Condition and Energy_efficiency_class should have a natural ordering, which should be represented by the data type.

print(data['Furnishing_quality'].cat.categories)
print(data['Condition'].cat.categories)
print(data['Energy_efficiency_class'].cat.categories)

Index(['basic', 'luxus', 'normal', 'refined'], dtype='object')
Index(['as new', 'by arrangement', 'dilapidated', 'first occupation',
       'first occupation after refurbishment', 'fixer-upper', 'maintained',
       'modernized', 'refurbished', 'renovated'],
      dtype='object')
Index([' A ', ' A+ ', ' B ', ' C ', ' D ', ' E ', ' F ', ' G ', ' H '], dtype='object')

We should rename same categories and sort them as good as possible.

data['Furnishing_quality'] = data['Furnishing_quality'] \
    .cat.rename_categories({'luxus': 'luxury'}) \
    .cat.reorder_categories(['basic', 'normal', 'refined', 'luxury'])

data['Condition'] = data['Condition'].cat.reorder_categories([
    'first occupation',
     'first occupation after refurbishment',
     'as new',
     'maintained',
     'renovated',
     'modernized',
     'refurbished',
     'by arrangement',
     'fixer-upper',
     'dilapidated'
])

data['Energy_efficiency_class'] = data['Energy_efficiency_class'] \
    .cat.rename_categories({
        ' A ': 'A',
        ' A+ ': 'A+',
        ' B ': 'B',
        ' C ': 'C',
        ' D ': 'D',
        ' E ': 'E',
        ' F ': 'F',
        ' G ': 'G',
        ' H ': 'H'
    }) \
    .cat.reorder_categories(['A+', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'])

Now let’s see how many complete samples we have.

len(data.dropna())

That’s very few. So we should drop some columns with many missing values.

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10552 entries, 0 to 10551
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 Price                    10552 non-null  float64 
 Type                     10150 non-null  category
 Living_space             10552 non-null  float64 
 Lot                      10552 non-null  float64 
 Rooms                    10552 non-null  float64 
 Bedrooms                 6878 non-null   float64 
 Bathrooms                8751 non-null   float64 
 Floors                   7888 non-null   float64 
 Year_built               9858 non-null   float64 
 Furnishing_quality       7826 non-null   category
Year_renovated           10211 non-null  float64 
Condition                10229 non-null  category
Heating                  9968 non-null   category
Energy_source            9325 non-null   category
Energy_efficiency_class  5733 non-null   category
State                    10551 non-null  object  
City                     10551 non-null  object  
Place                    10262 non-null  object  
Garages                  8592 non-null   float64 
Garage_type              8592 non-null   category
dtypes: category(7), float64(10), object(3)
memory usage: 1.1+ MB

Energy_efficiency_class is relatively unreliable and not too important for selling prices.

len(data.drop(columns=['Energy_efficiency_class']).dropna())

Better, but not good. The Bedrooms column has many missing values, too, and it’s likely to be correlated to Rooms. So let’s look at correlations between Rooms, Bedrooms, Bathrooms, Floors.

sns.pairplot(data[['Rooms', 'Bedrooms', 'Bathrooms', 'Floors']], plot_kws={"s": 5})
plt.show()

Floors is not correlated to the other columns, so keep it. Bedrooms show correlation to Rooms and Bathrooms, so drop Bedrooms. Bathroom shows some correlation to Rooms. Wether to drop Bathrooms should be decided by the increase in sample counts.

len(data.drop(columns=['Energy_efficiency_class', 'Bedrooms']).dropna())

len(data.drop(columns=['Energy_efficiency_class', 'Bedrooms', 'Bathrooms']).dropna())

We should keep Bathrooms, because dropping it only yields 300 more samples while neglecting possibly important information. Note that the number of bath rooms can be regarded as a measure for overall furnishing quality. Thus, there should be some correlation to Furnishing_quality.

data.groupby('Furnishing_quality')['Bathrooms'].mean()

Furnishing_quality
basic      2.207496
normal     2.363720
refined    1.826739
luxury     2.778761
Name: Bathrooms, dtype: float64

In addition, judging about furnishing quality of a house is highly subjective. Thus, we should drop the column to get more samples without missing data.

len(data.drop(columns=['Energy_efficiency_class', 'Bedrooms', 'Furnishing_quality']).dropna())

The Energy_source is another candidate for dropping, because it has more than 1000 missing values and its influence on selling prices should be rather low.

for cat in data['Energy_source'].cat.categories:
    print(cat)

 Bioenergie 
 Erdgas leicht 
 Erdgas leicht, Erdgas schwer 
 Erdgas schwer 
 Erdgas schwer, Bioenergie 
 Erdgas schwer, Holz 
 Erdwärme 
 Erdwärme, Fernwärme 
 Erdwärme, Gas 
 Erdwärme, Holzpellets 
 Erdwärme, Solar 
 Erdwärme, Solar, Holzpellets, Holz 
 Erdwärme, Solar, Umweltwärme 
 Erdwärme, Strom 
 Erdwärme, Umweltwärme 
 Fernwärme 
 Fernwärme, Bioenergie 
 Fernwärme, Flüssiggas 
 Fernwärme, Nahwärme, KWK fossil 
 Fernwärme-Dampf 
 Flüssiggas 
 Flüssiggas, Holz 
 Gas 
 Gas, Bioenergie 
 Gas, Fernwärme 
 Gas, Fernwärme-Dampf 
 Gas, Holz 
 Gas, Holz-Hackschnitzel 
 Gas, KWK fossil 
 Gas, Kohle, Holz 
 Gas, Strom 
 Gas, Strom, Holz 
 Gas, Strom, Kohle, Holz 
 Gas, Wasserenergie 
 Gas, Öl 
 Gas, Öl, Holz 
 Gas, Öl, Kohle 
 Gas, Öl, Kohle, Holz 
 Gas, Öl, Strom 
 Holz 
 Holz, Bioenergie 
 Holz-Hackschnitzel 
 Holzpellets 
 Holzpellets, Gas 
 Holzpellets, Gas, Öl 
 Holzpellets, Holz 
 Holzpellets, Holz-Hackschnitzel 
 Holzpellets, Kohle, Holz 
 Holzpellets, Strom 
 Holzpellets, Öl 
 KWK erneuerbar 
 KWK fossil 
 KWK regenerativ 
 Kohle 
 Kohle, Holz 
 Kohle/Koks 
 Nahwärme 
 Solar 
 Solar, Bioenergie 
 Solar, Erdgas schwer 
 Solar, Gas 
 Solar, Gas, Holz 
 Solar, Gas, Strom 
 Solar, Gas, Strom, Holz 
 Solar, Gas, Wasserenergie 
 Solar, Gas, Öl 
 Solar, Gas, Öl, Holz 
 Solar, Holz 
 Solar, Holz-Hackschnitzel 
 Solar, Holzpellets 
 Solar, Holzpellets, Holz 
 Solar, Holzpellets, Strom 
 Solar, Holzpellets, Öl 
 Solar, Strom 
 Solar, Strom, Bioenergie 
 Solar, Umweltwärme 
 Solar, Öl 
 Solar, Öl, Bioenergie 
 Solar, Öl, Holz 
 Solar, Öl, Holz-Hackschnitzel 
 Solar, Öl, Strom 
 Solar, Öl, Strom, KWK fossil 
 Strom 
 Strom, Bioenergie 
 Strom, Flüssiggas 
 Strom, Holz 
 Strom, Holz-Hackschnitzel 
 Strom, Kohle 
 Strom, Kohle, Holz 
 Strom, Umweltwärme 
 Umweltwärme 
 Wasserenergie 
 Windenergie 
 Wärmelieferung 
 Öl 
 Öl, Bioenergie 
 Öl, Fernwärme 
 Öl, Holz 
 Öl, Kohle 
 Öl, Kohle, Holz 
 Öl, Strom 
 Öl, Strom, Holz 
 Öl, Strom, Kohle, Holz 
 Öl, Umweltwärme 

Values are very diverse and hard to preprocess for regression. We would have to convert the column to several boolean columns. In addition, some grouping would be necessary (Holz is a subcategory of Bioenergie and so on).

len(data.drop(columns=['Energy_efficiency_class', 'Bedrooms', 'Furnishing_quality', 'Energy_source']).dropna())

Now we have almost 5000 complete samples. Should be a good compromise between completeness and level of detail.

data = data.drop(columns=['Energy_efficiency_class', 'Bedrooms', 'Furnishing_quality', 'Energy_source'])
data = data.dropna()

Outliers and Further Preprocessing#

Now that we have a cleaned data set we should remove outliers. The simplest method of detecting outliers is to look at the ranges of all feature. With describe we get a first overview for numerical features.

data.describe()

	Price	Living_space	Lot	Rooms	Bathrooms	Floors	Year_built	Year_renovated	Garages
count	4.854000e+03	4854.000000	4854.000000	4854.000000	4854.000000	4854.000000	4854.000000	4854.000000	4854.000000
mean	5.739566e+05	209.305740	1240.636904	7.051504	2.316028	2.256696	1964.252369	1995.626700	2.518541
std	5.880211e+05	118.252688	3806.518099	3.834865	1.595327	0.776769	49.065052	35.389067	2.719901
min	0.000000e+00	0.000000	0.000000	1.000000	0.000000	0.000000	1430.000000	1430.000000	1.000000
25%	2.800000e+05	135.000000	401.000000	5.000000	1.000000	2.000000	1950.000000	1991.000000	1.000000
50%	4.400000e+05	180.000000	675.000000	6.000000	2.000000	2.000000	1974.000000	2008.000000	2.000000
75%	6.850000e+05	248.000000	1042.000000	8.000000	3.000000	3.000000	1997.750000	2016.000000	3.000000
max	1.300000e+07	1742.240000	143432.000000	84.000000	26.000000	8.000000	2021.000000	2206.000000	65.000000

`Price` Column#

sns.histplot(data['Price'])
plt.show()

There are only very few high prices and price distribution concentrates on low prices. If the target variable has wide range, but most samples concentrate on a small portion of the range, then ‘learning’ the target is much more difficult than for more uniformly distributed data.

A common trick is to use nonlinear scaling. Especially for market prices it is known from experience that they follow a log-normal distribution, that is, after applying the logarithm we see a normal distribution. Before applying the logarithm we should drop samples with zeros in the Price column to avoid undefined results. A price of zero indicates that the seller did not provide a price in the advertisement. Thus, dropping such sample even is a good idea if we wouldn’t want to apply the logarithm.

data = data.loc[data['Price'] > 0, :]
sns.histplot(np.log(data['Price'].to_numpy()))
plt.show()

There seem to be same very small values.

data.loc[data['Price'] <= np.exp(7), :]

	Price	Type	Living_space	Lot	Rooms	Bathrooms	Floors	Year_built	Year_renovated	Condition	Heating	State	City	Place	Garages	Garage_type
6987	1.0	Duplex	459.0	2742.0	23.0	8.0	3.0	1957.0	1957.0	refurbished	stove heating	Nordrhein-Westfalen	Märkischer Kreis	Altena	1.0	Garage

Those samples should be dropped because house prices below \(\mathrm{e}^7\approx 1000\) EUR are very uncommon.

data['Price'] = np.log(data['Price'].to_numpy())
data = data.loc[data['Price'] > 7, :]

`Living_space` Column#

sns.histplot(data['Living_space'])
plt.show()

Same here as for Price.

data = data.loc[data['Living_space'] > 0, :]
sns.histplot(np.log(data['Living_space'].to_numpy()))
plt.show()

data['Living_space'] = np.log(data['Living_space'].to_numpy())

`Lot` Column#

sns.histplot(data['Lot'])
plt.show()

Same here as for Price again.

data = data.loc[data['Lot'] > 0, :]
sns.histplot(np.log(data['Lot'].to_numpy()))
plt.show()

data.loc[data['Lot'] <= np.exp(2), :]

	Price	Type	Living_space	Lot	Rooms	Bathrooms	Floors	Year_built	Year_renovated	Condition	Heating	State	City	Place	Garages	Garage_type
722	11.492723	Mid-terrace house	4.491441	0.25	4.0	1.0	3.0	1900.0	1900.0	modernized	wood-pellet heating	Baden-Württemberg	Main-Tauber-Kreis	Ahorn	1.0	Garage
1876	12.860362	Residential property	5.342334	1.00	5.0	1.0	1.0	2019.0	2019.0	first occupation	stove heating	Bayern	Regensburg (Kreis)	Hemau	1.0	Outside parking lot
6435	11.707834	Duplex	5.056246	1.96	5.0	2.0	3.0	1905.0	1981.0	refurbished	stove heating	Nordrhein-Westfalen	Höxter (Kreis)	Höxter	3.0	Outside parking lot
7887	12.594731	Duplex	5.857933	1.00	8.0	5.0	3.0	1965.0	2019.0	modernized	stove heating	Rheinland-Pfalz	Neuwied (Kreis)	Dierdorf	1.0	Outside parking lot

Lot size below \(\mathrm{e}^2<8\) m² is very unlikely.

data['Lot'] = np.log(data['Lot'].to_numpy())
data = data.loc[data['Lot'] > 2, :]

`Rooms` Column#

sns.histplot(data['Rooms'])
plt.show()

data.loc[data['Rooms'] >= 30, :]

	Price	Type	Living_space	Lot	Rooms	Bathrooms	Floors	Year_built	Year_renovated	Condition	Heating	State	City	Place	Garages	Garage_type
1863	14.430696	Residential property	6.371612	7.578145	41.0	21.0	3.0	1907.0	2019.0	maintained	stove heating	Bayern	Eichstätt (Kreis)	Dollnstein	16.0	Outside parking lot
4322	14.346139	Single dwelling	6.745236	7.783224	30.0	6.0	3.0	2009.0	2009.0	dilapidated	stove heating	Hessen	Limburg-Weilburg (Kreis)	Hadamar	4.0	Outside parking lot
4346	15.150512	Duplex	6.690842	6.311735	36.0	13.0	5.0	1961.0	2016.0	modernized	stove heating	Hessen	Frankfurt am Main	Rödelheim	3.0	Garage
4557	12.896717	Duplex	6.653534	7.970395	36.0	15.0	3.0	1970.0	2000.0	maintained	stove heating	Mecklenburg-Vorpommern	Demmin (Kreis)	Rosenow	1.0	Outside parking lot
4614	13.906265	Duplex	7.306531	8.234034	84.0	24.0	5.0	1991.0	2018.0	modernized	stove heating	Mecklenburg-Vorpommern	Demmin (Kreis)	Kruckow	16.0	Outside parking lot
4615	13.906265	Duplex	7.306531	8.234034	84.0	24.0	5.0	1989.0	2018.0	modernized	stove heating	Mecklenburg-Vorpommern	Demmin (Kreis)	Kruckow	16.0	Outside parking lot
6895	13.639966	Duplex	6.516193	7.090077	32.0	8.0	3.0	2003.0	2003.0	dilapidated	electric heating	Nordrhein-Westfalen	Gelsenkirchen	Hassel	8.0	Parking lot
7397	13.963931	Duplex	7.057898	6.952729	45.0	20.0	4.0	1950.0	1950.0	maintained	heat pump	Nordrhein-Westfalen	Gelsenkirchen	Horst	12.0	Outside parking lot
7904	13.805460	Duplex	6.682109	8.505121	35.0	13.0	3.0	2004.0	2019.0	first occupation after refurbishment	stove heating	Rheinland-Pfalz	Trier-Saarburg (Kreis)	Saarburg	15.0	Outside parking lot
8705	14.506155	Bungalow	7.003065	8.575462	40.0	0.0	3.0	2012.0	2012.0	fixer-upper	heat pump	Rheinland-Pfalz	Vulkaneifel (Kreis)	Gerolstein	40.0	Parking lot

There are only very few sample with high number of rooms. There is no chance to get good predictions from those few samples.

data = data.loc[data['Rooms'] < 30, :]

`Bathrooms` Column#

sns.histplot(data['Bathrooms'])
plt.show()

data.loc[data['Bathrooms'] >= 15, :]

	Price	Type	Living_space	Lot	Rooms	Bathrooms	Floors	Year_built	Year_renovated	Condition	Heating	State	City	Place	Garages	Garage_type
3270	15.264780	Duplex	7.462927	7.693605	26.0	26.0	4.0	1994.0	1994.0	modernized	heat pump	Brandenburg	Teltow-Fläming (Kreis)	Blankenfelde-Mahlow	26.0	Duplex lot
6006	13.457406	Bungalow	6.514713	8.750366	25.0	25.0	3.0	1874.0	2016.0	renovated	cogeneration units	Niedersachsen	Lüchow-Dannenberg (Kreis)	Küsten	30.0	Outside parking lot
7224	14.038654	Duplex	6.476972	6.063785	18.0	18.0	4.0	1962.0	1962.0	modernized	stove heating	Nordrhein-Westfalen	Herford (Kreis)	Herford	5.0	Garage
10088	12.037654	Multiple dwelling	4.499810	5.777652	4.0	22.0	2.0	1961.0	2000.0	modernized	heat pump	Schleswig-Holstein	Schleswig-Flensburg (Kreis)	Tarp	1.0	Outside parking lot

data = data.loc[data['Bathrooms'] < 15, :]

`Floors` Column#

sns.histplot(data['Floors'])
plt.show()

Nothing to do here.

`Year_built` Column#

sns.histplot(data['Year_built'])
plt.show()

data.loc[data['Year_built'] <= 1500, :]

	Price	Type	Living_space	Lot	Rooms	Bathrooms	Floors	Year_built	Year_renovated	Condition	Heating	State	City	Place	Garages	Garage_type
1311	14.580978	Corner house	6.182085	11.522876	13.0	7.0	3.0	1430.0	1430.0	fixer-upper	stove heating	Bayern	Berchtesgadener Land (Kreis)	Marktschellenberg	2.0	Carport
1458	14.077875	Duplex	6.522093	6.361302	19.0	9.0	3.0	1500.0	2014.0	modernized	underfloor heating	Bayern	Ansbach	Stadt	9.0	Carport
8174	12.203570	Bungalow	5.703782	5.634790	10.0	2.0	4.0	1492.0	1492.0	refurbished	stove heating	Rheinland-Pfalz	Bad Kreuznach (Kreis)	Bad Kreuznach	2.0	Garage

data = data.loc[data['Year_built'] > 1500, :]

Values above 2020 obviously are wrong (data set is from 2020).

data.loc[data['Year_built'] > 2020, :]

	Price	Type	Living_space	Lot	Rooms	Bathrooms	Floors	Year_built	Year_renovated	Condition	Heating	State	City	Place	Garages	Garage_type
2437	13.623139	Single dwelling	4.897840	5.609472	4.5	1.0	3.0	2021.0	2021.0	dilapidated	stove heating	Bayern	Rosenheim (Kreis)	Kolbermoor	2.0	Parking lot
2515	13.704579	Single dwelling	4.875197	5.796058	4.0	1.0	3.0	2021.0	2021.0	dilapidated	stove heating	Bayern	Ebersberg (Kreis)	Zorneding	2.0	Parking lot
2516	13.704579	Single dwelling	4.875197	5.963579	4.0	1.0	3.0	2021.0	2021.0	dilapidated	stove heating	Bayern	Ebersberg (Kreis)	Zorneding	2.0	Parking lot
2519	13.928839	Mid-terrace house	4.962845	6.061457	4.5	2.0	3.0	2021.0	2021.0	dilapidated	stove heating	Bayern	Rosenheim (Kreis)	Kolbermoor	2.0	Parking lot
2520	13.981025	Mid-terrace house	4.962845	6.165418	4.5	2.0	3.0	2021.0	2021.0	dilapidated	stove heating	Bayern	Rosenheim (Kreis)	Kolbermoor	2.0	Parking lot
2521	13.652992	Single dwelling	4.897840	5.700444	4.5	1.0	3.0	2021.0	2021.0	dilapidated	stove heating	Bayern	Rosenheim (Kreis)	Kolbermoor	2.0	Parking lot

data = data.loc[data['Year_built'] <= 2020, :]

To get a better distribution of the samples over the range, we again apply a logarithmic transform.

data.loc[:, 'Year_built'] = np.log(2021 - data['Year_built'].to_numpy())
sns.histplot(data['Year_built'])
plt.show()

`Year_renovated` Column#

sns.histplot(data['Year_renovated'])
plt.show()

There seem to be renovations before 1900, which seems somewhat strange. But remember that we filled missing values with values from Year_built. Values above 2020 obviously are wrong.

data.loc[data['Year_renovated'] > 2020, :]

	Price	Type	Living_space	Lot	Rooms	Bathrooms	Floors	Year_built	Year_renovated	Condition	Heating	State	City	Place	Garages	Garage_type
7803	12.992255	Duplex	6.969791	6.818924	26.0	13.0	2.0	4.025352	2026.0	by arrangement	heat pump	Rheinland-Pfalz	Trier-Saarburg (Kreis)	Reinsfeld	15.0	Parking lot
9324	12.971540	Mid-terrace house	5.075174	6.486161	6.0	2.0	3.0	4.727388	2206.0	modernized	heat pump	Sachsen	Zwickau	Nordvorstadt	2.0	Outside parking lot

data = data.loc[data['Year_renovated'] <= 2020, :]

data.loc[:, 'Year_renovated'] = np.log(2021 - data['Year_renovated'].to_numpy())
sns.histplot(data['Year_renovated'])
plt.show()

`Garages` Column#

sns.histplot(data['Garages'])
plt.show()

data.loc[data['Garages'] >= 20, :]

	Price	Type	Living_space	Lot	Rooms	Bathrooms	Floors	Year_built	Year_renovated	Condition	Heating	State	City	Place	Garages	Garage_type
461	14.343193	Bungalow	5.347108	8.932609	8.0	2.0	3.0	3.332205	3.332205	modernized	stove heating	Baden-Württemberg	Ludwigsburg (Kreis)	Vaihingen an der Enz	30.0	Outside parking lot
2579	14.220976	Duplex	6.278521	8.027150	15.0	4.0	2.0	3.761200	2.397895	modernized	underfloor heating	Bayern	Rottal-Inn (Kreis)	Eggenfelden	22.0	Parking lot
3761	15.009130	Bungalow	6.476972	8.116716	10.0	2.0	3.0	5.293305	1.609438	renovated	heat pump	Hessen	Frankfurt am Main	Niederursel	20.0	Outside parking lot
3932	13.384728	Duplex	5.375278	8.809714	15.0	7.0	4.0	4.510860	4.510860	maintained	stove heating	Hessen	Darmstadt-Dieburg (Kreis)	Otzberg	60.0	Outside parking lot
4175	14.066269	Duplex	5.777652	6.884487	11.0	5.0	2.0	3.496508	0.000000	modernized	stove heating	Hessen	Limburg-Weilburg (Kreis)	Bad Camberg	21.0	Outside parking lot
4243	11.407565	Duplex	3.058707	7.342132	1.0	1.0	4.0	3.912023	3.912023	modernized	heat pump	Hessen	Main-Taunus-Kreis	Hattersheim am Main	30.0	Parking lot
7166	13.102161	Bungalow	7.003065	9.239899	12.0	3.0	1.0	3.970292	2.639057	modernized	gas heating	Nordrhein-Westfalen	Minden-Lübbecke (Kreis)	Espelkamp	65.0	Parking lot
7870	12.959844	Corner house	5.828946	10.596635	14.0	2.0	3.0	4.795791	1.609438	dilapidated	stove heating	Rheinland-Pfalz	Altenkirchen (Westerwald) (Kreis)	Birnbach	50.0	Outside parking lot
9535	13.910821	Duplex	6.415097	8.097122	22.0	8.0	3.0	3.091042	3.091042	modernized	stove heating	Sachsen-Anhalt	Salzlandkreis	Bernburg (Saale)	58.0	Outside parking lot

data = data.loc[data['Garages'] < 20, :]

`Type` Column#

data['Type'].value_counts()

Mid-terrace house       2198
Duplex                   868
Single dwelling          565
Farmhouse                265
Villa                    213
Multiple dwelling        209
Special property         168
Residential property     124
Bungalow                 113
Corner house              78
Castle                     2
Name: Type, dtype: int64

data = data.loc[data['Type'] != 'Castle', :]
data['Type'] = data['Type'].cat.remove_categories('Castle')

`Condition` Column#

data['Condition'].value_counts()

modernized                              2146
dilapidated                              663
refurbished                              565
renovated                                509
maintained                               361
fixer-upper                              268
first occupation after refurbishment     210
first occupation                          50
by arrangement                            26
as new                                     3
Name: Condition, dtype: int64

We should remove 'as new' and 'by arrangement' because only few samples use these categories and both are somewhat dubious.

data = data.loc[~data['Condition'].isin(['as new', 'by arrangement']), :]
data['Condition'] = data['Condition'].cat.remove_categories(['as new', 'by arrangement'])

`Heating` Column#

data['Heating'].value_counts()

stove heating           2900
heat pump                563
oil heating              420
central heating          248
underfloor heating       162
night storage heater     138
district heating         117
wood-pellet heating       62
floor heating             55
electric heating          52
gas heating               32
cogeneration units        13
solar heating             10
Name: Heating, dtype: int64

Something is wrong here! More than every second house sold in 2020 has stove heating? And what about 'floor heating'? Is it gas powered or oil powered or what else? What’s the difference between 'floor heating' and 'underfloor heating'. It’s better to drop this column.

data = data.drop(columns=['Heating'])

`Garage_type` Column#

data['Garage_type'].value_counts()

Garage                     2647
Outside parking lot         897
Parking lot                 739
Carport                     409
Underground parking lot      53
Duplex lot                   26
Car park lot                  1
Name: Garage_type, dtype: int64

There are many similar categories. We should join some.

data.loc[data['Garage_type'] == 'Car park lot', 'Garage_type'] = 'Outside parking lot'
data.loc[data['Garage_type'] == 'Duplex lot', 'Garage_type'] = 'Outside parking lot'
data.loc[data['Garage_type'] == 'Parking lot', 'Garage_type'] = 'Outside parking lot'
data['Garage_type'] = data['Garage_type'].cat.remove_categories(['Car park lot', 'Duplex lot', 'Parking lot'])

data['Garage_type'].value_counts()

Garage                     2647
Outside parking lot        1663
Carport                     409
Underground parking lot      53
Name: Garage_type, dtype: int64

Save Cleaned Data#

We save cleaned data for future use.

data.to_csv(data_path.replace('.csv', '_preprocessed.csv'))

Linear Regression#

Now data is almost ready for training a model. It remains to convert categorical data to numerical data. Condition is ordered and numeric representation is accessible with Series.cat.codes. Columns Type and Garage_type should be one-hot encoded.

data['Condition_codes'] = data['Condition'].cat.codes
data = pd.get_dummies(data, columns=['Type', 'Garage_type'], drop_first=True)

data.head()

	Price	Living_space	Lot	Rooms	Bathrooms	Floors	Year_built	Year_renovated	Condition	State	City	Place	Garages	Condition_codes	Type_Duplex	Type_Farmhouse	Type_Mid-terrace house	Type_Multiple dwelling	Garage_type_Garage	Garage_type_Outside parking lot
0	13.118355	4.663439	5.433722	5.5	1.0	2.0	2.772589	2.772589	modernized	Baden-Württemberg	Bodenseekreis	Bermatingen	2.0	4	0	0	0	1	0	1
2	13.526494	5.093075	4.406719	5.0	2.0	4.0	2.079442	2.079442	dilapidated	Baden-Württemberg	Esslingen (Kreis)	Ostfildern	1.0	7	0	1	0	0	1	0
3	12.464583	4.941642	6.701960	4.0	2.0	2.0	4.795791	3.044522	fixer-upper	Baden-Württemberg	Waldshut (Kreis)	Bonndorf im Schwarzwald	1.0	6	0	1	0	0	1	0
8	12.804909	5.424950	6.880384	10.0	4.0	2.0	5.356586	1.791759	modernized	Baden-Württemberg	Enzkreis	Neuenbürg	8.0	4	1	0	0	0	0	1
10	14.375126	5.347108	7.286192	6.0	2.0	3.0	4.406719	1.945910	modernized	Baden-Württemberg	Stuttgart	Schönberg	2.0	4	0	0	1	0	1	0

We drop columns not used for regression and convert the data frame to NumPy arrays suitable for Scikit-Learn.

y = data['Price'].to_numpy()
X = data.drop(columns=['Price', 'Condition', 'State', 'City', 'Place']).to_numpy()

print(X.shape, y.shape)

(4772, 21) (4772,)

We have relatively few data. Thus, test set should be small to have more training samples.

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)
print(y_train.size, y_test.size)

3817 955

We use polynomial regression with regularization.

steps = [('poly', preprocessing.PolynomialFeatures()),
         ('ridge', linear_model.Ridge())]

pipe = pipeline.Pipeline(steps)

param_grid = {'poly__degree': [1, 2, 3],
              'ridge__alpha': [0] + [2 ** k for k in range(5, 15)]}

gs = model_selection.GridSearchCV(pipe, param_grid=param_grid,
                                  scoring='neg_mean_squared_error', n_jobs=-1, cv=5)

gs.fit(X_train, y_train)
best_params = gs.best_params_

Evaluating the Model#

Now we use the test set to evaluate prediction quality of the model.

print(best_params)

pipe.set_params(**best_params)
pipe.fit(X_train, y_train)

y_test_pred = pipe.predict(X_test)

{'poly__degree': 2, 'ridge__alpha': 64}

Root mean squared error between predicted and exact targets on its own does not tell much about fitting quality. We have to compare the value to standard deviation of the targets. Standard deviation is the root mean squared error of the exact targets and their mean. In other words, standard deviation tells us the prediction error if we would use constant predictions for all inputs. Obviously the constant should be the mean of the training (!) targets, but the mean of the training targets should be very close the mean of the test targets if test sample have been selected randomly.

rmse = metrics.mean_squared_error(y_test, y_test_pred, squared=False)
sigma = np.std(y_test)
print('RMSE:', rmse)
print('standard deviation:', sigma)
print('ratio:', rmse / sigma)

RMSE: 0.549529269010328
standard deviation: 0.7553775997302192
ratio: 0.7274894955934497

We see that the model’s prediction is better than constant prediction, but not so much.

We should have a closer look at the predictions. Since there is no natural ordering in the set of samples plotting y_test and y_test_pred with plot does not help much.

fig, ax = plt.subplots()
ax.plot(y_test, '-b', label='true targets')
ax.plot(y_test_pred, '-r', label='predictions')
ax.legend()
plt.show()

A better idea is to plot y_test versus y_test_pred. If true and predicted labels are close, then points should concentrate along the diagonal. Else they are far away from the diagonal.

fig, ax = plt.subplots()
ax.plot(y_test, y_test_pred, 'or', markersize=3)
ax.plot([9, 17], [9, 17], '-b')
ax.set_xlabel('true targets')
ax.set_ylabel('predictions')
ax.set_aspect('equal')
plt.show()

The red cloud shows some rotation compared to the blue line. Small target values get too high predictions and high target values get too low predictions. In other words, predictions tend to be too close to the target’s mean. Such behavior is typically observed if there are many similar samples with different targets in the training data. Then there is no clear functional dependence of the targets on the inputs and models tend to predict the mean targets.

To further investigate this issue we should look at the predictions on the training set. If we are right, then predictions on the training set should show similar behavior (predictions close to mean).

y_train_pred = pipe.predict(X_train)

fig, ax = plt.subplots()
ax.plot(y_train, y_train_pred, 'or', markersize=3)
ax.plot([9, 17], [9, 17], '-b')
ax.set_xlabel('true targets')
ax.set_ylabel('predictions')
ax.set_aspect('equal')
plt.show()

Again we see slight rotation. To summarize: our input data has too few details to explain the targets. There are simlar inputs with different targets leading to underestimation of high values and over estimation of low values. The only way out is gathering more data, either by dropping less columns or by getting relevant data from additional sources. We will come back to this issue soon.

Predictions#

When using our model for predicting house prices we have to keep in mind that we transformed some of the input data. All those transforms have to be applied to new inputs, too.

living_space = 80
lot = 3600
rooms = 5
bathrooms = 0
floors = 2
year_built = 1948
year_renovated = 1948
garages = 2
condition_codes = 7    # 0 = 'first occupation', 7 = 'dilapidated'
type_corner_house = 0
type_duplex = 0
type_farmhouse = 1
type_midterrace_house = 0
type_multiple_dwelling = 0
type_residential_property = 0
type_single_dwelling = 0
type_special_property = 0
type_villa = 0
garage_type_garage = 0
garage_type_outside_parking_lot = 1
garage_type_underground_parking_lot = 0

X = np.asarray([np.log(living_space), np.log(lot), rooms, bathrooms, floors,
                np.log(2021 - year_built), np.log(2021 - year_renovated),
                garages, condition_codes, type_corner_house, type_duplex, type_farmhouse,
                type_midterrace_house, type_multiple_dwelling, type_residential_property,
                type_single_dwelling, type_special_property, type_villa,
                garage_type_garage, garage_type_outside_parking_lot,
                garage_type_underground_parking_lot]).reshape(1, -1)

y = np.exp(pipe.predict(X))

print('predicted price: {:.0f} EUR'.format(y[0]))

predicted price: 115455 EUR

Worked Example: House Prices I

Contents

Worked Example: House Prices I#

The Data Set#

Outliers and Further Preprocessing#

Price Column#

Living_space Column#

Lot Column#

Rooms Column#

Bathrooms Column#

Floors Column#

Year_built Column#

Year_renovated Column#

Garages Column#

Type Column#

Condition Column#

Heating Column#

Garage_type Column#