Blog Author Classification (Training)#

Instead of deriving author information from single blog posts like in Text Classification we want to use all posts of a blog to derive age, gender and industry of the author from text data. We train three independent models for the three output variables.

Working with text data requires heavy preprocessing. If we want to apply a machine learning model to new data (see project Blog Author Classification (Test)), we have to preprocess the new data in the same way as training data. This means that not only the model has to be saved for later use, but also the parameters of all preprocessing steps have to be accessible to the user of the trained model. This issue will be addressed in this project, too.

Getting the Data#

We have to load blog author data and posts. All posts of a blog have to be joined to one long text.

Task: Load blog author data.

Solution:

# your solution

Task: Load the lemmatized blog posts. We only need blog IDs and lemmatized texts.

Solution:

# your solution

Task: Join all posts of one blog to a long string. Add a new column text to the blogs data frame containing blog texts. Then remove the posts data frame from memory to free some 100 MB of memory.

Solution:

# your solution

Model Inputs and Outputs#

Gender, age and industry values have to be converted to integers. Conversion rules will be needed again for getting human readable outputs from our model. Thus, we should create some data structure holding the conversion rules. If we use integers starting from 0, 1, 2,… lists do the job. For unknown industry we should use the highest integer, because samples with unknown industry will be excluded from training the industry model.

Task: Convert gender (2 classes), age (3 classes) and industry (many classes) to integer values. Create 3 lists for converting integers to human readable strings.

# your solution

Task: Create a NumPy array with all outputs (3 columns).

Solution:

# your solution

Task: Create a NumPy array with all texts (1 column of type object).

Solution:

# your solution

Train-Test Split#

Task: Split the data set into training (80 per cent) and test sets (20 per cent).

Solution:

# your solution

For training the industry model we will drop all samples with unknown industry. Here we have to take care, that this removel has similar influence on training and test sets. Else we would have to first remove the samples and then split the data, which would yield more complicated code than one split for all three models.

Task: Check that unknown industry got equally distributed to training and test sets.

Solution:

# your solution

Text to Numbers#

Task: Use Scikit-Learn’s TFidfVectorizer to convert text data to numerical data.

Solution:

# your solution

We have to save the mapping from words to numbers if we want to use some model trained on the preprocessed data. The vocabulary (maps words to indices) is accessible through tfidf_vect.vocabulary_ and can be passed to a fresh TfidfVectorizer object via the vocabulary argument. But vectorization also requires knowledge of the inverse document frequencies. These are accessible through tfidf_vect.idf_, but there is no way to pass them to a fresh TfidfVectorizer object. Thus, we have to save the whole object.

Task: Save the three lists with human readable labels and the vectorizer object to a file. Use the pickle module.

Solution:

# your solution

Gender Model#

Task: Train and evaluate a multinomial naive Bayes classifier for predicting blog authors’ gender with Scikit-Learn.

Solution:

# your solution

Task: Try a linear SVM for gender prediction.

Solution:

# your solution

Age Model#

Task: Train naive Bayes and SVM models for age prediction.

Solution:

# your solution

Industry Model#

Task: Select all training and test samples with kown industry.

Solution:

# your solution

Task: Train naive Bayes and SVM models for industry prediction.

Solution:

# your solution

Task: Calculate the accuracy for a model which always predicts ‘Student’ as industry.

Solution:

# your solution

Saving Models#

Scikit-Learn does not provide functions for saving trained models (in contrast to Keras). But pickling Scikit-Learn objects should work.

Task: Save the three SVM models to a file.

Solution:

# your solution

Task: Why is the file containing three SVM models so small? Or: What has to be saved to fully specify a SVM model?

Solution:

# your answer

Task: What’s the expected file size for a \(k\)-NN model?

Solution:

# your answer