Blog Author Classification (Test)#

We want to write a script which takes a list of URLs to blog posts and yields predictions for gender, age and industry of the blog author. For this purpose we have to load our trained models from project Blog Author Classification (Training) and we have to apply all the necessary preprocessing steps to the downloaded posts.

Getting some Blog Posts#

Task: Collect URLs of posts of some blog in a list. Take a blog for which you know gender, age and industry of the author. So we will see whether our models yield good predictions.

Solution:

# your solution

Task: Download all webpages in the list. Strip HTML tags with Beautiful Soup’s get_text and join all posts to one string.

Solution:

# your solution

Preprocessing#

Task: Repeat all preprocessing steps from part 1 of the text processing chapter (remove punctuation, tokenize, lemmatize).

Solution:

# your solution

Task: Load the three label lists and the vectorizer.

Solution:

# your solution

Task: Vectorize the lemmatized text.

Solution:

# your solution

Prediction#

Task: Load the three saved SVC models.

Solution:

# your solution

Task: Predict the blog author’s gender, age and industry. Provide the result in human readable form.

Solution:

# your solution