Preprocessing Text Data#

We want to classify blog posts by age and gender of the post’s author. Training data is available from The Blog Authorship Corpus, containing 650000 posts from 19000 blogs. Data may be freely used for non-commercial research purposes. Data was collected for research published in Effects of Age and Gender on Blogging (J. Schler, M. Koppel, S. Argamon, J. Pennebaker, Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs).

In addition to usual preprocessing and model training we will discuss how to convert text data to numerical features and how to cope with very high dimensional feature spaces. Models will be based on word counts. This first chapter contains everything we need to do before counting words. The second chapter discusses how to count words and how to use word counts for training machine learning models.

data_path = '/data/datasets/blogs/'

Getting and Restructuring the Data Set#

Data comes as ZIP file from the above mentioned website (313 MB). The ZIP file contains one XML file per blog (uncompressed size is 808 MB). An XML file is a text file containing some markup code (similar to HTML). Information about a blog’s author is provided in the file name.

Extracting Blog Information#

File names have the format blog_id.gender.age.industry.astronomical_sign.xml. We create a data frame containing all the information but astronomical signs and save it to a CSV file.

import pandas as pd
import numpy as np
import zipfile
import re
import langdetect

with zipfile.ZipFile(data_path + 'blogs.zip') as zf:
    file_names = zf.namelist()
    
print(file_names[:5])

['blogs/', 'blogs/1000331.female.37.indUnk.Leo.xml', 'blogs/1000866.female.17.Student.Libra.xml', 'blogs/1004904.male.23.Arts.Capricorn.xml', 'blogs/1005076.female.25.Arts.Cancer.xml']

XML files are in a subdirectory and the subdirectory is listed by zf.namelist(), too.

blog_ids = []
genders = []    # 'm' if male, 'f' if female
ages = []
industries = []

for file_name in file_names:
    
    if file_name.split('.')[-1] != 'xml':
        print('skipping', file_name)
        continue
    
    blog_id, gender, age, industry, astro = file_name.split('/')[-1].split('.')[0:-1]

    blog_ids.append(int(blog_id))
    
    if gender == 'male':
        genders.append('m')
    elif gender == 'female':
        genders.append('f')
    else:
        print('unknown gender:', gender)
        
    ages.append(int(age))
    
    industries.append(industry)

blogs = pd.DataFrame({'gender': genders, 'age': ages, 'industry': industries}, index=blog_ids)

skipping blogs/

blogs

	gender	age	industry
1000331	f	37	indUnk
1000866	f	17	Student
1004904	m	23	Arts
1005076	f	25	Arts
1005545	m	25	Engineering
...	...	...	...
996147	f	36	Telecommunications
997488	m	25	indUnk
998237	f	16	indUnk
998966	m	27	indUnk
999503	m	25	Internet

19320 rows × 3 columns

blogs['industry'] = blogs['industry'].str.replace('indUnk', 'unknown')

blogs

	gender	age	industry
1000331	f	37	unknown
1000866	f	17	Student
1004904	m	23	Arts
1005076	f	25	Arts
1005545	m	25	Engineering
...	...	...	...
996147	f	36	Telecommunications
997488	m	25	unknown
998237	f	16	unknown
998966	m	27	unknown
999503	m	25	Internet

19320 rows × 3 columns

blogs.to_csv(data_path + 'blogs.csv')

Converting XML Files to one CSV File#

We would like to have a data frame containing all blog posts. This data frame can easily be modified (data preprocessing!) and saved to a CSV file for future use.

Reading XML files can be done with the module xml.etree.ElementTree from the standard python library. Usage is relatively simple but the parser gets stuck at almost all files. Although file extension is XML the files do not contain valid XML. Most files contain characters not allowed in XML files and some files even contain HTML fragments, which make the parser fail. Thus, we have to parse files manually.

The structure of the files is as follows:

<Blog>
<date>DAY,MONTH_NAME,YEAR</date>
<post>TEXT_OF_POST</post>
...
<date>DAY,MONTH_NAME,YEAR</date>
<post>TEXT_OF_POST</post>
</Blog>

When parsing the files we have to take into account the following observations:

Files are littered with random white space characters.
Enconding is unknown and may vary from file to file.
Files contain non-printable characters (00h-1Fh) other than line breaks (could be an encoding issue).
Links in original posts are marked by urlLink (not urllink as indicated in the data set description).

White space can be removed with str.strip(). At least some files are not Unicode encoded. Interpreting non-Unicode files as Unicode may lead to errors. Using a 1-byte-encoding like ASCII or ISO 8859-1 also works for UTF-8 files, because each byte is interpreted as some character. Using a 1-byte-encoding for UTF-8 files may result in non-printable characters, which have to be removed before further processing of the text data. Removing all non-printable characters also removes line breaks. But line breaks do not matter for our classification task. So that’s not a problem. The link marker urlLink can be savely removed. Possible URLs following the marker are hard to remove. For the moment we keep them.

Months in the date field are given by name, mostly in English but also in some other languages. To translate month names to numbers we use a dictionary. To get the dictionary we may start with English month names and then add more names one by one, if program stops with key error. Here is a list of languages and one XML file per language:

language	file
French	1022086.female.17.Student.Cancer.xml
Spanish	1162265.male.17.Student.Aries.xml
Portuguese	1253219.female.27.indUnk.Sagittarius.xml
German	1366984.female.25.Technology.Aries.xml
Estonian	1405766.male.24.HumanResources.Scorpio.xml
Italian	1847277.female.24.Student.Gemini.xml
Finnish	2042296.female.25.Student.Sagittarius.xml
Dutch	3032299.female.33.Non-Profit.Scorpio.xml
Polish	3340219.male.45.Technology.Virgo.xml
Romanian	3559973.female.36.Manufacturing.Aries.xml
Swedish	4145017.male.23.BusinessServices.Libra.xml
Russian	4230660.female.13.Student.Virgo.xml
Croatian	817097.female.26.Student.Taurus.xml
Norwegian	887044.female.23.indUnk.Pisces.xml

Some dates are missing with ,, in the date field. We set such dates to 0/0/0.

Looking at some of the XML files with non-English month names we see that there are some post written in languages other than English. The data set providers only checked whether a blog (not a post) contains at least 200 common English words. Thus, we have to remove some posts. Language detection can be done with the langdetect module.

# cell execution may take several hours due to language detection

month_num = {'january': 1, 'february': 2, 'march': 3, 'april': 4, 'may': 5, 'june': 6,
             'july': 7, 'august': 8, 'september': 9, 'october': 10, 'november': 11, 'december': 12,
             # French
             'janvier': 1, 'mars': 3, 'avril': 4, 'mai': 5, 'juin': 6,
             'juillet': 7, 'septembre': 9, 'octobre': 10, 'novembre': 11,
             # Spanish
             'enero': 1, 'febrero': 2, 'marzo': 3, 'abril': 4, 'mayo': 5, 'junio': 6,
             'julio': 7, 'agosto': 8, 'septiembre': 9, 'octubre': 10, 'noviembre': 11, 'diciembre': 12,
             # Portuguese
             'janeiro': 1, 'fevereiro': 2, 'maio': 5, 'junho': 6,
             'julho': 7, 'agosto': 8, 'setembro': 9, 'outubro': 10, 'novembro': 11, 'dezembro': 12,
             # German
             'januar': 1, 'februar': 2, 'märz': 3, 'april': 4, 'mai': 5, 'juni': 6,
             'juli': 7, 'august': 8, 'september': 9, 'oktober': 10, 'november': 11, 'dezember': 12,
             # Estonian
             'jaanuar': 1, 'aprill': 4, 'juuni': 6, 'juuli': 7,
             # Italian
             'giugno': 6, 'luglio': 7, 'ottobre': 10,
             # Finnish
             'toukokuu': 5, 'elokuu': 8,
             # Dutch
             'maart': 3, 'mei': 5, 'augustus': 8,
             # Polish
             'maj': 5, 'czerwiec': 6, 'lipiec': 7,
             # Romanian
             'ianuarie': 1, 'februarie': 2, 'iulie': 7, 'septembrie': 9, 'noiembrie': 11,
             # Swedish
             'augusti': 8,
             # Russian
             'avgust': 8,
             # Croatian
             'lipanj': 6, 'kolovoz': 8,
             # Norwegian
             'mars': 3, 'desember': 12,
             # unknown
             'unknown': 0}


blog_ids = []
days = []
months = []
years = []
texts = []
langs = []

with zipfile.ZipFile(data_path + 'blogs.zip') as zf:

    for file_name in zf.namelist():

        if file_name.split('.')[-1] != 'xml':
            print('skipping', file_name)
            continue

        #print(file_name)

        blog_id = int(file_name.split('/')[-1].split('.')[0])

        with zf.open(file_name) as f:
            xml = f.read().decode(encoding='iso-8859-1')

        xml_posts = xml.split('<date>')[1:]

        for xml_post in xml_posts:

            day, month, year = xml_post[:(xml_post.find('</date>'))].split(',')
            if len(day) == 0:
                day = '0'
            if len(month) == 0:
                month = 'unknown'
            if len(year) == 0:
                year = '0'

            text = xml_post[(xml_post.find('<post>') + 6):(xml_post.find('</post>'))]
            text = re.sub(r'[\x00-\x1F]+', ' ', text)    # non-printable characters
            text = text.replace('&nbsp;', ' ')    # HTML entity for protected spaces
            text = text.replace('urlLink', '')    # link marker
            text = text.strip()

            try:
                lang = langdetect.detect(text)                
            except langdetect.LangDetectException:
                lang = ''
                
            if len(text) > 0:
                blog_ids.append(blog_id)
                days.append(int(day))
                months.append(month_num[month.lower()])
                years.append(int(year))
                texts.append(text)
                langs.append(lang)
        
posts = pd.DataFrame(data={'blog_id': blog_ids, 'day': days, 'month': months, 'year': years,
                           'text': texts, 'lang': langs})
posts

skipping blogs/

	blog_id	day	month	year	text	lang
0	1000331	31	5	2004	Well, everyone got up and going this morning. ...	en
1	1000331	29	5	2004	My four-year old never stops talking. She'll ...	en
2	1000331	28	5	2004	Actually it's not raining yet, but I bought 15...	en
3	1000331	28	5	2004	Ha! Just set up my RSS feed - that is so easy!...	en
4	1000331	28	5	2004	Oh, which just reminded me, we were talking ab...	en
...	...	...	...	...	...	...
676023	999503	4	7	2004	Today we celebrate our independence day. In...	en
676024	999503	3	7	2004	Ugh, I think I have allergies... My nose has ...	en
676025	999503	2	7	2004	"Science is like sex; occasionally something p...	en
676026	999503	1	7	2004	Dog toy or marital aid I managed 10/14 on th...	en
676027	999503	1	7	2004	I had a dream last night about a fight when I ...	en

676028 rows × 6 columns

print(len(posts))

posts = posts.loc[posts['lang'] == 'en', :]
posts = posts.drop(columns=['lang'])

print(len(posts))

676028
653764

posts.to_csv(data_path + 'posts.csv')

posts = pd.read_csv(data_path + 'posts.csv', index_col=0)

posts

	blog_id	day	month	year	text
0	1000331	31	5	2004	Well, everyone got up and going this morning. ...
1	1000331	29	5	2004	My four-year old never stops talking. She'll ...
2	1000331	28	5	2004	Actually it's not raining yet, but I bought 15...
3	1000331	28	5	2004	Ha! Just set up my RSS feed - that is so easy!...
4	1000331	28	5	2004	Oh, which just reminded me, we were talking ab...
...	...	...	...	...	...
676023	999503	4	7	2004	Today we celebrate our independence day. In...
676024	999503	3	7	2004	Ugh, I think I have allergies... My nose has ...
676025	999503	2	7	2004	"Science is like sex; occasionally something p...
676026	999503	1	7	2004	Dog toy or marital aid I managed 10/14 on th...
676027	999503	1	7	2004	I had a dream last night about a fight when I ...

653702 rows × 5 columns

Exploring the Data Set#

Now we have two data frames: blogs and posts. We should have a look at the data before tackling the learning task.

print('blogs:', len(blogs))
print('posts:', len(posts))

blogs: 19320
posts: 653702

Exploring Blog Authors#

blogs.groupby('gender').count()

	age	industry
gender
f	9660	9660
m	9660	9660

The data set is well balanced with respect to blog author’s gender.

blogs.groupby('age').count()

	gender	industry
age
13	690	690
14	1246	1246
15	1771	1771
16	2152	2152
17	2381	2381
23	2026	2026
24	1895	1895
25	1620	1620
26	1340	1340
27	1205	1205
33	464	464
34	378	378
35	338	338
36	288	288
37	259	259
38	171	171
39	152	152
40	145	145
41	139	139
42	127	127
43	116	116
44	94	94
45	103	103
46	72	72
47	71	71
48	77	77

blogs['age'].hist(bins=np.arange(13, 49)-0.5)

<Axes: >

We have three age groups of different size:

print(np.count_nonzero((blogs['age'] < 20).to_numpy()))
print(np.count_nonzero(((blogs['age'] > 20) & (blogs['age'] < 30)).to_numpy()))
print(np.count_nonzero((blogs['age'] > 30).to_numpy()))

8240
8086
2994

According to the data set description gender should be balanced in each age group.

blogs['age_group'] = pd.cut(blogs['age'], bins=[0, 20, 30, 100])

blogs.groupby(['age_group', 'gender']).count()

/tmp/ipykernel_19434/2441672544.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  blogs.groupby(['age_group', 'gender']).count()

		age	industry
age_group	gender
(0, 20]	f	4120	4120
(0, 20]	m	4120	4120
(20, 30]	f	4043	4043
(20, 30]	m	4043	4043
(30, 100]	f	1497	1497
(30, 100]	m	1497	1497

blogs.groupby('industry').count()['age'].sort_values(ascending=False)

industry
unknown                    6827
Student                    5120
Education                   980
Technology                  943
Arts                        721
Communications-Media        479
Internet                    397
Non-Profit                  372
Engineering                 312
Government                  236
Law                         197
Consulting                  191
Science                     184
Marketing                   180
BusinessServices            163
Publishing                  150
Advertising                 145
Religion                    139
Telecommunications          119
Military                    116
Banking                     112
Accounting                  105
Fashion                      98
Tourism                      94
HumanResources               94
Transportation               91
Sports-Recreation            90
Manufacturing                87
Architecture                 69
Chemicals                    62
Biotech                      57
LawEnforcement-Security      57
RealEstate                   55
Museums-Libraries            55
Construction                 55
Automotive                   54
Agriculture                  36
InvestmentBanking            33
Environment                  28
Maritime                     17
Name: age, dtype: int64

Industry is available for about 60 per cent of the blog authors.

Exploring Blog Posts#

Although for our classification task dates of posts are irrelevant, we have a look at them. Looking at irrelevant columns yields a better feeling for the data set and its reliability.

posts.groupby('year').count()

	blog_id	day	month	text
year
0	23	23	23	23
1999	68	68	68	68
2000	866	866	866	866
2001	5427	5427	5427	5427
2002	21772	21772	21772	21772
2003	100199	100199	100199	100199
2004	525318	525318	525318	525318
2005	6	6	6	6
2006	23	23	23	23

The data set providers scraped the data in August 2004. Thus, there should be no newer posts. Since we only are interested in post texts, we do not care about this inconsistency here.

posts.groupby('month').count()

	blog_id	day	year	text
month
0	23	23	23	23
1	21812	21812	21812	21812
2	24663	24663	24663	24663
3	29099	29099	29099	29099
4	33166	33166	33166	33166
5	75275	75275	75275	75275
6	125797	125797	125797	125797
7	154842	154842	154842	154842
8	125919	125919	125919	125919
9	13051	13051	13051	13051
10	16296	16296	16296	16296
11	16525	16525	16525	16525
12	17234	17234	17234	17234

The maximum in summer months is not because people write more blog posts in summer. Blogging became more and more popular from month to month and posts had been collected till August 2004. Including posts from September 2004 we (presumable) would get September counts higher than August counts. Slight drop from July to August could be caused by incomplete data for August 2004.

posts.groupby('day').count()['blog_id'].plot()

<Axes: xlabel='day'>

There are more posts at the beginning of a month than at a month’s end. Counts for 31st are much lower because not every month has a 31st.

We should have a look at class balancing. We already know that gender is well balanced and age is not if we count on a per-blog basis. But since we want to classify blog posts (not complete blogs) by gender and age of the author we have to consider class sizes on a per post-basis.

posts_per_blog = posts.groupby('blog_id')['day'].count()

blogs['posts'] = 0
blogs.loc[posts_per_blog.index, 'posts'] = posts_per_blog

blogs.groupby(['gender', 'age_group'])['posts'].sum()

/tmp/ipykernel_19434/1016594775.py:6: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  blogs.groupby(['gender', 'age_group'])['posts'].sum()

gender  age_group
f       (0, 20]      110387
        (20, 30]     155425
        (30, 100]     56786
m       (0, 20]      115255
        (20, 30]     152739
        (30, 100]     63110
Name: posts, dtype: int64

In the highest age group gender is somewhat unbalanced, but not much. An even more accurate measure of class size (data per class) is the cummulated text length per class.

posts['length'] = posts['text'].str.len()
chars_per_blog = posts.groupby('blog_id')['length'].sum()

blogs['chars'] = 0
blogs.loc[chars_per_blog.index, 'chars'] = chars_per_blog

blogs.groupby(['gender', 'age_group'])['chars'].sum()

/tmp/ipykernel_19434/1431541158.py:7: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  blogs.groupby(['gender', 'age_group'])['chars'].sum()

gender  age_group
f       (0, 20]      117123221
        (20, 30]     179750611
        (30, 100]     74878617
m       (0, 20]      119793206
        (20, 30]     175919412
        (30, 100]     72894990
Name: chars, dtype: int64

Here balancing of gender looks better, but again the highest age group is much smaller than the other two age groups.

Preprocessing Text for Counting Words#

Machine learning algorithms expect numbers as inputs. So we have to convert strings to vectors of numbers. There exist different conversion techniques, some advanced ones like Word2vec and some simpler ones like the bag of words approach. The latter assigns each word in a corpus a position in a vector and represents a string by counting the occurrences of each word. The vector representation of a string is the vector containing all word counts.

Input features are word counts and length of feature vectors equals the number of different words in a dictionary. Thus, feature vectors are extremely long and contain zeros almost everywhere. Vectors containing zeros almost everywhere are called sparse vectors. A sparse vectors is not stored as array, but as list of index-value pairs for non-zero components only. Memory consumption is not given by vector length but by the number of non-zero components. Scikit-Learn and NumPy support sparse vectors (and matrices) and automatically choose a suitable data type where appropriate.

The dictionary (and, thus, the feature space dimension) is determined from the training set. All words contained in the training set form the dictionary. Usually one leaves out words occurring only in very few training samples or words occurring in almost all training samples. From the former a model cannot learn something useful due to lack of samples. The latter do not contain useful information to discriminate between different classes.

Before converting strings to vectors some preprocessing is necessary. At least punctuation and other special characters should be removed. Other preprocessing steps may include:

Stop word removal: Remove common words like ‘and’, ‘or’, ‘have’. There exist list of stop words for most languages. Stop word removal has to be used with care, because some common words may contain important information, like ‘not’ for instance.
Stemming: Remove word endings like plural ‘s’ or ‘ing’ to get word stems. There exist many different stemming algorithms. Results are sometimes incorrect. For instance, ‘thus’ is usually stemmed to ‘thu’.
Lemmatization: Get the base form of a word. It’s a more intelligent form of stemming, but requires lots of computation time. Again, there exist many different algorithms.

Stop words, stemming, and lemmatization are, for instance, implemented in the nltk Python package (Natural Language Toolkit). The subject is known as natural language processing.

Removing Punctuation and other Special Characters#

We remove all characters but A-Z, numbers, single spaces, and basic punctuation (!, ?, dot, comma, aposthrophe). Regular expressions allow for efficient removal.

posts['text'] = posts['text'].str.replace(r"[^\w ,\.\?!']", '', regex=True)

posts['text'] = posts['text'].str.replace(r'\s+', ' ', regex=True)

posts['text']

0         Well, everyone got up and going this morning. ...
1         My fouryear old never stops talking. She'll sa...
2         Actually it's not raining yet, but I bought 15...
3         Ha! Just set up my RSS feed that is so easy! W...
4         Oh, which just reminded me, we were talking ab...
                                ...                        
676023    Today we celebrate our independence day. In ho...
676024    Ugh, I think I have allergies... My nose has b...
676025    Science is like sex occasionally something pra...
676026    Dog toy or marital aid I managed 1014 on this ...
676027    I had a dream last night about a fight when I ...
Name: text, Length: 653702, dtype: object

Maybe some texts are empty now. We should remove them.

print(len(posts))

posts = posts.loc[posts.loc[:, 'text'] != '', :]

print(len(posts))

653702
653702

Lemmatization#

To reduce dictionary size and increase chances for good classification results we use lemmatization. For instance we want to count ‘child’ and ‘children’ as one and the same word. We choose the WordNetLemmatizer of NLTK. WordNet is a database provided by Princeton University which contains relations between English words.

The WordNetLemmatizer takes a word and looks it up in the data base. If it is found there, it returns the base form, else the original word is returned. WordNet data base can be searched online, too. Searching WordNet database with NLTK or online includes some stemming-like preprocessing steps.

import nltk

Before first use of WordNetLemmatizer we have to download the database.

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /var/lib/u21302575108/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

True

We have to create a WordNetLemmatizer object and then call its lemmatize method.

lemmatizer = nltk.stem.WordNetLemmatizer()

lemmatizer.lemmatize('child')

'child'

lemmatizer.lemmatize('children')

'child'

Note that the WordNet lemmatizer only works for lower case words (a not well documented fact).

lemmatizer.lemmatize('Children')

'Children'

Simply calling WordNetLemmatizer with some word may yield unexpected results. Given the sentence ‘He is killing him.’ we would expect ‘killing’ to be lemmatized to ‘kill’.

lemmatizer.lemmatize('killing')

'killing'

The problem here is that ‘killing’ is the base form of a noun (‘That resulted in a killing.’) and WordNetLemmatizer by default looks for nouns. A second argument to lemmatize modifies the default behavior.

lemmatizer.lemmatize('killing', pos=nltk.corpus.reader.wordnet.VERB)

'kill'

The abbreviation ‘pos’ stands for ‘part of speech’. The module nltk.corpus.reader.wordnet contains some WordNet related functionality. It defines some constants, for instance. Passing nltk.corpus.reader.wordnet.NOUN to pos (the default) tells the lemmatizer that the word is a noun. Passing nltk.corpus.reader.wordnet.VERB tells it that the word is a verb. Further options are nltk.corpus.reader.wordnet.ADJ (adjectives) and nltk.corpus.reader.wordnet.ADV (adverbs).

print(nltk.corpus.reader.wordnet.NOUN)
print(nltk.corpus.reader.wordnet.VERB)
print(nltk.corpus.reader.wordnet.ADJ)
print(nltk.corpus.reader.wordnet.ADV)

n
v
a
r

Although these are simple strings, we should use the constants. If implementation of NLTK oder WordNet changes, our code is more likely to remain working then.

The question now is: How to obtain POS information? NLTK implements several POS taggers. If you do not want to decide which one to choose, use the one recommended by NLTK by simply calling pos_tag(). This function takes a list of words and punctuation symbols as argument. Such a list can be generated by calling word_tokenize(). Again several tokenization algorithms are available and word_tokenize() uses the recommended one.

To use tokenization and tagging we have to download some NLTK data. The data to download may change if NLTK recommends other algorithms in future. But corresponding methods will show a warning if required data is not available and the warning message contains the code for downloading.

nltk.download('punkt')    # for tokenization
nltk.download('averaged_perceptron_tagger')    # for POS tagging

[nltk_data] Downloading package punkt to
[nltk_data]     /var/lib/u21302575108/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /var/lib/u21302575108/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!

True

Tokenization and tagging (in theory) could be implemented as follows.

posts['tokenized'] = None
posts['tagged'] = None

for idx in posts.index:
    posts.loc[idx, 'tokenized'] = nltk.tokenize.word_tokenize(posts.loc[idx, 'text'])
    posts.loc[idx, 'tagged'] = nltk.tag.pos_tag(posts.loc[idx, 'tokenized'])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[39], line 5
      2 posts['tagged'] = None
      4 for idx in posts.index:
----> 5     posts.loc[idx, 'tokenized'] = nltk.tokenize.word_tokenize(posts.loc[idx, 'text'])
      6     posts.loc[idx, 'tagged'] = nltk.tag.pos_tag(posts.loc[idx, 'tokenized'])

File /opt/conda/envs/python3/lib/python3.11/site-packages/pandas/core/indexing.py:849, in _LocationIndexer.__setitem__(self, key, value)
    846 self._has_valid_setitem_indexer(key)
    848 iloc = self if self.name == "iloc" else self.obj.iloc
--> 849 iloc._setitem_with_indexer(indexer, value, self.name)

File /opt/conda/envs/python3/lib/python3.11/site-packages/pandas/core/indexing.py:1835, in _iLocIndexer._setitem_with_indexer(self, indexer, value, name)
   1832 # align and set the values
   1833 if take_split_path:
   1834     # We have to operate column-wise
-> 1835     self._setitem_with_indexer_split_path(indexer, value, name)
   1836 else:
   1837     self._setitem_single_block(indexer, value, name)

File /opt/conda/envs/python3/lib/python3.11/site-packages/pandas/core/indexing.py:1891, in _iLocIndexer._setitem_with_indexer_split_path(self, indexer, value, name)
   1886     if len(value) == 1 and not is_integer(info_axis):
   1887         # This is a case like df.iloc[:3, [1]] = [0]
   1888         #  where we treat as df.iloc[:3, 1] = 0
   1889         return self._setitem_with_indexer((pi, info_axis[0]), value[0])
-> 1891     raise ValueError(
   1892         "Must have equal len keys and value "
   1893         "when setting with an iterable"
   1894     )
   1896 elif lplane_indexer == 0 and len(value) == len(self.obj.index):
   1897     # We get here in one case via .loc with a all-False mask
   1898     pass

ValueError: Must have equal len keys and value when setting with an iterable

This code results in an error which is somewhat hard to figure out. If on the right-hand side of an assignment to some Pandas object is an iterable, then Pandas expects a same-sized iterable on the left-hand side. Thus, it is not possible to assign, for instance, a list to a cell of a data frame. There exist several more or less complicated workarounds, which all are rather inefficient. Thus, we use the following code, which requires two for loops (list comprehensions) instead of one.

# cell execution takes several minutes

posts['tokenized'] = [nltk.tokenize.word_tokenize(text) for text in posts['text']]

posts['tokenized']

0         [Well, ,, everyone, got, up, and, going, this,...
1         [My, fouryear, old, never, stops, talking, ., ...
2         [Actually, it, 's, not, raining, yet, ,, but, ...
3         [Ha, !, Just, set, up, my, RSS, feed, that, is...
4         [Oh, ,, which, just, reminded, me, ,, we, were...
                                ...                        
676023    [Today, we, celebrate, our, independence, day,...
676024    [Ugh, ,, I, think, I, have, allergies, ..., My...
676025    [Science, is, like, sex, occasionally, somethi...
676026    [Dog, toy, or, marital, aid, I, managed, 1014,...
676027    [I, had, a, dream, last, night, about, a, figh...
Name: tokenized, Length: 653702, dtype: object

# cell execution takes an hour and requires 25 GB of memory

posts['tagged'] = [nltk.tag.pos_tag(tokens) for tokens in posts['tokenized']]

posts['tagged']

0         [(Well, RB), (,, ,), (everyone, NN), (got, VBD...
1         [(My, PRP$), (fouryear, JJ), (old, JJ), (never...
2         [(Actually, RB), (it, PRP), ('s, VBZ), (not, R...
3         [(Ha, NNP), (!, .), (Just, RB), (set, VBN), (u...
4         [(Oh, UH), (,, ,), (which, WDT), (just, RB), (...
                                ...                        
676023    [(Today, NN), (we, PRP), (celebrate, VBP), (ou...
676024    [(Ugh, NNP), (,, ,), (I, PRP), (think, VBP), (...
676025    [(Science, NN), (is, VBZ), (like, IN), (sex, N...
676026    [(Dog, NNP), (toy, NN), (or, CC), (marital, JJ...
676027    [(I, PRP), (had, VBD), (a, DT), (dream, NN), (...
Name: tagged, Length: 653702, dtype: object

The problem now is to translate NLTK POS tags to WordNet POS tags. Have a look at the list of NLTK POS tags. There we see the following relations:

NLTK POS tag	WordNet POS tag
JJ…	ADJ
RB…	ADV
NN…	NOUN
VB…	VERB

All NLTK POS tags have at least two characters. So we may use the following conversion function.

def NLTKPOS_to_WordNetPOS(tag):
    
    if tag[0:2] == 'JJ':
        return nltk.corpus.reader.wordnet.ADJ
    elif tag[0:2] == 'RB':
        return nltk.corpus.reader.wordnet.ADV
    elif tag[0:2] == 'NN':
        return nltk.corpus.reader.wordnet.NOUN
    elif tag[0:2] == 'VB':
        return nltk.corpus.reader.wordnet.VERB
    else:
        return None

Tokens with NLTK POS tags not present in WordNet can be removed, because they do not carry much information about the text. Here we have to keep in mind that our model will be based on word counts. So no relations between words are considered. For our model the sentence ‘John is in the house.’ will be the same as ‘The house is in John’. For more advanced models, relations between words, and thus word classes other than adjectives, adverbs, verbs, nouns, may be of importance.

Note that the lemmatize function always returns some word. If a word is not found in the WordNet data base, then the orignal word is returned. If we want to sort out words not contained in WordNet we have use a trick. Looking at the source code of lemmatize we see that the function calls nltk.corpus.wordnet._morphy. The _morphy function returns a (possibly empty) list of lemmas found in WordNet. If _morphy returns an empty list, we know that the word under consideration is something unsual (contains typos, for instance) and should be ignored. Else we use the first lemma in the list.

In principle, this approach is good, but there’s a snag to it: If we pass the wrong POS tag to _morphy we won’t get a result.

nltk.corpus.wordnet._morphy('children', nltk.corpus.reader.wordnet.VERB)

[]

So we have to refine our strategy. If _morphy returns an empty list, we call _morphy with all possible POS tags. If all returned lists are empty, then we can be relatively sure that the word is something unsual and, thus, irelevant for our classification task.

Another issue is that WordNet does not lemmatize words containing apostrophes. For she’s or havn’t that’s not a real problem, because such words are of little importance for our classification tasks. But what about mama’s, for instance? We should remove all occurences of ‘s.

# cell execution may take several hours

print_max = 1000
printed = 0

posts['lemmatized'] = None

for idx in posts.index:
    lemmas = []    # list of all lemmatized words of current post
    for token, tag in posts.loc[idx, 'tagged']:
        modified_token = token.lower().replace("'s", '')
        
        wordnet_tag = NLTKPOS_to_WordNetPOS(tag)
        if not (wordnet_tag is None):
            
            morphy_result = nltk.corpus.wordnet._morphy(modified_token, wordnet_tag)
            if len(morphy_result) > 0:
                lemmas.append(morphy_result[0])
            else:
                morphy_result_all = nltk.corpus.wordnet._morphy(modified_token, nltk.corpus.reader.wordnet.NOUN) \
                                    + nltk.corpus.wordnet._morphy(modified_token, nltk.corpus.reader.wordnet.VERB) \
                                    + nltk.corpus.wordnet._morphy(modified_token, nltk.corpus.reader.wordnet.ADJ) \
                                    + nltk.corpus.wordnet._morphy(modified_token, nltk.corpus.reader.wordnet.ADV)
                if len(morphy_result_all) > 0:
                    lemmas.append(morphy_result_all[0])
                else:
                    if printed < print_max:
                        print(modified_token, end=', ')
                        printed += 1
                    
    posts.loc[idx, 'lemmatized'] = ' '.join(lemmas)
    
posts['lemmatized']

everyone, , , everything, .., fouryear, ummm, ...., ummm, anything, , goldeyes, n't, skydome, occassionally, goldeyes, n't, 'm, , n't, everyone, gameboy, n't, n't, n't, everything, 've, hmm, something, else, mcnally, something, anyone, 've, 've, breadmaker, n't, n't, 've, imdb, everything, something, something, 'm, ineveitable, n't, , n't, 'm, 've, kel, 'm, ya, di, amore, 'm, .., n't, n't, heh, , everyone, sleepyland, n't, 'm, ya, di, amore, whoopie, 've, umm, n't, epvm, n't, poopeyhead, nathan, 're, n't, nathan, nathan, alstadt, we, we, dan, dave, we, dan, , , .., n't, 'm, thnk, 'm, epvm, umm, heh, n't, arg, , 'm, 'm, madsen, eek, umm, .., brett, alex, joanne, now.they, 're, alex, , gosh, darnit, improv, 'm, 'm, 're, n't, ...., ya, di, amore, 'm, 'm, n't, ya, di, amore, , something, grrr, , 'm, enjoyig, hey, catie, momly, n't, duper, n't, 'm, anyone, 'm, , thingamabobber, 're, 're, 're, 've, di, amore, n't, 'm, 'm, , everyone, brandon, jake, n't, meh, , ok., my, n't, a., 'm, kane, ummm, something, else, psh, n't, n't, heh, anything, 'm, ya, di, amore, 'm, superduper, ...., , 're, contraversial, blahdeblahdeblah, nequa, naperville, 're, 're, 're, n't, heck, kathryn, how, poopy, 'm, sleepyland, goodnight, everyone, ya, di, amore, 'm, superduper, ...., , 're, contraversial, blahdeblahdeblah, nequa, naperville, 're, 're, 're, n't, heck, kathryn, how, poopy, 'm, sleepyland, goodnight, everyone, ya, di, amore, hey, alex, 'm, baaack, 'm, something, everyone, hink, genevieve, n't, grr, pooey, , tootsie, , catie, , n't, steve, chem, , , ah, passon, .., kast, hmmm, , anyone, everything, 've, heh, spontanious, catie, rar, ah, n't, something, everything, n't, hmmm, ah, , 'm, 're, 've, n't, righthand, ya, di, amore, n't, n't, , 've, n't, n't, 've, finaly, 'm, 'm, , , ...., , , , 'm, ya, di, amore, hehe, 'm, n't, 'm, 'm, 'm, umm, kristen, chris, alex, general.oh, umm, epvm, 'm, and, arg, 'm, 'm, jeez, owwwwwwwwwwwwww, jeez, n't, , n't, , alex, emily, chris, alex, chem, , , 'm, 'm, something, jusat, arg, n't, ouch, , 'm, ...., 've, n't, , , .., something, umm, him, eek, n't, kathryn, joanne, 'm, if, him, could, n't, something, s., umm, n't, n't, meh, 'm, n't, n't, .., coyle, , , borring, .., sooooooooooooooooooooooooooo, mehgan, n't, arg, , my, eek, n't, n't, arg, 've, , eek, welp, haha, n't, 've, kane, n't, n't, 'm, argh, vnted, 'm, alstadt, n't, n't, , 'm, aww, selfseeking, , , n't, n't, minddr, ahh, ya, di, amore, brandon, 'm, 're, jumpin, 'm, 've, 'm, .., heh, arg, 'm, ya, di, amore, rar, heh, hoo, 'm, umm, n't, n't, s'all, umm, n't, anything, chris, , 're, n't, 'm, n't, n't, , n't, 'm, 'm, n't, arg, , , anyone, heh, goodnight, 'm, 'm, 've, 'm, everyone, everyone, rar, ...., rar, mixedup, , n't, that, ah, 'm, anyone, friggin, business.and, .., 'm, weieners, arg, 're, 're, 're, .., 're, 're, gon, haha, gianopolis, something, 'm, heck, whew, ...., n't, arg, naperville, pooey, something, 're, , psh, naperville, 're, minigang, n't, jeez, fricking, 'm, naperville, , , haha, whoa, ...., opur, ummm, ...., , n't, rotton, , 're, diane, n't, chem, , n't, , something, n't, eek, julia, , n't, n't, kathryn, yikes, ...., 'm, .., alstadt, n't, anything, , nathan, steve, n't, arg, 'm, errr, something, everyone, xoxo, everything, kathryn, n't, , 'm, amanda, 've, 'm, , anytime, 'm, n't, , exb, ooo, dave, im'ing, drat, n't, , 'm, eek, n't, dan, n't, n't, everything, 'm, , something, 'm, 'm, , arg, chem, , kordalewski, rar, , , 'm, chem, soo, diane, ahhhh, diane, , 'm, pez, n't, , n't, kathryn, n't, nikki, n't, n't, , nonband, nikki, 're, malevalent, nikki, ...., n't, 'm, everything, , joanne, nikki, ...., ...., n't, anyone, perkyness, vdiddy, , doman, , 'm, reallly, , , something, kari, , ...., christy, ryan, her, , anything, anything, everything, , dave, n't, 've, n't, , 'm, scool, , 'm, , something, soo, ...., ...., eek, yikes, stevey, steve, zimnie, 've, , , , haha, n't, steve, , n't, hmmm, ...., ummm, , , , gov., andre, haha, ah, kathryn, n't, anyone, , , albinak, , c., 'm, 'm, 'm, n't, 'm, n't, n't, 'm, 're, kathryn, joanne, 'm, 'm, everyone, bucs, 'm, everyone, 'm, wated, 're, neil, something, neil, n't, kristen, alex, 'm, kathryn, joanne, 'm, n't, n't, superbowl, n't, something, n't, n't, ...., , hehe, kazba, catie, 're, 'm, myself, evereryone, else, superbowl, .so, , else, 've, everything, 'm, 'm, 've, , n't, something, emily, 'm, 'm, joanne, kristen, 're, 'm, 'm, superbowl, n't, jso, n't, alex, jso, jso, n't, jso, 're, n't, .., 'm, , andy, diane, ...., diane, , 'm, , haha, n't, something, , rollercoaster, satuday, 've, , duper, diane, arg, c., 've, emily, n't, n't, ...., n't, n't, phew, 'm, 'm, 'm, 'm, something, phew, rollercoaster, 'm, emily, n't, 'm, n't, anything, kath, oprah, phil, , bulemic, joanne, , joanne, versa, else, , 'm, 'm, goodnight, xoxo, emily, 're, anything, 're, n't, 'm, n't, zach, richa, jeez, , everyone, n't, everyone, n't, jeez, n't, n't, n't, , 're, ...., arg, heck, , , , something, something, n't, ...., n't, 're, n't, , 're, n't, 'm, 're, n't, kari, richa, zach, 're, .., , n't, 'm, richa, zach, 'm, 'm, , calfornia, , 'm, n't, .., n't, bleh, zach, everyone, 've, richa, anyone, to.do, n't, sooo, n't, anyone, rotton, , jso, 'm, should, awesomely, diane, hehe, , possitively, snippity, emily, , emily, eric, eek, n't, im, dave, justin, dan, 'm, joanne, kathryn, nikki, rar, chris, jim, haha, shpeal, klos, diane, , diane, diane, n't, n't, eric, chris, jim, 're, catie, , n't, n't, n't, anything, 've, .., , n't, lookis, 've, musto, , prefference, jso, 've, , n't, muah, 've, n't, chem, diane, 'm, 'm, diane, 'm, , , 've, , usuaully, n't, , andy, n't, diane, n't, 'm, n't, n't, emily, diane, christine, , n't, kathryn, 've, kathryn, n't, realllly, kath, n't, lockin, n't, 've, joanne, 'm, 

0         well get up go morning still raining okay sort...
1         old never stop talk say mom say say oh yeah do...
2         actually not raining yet buy ticket game mom b...
3         ha just set r feed be so easy do do enough tod...
4         just remind be talk can food coffee break morn...
                                ...                        
676023    today celebrate independence day honor event g...
676024    think have allergy nose have be stuff week mak...
676025    science be sex occasionally practical come not...
676026    dog toy marital aid manage little quiz see wel...
676027    have dream last night fight be younger dad hea...
Name: lemmatized, Length: 653702, dtype: object

posts = posts[['blog_id', 'lemmatized']]

posts.to_csv(data_path + 'posts_lemmatized.csv')

posts

	blog_id	lemmatized
0	1000331	well get up go morning still raining okay sort...
1	1000331	old never stop talk say mom say say oh yeah do...
2	1000331	actually not raining yet buy ticket game mom b...
3	1000331	ha just set r feed be so easy do do enough tod...
4	1000331	just remind be talk can food coffee break morn...
...	...	...
676023	999503	today celebrate independence day honor event g...
676024	999503	think have allergy nose have be stuff week mak...
676025	999503	science be sex occasionally practical come not...
676026	999503	dog toy marital aid manage little quiz see wel...
676027	999503	have dream last night fight be younger dad hea...

653702 rows × 2 columns

posts_lemmatized = pd.read_csv(data_path + 'posts_lemmatized.csv', index_col=0, nrows=100)

posts['lemmatized'] = posts_lemmatized['lemmatized']

idx = 0

print(posts.loc[0, 'lemmatized'])

well get up go morning still raining okay sort suit mood easily have stay home bed book cat have be lot rain people have wet basement be lake be golf course fields be green green green be suppose be degree friday be deal mosquito next week hear winnipeg describe old testament city cbc radio one last week sort rings true flood infestation

We could improve preprocessing by tagging geographical locations, names of persons, and so on. But somewhere one has to stop. Let’s see what a machine learning model can learn from our (not perfectly) preprocessed data…

Preprocessing Text Data

Contents