January 10, 2017 Text Classification

Natural Language Processing using python


Let’s learn from a precise demo on Natural Language Processing on Newsgroup data for Machine Learning

What we will do :

1. Read the newsgroup data
2. Use TfIdfVectorizer for converting a collection of raw documents to a matrix of TF-IDF features.
3. Fit random forest and multinomial model (No crossvalidation is used here)
4. Check both model accuracy on test data set

Importing libraries

Let’s import the libraries and import the newsgroup data set

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

Reading the training and testing data

Let’s choose four categories out of 20 categories in this data set.(Just for simplicity in demonstration)

categories = ['alt.atheism', 'talk.religion.misc',
                     'comp.graphics', 'sci.space']
trainDS = fetch_20newsgroups(subset='train',
testDS = fetch_20newsgroups(subset='test',

Tf-idf initialization & fitting

Let’s initialize and fit TfidfVectorizer to the data set.This function will create new features

vectorizer = TfidfVectorizer()
trainVectors = vectorizer.fit_transform(trainDS.data)
testVectors = vectorizer.transform(testDS.data)


(2034, 34118)

We can see from above that

1.  Only 2034 observations are used
2. Number of features or variables created by TfidfVectorizer function is 34118.
3. Number of features is quite large.
4. Large dimentionality of feature set is not good in general.
5. There are many ways to handle dimentionality issues(Like Dimentionality reduction    techniques ,adding regularization term in models,etc.)
6.We are not focusing on any of these things.This is just a starter .

Model Fitting & F1-score Metric

Let’s fit multinomial model and random forest model on the train data set and check the metric for each model on test data set

modelMultinomial = MultinomialNB(alpha=.01)
modelForest = RandomForestClassifier(n_estimators = 100) 

modelMultinomial.fit(trainVectors, trainDS.target)
modelForest = modelForest.fit(trainVectors, trainDS.target)

predictionsMultinomial = modelMultinomial.predict(testVectors)
predictionsRandom = modelForest.predict(testVectors)

print(metrics.f1_score(testDS.target, predictionsMultinomial, average='macro'),
metrics.f1_score(testDS.target, predictionsRandom, average='macro'))

0.882135924027 0.792904050831