Natural Language Processing.

Natural-language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to fruitfully process large amounts of natural language data (wikipedia). This rapidly improving area of artificial intelligence covers tasks such as speech recognition, natural-language understanding, and natural language generation. In the following projects, we're going to be building a strong NLP foundation by practicing:

  • Tokenizing - Splitting sentences and words from the body of text.
  • Part of Speech tagging
  • Chunking

So, Let's begin.

Importing the Libraries.

In [2]:
import nltk
import sys
import sklearn
import random

print(sys.version)
print(nltk.__version__)
print(sklearn.__version__)

# nltk.download()
3.6.7 |Anaconda custom (64-bit)| (default, Oct 23 2018, 14:01:38) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
3.4
0.20.2

Now that we have the nltk package installed, lets go over some basic natural language processing vocabulary:

Corpus -

Body of text, singular. Corpora is the plural of this. Example: A collection of medical journals.

Lexicon -

Words and their meanings. Example: English dictionary. Consider, however, that various fields will have different lexicons.

Token -

Each "entity" that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is "tokenized" into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.

Stop Words with NLTK:

When using Natural Language Processing, our goal is to perform some analysis or processing so that a computer can respond to text appropriately.

The process of converting data to something a computer can understand is referred to as "pre-processing." One of the major forms of pre-processing is going to be filtering out useless data. In natural language processing, useless words (data), are referred to as stop words.

Stemming Words with NLTK:

Stemming, which attempts to normalize sentences, is another preprocessing step that we can perform. In the english language, different variations of words and sentences often having the same meaning. Stemming is a way to account for these variations; furthermore, it will help us shorten the sentences and shorten our lookup. For example, consider the following sentence:

  • I was taking a ride on my horse.
  • I was riding my horse.

These sentences mean the same thing, as noted by the same tense (-ing) in each sentence; however, that isn't intuitively understood by the computer. To account for all the variations of words in the english language, we can use the Porter stemmer, which has been around since 1979.

In [6]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

Lets import some sample and training text - George Bush's 2005 and 2006 state of the union addresses.

In [7]:
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

Now that we have some text, we can train the PunktSentenceTokenizer

In [8]:
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

Now lets tokenize the sample_text using our trained tokenizer

In [9]:
tokenized = custom_sent_tokenizer.tokenize(sample_text)

Let's define a function that will tag each tokenized word with a part of speech

In [10]:
def process_content():
    try:
        for i in tokenized[:5]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)

    except Exception as e:
        print(str(e))
        
process_content()
[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]
[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), (',', ','), ('distinguished', 'JJ'), ('guests', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'VB'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), (',', ','), ('graceful', 'JJ'), (',', ','), ('courageous', 'JJ'), ('woman', 'NN'), ('who', 'WP'), ('called', 'VBD'), ('America', 'NNP'), ('to', 'TO'), ('its', 'PRP$'), ('founding', 'NN'), ('ideals', 'NNS'), ('and', 'CC'), ('carried', 'VBD'), ('on', 'IN'), ('a', 'DT'), ('noble', 'JJ'), ('dream', 'NN'), ('.', '.')]
[('Tonight', 'NN'), ('we', 'PRP'), ('are', 'VBP'), ('comforted', 'VBN'), ('by', 'IN'), ('the', 'DT'), ('hope', 'NN'), ('of', 'IN'), ('a', 'DT'), ('glad', 'JJ'), ('reunion', 'NN'), ('with', 'IN'), ('the', 'DT'), ('husband', 'NN'), ('who', 'WP'), ('was', 'VBD'), ('taken', 'VBN'), ('so', 'RB'), ('long', 'RB'), ('ago', 'RB'), (',', ','), ('and', 'CC'), ('we', 'PRP'), ('are', 'VBP'), ('grateful', 'JJ'), ('for', 'IN'), ('the', 'DT'), ('good', 'JJ'), ('life', 'NN'), ('of', 'IN'), ('Coretta', 'NNP'), ('Scott', 'NNP'), ('King', 'NNP'), ('.', '.')]
[('(', '('), ('Applause', 'NNP'), ('.', '.'), (')', ')')]
[('President', 'NNP'), ('George', 'NNP'), ('W.', 'NNP'), ('Bush', 'NNP'), ('reacts', 'VBZ'), ('to', 'TO'), ('applause', 'VB'), ('during', 'IN'), ('his', 'PRP$'), ('State', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('Union', 'NNP'), ('Address', 'NNP'), ('at', 'IN'), ('the', 'DT'), ('Capitol', 'NNP'), (',', ','), ('Tuesday', 'NNP'), (',', ','), ('Jan', 'NNP'), ('.', '.')]

Chunking with NLTK

Now that each word has been tagged with a part of speech, we can move onto chunking: grouping the words into meaningful clusters. The main goal of chunking is to group words into "noun phrases", which is a noun with any associated verbs, adjectives, or adverbs.

The part of speech tags that were generated in the previous step will be combined with regular expressions, such as the following:

In [11]:
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            # combine the part-of-speech tag with a regular expression
            
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            # draw the chunks with nltk
            # chunked.draw()     

    except Exception as e:
        print(str(e))

        
process_content()

Chinking with NLTK

Sometimes there are words in the chunks that we don't won't, we can remove them using a process called chinking.

In [ ]:
def process_content():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            # The main difference here is the }{, vs. the {}. This means we're removing 
            # from the chink one or more verbs, prepositions, determiners, or the word 'to'.

            chunkGram = r"""Chunk: {<.*>+}
                                    }<VB.?|IN|DT|TO>+{"""

            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            # print(chunked)
            for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
                print(subtree)

            # chunked.draw()

    except Exception as e:
        print(str(e))

        
process_content()

Named Entity Recognition with NLTK

One of the most common forms of chunking in natural language processing is called "Named Entity Recognition." NLTK is able to identify people, places, things, locations, monetary figures, and more.

There are two major options with NLTK's named entity recognition: either recognize all named entities, or recognize named entities as their respective type, like people, places, locations, etc.

Here, with the option of binary = True, this means either something is a named entity, or not. There will be no further detail.

In [13]:
def process_content():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged, binary=True)
            # namedEnt.draw()
            
    except Exception as e:
        print(str(e))

        
process_content()

Text Classification

Text classification using NLTK

Now that we have covered the basics of preprocessing for Natural Language Processing, we can move on to text classification using simple machine learning classification algorithms.

In [3]:
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# shuffle the documents
random.shuffle(documents)

print('Number of Documents: {}'.format(len(documents)))
print('First Review: {}'.format(documents[1]))

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

print('Most common words: {}'.format(all_words.most_common(15)))
print('The word happy: {}'.format(all_words["happy"]))
Number of Documents: 2000
First Review: (['"', 'jack', 'frost', ',', '"', 'is', 'one', 'of', 'those', 'dumb', ',', 'corny', 'concoctions', 'that', 'attempts', 'to', 'be', 'a', 'heartwarming', 'family', 'film', ',', 'but', 'is', 'too', 'muddled', 'in', 'its', 'own', 'cliches', 'and', 'predictability', 'to', 'be', 'the', 'least', 'bit', 'touching', '.', 'this', 'does', 'not', 'come', 'as', 'a', 'surprise', ',', 'since', 'the', 'studio', 'that', 'made', 'it', 'is', 'warner', 'brothers', ',', 'who', 'is', 'on', 'a', 'current', 'streak', 'of', 'one', 'bad', 'film', 'after', 'the', 'other', '.', 'jack', 'frost', '(', 'michael', 'keaton', ')', 'is', 'a', 'struggling', 'middle', '-', 'aged', 'rock', 'musician', 'who', 'loves', 'his', 'wife', ',', 'gabby', '(', 'kelly', 'preston', ')', ',', 'and', '11', '-', 'year', '-', 'old', 'son', ',', 'charlie', '(', 'joseph', 'cross', ')', ',', 'but', 'doesn', "'", 't', 'spend', 'nearly', 'enough', 'time', 'with', 'them', '.', 'when', 'he', 'receives', 'a', 'call', 'from', 'a', 'music', 'label', 'that', 'wants', 'to', 'hear', 'him', 'play', ',', 'he', 'has', 'to', 'cancel', 'his', 'planned', 'family', 'outing', 'up', 'in', 'the', 'mountains', 'for', 'christmas', '.', 'halfway', 'there', ',', 'jack', 'has', 'second', 'thoughts', ',', 'but', 'on', 'his', 'way', 'back', 'home', ',', 'is', 'in', 'a', 'car', 'accident', 'and', 'dies', '.', 'switch', 'forward', 'a', 'year', ',', 'christmas', 'is', 'approaching', 'once', 'again', ',', 'and', 'charlie', 'and', 'gabby', 'are', 'still', 'having', 'a', 'difficult', 'time', 'coming', 'to', 'terms', 'with', 'jack', "'", 's', 'death', '.', 'when', 'charlie', 'begins', 'to', 'play', 'the', 'harmonica', 'his', 'father', 'gave', 'him', 'the', 'night', 'before', 'he', 'died', ',', 'the', 'snowman', 'outside', 'the', 'house', 'is', 'taken', 'over', 'by', 'jack', "'", 's', 'spirit', '.', 'jack', 'wants', 'to', 'spend', 'some', 'time', 'with', 'his', 'son', 'before', 'the', 'upcoming', 'warm', 'front', 'melts', 'him', ',', 'but', 'charlie', 'desperately', 'tries', 'to', 'prevent', 'his', 'melting', 'demise', '.', '"', 'frosty', 'the', 'snowman', ',', '"', 'is', 'a', 'classic', 'cartoon', ',', 'and', 'the', 'idea', 'of', 'a', 'snowman', 'that', 'is', 'alive', 'works', 'splendidly', 'when', 'animated', ',', 'but', 'as', 'a', 'live', '-', 'action', 'film', ',', 'it', 'doesn', "'", 't', 'work', 'at', 'all', '.', 'after', 'a', 'somewhat', 'promising', 'prologue', 'in', 'which', 'the', 'frost', 'family', 'is', 'established', ',', '"', 'jack', 'frost', ',', '"', 'quickly', 'goes', 'downhill', ',', 'especially', 'once', 'the', 'snowman', 'comes', 'into', 'play', '.', 'since', 'jack', 'has', 'been', 'deceased', 'for', 'a', 'whole', 'year', ',', 'you', 'would', 'think', 'there', 'would', 'be', 'many', 'questions', 'to', 'ask', 'him', ',', 'such', 'as', ',', '"', 'what', 'happens', 'after', 'you', 'die', '?', '"', 'or', ',', '"', 'how', 'does', 'it', 'feel', 'to', 'be', 'a', 'snowman', '?', '"', 'but', 'instead', ',', 'the', 'film', 'focuses', 'on', 'a', 'snowball', 'fight', 'subplot', 'and', 'an', 'inevitably', 'oversentimental', 'climax', 'that', 'could', 'be', 'telegraphed', 'before', 'i', 'even', 'sat', 'down', 'to', 'watch', 'the', 'movie', '.', 'the', 'performances', 'are', 'respectable', 'enough', ',', 'but', 'no', 'one', 'deserves', 'to', 'be', 'punished', 'by', 'appearing', 'in', 'a', 'silly', 'film', 'like', 'this', '.', 'michael', 'keaton', 'at', 'least', 'got', 'off', 'easy', ',', 'since', 'he', 'disappears', 'after', 'the', 'first', 'twenty', 'minutes', ',', 'but', 'what', 'exactly', 'does', 'he', 'think', 'he', 'is', 'doing', 'with', 'his', 'career', 'here', '?', 'i', 'have', 'always', 'liked', 'kelly', 'preston', '.', 'she', 'is', 'clearly', 'a', 'talented', ',', 'charismatic', 'actress', ',', 'but', 'has', 'never', 'been', 'given', 'a', 'good', 'role', 'in', 'her', 'life', ',', 'usually', 'having', 'to', 'settle', 'for', 'a', 'one', '-', 'dimensional', 'supporting', 'character', ',', 'as', 'in', ',', '1997', "'", 's', ',', '"', 'nothing', 'to', 'lose', ',', '"', 'and', ',', '"', 'addicted', 'to', 'love', '.', '"', 'joseph', 'cross', 'was', 'probably', 'the', 'highlight', 'in', 'the', 'cast', ',', 'since', 'he', 'believably', 'portrayed', 'a', 'boy', 'suffering', 'the', 'loss', 'of', 'a', 'parent', '.', 'in', 'one', 'of', 'the', 'only', 'subplots', 'that', 'actually', 'works', ',', 'due', 'to', 'its', 'wittiness', ',', 'henry', 'rollins', 'is', 'highly', 'amusing', 'as', 'a', 'hockey', 'coach', 'who', 'becomes', 'terrified', 'and', 'paranoid', 'after', 'seeing', 'the', 'live', 'snowman', '.', 'this', 'brief', 'hint', 'of', 'cleverness', 'is', 'pushed', 'to', 'the', 'side', ',', 'however', ',', 'by', 'the', 'tried', '-', 'and', '-', 'true', 'main', 'plot', 'at', 'hand', ',', 'which', 'is', 'the', 'sappy', 'story', 'of', 'a', 'father', 'and', 'son', '.', 'since', 'i', 'knew', 'what', 'was', 'going', 'to', 'happen', 'by', 'the', 'time', 'the', 'conclusion', 'came', 'around', ',', 'i', 'had', 'no', 'choice', 'but', 'to', 'sit', 'there', 'and', 'listen', 'to', 'painfully', 'insipid', ',', 'cringe', '-', 'inducing', 'lines', 'of', 'dialogue', '.', 'some', 'of', 'my', 'favorites', 'was', 'an', 'interaction', 'between', 'the', 'son', 'and', 'father', ':', '"', 'you', 'da', 'man', ',', '"', 'says', 'charlie', '.', '"', 'no', ',', 'i', 'da', 'snowman', ',', '"', 'replies', 'jack', '.', 'or', 'how', 'about', 'this', 'little', 'zinger', ',', 'coming', 'from', 'a', 'school', 'bully', 'that', 'miraculously', 'becomes', 'friendly', 'towards', 'charlie', 'and', 'tries', 'to', 'help', 'him', 'out', ':', '"', 'snowdad', 'is', 'better', 'than', 'no', 'dad', '.', '"', 'do', 'people', 'really', 'get', 'paid', 'in', 'hollywood', 'for', 'writing', 'pieces', 'of', 'trash', 'like', 'this', '?', 'the', 'snowman', ',', 'created', 'by', 'john', 'henson', "'", 's', 'creature', 'shop', ',', 'is', 'more', 'believable', 'than', 'the', 'snowman', 'from', 'last', 'year', "'", 's', 'unintentionally', 'hilarious', 'direct', '-', 'to', '-', 'video', 'horror', 'flick', ',', 'also', 'called', ',', '"', 'jack', 'frost', ',', '"', 'but', 'it', 'still', 'was', 'difficult', 'to', 'tell', 'if', 'it', 'was', 'a', 'person', 'in', 'a', 'suit', 'or', 'computer', 'effects', '.', 'either', 'way', ',', 'it', 'was', 'an', 'awful', 'lot', 'of', 'work', 'to', 'go', 'through', ',', 'just', 'to', 'come', 'up', 'with', 'a', 'final', 'product', 'as', 'featherbrained', 'as', 'this', 'project', '.', 'as', 'a', 'seasonal', 'holiday', 'picture', ',', '"', 'jack', 'frost', ',', '"', 'is', 'pretty', 'much', 'a', 'clunker', '.', 'a', 'better', 'christmas', 'film', 'from', 'this', 'year', 'is', ',', '"', 'i', "'", 'll', 'be', 'home', 'for', 'christmas', '.', '"', 'better', 'yet', ',', 'my', 'suggestion', 'would', 'be', 'to', 'stay', 'home', 'and', 'watch', 'a', 'quality', 'film', ',', 'such', 'as', ',', '"', 'it', "'", 's', 'a', 'wonderful', 'life', ',', '"', '"', 'a', 'christmas', 'story', ',', '"', 'or', ',', '"', 'prancer', '.', '"', '"', 'jack', 'frost', ',', '"', 'is', 'an', 'earnest', ',', 'but', 'severely', 'misguided', 'film', ',', 'and', 'children', ',', 'as', 'well', 'as', 'adults', ',', 'deserve', 'better', '.', 'i', 'doubt', 'they', 'would', 'want', 'to', 'see', 'a', 'movie', 'about', 'the', 'death', 'of', 'a', 'parent', ',', 'anyway', '.'], 'neg')
Most common words: [(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822), ('s', 18513), ('"', 17612), ('it', 16107), ('that', 15924), ('-', 15595)]
The word happy: 215

We'll use the 4000 most common words as features

In [4]:
print(len(all_words))
word_features = list(all_words.keys())[:4000]
39768

The find_features function will determine which of the 3000 word features are contained in the review

In [5]:
def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

Lets use an example from a negative review

In [8]:
features = find_features(movie_reviews.words('neg/cv000_29416.txt'))
for key, value in features.items():
    if value == True:
        print(key)
plot
:
two
teen
couples
go
to
a
church
party
,
drink
and
then
drive
.
they
get
into
an
accident
one
of
the
guys
dies
but
his
girlfriend
continues
see
him
in
her
life
has
nightmares
what
'
s
deal
?
watch
movie
"
sorta
find
out
critique
mind
-
fuck
for
generation
that
touches
on
very
cool
idea
presents
it
bad
package
which
is
makes
this
review
even
harder
write
since
i
generally
applaud
films
attempt
break
mold
mess
with
your
head
such
(
lost
highway
&
memento
)
there
are
good
ways
making
all
types
these
folks
just
didn
t
snag
correctly
seem
have
taken
pretty
neat
concept
executed
terribly
so
problems
well
its
main
problem
simply
too
jumbled
starts
off
normal
downshifts
fantasy
world
you
as
audience
member
no
going
dreams
characters
coming
back
from
dead
others
who
look
like
strange
apparitions
disappearances
looooot
chase
scenes
tons
weird
things
happen
most
not
explained
now
personally
don
trying
unravel
film
every
when
does
give
me
same
clue
over
again
kind
fed
up
after
while
biggest
obviously
got
big
secret
hide
seems
want
completely
until
final
five
minutes
do
make
entertaining
thrilling
or
engaging
meantime
really
sad
part
arrow
both
dig
flicks
we
actually
figured
by
half
way
point
strangeness
did
start
little
bit
sense
still
more
guess
bottom
line
movies
should
always
sure
before
given
password
enter
understanding
mean
showing
melissa
sagemiller
running
away
visions
about
20
throughout
plain
lazy
!
okay
people
chasing
know
need
how
giving
us
different
offering
further
insight
down
apparently
studio
took
director
chopped
themselves
shows
might
ve
been
decent
here
somewhere
suits
decided
turning
music
video
edge
would
actors
although
wes
bentley
seemed
be
playing
exact
character
he
american
beauty
only
new
neighborhood
my
kudos
holds
own
entire
feeling
unraveling
overall
doesn
stick
because
entertain
confusing
rarely
excites
feels
redundant
runtime
despite
ending
explanation
craziness
came
oh
horror
slasher
flick
packaged
someone
assuming
genre
hot
kids
also
wrapped
production
years
ago
sitting
shelves
ever
whatever
skip
where
joblo
nightmare
elm
street
3
7
/
10
blair
witch
2
crow
9
salvation
4
stir
echoes
8

Now lets do it for all the documents

In [9]:
featuresets = [(find_features(rev), category) for (rev, category) in documents]

We can split the featuresets into training and testing datasets using sklearn

In [11]:
from sklearn import model_selection
training, testing = model_selection.train_test_split(featuresets, test_size = 0.25)

print(len(training))
print(len(testing))
1500
500

We can use sklearn algorithms in NLTK

In [12]:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC

model = SklearnClassifier(SVC(kernel = 'linear'))

Train the model on the training data

In [13]:
model.train(training)
Out[13]:
<SklearnClassifier(SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))>

Test on the testing dataset!

In [14]:
accuracy = nltk.classify.accuracy(model, testing)*100
print("SVC Accuracy: {}".format(accuracy))
SVC Accuracy: 79.4