Natural-language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to fruitfully process large amounts of natural language data (wikipedia). This rapidly improving area of artificial intelligence covers tasks such as speech recognition, natural-language understanding, and natural language generation. In the following projects, we're going to be building a strong NLP foundation by practicing:
So, Let's begin.
import nltk
import sys
import sklearn
import random
print(sys.version)
print(nltk.__version__)
print(sklearn.__version__)
# nltk.download()
Now that we have the nltk package installed, lets go over some basic natural language processing vocabulary:
Body of text, singular. Corpora is the plural of this. Example: A collection of medical journals.
Words and their meanings. Example: English dictionary. Consider, however, that various fields will have different lexicons.
Each "entity" that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is "tokenized" into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.
When using Natural Language Processing, our goal is to perform some analysis or processing so that a computer can respond to text appropriately.
The process of converting data to something a computer can understand is referred to as "pre-processing." One of the major forms of pre-processing is going to be filtering out useless data. In natural language processing, useless words (data), are referred to as stop words.
Stemming, which attempts to normalize sentences, is another preprocessing step that we can perform. In the english language, different variations of words and sentences often having the same meaning. Stemming is a way to account for these variations; furthermore, it will help us shorten the sentences and shorten our lookup. For example, consider the following sentence:
These sentences mean the same thing, as noted by the same tense (-ing) in each sentence; however, that isn't intuitively understood by the computer. To account for all the variations of words in the english language, we can use the Porter stemmer, which has been around since 1979.
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content():
try:
for i in tokenized[:5]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged)
except Exception as e:
print(str(e))
process_content()
Now that each word has been tagged with a part of speech, we can move onto chunking: grouping the words into meaningful clusters. The main goal of chunking is to group words into "noun phrases", which is a noun with any associated verbs, adjectives, or adverbs.
The part of speech tags that were generated in the previous step will be combined with regular expressions, such as the following:
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content():
try:
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
# combine the part-of-speech tag with a regular expression
chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
# draw the chunks with nltk
# chunked.draw()
except Exception as e:
print(str(e))
process_content()
Sometimes there are words in the chunks that we don't won't, we can remove them using a process called chinking.
def process_content():
try:
for i in tokenized[5:]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
# The main difference here is the }{, vs. the {}. This means we're removing
# from the chink one or more verbs, prepositions, determiners, or the word 'to'.
chunkGram = r"""Chunk: {<.*>+}
}<VB.?|IN|DT|TO>+{"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
# print(chunked)
for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
print(subtree)
# chunked.draw()
except Exception as e:
print(str(e))
process_content()
One of the most common forms of chunking in natural language processing is called "Named Entity Recognition." NLTK is able to identify people, places, things, locations, monetary figures, and more.
There are two major options with NLTK's named entity recognition: either recognize all named entities, or recognize named entities as their respective type, like people, places, locations, etc.
Here, with the option of binary = True, this means either something is a named entity, or not. There will be no further detail.
def process_content():
try:
for i in tokenized[5:]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
namedEnt = nltk.ne_chunk(tagged, binary=True)
# namedEnt.draw()
except Exception as e:
print(str(e))
process_content()
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
# shuffle the documents
random.shuffle(documents)
print('Number of Documents: {}'.format(len(documents)))
print('First Review: {}'.format(documents[1]))
all_words = []
for w in movie_reviews.words():
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
print('Most common words: {}'.format(all_words.most_common(15)))
print('The word happy: {}'.format(all_words["happy"]))
print(len(all_words))
word_features = list(all_words.keys())[:4000]
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)
return features
features = find_features(movie_reviews.words('neg/cv000_29416.txt'))
for key, value in features.items():
if value == True:
print(key)
featuresets = [(find_features(rev), category) for (rev, category) in documents]
from sklearn import model_selection
training, testing = model_selection.train_test_split(featuresets, test_size = 0.25)
print(len(training))
print(len(testing))
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC
model = SklearnClassifier(SVC(kernel = 'linear'))
model.train(training)
accuracy = nltk.classify.accuracy(model, testing)*100
print("SVC Accuracy: {}".format(accuracy))