NLP: Natural Language Processing | The Ultimate Guide


Table of Content – Natural Language Processing (NLP)
NLP What is Stemming?
Python Stopwords
Python Libraries for NLP Chunking and Chinking
NLTK What is WordNet?
Components of NLP Applications of NLP
What is Tokenizing? Pros and Cons of NLP
Stats in NLP NLP’S impact on future
What is Word Cloud? Conclusion

Natural Language Processing (NLP) is the sub-part of Artificial Intelligence that explores how machines interact with human language.

The most common applications that use NLP are Search engines such as Google, Word Processors such as Microsoft Word, Grammarly to check the text for Grammar correction, Google translate, Call centers, Personal assistant applications Alexa, Siri, and more.

Who can learn NLP (Natural Language Processing)?

  • Students and industry professionals who wish to get hands-on with a powerful AI technology – Natural Language Processing
  • Users who want to learn NLP and implement it in their projects using Python and take their AI/ML skills to the next level

Benefits of NLP (Natural Language Processing)

  • Perform large-scale analysis
  • Get a more objective and accurate analysis.
  • Streamline processes and reduce costs
  • Improve customer satisfaction
  • Better understand your market.
  • Gain real, actionable insights

Requirements

  • Fundamentals of AI and ML will be an added benefit.
  • Previous knowledge and experience with the Python programming language are a must.

NLP

Language is a way of communicating with each other. Natural Language Processing, also known as NLP, is a subfield of computer science that deals with Artificial Intelligence, which helps computers to understand and process human language.

Natural Language Processing - NLP

In simple words, NLP is a program that helps machines to understand our language. The language in which we write and speak. 

NLP is one of the critical features of the blooming Artificial Intelligence sector. NLP enables machines to read, understand and react. It is all about guiding machines on how to understand human languages and their meaning from the text.

Artificial intelligence

Artificial Intelligence is a wide-range branch of computer science used to build intelligent machines that can perform tasks that require human intelligence.

Artificial Intelligence Robot standing

It can also be referred to as copying human intelligence in programmed machines so that machines think like humans and copy their actions.

The main characteristic of AI is to achieve a specific goal, its ability to rationalize and take actions. Similarly, machine learning is a concept wherein programs automatically learn and adapt to new data without human assistance.

Following is how Natural language processing involves typical interaction between Machines and Humans:

  1. A person talks to a machine
  2. The machine records the audio
  3. The audio is converted to text format
  4. The text is processed further to data that is understood by the machine
  5. The data is then converted to audio format
  6. The machine reverts by playing the audio file

Brief History of NLP

The history of NLP is divided into Four Phases. They have different styles and concerns.

  1.  Machine translation Phase: This phase was in the late 1940s to 1960s. It mainly concentrated on the Machine translation part. That time was a phase of optimism. The research had already kickstarted in the 1950s. In 1954 the experiment on automatic translation from Russian to English happened in the Georgetown-IBM experiment. During the same year, the publication of Journal Machine Translation began. In the consecutive years, international conferences were also held on Machine translation.
  2. AI Influenced Phase: This phase lasted from the late 1960s to the 1970s. The work done during this phase was related to world knowledge and its function in the manipulation and construction of meaning representations. Hence, it is known as AI-flavored Phase. This phase began in 1961, where it focused on the issues of addressing and constructing data. AI mainly influenced it. BASEBALL Question-answering system was also added in the same year. The input to its system was done to a particular extent, and simple language processing was involved.
  3. Grammatico-logical Phase: This phase was from the 1970s to the late 1980s. The last phase was not a success in terms of practical system building. So the researchers began reasoning in AI and used logic for knowledge representation. During this phase, many practical resources and tools were developed. It included natural language tools with operational systems.
  4. Lexical & Corpus Phase: The grammar which appeared in the late 1980s was given a lexicalized approach in this Phase which was of significant influence. The introduction of machine learning concepts for language processing revolutionized NLP in this decade. 

The Processes of NLP

The processes of NLP

The logical steps involved in Natural Language Processing are as follows: 

1 – Morphological Processing: The first phase of NLP is also known as the lexical analysis phase. The primary function is to break the chunks of input into token sets corresponding to sentences and words. In short, breaking the input words by removing all suffixes and prefixes makes it easy.

For example, the word “unsettled” is divided into sub-word as “un-settled”

2 – Syntax Analysis: The second phase of NLP mainly checks the syntax. It can be divided into two parts or folds: 

To check the formation of the sentence and break it up into a structure that shows its syntactic relationships between two or more different words. It checks the basic grammar rules. 

For example, 
Boy the goes to store 

It is wrong as per the syntax since determiners can’t come after a noun.

3 – Semantic Analysis: It is the third phase of NLP. The main motto of this phase is to extract the dictionary meaning from the text. 

Natural Language Processing Pyramid

The text is checked for meaning. It would be useless if it doesn’t have meaning, even if it has correct syntax.

For example, 
She is wearing a colorless green dress

How colorless can a green dress be?

4 – Pragmatic Analysis: This is the fourth and final phase of NLP. This phase puts in the events present in the given context with the object references obtained during the last phase. 

For example, “Put the banana in the basket on the shelf” can have two interpretations. 

It can be interpreted as Banana is on the shelf or the Basket is on the shelf.

Python

Python Logo

Python is a general-purpose programming language that is used worldwide. It is a high-level, interactive, and object-oriented language. It is versatile, concise, and easy to read. Hence it can do whatever you want to do. 

Like other languages, Python does not use punctuation. It uses English keywords more frequently, which makes it easier. It has endless libraries and packages, which makes Python a good choice and can be used for web development, data science, machine learning, etc.

Benefits of Python

  • Python is a language for beginner-level programmers. It also supports the development of a wide range of applications.
  • We do not need to compile our program in Python before executing it. Python code is processed at runtime by the interpreter.
  • Python is an interactive platform. We can sit at the prompt and interact with the interpreter directly to write our programs.
  • Python has fewer keywords, a simple structure, and defined syntax. It helps to learn the language quickly.
  • Python supports an object-oriented technique of programming that encapsulates code within objects, making it easy to maintain.
  • Python’s libraries are portable and compatible with various platforms.
  • Python allows interactive testing and debugging code parts, also called interactive mode.

Installation of Python for NLP

Many systems have Python already installed in them by default.

To check whether your Windows machine has Python installed or not, either type python in the search bar or go to the command line and hit the following command,

python --version

If Python is not installed on your computer, we will have to go to https://www.python.org/downloads/. Select the latest version for windows.

  • Click on the Downloaded File
  • Select Customize Installation
  • And then click next,

On the next screen, 

  • Select the advanced options
  • Give a Custom install location
  • Click Install

Click the close button once the installation is done.

Here is the image illustrating the step-by-step procedure for Python3 installation.

How to Install Python3 on Windows Step-By-Step with illustrations.

Why do we use Python for NLP?

Python is a straightforward programming language but also powerful at the same time. It has the great functionality of processing linguistic data. 

It has a shallow learning curve, transparent semantic syntax, and sound-string handling functionality. Hence, Python is the most preferred implementation language for NLP.

Python has an extensive standard library that includes numerical, web-data, and graphical processing components. It is also used for scientific research and is mainly known for its productivity, quality, and software maintainability. Many industries also use Python.

Python is a scripting language, but it helps in interactive exploration. It permits variables to be typed dynamically, further helping in rapid development. Python, an object-oriented language, allows data and methods to be encapsulated and reused very easily.

Its toolkit fits in new components, whether these components extend existing functionality. The toolkit is also organized. 

NLTK

Python provides NLTK(Natural Language Toolkit), a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language.

NLTK was developed keeping some basic requirements, which included:

  • Consistency
  • Simplicity
  • Extensibility
  • Modularity
  • Well-documented
  • NLTK organization

The toolkit can accommodate new components easily, even if they replicate or extend their functionality. It provides standard interfaces for performing tasks such as tagging and Parsingtokenization, and primary classes to represent data relevant to NLP. It has documentation, which includes nomenclaturedata structures, and implementations.

Different components of the toolkit interact using well-defined interfaces. Using small parts of the toolkit, individual projects can be completed. This feature is helpful for students to learn how to use the toolkit incrementally.

The organization of NLTK was done using task-specific packages. Every package combines data structures for denoting certain information and implementing using standard algorithms.

Python Libraries for NLP

People who had vast knowledge of machine learning, mathematics, and linguistics, were part of natural language processing projects. With the advancing times, ready-made tools that simplify text are used by the developers so that the main focus remains on building machine learning models.

Many Python tools and libraries are available to solve natural language processing problems. Python is considered to be an excellent choice for implementing NLP. The transparent semantics and simple syntax make it excellent for NLP tasks. It helps developers by providing support in integration with other languages.

Python helps developers with its extensive collection of NLP tools and libraries, which help NLP-related tasks, including 

  • document classification, 
  • topic modeling, 
  • sentiment analysis, 
  • part-of-speech(POS) tagging, and 
  • word vectors. 

Eight most widely used Python libraries can help us deliver good quality projects.

Python Libraries for NLP

Let’s discuss them further.

1) Natural language Toolkit – NLTK

NLTK is an essential tool for natural language processing and machine learning. It is one of the essential libraries in python, which helps out in 

  • classification, 
  • Parsing, 
  • tagging, and 
  • tokenization tasks. 

NLTK library was introduced at the University of Pennsylvania by Steven Bird and Edward Loper. NLTK is used around the globe by many universities in their courses. For more: https://www.nltk.org/

2)TextBlob 

For developers who have started their work with NLP using Python, Textblob is quite essential for them. It provides a friendly environment for beginners. It provides an easy interface to learning and understanding most NLP tasks. It is one of the best choices for beginners in NLP. For more: https://textblob.readthedocs.io/en/dev/

3) CoreNLP

The CoreNLP library is written in Java and was developed at Stanford University. It is mainly used for developers trying their hands on NLP in Python. The components of CoreNLP are used to boost the latter’s efficiency when integrated with NLTK. For more: https://stanfordnlp.github.io/CoreNLP/

4) Gensim

Identifying semantic similarity between two files through the topic modeling toolkit is the specialty of this Gensim Python library. Like other packages, it handles batch and in-memory processing and helps with efficient data streaming and incremental algorithms.

The key feature includes its excellent memory usage optimization and also processing speed. On top of it, its vector space modeling capabilities are the cherry on the cake feature. For more: https://github.com/RaRe-Technologies/gensim

5) SpaCy 

SpaCy is the newest library, which was designed for usage in production. SpaCy makes it more accessible than the other Python libraries like NLTK. Compared to other libraries, SpaCy is the fastest syntactic parser. Its toolkit is written in Cython, making it more efficient and faster. For more: https://spacy.io/

6) Polyglot

Polyglot is a lesser-known library. But it provides a broad range of Analysis and impressive language coverage. Polyglot has similarities to SpaCy in terms of efficiency. Its library is different since it requests only the command line to be used. For more: https://polyglot.readthedocs.io/en/latest/index.html

7) Scikit-learn

The Scikit-learn library is handy for the machine learning developer since it provides many algorithms. Developers make the most of its features because it has excellent documentation methods. It provides us with the functions which use the bag-of-words method to solve the text classification problems.

8) Pattern

The Pattern is another critical NLP library for Python developers to handle Natural Languages. The Pattern helps us with 

  • vector space modeling, 
  • clustering, 
  • WordNet, and 
  • part-of-speech tagging. 

It is considered a web miner, but it isn’t enough for completing other natural language processing tasks. For more: https://www.clips.uantwerpen.be/pages/pattern

NLTK

Python NLP with NLTK

NLTK is a beautiful tool for working in computational linguistics using Python and a great library to play with NLP. Many open-source NLP tools are available, but the NLTK-Natural Language Toolkit scores high for its easy use and concept explanation. The curve of learning Python is easy and speedy, while NLTK is written in Python, so NLTK forms a perfect learning kit.

NLTK (Natural Language Toolkit) is a complete package that contains libraries and programs for statistical language processing. NLTK incorporates tokenizationlemmatizationpunctuationcharacter countstemming, and word count. It is a free, open-source, community-driven project. 

NLTK is considered the most potent NLP library, which contains packages to make machines understand human language and reply with an appropriate response.

Installing NLTK

Python is a must for installing NLTK in our systems. We have discussed the installation of Python. Now, let’s install NLTK. Below are the steps we need to take for the installation procedure –

  • Firstly, we need to open the command prompt and then go to the location of the pip folder.
  • Then hit the following command to install –

pip install nltk

pip install NLTK

Installation should be done successfully.

The next step would be to open the Python shell from the start menu and then enter the following command to verify that NLTK has been installed or not-

import nltk

We have successfully installed NLTK on our machine if no error occurs.

Downloading NLTK’s Dataset and Packages

Once NLTK is installed on our computers, we must download its datasets (corpus). With the following commands, we can download all the NLTK datasets −

import nltk 
nltk.download()

Downloading NLTK Dataset and Packages

We will get this NLTK downloading window.

NLTK downloading window with Collections

Click on the download button to start the downloads.

Running the NLTK script

Let’s see an example to understand how to run the NLTK script in which we will be implementing tokenization. The nltk.tokenize is the package provided by the NLTK module to achieve the process of tokenization. 

Tokenizing sentences into words: Splitting a sentence into words or generating a list of words from a string is necessary for any text processing operation.

Let’s check this with the help of various modules provided by the nltk.tokenize package.

For basic word tokenization, word_tokenize module is used.

Start by importing the natural language toolkit(nltk).

import nltk

Then we need to import the word_tokenize class

from nltk.tokenize import word_tokenize

Now, input the sentence we tokenize. −

word_tokenize("Code Part Time provides high quality tutorials")

The output we get is –

['Code',’Part’,’Time’, 'provides', 'high', 'quality', 'tutorials']

Components of NLP

Word Level Analysis – Regular expressions

Regular expressions(RE) are used to find a set of strings with the help of specialized syntax held in a pattern. RE is a language for mentioning text search strings. 

In MS Word, RE is used to search for texts uniquely. Many search engines use regular expressions features for searching strings. 

Some important Characteristics of Regular Expressions:

  • In particular language, Regular Expression is a formula. It is used for mentioning a sequence of symbols and simple classes of strings. It is also an algebraic notation for characterizing a set of strings.
  • An American Mathematician named Stephen Cole Kleene formalized the regular expression language.
  • Two main things are required for RE. The first is the Pattern we need to search, and the other is the corpus of text from where we want to search.
What is a Regular Expression

In Mathematical terms, Regular Expressions are defined as follows:

– φ denotes that it is an empty language and is also a Regular expression.

– ε is said to be a Regular expression, which states that the language has an Empty string.

– If A and B are Regular expressions,

  A, B

  A.B(Concatenation of AB)

  A+B (Union of A and B)

  A*, B* (Kleen Closure of A and B) are regular expressions.

 If we obtain a string from the rules mentioned above, it is also said to be a regular expression.

Syntactic Analysis

Syntactic Analysis is a phase of NLP. Its primary function is to find the dictionary meaning or the exact meaning of the text. The syntactic analysis compares the text to formal grammar rules and checks for its meaning.

It is defined as analyzing the strings of symbols in natural language according to formal grammar rules, also known as Parsing.

Syntactic Analysis - Top-Down Bottom-Up

Let’s see the two types of Parsing 

  1. Top-down Parsing – It uses a recursive procedure to process the input. The parser starts to build the parse tree from the starting symbol and tries to transform the start symbol into the input. The only disadvantage of the recursive descent parsing approach is backtracking.
  2. Bottom-up Parsing – In this Parsing, the most intricate parts are at the bottom of the upside-down tree, and larger structures are in successively higher layers until, at the top or “root” of the tree, a single unit describes the entire input stream. A bottom-up parse discovers and processes that tree starting from the bottom left end and incrementally works its way upwards and rightwards.

Derivation

The derivation is a collection set of production rules, which can be used sequentially to obtain the input strings. Whenever we are doing Parsing, we need to decide the non-terminal, which is to be replaced using the decided set of production rules.

Types of Derivation

Types of Derivation

  1. Left-most Derivation: The sentential form of input is scanned from left to right and is replaced. The sentential form, in this case, is known as the left-sentential form
  2. Right-most Derivation: In this type of derivation, the sentential form of the input is scanned from right to left and is replaced. The sentential form is known as the right-sentential form

Grammar

Grammar is essential to describe the syntactic structure of well-formed programs. It also denotes syntactical rules for conversation in natural languages. Linguists have attempted to define grammar since the inception of natural languages like English, Chinese, Hindi, etc. 

The theory of formal languages can also be applied in Computer Science, mainly in programming languages and data structure. For example, the precise grammar rules in the C language state how functions are made from lists and statements. 

There are various types of grammar used in the field of computer science. One of them is Context-free grammar. (CFG)

Context-free grammar, also called CFG, is a notation for describing languages. It is a superset of Regular grammar. Let’s see it in the following diagram −

Context Free Grammer (CFG)

CFG consists of a finite set of grammar rules with the following four components −

  1. Set of Non-terminals: It is represented by V. There are syntactic variables called “non-terminals” that describe the sets of strings that the grammar creates. These strings help to define further the language that the grammar creates.
  2. Set of Terminals: It is called tokens and is defined by Σ. Strings are formed with the primary symbols of terminals.
  3. Set of Productions: It is represented by P. The set tells you how the terminals and non-terminals can be put together. Every production(P) consists of non-terminalsarrows, and terminals (the sequence of terminals). Non-terminals justify the left side of the production, and terminals justify the right side
  4. Start Symbol: The production initiates from the start symbol. It is represented by the symbol S. Non-terminal symbol is always designated as the start symbol. 

Semantic Analysis

Semantic Analysis is the process of extracting meaning from text. It assists computers in comprehending and interpreting phrases, paragraphs, and entire documents by evaluating their grammatical structure and establishing links between individual words in a given context.

It is a critical subtask of Natural Language Processing (NLP) and is at the heart of machine learning tools such as search engines, chatbots, and text analysis.

Semantic analysis-driven solutions can assist businesses in extracting relevant data from unstructured data sources such as emailssupport issues, and consumer comments.

Lexical Semantics

Lexical semantics represents the study of the meanings of individual words. It encloses nouns, adjectives, adverbs, affixes (sub-units), compound nouns, and phrases. All words, sub-words, and other lexical items are referred together as lexical items. 

We can also define lexical semantics as the relationship between lexical items, the meaning of sentences, and the syntax of sentences. 

Working of Semantics Analysis:

Lexical semantics plays a crucial role in semantic Analysis. Lexical semantics allow machines to understand relationships between lexical items (words, phrasal verbs, etc.):

  • Hyponyms: The specific lexical items of a generic lexical item (hypernym), e.g., orange is a hyponym of fruit (hypernym).
  • Meronomy: A logical arrangement of text and words that denotes a constituent part of or member of something, e.g., a segment of an orange
  • Polysemy: This represents the meaning of words or phrases, though slightly different, have a core meaning that they both share. (e.g., I read a paper, and then I wrote a paper)
  • Synonyms: Some words that have the same sense or nearly the same meaning as another, e.g., happy, content, ecstatic, overjoyed
  • Antonyms: The words that have close to opposite meanings, e.g., happy, sad
  • Homonyms: Any two words that sound the same and are spelled alike but have a different meaning, e.g., orange (color), orange (fruit)

Semantic Analysis Techniques

Depending on the type of information one would like to obtain from data, one can use one of two semantic analysis techniques:

  1. Text classification model: This analysis model assigns predefined categories to text.
  2. Text extractor: This semantic analysis technique pulls out specific information from the text.

Types of Classification Models:

  • Topic classification: The text is sorted into predefined categories based on its content in this classification method. For example, Customer care representatives may wish to categorize support tickets as they arrive at their help desk. Based on its semantic content, machine learning methods can determine whether a ticket should be categorized as a “Payment issue” or a “Shipping issue” based on its semantic content.
  • Sentiment analysis: In this type of classification method, it detects positive, negative, or neutral emotions in a text to denote urgency. For example, tag Twitter mentions by sentiment to get a sense of how customers feel about your brand and identify disgruntled customers in real-time.
  • Intent classification: The text is classified based on what customers want to do next. You can tag sales emails as “Interested” and “Not Interested” to proactively reach out to those who may want to try your product.

Types of Extraction Models

  • Keyword extraction: It includes finding relevant words and expressions in a text. This method is used alone or alongside one of the above methods to gain more granular insights. For instance, one can analyze the keywords in many tweets that have been categorized as “coding” and detect which words/topics are mentioned most often.
  • Entity extraction: This method identifies the named entities in text, like names of people, companies, places, etc. For example, A customer care staff may find it beneficial to automatically collect product names, shipment numbers, email addresses, and any other pertinent data from customer support tickets.

What is Tokenizing?

Tokenizing is breaking up a part of the text into smaller pieces, such as sentences and words. For example – a word can be a token in a sentence, while a sentence can be a token in a paragraph. NLP builds applications that include QA systems, language translation, voice systems, etc. Therefore to build them, it is necessary to understand the text pattern.

Tokenization is necessary to understand the patterns. Hence, it is the base step for further steps such as lemmatization and stemming

nltk.tokenize is the package to achieve the process of tokenization provided by the NLTK module.

NLTK tokenization further comprises sub-modules –

tokenization - word and sentence

The Natural Language toolkit has two fundamental modules

  1. word tokenize
  2. sentence tokenize

Tokenization of words

The word_tokenize() method divides a sentence into words. The word_tokenize() method splits a sentence into words. The output of word tokenization can be converted to Data Frame for better text understanding in machine learning applications.

Machine learning models mostly need numeric data to be processed and make a prediction. Tokenization of words becomes critical during the text (string) to the numeric data conversion process. It can also be used for text cleaning processes such as punctuation removal, numeric character removal, or stemming. 

Let us see an example,

import nltk 
from nltk.tokenize import word_tokenize  
word_tokenize("This is a beautiful day.")

The Output of the code above will be –

['This', 'is', 'a', 'beautiful', 'day', '.']

Explanation of the above code:

  • Firstly, we need to import the natural language toolkit(nltk) and the word_tokenize class.
  • Then, input the sentence you want to convert to tokens. This function breaks each word with punctuation which can be seen in the output.

Tokenization of Sentences

Tokenization of sentences is done by sub-module sent_tokenize. What if we need to count average words per sentence?

We need both the NLTK sentence tokenizer and the word tokenizer to calculate the ratio to complete such a task. It provides an output that serves as a feature for machine training as the output will be in numeric format. 

For example

from nltk.tokenize import sent_tokenize 
text1 = "God is Great! I won a lottery." 
print(sent_tokenize(text1))
tokenization of sentences

The input has a total of 7 words and two sentences

The output of the code above will be –

['God is Great!', 'I won a lottery']

Explanation of the above code:

  • The first line imported the sent_tokenize module. 
  • Later the sentence tokenizer in the NLTK module parsed the sentences and showed the output.

Hence, sent_tokenize helps to break each of the sentences.

Stats in NLP

Large amounts of data are used, and conclusions are derived from it. Statistical natural language processing uses machine learning algorithms to train natural language, processing models.

After successfully training on a significant volume of data, the trained model will produce favorable results when the deduction is used. Statistical NLP is easy to scale. They can have faster development. Let’s see some of the features. 

The Frequency Distribution

Frequency Distribution Paragraphs

Take these above paragraphs(as shown in the image above) as input for finding frequency. Here’s the code to determine the frequency of words in our text. 

from nltk.probability import FreqDist 
fdist = FreqDist(words) 
print(fdist.most_common (10))

Explanation of the above code:

  • We need to import the required libraries and then find the frequency. 
  • NLTK in python has a function FreqDist which gives you the frequency of words within a text. 

NLTK’s FreqDist accepts any iterable. As a string is iterated character by character, it pulls things apart.

  • To count words, you need to feed FreqDist words.
  • The last is to print the ten most common words.

The output of the code above will be –

Frequency Distribution Result

The most commonly used words are punctuation marks and stopwords. We Need to remove such words to analyze the actual text. 

Plot The Frequency Graph

Here is a graph plotting code to visualize the word distribution in the text. 

from nltk.probability import FreqDist 
import matplotlib.pyplot as plt 
fdist = FreqDist(words) 
fdist.plot(10)

The output of the code above will be –

Plot the Frequency Graph

The above graph shows that the period “.” is used nine times in the text. Punctuation marks aren’t significant when it comes to processing natural language. Therefore, we shall be removing such punctuation marks in the next step. 

Removing Punctuation Marks

Now, let us remove the punctuation marks as they are not very useful. We will be using the isalpha() method to separate the punctuation marks from the actual text. The isalpha() method returns True if all the characters are alphabet letters (a-z)

The if statement is used to check for words using the isalpha() method. Additionally, we’ll create a new list called words_no_punc to contain words in lower case but without punctuation marks.

For storing the words in lowercase we are appending the words in the words_no_punc list using the lower()

Then we will print the words and at the last the total length

from nltk.probability import FreqDist 
fdist = FreqDist(words)  
words_no_punc = [] 
for w in words: 
	if w.isalpha(): 
		words_no_punc.append(w.lower()) 
print (words_no_punc) 
print (len(words_no_punc))

The output of the code above will be –

Removing Punctuation Marks Words

The entire punctuation marks from our text are excluded. These can also be cross-checked with the number of words. 

Plotting graph without punctuation marks

Here is a graph code to visualize the word distribution in the text without the punctuation marks.

from nltk.probability import FreqDist 
fdist = FreqDist(words) 
words_no_punc = [] 
for w in words: 
	if w.isalpha(): 
		words_no_punc.append(w.lower())  
fdist = FreqDist(words_no_punc) 
fdist.most common (10) 
fdist.plot(10)

We use the same function, FreqDist, which gives the frequency of words within the text. In the above code, we use the plot function to plot the graph of the most common words without punctuation marks. 

The Output will be –

[('and', 7), ('the', 7), ('little', 5), ('a', 4), ('was', 4), ('pig', 4), ('he', 4), ('house', 4), ('to', 3), ('out', 3)]

Plotting graph without punctuation marks

What is Word Cloud?

Word Cloud is said to be a data visualization technique. From the given text, words are displayed on the main chart. The essential words or the frequent ones are displayed in bolder and larger font, and on the opposite, the less frequent or the less essential ones are displayed in smaller font.

In NLP, it is said to be the most helpful technique, which gives us an idea from the text on how words can be analyzed. Some of the properties are as below:

  • font_path: Specifies the path for the fonts we want to use.
  • width: Determines the width of the canvas.
  • height: Determines the height of the canvas.
  • min_font_size: Indicates the smallest font size to use.
  • max_font_size: Specifies the largest font size to use.
  • font_step: Specifies the step size for the font.
  • max_words: Specifies the maximum number of words on the word cloud.
  • stopwords: Our code will remove these words.
  • background_color: Indicates the background color for canvas.
  • normalize_plurals: Removes the trailing “s” from words.

Implementation of Word cloud

Let’s plot a graph to visualize the implementation of Word Cloud.

from wordcloud import WordCloud 
import matplotlib.pyplot as plt  
text= "little lazy house built pit lazy enough worked old upon"
wordcloud = WordCloud().generate(text) 
plt.figure(figsize = (12, 12)) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.show()

Explanation of Steps in the above code –

  • We will start by importing the wordcloud library.
  • The matplotlib.pyplot is a state-based interface to matplotlib. It provides a MATLAB-like way of plotting which will be stored into plt.
  • After that, we will generate the wordcloud using generate().
  • We have used the plt to plot the wordcloud and figure() function to create a new figure present in plt.
  • Finally, the imshow() function displays data as an image.

The output of the code above will be –

Implementation of Word Cloud

In the graph, we can see that the most frequent words are displayed in larger fonts.

Word Cloud is swift and is very simple to understand. They are visually appealing and engaging, but they lack the context of words.

What is Stemming?

Stemming is used to normalize words by truncating the actual word into its base word. In simple words, it is a process in which the base word is extracted by removing affixes from the actual content word. 

Depending on the context used, a single word can be taken in multiple forms in English and many other languages.

Let’s take an example of the verb “study“. It can be used in many forms like “studied“, “studying“, and “studies,” depending upon the context.

While tokenizing words, the interpreter counts them as different words, not one, since their meaning is one.

An example of Stemming in NLP

According to the steps of NLP, the content meaning is being analyzed. Hence for this purpose, we use Stemming. This process does not give us the exact grammatical or dictionary word for the particular group of words. Stemming is mainly used in Search Engines. It is used to index the words.

Hence, it stores the base word or the actual word rather than all of its forms. Thereby reducing the index size and increasing the retrieval accuracy of the search engines. Let’s have a look at the example:

import nltk 
from nltk.stem import PorterStemmer 
word_stemmer = PorterStemmer() 
word_stemmer.stem('writing')

Explanation of the above code:

  • We first need to import the Natural language Toolkit-nltk. For stemming a word, we need some algorithm.
  • So we need to import the PorterStemmer class to implement the Porter Stemmer algorithm.
  • The next step is to create an instance of the Porter Stemmer class.
  • Finally, we need to input the word we want to stem from.

The output of the code above will be –

write

Various Stemming Algorithms

Various Stemming Algorithms

The NLTK process has stemmer – the stem() method. This interface has all the stemmers like the 

  1. PorterStemmer
  2. LancasterStemmer
  3. RegexpStemmer, and 
  4. SnowballStemmer

1) Porter Stemming algorithm

The Porter stemming algorithm is known as the most common algorithm. It mainly eliminates and replaces the common suffixes of English words. 

PorterStemmer class

Often, the stem is a shorter word having a similar root meaning. The PorterStemmer class helps to stem the words with the porter stemmer algorithms. This class has a vast knowledge of the various word forms and their suffixes, which helps it transform the input word into the final stem word. 

Example

import nltk 
from nltk.stem import PorterStemmer 
word_stemmer = PorterStemmer() 
word_stemmer.stem('writing')

Explanation of the above code:

  • The first essential step is to import the natural language toolkit(nltk)
  • Then we import the PorterStemmer class, which will implement the Porter Stemmer algorithm.
  • Later, we create an instance of Porter Stemmer’s class. 
  • Finally, we give the word we want to stem.

The output of the code above will be –

'Write'

2) Lancaster stemming algorithm

As the name suggests, it was developed by Lancaster University and is a ubiquitous stemming algorithm. 

LancasterStemmer class

LancasterStemmer class helps to implement the Lancaster stemming algorithms for the word we want to stem.

Example

import nltk 
from nltk.stem import LancatserStemmer 
Lanc_stemmer = LancasterStemmer() 
Lanc_stemmer.stem('helps')

The output of the code above will be –

'help'

3) Regular Expression stemming algorithm

This stemming algorithm helps us to customize and create our stemmer.

RegexpStemmer class

RegexpStemmer class helps us to implement Regular Expression Stemmer Algorithms. Firstly it selects a regular expression and removes a prefix or suffix that matches the expression.

Example

import nltk 
from nltk.stem import RegexpStemmer 
Reg_stemmer = RegexpStemmer() 
Reg_stemmer.stem('ingeat')

The output of the code above will be –

'eat'

4) Snowball stemming algorithm

Snowball stemming algorithm is one of the convenient stemming algorithms.

SnowballStemmer class

Snowball stemming algorithms are implemented using SnowballStemmer class. This class supports 15 non-English Languages. We need to create the instance with the language name we want to use and later call the stem() method to use this class.

Example

import nltk 
from nltk.stem import SnowballStemmer 
French_stemmer = SnowballStemmer(‘french’) 
French_stemmer.stem (‘Bonjoura’)

The output of the code above will be –

'bonjour'

Exploring Lemmatization, Stemmer VS Lemmatizer

The Lemmatization technique is similar to stemming. The output is called a ‘lemma’. Lemma is said to be the root word. 

Exploring Lemmatization

It finds the dictionary word and does not truncate the original word. NLTK has a WordNetLemmatizer class which acts as a thin coating around the wordnet corpus. 

To find the lemma it uses morphy() function to the WordNet CorpusReader class. 

Example –

import nltk 
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
lemmatizer.lemmatize('books')

Explanation of the above code:

  • The preliminary step is to import the natural language toolkit(nltk)
  • Then, we will have to import the WordNetLemmatizer class to implement the lemmatization technique.
  • We will proceed by creating an instance of WordNetLemmatizer class.
  • The last one is to call the lemmatize() method and input the word of which you want to find the lemma.

The output of the code above will be –

'book'

Difference between stemming and Lemmatization examples

Let us understand it with one more example using PorterStemmer:

from nltk.stem import PorterStemmer 
stemmer = PorterStemmer() 
print (stemmer.stem('studies'))

The output of the code above will be –

studi

Using the process of stemming, the word “studies” gets truncated to “studi“. 

Let us see the same example on using Lemmatization,

from nltk. stem import WordNet Lemmatizer 
lemmatizer = WordNet Lemmatizer() 
print(lemmatizer.lemmatize('studies'))

The output of the code above will be –

study

With lemmatization, the word “studies” displays its dictionary word “study“. 

PorterStemmer class cuts off the ‘es‘ from the word. On the other side, WordNetLemmatizer class finds the dictionary word. 

Talking in simple terms, PorterStemmer(stemming) only looks at the form of the word, whereas lemmatization finds out the meaning of the word. We will always get a valid word after lemmatization. 

Stopwords

There are many common words present in our text data that do not contribute to the meaning of a sentence. Those words are not necessary for natural language processing or even information retrieval. Stopwords like ‘the’ and ‘a’ are the most commonly used. 

NLTK stopwords corpus

The Natural Language Tool kit comes with a stopword corpus. It has word lists for many languages. Let us see the following example. We shall be using the stopwords of English.

from nltk.corpus import stopwords 
english_stops = set(stopwords.words1('english')) 
words1 = ['I', 'am', 'a', 'writer'] 

The output of the code above will be –

['I', 'writer']

Here is the list of Stopwords:

NLTK Stopwords corpus

POS(Parts-of-Speech) Tagging

Tagging can be defined as a kind of classification wherein the description of tokens is automatically assigned. The parts of speech like nouns, verbs, adjectives, and conjunctions are descriptors’ tags

When it comes to Parts-of-Speech (POS) tagging, we define it as the process of transforming a sentence composed of a list of words into a list of tuples. Tuples are in the form of tags. POS tagging is also known as assigning one of the parts of speech to the given word.

Some of the NLTK POS Tags Examples are mentioned in the image below:

Parts of Speech Tagging List

Let us see an example –

import nltk 
from nltk import word_tokenize 
sentence1 = "I am going to College" 
print (nltk.pos_tag(word_tokenize(sentence1)))

The output of the code above will be –

[('I', 'PRP'), ('am', 'VBP'), ('going', 'VBG'), ('to', 'TO'), ('college', 'NN')]

Chunking and Chinking

Chunking –

Example of Content after chunking

Chunking is one of the most essential processes in Natural language Processing. Chunking is a process where we have an unstructured text from which we extract meaningful phrases. 

Chunking works on the top of Parts of Speech tagging. Chunking is a process that splits simple text into more meaningful phrases than individual words. It’s hard to infer meaningful information by tokenizing a bunch of words. The input we provide is the POS tags, and the output is the chunks

The Process of Chunking

Phrases can be defined as an influential group of words. There are five kinds of phrases:

  1. Noun Phrases (NP)
  2. Verb Phrases (VP)
  3. Adjective Phrases (ADJP)
  4. Adverb Phrases (ADVP)
  5. Prepositional Phrases (PP)

Some rules are being defined for the phrase structure: 

  • S(Sentence) → NP VP
  • NP → {Determiner, Noun, Pronoun, Proper name}
  • VP → V (NP)(PP)(Adverb)
  • PP → Pronoun (NP)
  • AP → Adjective (PP)

Let us look at an Example,

grammar= "NP : {<DT>?<]]>*<NN>}"  
parser = nltk. RegexpParser(grammar) 
output = parser.parse(tagged_words) 
print (output) 
output.draw()

The output of the code above will be –

(S A/DT very/RB (NP beautiful/JJ young/l) lady/NN) is/VBZ walking/VBG on/IN (NP the/DT beach/NN))

Role of Phrases in the process of chunking

Explanation of the above code:

  • We need to Extract Noun Phrases from the text. 
  • We will define ‘?’ as an optional character And ‘*’ for 0 or more repetitions.
  • We will first create a parser. Then we will start by parsing the text.
  • At last, we need to get an output.

This is how we successfully extract noun phrases from the text.

Chinking

Chinking is a small part of the chunk. We need only a part of the text from the whole chunk in many scenarios. In complicated situations, chunking may result in unuseful data. Chinking comes into the picture in such situations, excluding some of the chunked text.

Let’s check out an example. We will be taking the whole string as a chunk, and with the help of the chinking process, we will eliminate the adjectives from it.

Chinking is most commonly used when lots of unuseful data is left even after the chunking procedure. We will be writing the chinking grammar in the inverted curly braces, i.e. } Grammar {

Let’s see the example of Chinking for zero or more repetitions-

grammar = "7"" NP: {<.*>+} 
}<]]>+{""" 
parser = nltk. RegexpParser(grammar) 
output = parser.parse(tagged_words) 
print (output) 
output.draw()

In the above Chinking example, we will be taking the whole string and then excluding adjectives from that chunk.

Then we create a parser, parse the string, and finally display the output.

The output of the code above will be –

(s (NP A/DT very/RB) beautiful/JJ young/JJ (NP lady/NN is/VBZ walking/VBG on/IN the/DT beach/NN))

Chinking Example

What is WordNet?

WordNet

Princeton University created Wordnet, a huge lexical database for the English language. It is said to be part of the NLTK corpus.

A set of synsets is defined as a group of nouns, verbs, adjectives, and adverbs. Every set of synsets has a distinct meaning.

Some of its uses are as follows:

  • It helps find the meaning of a word.
  • It also helps us find synonyms and antonyms of a word.
  • Words that have multiple uses and definitions help in word sense disambiguation
  • Using Wordnet, word similarities and relations can also be explored.

Wordnet can be imported with the following command

from nltk.corpus import wordnet

Synset instances

The cluster of synonym words having the same idea is called synset instances. Using Wordnet, we get a list of Synset instances.

The ‘wordnet.synsets(word)‘ provides us the list of Synsets.

Let us look at an example below –

from nltk.corpus import wordnet as wnt 
syn = wnt.synsets('dog')[0] 
syn.name() 
syn.definition() 
syn.examples()

Explanation of the above code:

  • We will have to import Wordnet first. 
  • The second step would be to provide the word we want to look up the Synset for. 
  • We have used the name() method to get the unique name for the Synset.
  • We have used the definition() method to get the word’s definition.
  • We also have used the examples() method to get the examples of the word.

The output of the code above will be –

'dog.n.01' 'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds' ['the dog barked all night']

Applications of NLP (Natural Language Processing)

Natural Language Processing is used widely in this Era of Automation.

Also, it is an Outgrowing technology that emerges from various forms of AI for creating interactive communication between machines and humans. It is becoming essential for the future to create such applications. Let us see some Applications of NLP:

1) Machine Translation

Machine translation is a significant application of NLP. At least once, we all have used Machine translation to understand something, not in the language we understand.

Machine translation translates text/speech from one source language into another language. Google translate is a prevalent example of Machine translation.

Machine Learning - an example of Google Translate

2) Spam Detection

Email Spam Detection

The Spam detection technique is used to detect unwanted emails in our mailbox. These emails are one of the biggest problems in this world.

Promotional messages take up most of the space in our inboxes. Therefore, spam filters are a must as the first step of defense against these messages. The NLP functionality can develop spam filtering systems.

3) Question Answering

Alexa Devices - Question Answering Systems

Question answering applications are also essential applications of Natural language processing. QA apps primary focus is to build systems that answer the questions automatically asked by humans in their natural language.

Search English like google helps us get the world’s information at just a click, but they are still far behind in answering when humans post questions in their natural language.

A computer that understands our natural language also has the capability of a program system to translate the sentences into their internal representation so that the system can generate reasonable answers.

4) Sentiment analysis

Sentiment Analysis is used to recognize the sentiments among various posts. It also analyses the sender’s behavior, emotional state, and attitude. Many companies use sentiment analysis to find out the sentiment and opinion of their customers online. It helps find out what their customers think about their services and products. 

This application runs with the help of Natural language processing and statistics by assigning values to the text (positive, negative, or natural) and identifying the mood of the context (happy, sad, angry, etc.).

5) Chatbot

Robotics Chatbot

Chatbots are used to provide customer chat services by many companies. NLP supports chatbots to analyze, understand and prioritize the questions according to complexity, which helps chatbots to reply to customer questions faster. 

6) Spelling Correction

MS Word - an example of Spelling Correction feature

Word Processor software like Powerpoint and MS-word also use NLP for spelling correction.

7) Speech recognition

Speech recognition is a technique that helps us convert spoken words into text form. Each of us uses it in our daily routine. 

Speech Recognition of a boy by his mobile phone

Speech recognition algorithms are used in various applications, including mobile, video recovery, voice user interfaces, and home automation.

Pros and Cons of NLP Systems

Pros of NLP Systems –

  • We as a User can ask a question about any topic and get an answer within seconds.
  • NLP system provides us the answers in natural language
  • The NLP system does not provide unwanted and unnecessary answers; it gives us exact answers to our questions.
  • The NLP system helps us communicate with the computer in our natural language.
  • The more information provided in the question, the more accuracy we get in the answers.
  • NLP systems are very time efficient.
  • NLP is also used to improve the efficiency of documentation processes.
  • NLP helps in structuring a highly unstructured data source.

Cons of NLP Systems –

  • NLP systems cannot adapt to new domains and have limited functions only. That is why they are built for single and specific tasks only.
  • NLP systems lack a User interface, making it difficult for users to communicate with the system.
  • Complex Query Language: The NLP system will not be able to provide us with the correct answer if the question is poorly worded or unclear.
  • NLP systems may not show us context.

How does NLP impact the future?

Soon, Our computer systems and machines will be using NLP to learn from the information online and use it in the real practical world.

Also, when combined with natural language generation, systems will have more potential to receive and give valuable and resourceful data or information. However, that still requires lots of effort.

NLP and Bots

Chatbots help customers with questions and direct them to the relevant resources and products at any point in time and day. 

Robots using NLP around its customers

Hence chatbots need to be intelligent and easy to use, primarily when customer service is related because they have high expectations and low patience. With the help of NLP, Chatbots understand language, where customers communicate in their language or words. 

Through integration with semantic and other cognitive technologies that enable a more in-depth comprehension of human language, chatbots will be able to comprehend and respond to more sophisticated and longer-form requests and functions in multiple contexts, all in real-time.

This enhanced functionality will help other bots become more successful and precise over time, from virtual assistants like Cortana and Amazon’s Alexa to more automation- or task-oriented platforms.

Bots combined with NLP will help us understand the text and perform actions such as sharing geoinformation, retrieving links and images, or executing more complex actions. 

NLP and invisible UI

Human communication with the machine, text, and conversation is essential. Amazon’s Echo is an excellent example of humans directly contacting technology. The Invisible UI will rely on the direct interaction between user and machine through various methods like voice, text, or even both. 

NLP, as it gets better at understanding us, what we say, no matter how we say it, and what we are doing, will be essential for any invisible or zero UI application.

Smarter Search

The Future of NLP is also for more innovative search—something we’ve been talking about for a long time.

The same skills that enable a chatbot to understand a customer’s request can also be used to offer a “search like you talk” feature (similar to how you might question Siri), rather than focusing on keywords or themes.

Intelligence from unstructured information

Understanding human language is compelling when applied to extract information and reveal meaning and sentiment in large amounts of text content (i.e., unstructured information), especially the types of content that people must manually examine.

A greater adequate and accurate understanding between humans and machines will strengthen both efficiencies. NLP will be critical in recognizing the authentic voice of the user and customer and enabling more seamless engagement on any platform that uses language and human communication, regardless of the platform.

Soon, Our computer systems and machines will be using NLP to learn from the information online and use it in the real practical world. However, that still requires lots of effort.

Conclusion

NLP’s future is very closely linked to the growth of AI. The major problem of AI is the processing of Human readable natural language. We can say that it is as good as making our Computers as intelligent as Humans.

A natural language toolkit will become more effective. In the future, with the help of NLP, Computers will be able to learn from the information present online and use it in the real world.

When combined with NLP, Computers will become more capable of providing more resourceful data or information. 

NLP Natural Language Processing The Ultimate Guide

References –

WordNet Search – 3.1. http://wordnetweb.princeton.edu/perl/webwn?o8=1&o9=&s=dog&h=00000000&j=7

You may like to Explore –

Sneak peak into Artificial Intelligence – AI

Cheers!

Happy Coding.

About the Author

This article was authored by Rawnak.