top of page
Search

An Introduction to Natural Language Processing

Introduction

Natural Language Processing is a branch of artificial intelligence. It enables computers to understand, generate text, and interact with human language. Many of you have probably interacted with NLP without realizing it. It is mainly used as a virtual assistant like Siri, Alexa, Cortana, etc. NLP does both. It understands the user’s request and responds in different human languages. It is an interdisciplinary domain of computer science, psychology, and linguistics. Other software tools are text summarizer, sentimental analysis, grammar checker, and text translation. For example, while replying to any email or chat, the reply is automatically suggested using NLP.



History

The field of Natural Language Processing began’ s in 1940 after World War II. When people wanted to understand each other language, but it was impossible to translate, researchers were trying to model a machine that could automatically translate languages to others for smooth communication. In 1950, the first translation machine, the Georgetown-IBM system, for translating Russian to English and vice versa. The same year, Alan Turing published an article titled “Computing Machinery and Intelligence,” now called the Turing test.


In 1970 years, researchers split NLP into two divisions symbolic and stochastic. In symbolic, researchers focused on formal languages and generating syntax, which was the beginning of AI. Other stochastic researchers were interested in statistical and probabilistic methods of NLP, working on pattern reorganization between texts. The introduction of Corpova and significant language datasets led to the development of statistical language models and the application of techniques like Hidden Markov Models and probabilistic context-free grammar.


In recent years, deep learning has revolutionized NLP. Models like Word2Vec, GloVe, and transformer-based models like GPT and BERT have pushed the boundaries of NLP performance, enabling machines to generate language with more accuracy and fluency.


Types of Natural Language Processing


Information Extraction: NLP searches and retrieves information from a large set of unstructured data, like web search engines.


Lemmatization and Stemming: These techniques reduce words to their base form, like “Jumping” to “Jump.” Lemmatization and stemming have different procedures for reducing words, but lemmatization is the better one, as lemma words are always meaningful. While stemming can produce non-words with no meaning.


Named Entity Reorganization: This aims to extract relevant information and meaning from the text data. For example, while reading a Harry Potter, “Harry went to the magical kingdom Hogwarts.” Here “Harry” and “Hogwarts” are the highlights or special words to understand the relevancy of the sentence. So, NER is a tool or model that helps fetch names, places, or anything important as information.


Sentiment Analysis: It is also known as opinion mining, which helps to determine the sentiment or emotion expressed in the text. It judges whether the text is positive, negative, or neutral. For example, “I like all Harry Potter movies; all are fantastic!”; here, the sentiment is positive. On the other hand, if someone writes, “I hate to eat tomatoes,” the sentiment is negative as the person expresses hating for tomatoes.


Summarization: An NLP technique condenses a large text into shorter versions while retaining its main points and essential information. It aims to provide a concise summary that captures the essence of the original text.


Machine Translation: The automatic process of translating text or speech from one language to another using computer algorithms. It filled the language barrier gap and facilitated communication between people of different languages.


Virtual Assistant: NLP powers chatbots and virtual assistants, enabling them to understand user queries and respond naturally and helpfully.



Working of Natural Language Processing


Step 1 - Tokenization: The text is divided into individual words or tokens.

Step 2 - Text Cleaning: Noise is removed, and the text is converted to lowercase.

Step 3 - Part-of-Speech Tagging: Each token is assigned a part-of-speech label, such as a noun, verb, adjective, etc.

Step 4 - Parsing: The sentence structure is analyzed to understand the relationships between words.

Step 5 - Entity Recognition: Named entities, like names, dates, locations, etc, are identified.

Step 6 - Sematic Analysis: The meaning of the text is extracted using various methods like sentiment analysis or semantic role labelling.

Step 7 - Language Modelling: It is used to predict the probability of the next word in a sequence, enabling tasks like text completion or generation.


Libraries Used


NLTK (Natural Language Toolkit): One of the most well-known and popular Python NLP libraries is NLTK. It offers simple interfaces and features for tasks like tokenization, part-of-speech tagging, named entity identification, sentiment analysis, and more. For those new to NLP, NLTK also provides a selection of text corpora and language processing tools, making it a great place to start.


spaCy: It is another popular NLP library for Python that is designed to be efficient, fast, and production ready. It offers robust capabilities for tasks like tokenization, part-of-speech tagging, dependency parsing, named entity recognition, and more. spaCy is widely used in industry applications due to its speed and ease of use.


Genism: A library called Gensim is mainly concerned with topic modeling and document similarity analysis. Latent Semantic Analysis (LSA) models, Word2Vec word embeddings, and Doc2Vec document embeddings may all be created using its tools. Recommendation systems, document retrieval, and document clustering are frequently employed with Gensim.


Transformers: Hugging Face created the Transformers library, a cutting-edge NLP library that supports a variety of transformer-based models like BERT, GPT, RoBERTa, and others. Pre-trained models are provided for a variety of NLP tasks, making it simple to modify and apply them to particular uses including sentiment analysis, text creation, machine translation, and question-answering systems.


Standford NLP: A collection of Java-written NLP tools are available from Stanford NLP. It includes activities like named entity recognition, sentiment analysis, dependency parsing, and part-of-speech tagging. Models and performance from Stanford NLP are renowned for their excellence.


TextBlob: Another Stanford package, CoreNLP, has strong NLP features for tokenization, part-of-speech tagging, named entity identification, sentiment analysis, and more. Despite having bindings for various programming languages, CoreNLP is built in Java.


CoreNLP: Another Stanford package, CoreNLP, has strong NLP features for tokenization, part-of-speech tagging, named entity identification, sentiment analysis, and more. Despite having bindings for various programming languages, CoreNLP is built in Java.


AllenNLP: A research-focused NLP library built on top of PyTorch is called AllenNLP. It provides pre-built components for a variety of applications, including semantic role labeling, question-answering, and text categorization. Additionally, AllenNLP makes it simple to experiment with different NLP models and architectures.

Author - Jinal Swarnakar



11 views0 comments

Comments


bottom of page