How to Lemmatize a Dataframe in Python

In this tutorial, we will show you how to lemmatize a dataframe in Python using the NLTK library.

Updated March 29, 2023

Hello and welcome to this beginner’s tutorial on how to lemmatize a dataframe in Python. Lemmatization is the process of converting a word to its base form, or lemma. It’s a useful technique in natural language processing for reducing words to their root form and improving the accuracy of text analysis. In this tutorial, we will show you how to lemmatize a dataframe in Python using the NLTK library.

What is a Dataframe?

A dataframe is a two-dimensional tabular data structure in which data is organized in rows and columns, similar to a spreadsheet or a SQL table. Dataframes are a popular data structure in data science and machine learning and are used for data manipulation, analysis, and visualization.

What is Lemmatization?

Lemmatization is the process of reducing a word to its base form, or lemma, using morphological analysis. The base form of a word is its root form, which is typically a noun, verb, adjective, or adverb. For example, the lemmas of the words “running,” “ran,” and “runs” are “run.”

Method: Using the NLTK Library

The Natural Language Toolkit (NLTK) is a popular Python library for natural language processing. It provides a wide range of tools and resources for text processing, including lemmatization.

Here’s an example of how to lemmatize a dataframe using the NLTK library:

import pandas as pd
import nltk
from nltk.stem import WordNetLemmatizer

# Create a sample dataframe
df = pd.DataFrame({'text': ['I am running', 'He ran fast', 'They are runners']})

# Create a lemmatizer object
lemmatizer = WordNetLemmatizer()

# Define a function to lemmatize text
def lemmatize_text(text):
    # Tokenize the text into words
    words = nltk.word_tokenize(text)

    # Lemmatize each word
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

    # Join the lemmatized words back into a string
    lemmatized_text = ' '.join(lemmatized_words)

    return lemmatized_text

# Apply the lemmatize_text function to the text column of the dataframe
df['lemmatized_text'] = df['text'].apply(lemmatize_text)

# Print the original and lemmatized dataframes
print(df)

In this code, we first import the necessary libraries, including pandas and NLTK. We then create a sample dataframe with a text column containing three sentences. We create a lemmatizer object using the WordNetLemmatizer class from NLTK. We define a function called lemmatize_text that takes a text string as input, tokenizes it into words, lemmatizes each word, and joins the lemmatized words back into a string. We apply the lemmatize_text function to the text column of the dataframe using the apply() method and store the lemmatized text in a new column called lemmatized_text. Finally, we print both the original and lemmatized dataframes.

Conclusion

In this tutorial, we have shown you how to lemmatize a dataframe in Python using the NLTK library. We have covered the basics of lemmatization, creating a lemmatizer object, defining a lemmatization function, applying the function to a dataframe column, and printing the original and lemmatized dataframes.

Lemmatization is an essential technique in natural language processing, and it can improve the accuracy of text analysis and classification tasks.