How to convert token list into wordnet lemma list using nltk?

-1

I have a list of tokens extracted out of a pdf source. I am able to pre process the text and tokenize it but I want to loop through the tokens and convert each token in the list to its lemma in the wordnet corpus. So, my tokens list looks like this:

['0000', 'Everyone', 'age', 'remembers', 'Þ', 'rst', 'heard', 'contest', 'I', 'sitting', 'hideout', 'watching', ...]

There's no lemmas of words like 'Everyone', '0000', 'Þ' and many more which I need to eliminate. But for words like 'age', 'remembers', 'heard' etc. the token list is suppose to look like:

['age', 'remember', 'hear', ...]

I am checking the synonyms through this code:

syns = wn.synsets("heard")

print(syns[0].lemmas()[0].name())

At this point I have created the function clean_text() in python for preprocessing. That looks like:

def clean_text(text):

    # Eliminating punctuations

    text = "".join([word for word in text if word not in string.punctuation])

    # tokenizing

    tokens = re.split("W+", text)

    # lemmatizing and removing stopwords

    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]

    # converting token list into synset

    syns = [text.lemmas()[0].name() for text in wn.synsets(text)]

    return text

I am getting the error :

syns = [text.lemmas()[0].name() for text in wn.synsets(text)]

AttributeError: 'list' object has no attribute 'lower'

How to get the token list for each lemma?

The full code:

import string

import re

from wordcloud import WordCloud

import nltk

from nltk.tokenize.treebank import TreebankWordDetokenizer

from nltk.corpus import wordnet

import PyPDF4

import matplotlib

import numpy as np

from PIL import Image



stopwords = nltk.corpus.stopwords.words('english')

moreStopwords = ['clin97803078874365pallr1indd'] # additional stopwords to be removed manually.

wn = nltk.WordNetLemmatizer()



data = PyPDF4.PdfFileReader(open('ReadyPlayerOne.pdf', 'rb'))

pageData = ''

for page in data.pages:

    pageData += page.extractText()

# print(pageData)





def clean_text(text):

    text = "".join([word for word in text if word not in string.punctuation])

    tokens = re.split("W+", text)

    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]

    syns = [text.lemmas()[0].name() for text in wordnet.synsets(text)]

    return syns





print(clean_text(pageData))

edited Nov 21 '18 at 17:04

asked Nov 21 '18 at 16:41

Tony

You should check your imports, in nltk, wordnet might refer to deferents objects, and some do not have synsets attributes
– BlueSheepToken
Nov 21 '18 at 16:48

Oh, yes I was importing wordnet as wn and declaring the WordnetLematizer in wn variable too. But, now I am getting this error. AttributeError: 'list' object has no attribute 'lower'
– Tony
Nov 21 '18 at 16:52

Nice, you should edit your post, I might not be able to help you there.
– BlueSheepToken
Nov 21 '18 at 16:54

1

Hi Tony, it's best to create a Minimal, Complete, and Verifiable example. See here: stackoverflow.com/help/mcve. This means we should be able to copy/paste your code and run it in a REPL so that we can confirm the error you're seeing, and (ideally) point you in the right direction for a fix. Good luck!
– Matt Messersmith
Nov 21 '18 at 16:56

Done, thanks @Matt
– Tony
Nov 21 '18 at 17:04

|
show 1 more comment

-1

['0000', 'Everyone', 'age', 'remembers', 'Þ', 'rst', 'heard', 'contest', 'I', 'sitting', 'hideout', 'watching', ...]

There's no lemmas of words like 'Everyone', '0000', 'Þ' and many more which I need to eliminate. But for words like 'age', 'remembers', 'heard' etc. the token list is suppose to look like:

['age', 'remember', 'hear', ...]

I am checking the synonyms through this code:

syns = wn.synsets("heard")

print(syns[0].lemmas()[0].name())

At this point I have created the function clean_text() in python for preprocessing. That looks like:

def clean_text(text):

    # Eliminating punctuations

    text = "".join([word for word in text if word not in string.punctuation])

    # tokenizing

    tokens = re.split("W+", text)

    # lemmatizing and removing stopwords

    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]

    # converting token list into synset

    syns = [text.lemmas()[0].name() for text in wn.synsets(text)]

    return text

I am getting the error :

syns = [text.lemmas()[0].name() for text in wn.synsets(text)]

AttributeError: 'list' object has no attribute 'lower'

How to get the token list for each lemma?

The full code:

import string

import re

from wordcloud import WordCloud

import nltk

from nltk.tokenize.treebank import TreebankWordDetokenizer

from nltk.corpus import wordnet

import PyPDF4

import matplotlib

import numpy as np

from PIL import Image



stopwords = nltk.corpus.stopwords.words('english')

moreStopwords = ['clin97803078874365pallr1indd'] # additional stopwords to be removed manually.

wn = nltk.WordNetLemmatizer()



data = PyPDF4.PdfFileReader(open('ReadyPlayerOne.pdf', 'rb'))

pageData = ''

for page in data.pages:

    pageData += page.extractText()

# print(pageData)





def clean_text(text):

    text = "".join([word for word in text if word not in string.punctuation])

    tokens = re.split("W+", text)

    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]

    syns = [text.lemmas()[0].name() for text in wordnet.synsets(text)]

    return syns





print(clean_text(pageData))

edited Nov 21 '18 at 17:04

asked Nov 21 '18 at 16:41

Tony

You should check your imports, in nltk, wordnet might refer to deferents objects, and some do not have synsets attributes
– BlueSheepToken
Nov 21 '18 at 16:48

Oh, yes I was importing wordnet as wn and declaring the WordnetLematizer in wn variable too. But, now I am getting this error. AttributeError: 'list' object has no attribute 'lower'
– Tony
Nov 21 '18 at 16:52

Nice, you should edit your post, I might not be able to help you there.
– BlueSheepToken
Nov 21 '18 at 16:54

1

Hi Tony, it's best to create a Minimal, Complete, and Verifiable example. See here: stackoverflow.com/help/mcve. This means we should be able to copy/paste your code and run it in a REPL so that we can confirm the error you're seeing, and (ideally) point you in the right direction for a fix. Good luck!
– Matt Messersmith
Nov 21 '18 at 16:56

Done, thanks @Matt
– Tony
Nov 21 '18 at 17:04

|
show 1 more comment

-1

['0000', 'Everyone', 'age', 'remembers', 'Þ', 'rst', 'heard', 'contest', 'I', 'sitting', 'hideout', 'watching', ...]

There's no lemmas of words like 'Everyone', '0000', 'Þ' and many more which I need to eliminate. But for words like 'age', 'remembers', 'heard' etc. the token list is suppose to look like:

['age', 'remember', 'hear', ...]

I am checking the synonyms through this code:

syns = wn.synsets("heard")

print(syns[0].lemmas()[0].name())

At this point I have created the function clean_text() in python for preprocessing. That looks like:

def clean_text(text):

    # Eliminating punctuations

    text = "".join([word for word in text if word not in string.punctuation])

    # tokenizing

    tokens = re.split("W+", text)

    # lemmatizing and removing stopwords

    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]

    # converting token list into synset

    syns = [text.lemmas()[0].name() for text in wn.synsets(text)]

    return text

I am getting the error :

syns = [text.lemmas()[0].name() for text in wn.synsets(text)]

AttributeError: 'list' object has no attribute 'lower'

How to get the token list for each lemma?

The full code:

import string

import re

from wordcloud import WordCloud

import nltk

from nltk.tokenize.treebank import TreebankWordDetokenizer

from nltk.corpus import wordnet

import PyPDF4

import matplotlib

import numpy as np

from PIL import Image



stopwords = nltk.corpus.stopwords.words('english')

moreStopwords = ['clin97803078874365pallr1indd'] # additional stopwords to be removed manually.

wn = nltk.WordNetLemmatizer()



data = PyPDF4.PdfFileReader(open('ReadyPlayerOne.pdf', 'rb'))

pageData = ''

for page in data.pages:

    pageData += page.extractText()

# print(pageData)





def clean_text(text):

    text = "".join([word for word in text if word not in string.punctuation])

    tokens = re.split("W+", text)

    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]

    syns = [text.lemmas()[0].name() for text in wordnet.synsets(text)]

    return syns





print(clean_text(pageData))

edited Nov 21 '18 at 17:04

asked Nov 21 '18 at 16:41

Tony

['0000', 'Everyone', 'age', 'remembers', 'Þ', 'rst', 'heard', 'contest', 'I', 'sitting', 'hideout', 'watching', ...]

There's no lemmas of words like 'Everyone', '0000', 'Þ' and many more which I need to eliminate. But for words like 'age', 'remembers', 'heard' etc. the token list is suppose to look like:

['age', 'remember', 'hear', ...]

I am checking the synonyms through this code:

syns = wn.synsets("heard")

print(syns[0].lemmas()[0].name())

At this point I have created the function clean_text() in python for preprocessing. That looks like:

def clean_text(text):

    # Eliminating punctuations

    text = "".join([word for word in text if word not in string.punctuation])

    # tokenizing

    tokens = re.split("W+", text)

    # lemmatizing and removing stopwords

    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]

    # converting token list into synset

    syns = [text.lemmas()[0].name() for text in wn.synsets(text)]

    return text

I am getting the error :

syns = [text.lemmas()[0].name() for text in wn.synsets(text)]

AttributeError: 'list' object has no attribute 'lower'

How to get the token list for each lemma?

The full code:

import string

import re

from wordcloud import WordCloud

import nltk

from nltk.tokenize.treebank import TreebankWordDetokenizer

from nltk.corpus import wordnet

import PyPDF4

import matplotlib

import numpy as np

from PIL import Image



stopwords = nltk.corpus.stopwords.words('english')

moreStopwords = ['clin97803078874365pallr1indd'] # additional stopwords to be removed manually.

wn = nltk.WordNetLemmatizer()



data = PyPDF4.PdfFileReader(open('ReadyPlayerOne.pdf', 'rb'))

pageData = ''

for page in data.pages:

    pageData += page.extractText()

# print(pageData)





def clean_text(text):

    text = "".join([word for word in text if word not in string.punctuation])

    tokens = re.split("W+", text)

    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]

    syns = [text.lemmas()[0].name() for text in wordnet.synsets(text)]

    return syns





print(clean_text(pageData))

python nltk wordnet

edited Nov 21 '18 at 17:04

asked Nov 21 '18 at 16:41

Tony

edited Nov 21 '18 at 17:04

asked Nov 21 '18 at 16:41

Tony

edited Nov 21 '18 at 17:04

asked Nov 21 '18 at 16:41

Tony

asked Nov 21 '18 at 16:41

Tony

asked Nov 21 '18 at 16:41

Tony

You should check your imports, in nltk, wordnet might refer to deferents objects, and some do not have synsets attributes
– BlueSheepToken
Nov 21 '18 at 16:48

Oh, yes I was importing wordnet as wn and declaring the WordnetLematizer in wn variable too. But, now I am getting this error. AttributeError: 'list' object has no attribute 'lower'
– Tony
Nov 21 '18 at 16:52

Nice, you should edit your post, I might not be able to help you there.
– BlueSheepToken
Nov 21 '18 at 16:54

1

Hi Tony, it's best to create a Minimal, Complete, and Verifiable example. See here: stackoverflow.com/help/mcve. This means we should be able to copy/paste your code and run it in a REPL so that we can confirm the error you're seeing, and (ideally) point you in the right direction for a fix. Good luck!
– Matt Messersmith
Nov 21 '18 at 16:56

Done, thanks @Matt
– Tony
Nov 21 '18 at 17:04

|
show 1 more comment

You should check your imports, in nltk, wordnet might refer to deferents objects, and some do not have synsets attributes
– BlueSheepToken
Nov 21 '18 at 16:48

Oh, yes I was importing wordnet as wn and declaring the WordnetLematizer in wn variable too. But, now I am getting this error. AttributeError: 'list' object has no attribute 'lower'
– Tony
Nov 21 '18 at 16:52

Nice, you should edit your post, I might not be able to help you there.
– BlueSheepToken
Nov 21 '18 at 16:54

1

Hi Tony, it's best to create a Minimal, Complete, and Verifiable example. See here: stackoverflow.com/help/mcve. This means we should be able to copy/paste your code and run it in a REPL so that we can confirm the error you're seeing, and (ideally) point you in the right direction for a fix. Good luck!
– Matt Messersmith
Nov 21 '18 at 16:56

Done, thanks @Matt
– Tony
Nov 21 '18 at 17:04

You should check your imports, in nltk, wordnet might refer to deferents objects, and some do not have synsets attributes
– BlueSheepToken
Nov 21 '18 at 16:48

Oh, yes I was importing wordnet as wn and declaring the WordnetLematizer in wn variable too. But, now I am getting this error. AttributeError: 'list' object has no attribute 'lower'
– Tony
Nov 21 '18 at 16:52

Nice, you should edit your post, I might not be able to help you there.
– BlueSheepToken
Nov 21 '18 at 16:54

Hi Tony, it's best to create a Minimal, Complete, and Verifiable example. See here: stackoverflow.com/help/mcve. This means we should be able to copy/paste your code and run it in a REPL so that we can confirm the error you're seeing, and (ideally) point you in the right direction for a fix. Good luck!
– Matt Messersmith
Nov 21 '18 at 16:56

Done, thanks @Matt
– Tony
Nov 21 '18 at 17:04

|
show 1 more comment

1 Answer
1

active

oldest

votes

You are calling wordnet.synsets(text) with a list of words (check what is text at that point) and you should call it with a word.
The preprocessing of wordnet.synsets is trying to apply .lower() to its parameters and therefore the error (AttributeError: 'list' object has no attribute 'lower').

Below there is a functional version of clean_text with a fix of this problem:

import string

import re

import nltk

from nltk.corpus import wordnet



stopwords = nltk.corpus.stopwords.words('english')

wn = nltk.WordNetLemmatizer()



def clean_text(text):

    text = "".join([word for word in text if word not in string.punctuation])

    tokens = re.split("W+", text)

    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]

    lemmas = 

    for token in text:

        lemmas += [synset.lemmas()[0].name() for synset in wordnet.synsets(token)]

    return lemmas





text = "The grass was greener."



print(clean_text(text))

Returns:

['grass', 'Grass', 'supergrass', 'eatage', 'pot', 'grass', 'grass', 'grass', 'grass', 'grass', 'denounce', 'green', 'green', 'green', 'green', 'fleeceable']

answered Nov 21 '18 at 17:24

Julian Peller

864511

Hey Julian, is there a way to avoid the repition of words like grass is being repeated so many times.
– Tony
Nov 21 '18 at 17:37

@Tony sure, use a set instead of a list.
– Matt Messersmith
Nov 21 '18 at 17:49

@Tony, you can use set as @Matt Messersmith suggested to remove duplicates of your final list. Replace return lemmas with return list(set(lemmas)).
– Julian Peller
Nov 21 '18 at 17:57

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53416780%2fhow-to-convert-token-list-into-wordnet-lemma-list-using-nltk%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Below there is a functional version of clean_text with a fix of this problem:

import string

import re

import nltk

from nltk.corpus import wordnet



stopwords = nltk.corpus.stopwords.words('english')

wn = nltk.WordNetLemmatizer()



def clean_text(text):

    text = "".join([word for word in text if word not in string.punctuation])

    tokens = re.split("W+", text)

    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]

    lemmas = 

    for token in text:

        lemmas += [synset.lemmas()[0].name() for synset in wordnet.synsets(token)]

    return lemmas





text = "The grass was greener."



print(clean_text(text))

Returns:

['grass', 'Grass', 'supergrass', 'eatage', 'pot', 'grass', 'grass', 'grass', 'grass', 'grass', 'denounce', 'green', 'green', 'green', 'green', 'fleeceable']

answered Nov 21 '18 at 17:24

Julian Peller

864511

Hey Julian, is there a way to avoid the repition of words like grass is being repeated so many times.
– Tony
Nov 21 '18 at 17:37

@Tony sure, use a set instead of a list.
– Matt Messersmith
Nov 21 '18 at 17:49

@Tony, you can use set as @Matt Messersmith suggested to remove duplicates of your final list. Replace return lemmas with return list(set(lemmas)).
– Julian Peller
Nov 21 '18 at 17:57

add a comment |

Below there is a functional version of clean_text with a fix of this problem:

import string

import re

import nltk

from nltk.corpus import wordnet



stopwords = nltk.corpus.stopwords.words('english')

wn = nltk.WordNetLemmatizer()



def clean_text(text):

    text = "".join([word for word in text if word not in string.punctuation])

    tokens = re.split("W+", text)

    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]

    lemmas = 

    for token in text:

        lemmas += [synset.lemmas()[0].name() for synset in wordnet.synsets(token)]

    return lemmas





text = "The grass was greener."



print(clean_text(text))

Returns:

['grass', 'Grass', 'supergrass', 'eatage', 'pot', 'grass', 'grass', 'grass', 'grass', 'grass', 'denounce', 'green', 'green', 'green', 'green', 'fleeceable']

answered Nov 21 '18 at 17:24

Julian Peller

864511

Hey Julian, is there a way to avoid the repition of words like grass is being repeated so many times.
– Tony
Nov 21 '18 at 17:37

@Tony sure, use a set instead of a list.
– Matt Messersmith
Nov 21 '18 at 17:49

@Tony, you can use set as @Matt Messersmith suggested to remove duplicates of your final list. Replace return lemmas with return list(set(lemmas)).
– Julian Peller
Nov 21 '18 at 17:57

add a comment |

Below there is a functional version of clean_text with a fix of this problem:

import string

import re

import nltk

from nltk.corpus import wordnet



stopwords = nltk.corpus.stopwords.words('english')

wn = nltk.WordNetLemmatizer()



def clean_text(text):

    text = "".join([word for word in text if word not in string.punctuation])

    tokens = re.split("W+", text)

    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]

    lemmas = 

    for token in text:

        lemmas += [synset.lemmas()[0].name() for synset in wordnet.synsets(token)]

    return lemmas





text = "The grass was greener."



print(clean_text(text))

Returns:

['grass', 'Grass', 'supergrass', 'eatage', 'pot', 'grass', 'grass', 'grass', 'grass', 'grass', 'denounce', 'green', 'green', 'green', 'green', 'fleeceable']

answered Nov 21 '18 at 17:24

Julian Peller

864511

Below there is a functional version of clean_text with a fix of this problem:

import string

import re

import nltk

from nltk.corpus import wordnet



stopwords = nltk.corpus.stopwords.words('english')

wn = nltk.WordNetLemmatizer()



def clean_text(text):

    text = "".join([word for word in text if word not in string.punctuation])

    tokens = re.split("W+", text)

    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]

    lemmas = 

    for token in text:

        lemmas += [synset.lemmas()[0].name() for synset in wordnet.synsets(token)]

    return lemmas





text = "The grass was greener."



print(clean_text(text))

Returns:

['grass', 'Grass', 'supergrass', 'eatage', 'pot', 'grass', 'grass', 'grass', 'grass', 'grass', 'denounce', 'green', 'green', 'green', 'green', 'fleeceable']

answered Nov 21 '18 at 17:24

Julian Peller

864511

answered Nov 21 '18 at 17:24

Julian Peller

864511

answered Nov 21 '18 at 17:24

Julian Peller

864511

answered Nov 21 '18 at 17:24

Julian Peller

864511

Hey Julian, is there a way to avoid the repition of words like grass is being repeated so many times.
– Tony
Nov 21 '18 at 17:37

@Tony sure, use a set instead of a list.
– Matt Messersmith
Nov 21 '18 at 17:49

@Tony, you can use set as @Matt Messersmith suggested to remove duplicates of your final list. Replace return lemmas with return list(set(lemmas)).
– Julian Peller
Nov 21 '18 at 17:57

add a comment |

Hey Julian, is there a way to avoid the repition of words like grass is being repeated so many times.
– Tony
Nov 21 '18 at 17:37

@Tony sure, use a set instead of a list.
– Matt Messersmith
Nov 21 '18 at 17:49

@Tony, you can use set as @Matt Messersmith suggested to remove duplicates of your final list. Replace return lemmas with return list(set(lemmas)).
– Julian Peller
Nov 21 '18 at 17:57

Hey Julian, is there a way to avoid the repition of words like grass is being repeated so many times.
– Tony
Nov 21 '18 at 17:37

@Tony sure, use a set instead of a list.
– Matt Messersmith
Nov 21 '18 at 17:49

@Tony, you can use set as @Matt Messersmith suggested to remove duplicates of your final list. Replace return lemmas with return list(set(lemmas)).
– Julian Peller
Nov 21 '18 at 17:57

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Tukukkk