Counting lower vs non-lowercase tokens for tokenized text with several conditions
up vote
1
down vote
favorite
Assuming that the text is tokenized with whitespace for a natural language processing task, the goal is to check the count of the words (regardless of casing) and check them through some conditions.
The current code works as its supposed to but is there a way to optimize and if-else conditions to make it cleaner or more directly?
First the function has to determine whether :
token is an xml tag, if so ignore it and move to the next token
token is in a list of predefineddelayed sentence start
, if so ignore it and move to the next token
# Skip XML tags.
if re.search(r"(<S[^>]*>)", token):
continue
# Skip if sentence start symbols.
elif token in self.DELAYED_SENT_START:
continue
Then it checks whether to toggle the is_first_word
condition to whether the token is the first word of the sentence; note that there can be many sentences in each line.
if token in a list of pre-defined sentence ending and the
is_first_word
condition is False, then setis_first_word
to True and then move on to the next tokenif there's nothing to case, since none of the characters falls under the letter regex, then set
is_first_word
to False and move on to the next token
# Resets the `is_first_word` after seeing sent end symbols.
if not is_first_word and token in self.SENT_END:
is_first_word = True
continue
# Skips words with nothing to case.
if not re.search(r"[{}]".format(ll_lu_lt), token):
is_first_word = False
continue
Then finally after checking for unweight-able words, and the function continues to finally updates the weight.
First all weights are set to 0, and then set to 1 if it's not is_first_word
.
Then if the possibly_use_first_token
option is set, then check if the token is lower case, if so use the word. Otherwise, assign a 0.1 weight to it, that's better than setting the weights to 0.
Then finally, update the weights if it's non-zero. And set the is_first_word
toggle to False
current_word_weight = 0
if not is_first_word:
current_word_weight = 1
elif possibly_use_first_token:
# Gated special handling of first word of sentence.
# Check if first characer of token is lowercase.
if token[0].is_lower():
current_word_weight = 1
elif i == 1:
current_word_weight = 0.1
if current_word_weight > 0:
casing[token.lower()][token] += current_word_weight
is_first_word = False
The full code is in the train()
function below:
import re
from collections import defaultdict, Counter
from six import text_type
from sacremoses.corpus import Perluniprops
from sacremoses.corpus import NonbreakingPrefixes
perluniprops = Perluniprops()
class MosesTruecaser(object):
"""
This is a Python port of the Moses Truecaser from
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/truecase.perl
"""
# Perl Unicode Properties character sets.
Lowercase_Letter = text_type(''.join(perluniprops.chars('Lowercase_Letter')))
Uppercase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))
Titlecase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))
def __init__(self):
# Initialize the object.
super(MosesTruecaser, self).__init__()
# Initialize the language specific nonbreaking prefixes.
self.SKIP_LETTERS_REGEX = r"[{}{}{}]".format(Lowercase_Letter,
Uppercase_Letter, Titlecase_Letter)
self.SENT_END = [".", ":", "?", "!"]
self.DELAYED_SENT_START = ["(", "[", """, "'", "'", """, "[", "]"]
def train(self, filename, possibly_use_first_token=False):
casing = defaultdict(Counter)
with open(filename) as fin:
for line in fin:
# Keep track of first words in the sentence(s) of the line.
is_first_word = True
for i, token in enumerate(line.split()):
# Skip XML tags.
if re.search(r"(<S[^>]*>)", token):
continue
# Skip if sentence start symbols.
elif token in self.DELAYED_SENT_START:
continue
# Resets the `is_first_word` after seeing sent end symbols.
if not is_first_word and token in self.SENT_END:
is_first_word = True
continue
# Skips words with nothing to case.
if not re.search(r"[{}]".format(ll_lu_lt), token):
is_first_word = False
continue
current_word_weight = 0
if not is_first_word:
current_word_weight = 1
elif possibly_use_first_token:
# Gated special handling of first word of sentence.
# Check if first characer of token is lowercase.
if token[0].is_lower():
current_word_weight = 1
elif i == 1:
current_word_weight = 0.1
if current_word_weight > 0:
casing[token.lower()][token] += current_word_weight
is_first_word = False
return casing
python regex natural-language-processing
add a comment |
up vote
1
down vote
favorite
Assuming that the text is tokenized with whitespace for a natural language processing task, the goal is to check the count of the words (regardless of casing) and check them through some conditions.
The current code works as its supposed to but is there a way to optimize and if-else conditions to make it cleaner or more directly?
First the function has to determine whether :
token is an xml tag, if so ignore it and move to the next token
token is in a list of predefineddelayed sentence start
, if so ignore it and move to the next token
# Skip XML tags.
if re.search(r"(<S[^>]*>)", token):
continue
# Skip if sentence start symbols.
elif token in self.DELAYED_SENT_START:
continue
Then it checks whether to toggle the is_first_word
condition to whether the token is the first word of the sentence; note that there can be many sentences in each line.
if token in a list of pre-defined sentence ending and the
is_first_word
condition is False, then setis_first_word
to True and then move on to the next tokenif there's nothing to case, since none of the characters falls under the letter regex, then set
is_first_word
to False and move on to the next token
# Resets the `is_first_word` after seeing sent end symbols.
if not is_first_word and token in self.SENT_END:
is_first_word = True
continue
# Skips words with nothing to case.
if not re.search(r"[{}]".format(ll_lu_lt), token):
is_first_word = False
continue
Then finally after checking for unweight-able words, and the function continues to finally updates the weight.
First all weights are set to 0, and then set to 1 if it's not is_first_word
.
Then if the possibly_use_first_token
option is set, then check if the token is lower case, if so use the word. Otherwise, assign a 0.1 weight to it, that's better than setting the weights to 0.
Then finally, update the weights if it's non-zero. And set the is_first_word
toggle to False
current_word_weight = 0
if not is_first_word:
current_word_weight = 1
elif possibly_use_first_token:
# Gated special handling of first word of sentence.
# Check if first characer of token is lowercase.
if token[0].is_lower():
current_word_weight = 1
elif i == 1:
current_word_weight = 0.1
if current_word_weight > 0:
casing[token.lower()][token] += current_word_weight
is_first_word = False
The full code is in the train()
function below:
import re
from collections import defaultdict, Counter
from six import text_type
from sacremoses.corpus import Perluniprops
from sacremoses.corpus import NonbreakingPrefixes
perluniprops = Perluniprops()
class MosesTruecaser(object):
"""
This is a Python port of the Moses Truecaser from
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/truecase.perl
"""
# Perl Unicode Properties character sets.
Lowercase_Letter = text_type(''.join(perluniprops.chars('Lowercase_Letter')))
Uppercase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))
Titlecase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))
def __init__(self):
# Initialize the object.
super(MosesTruecaser, self).__init__()
# Initialize the language specific nonbreaking prefixes.
self.SKIP_LETTERS_REGEX = r"[{}{}{}]".format(Lowercase_Letter,
Uppercase_Letter, Titlecase_Letter)
self.SENT_END = [".", ":", "?", "!"]
self.DELAYED_SENT_START = ["(", "[", """, "'", "'", """, "[", "]"]
def train(self, filename, possibly_use_first_token=False):
casing = defaultdict(Counter)
with open(filename) as fin:
for line in fin:
# Keep track of first words in the sentence(s) of the line.
is_first_word = True
for i, token in enumerate(line.split()):
# Skip XML tags.
if re.search(r"(<S[^>]*>)", token):
continue
# Skip if sentence start symbols.
elif token in self.DELAYED_SENT_START:
continue
# Resets the `is_first_word` after seeing sent end symbols.
if not is_first_word and token in self.SENT_END:
is_first_word = True
continue
# Skips words with nothing to case.
if not re.search(r"[{}]".format(ll_lu_lt), token):
is_first_word = False
continue
current_word_weight = 0
if not is_first_word:
current_word_weight = 1
elif possibly_use_first_token:
# Gated special handling of first word of sentence.
# Check if first characer of token is lowercase.
if token[0].is_lower():
current_word_weight = 1
elif i == 1:
current_word_weight = 0.1
if current_word_weight > 0:
casing[token.lower()][token] += current_word_weight
is_first_word = False
return casing
python regex natural-language-processing
Could you provide some example inputs and the corresponding outputs?
– 200_success
15 mins ago
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
Assuming that the text is tokenized with whitespace for a natural language processing task, the goal is to check the count of the words (regardless of casing) and check them through some conditions.
The current code works as its supposed to but is there a way to optimize and if-else conditions to make it cleaner or more directly?
First the function has to determine whether :
token is an xml tag, if so ignore it and move to the next token
token is in a list of predefineddelayed sentence start
, if so ignore it and move to the next token
# Skip XML tags.
if re.search(r"(<S[^>]*>)", token):
continue
# Skip if sentence start symbols.
elif token in self.DELAYED_SENT_START:
continue
Then it checks whether to toggle the is_first_word
condition to whether the token is the first word of the sentence; note that there can be many sentences in each line.
if token in a list of pre-defined sentence ending and the
is_first_word
condition is False, then setis_first_word
to True and then move on to the next tokenif there's nothing to case, since none of the characters falls under the letter regex, then set
is_first_word
to False and move on to the next token
# Resets the `is_first_word` after seeing sent end symbols.
if not is_first_word and token in self.SENT_END:
is_first_word = True
continue
# Skips words with nothing to case.
if not re.search(r"[{}]".format(ll_lu_lt), token):
is_first_word = False
continue
Then finally after checking for unweight-able words, and the function continues to finally updates the weight.
First all weights are set to 0, and then set to 1 if it's not is_first_word
.
Then if the possibly_use_first_token
option is set, then check if the token is lower case, if so use the word. Otherwise, assign a 0.1 weight to it, that's better than setting the weights to 0.
Then finally, update the weights if it's non-zero. And set the is_first_word
toggle to False
current_word_weight = 0
if not is_first_word:
current_word_weight = 1
elif possibly_use_first_token:
# Gated special handling of first word of sentence.
# Check if first characer of token is lowercase.
if token[0].is_lower():
current_word_weight = 1
elif i == 1:
current_word_weight = 0.1
if current_word_weight > 0:
casing[token.lower()][token] += current_word_weight
is_first_word = False
The full code is in the train()
function below:
import re
from collections import defaultdict, Counter
from six import text_type
from sacremoses.corpus import Perluniprops
from sacremoses.corpus import NonbreakingPrefixes
perluniprops = Perluniprops()
class MosesTruecaser(object):
"""
This is a Python port of the Moses Truecaser from
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/truecase.perl
"""
# Perl Unicode Properties character sets.
Lowercase_Letter = text_type(''.join(perluniprops.chars('Lowercase_Letter')))
Uppercase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))
Titlecase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))
def __init__(self):
# Initialize the object.
super(MosesTruecaser, self).__init__()
# Initialize the language specific nonbreaking prefixes.
self.SKIP_LETTERS_REGEX = r"[{}{}{}]".format(Lowercase_Letter,
Uppercase_Letter, Titlecase_Letter)
self.SENT_END = [".", ":", "?", "!"]
self.DELAYED_SENT_START = ["(", "[", """, "'", "'", """, "[", "]"]
def train(self, filename, possibly_use_first_token=False):
casing = defaultdict(Counter)
with open(filename) as fin:
for line in fin:
# Keep track of first words in the sentence(s) of the line.
is_first_word = True
for i, token in enumerate(line.split()):
# Skip XML tags.
if re.search(r"(<S[^>]*>)", token):
continue
# Skip if sentence start symbols.
elif token in self.DELAYED_SENT_START:
continue
# Resets the `is_first_word` after seeing sent end symbols.
if not is_first_word and token in self.SENT_END:
is_first_word = True
continue
# Skips words with nothing to case.
if not re.search(r"[{}]".format(ll_lu_lt), token):
is_first_word = False
continue
current_word_weight = 0
if not is_first_word:
current_word_weight = 1
elif possibly_use_first_token:
# Gated special handling of first word of sentence.
# Check if first characer of token is lowercase.
if token[0].is_lower():
current_word_weight = 1
elif i == 1:
current_word_weight = 0.1
if current_word_weight > 0:
casing[token.lower()][token] += current_word_weight
is_first_word = False
return casing
python regex natural-language-processing
Assuming that the text is tokenized with whitespace for a natural language processing task, the goal is to check the count of the words (regardless of casing) and check them through some conditions.
The current code works as its supposed to but is there a way to optimize and if-else conditions to make it cleaner or more directly?
First the function has to determine whether :
token is an xml tag, if so ignore it and move to the next token
token is in a list of predefineddelayed sentence start
, if so ignore it and move to the next token
# Skip XML tags.
if re.search(r"(<S[^>]*>)", token):
continue
# Skip if sentence start symbols.
elif token in self.DELAYED_SENT_START:
continue
Then it checks whether to toggle the is_first_word
condition to whether the token is the first word of the sentence; note that there can be many sentences in each line.
if token in a list of pre-defined sentence ending and the
is_first_word
condition is False, then setis_first_word
to True and then move on to the next tokenif there's nothing to case, since none of the characters falls under the letter regex, then set
is_first_word
to False and move on to the next token
# Resets the `is_first_word` after seeing sent end symbols.
if not is_first_word and token in self.SENT_END:
is_first_word = True
continue
# Skips words with nothing to case.
if not re.search(r"[{}]".format(ll_lu_lt), token):
is_first_word = False
continue
Then finally after checking for unweight-able words, and the function continues to finally updates the weight.
First all weights are set to 0, and then set to 1 if it's not is_first_word
.
Then if the possibly_use_first_token
option is set, then check if the token is lower case, if so use the word. Otherwise, assign a 0.1 weight to it, that's better than setting the weights to 0.
Then finally, update the weights if it's non-zero. And set the is_first_word
toggle to False
current_word_weight = 0
if not is_first_word:
current_word_weight = 1
elif possibly_use_first_token:
# Gated special handling of first word of sentence.
# Check if first characer of token is lowercase.
if token[0].is_lower():
current_word_weight = 1
elif i == 1:
current_word_weight = 0.1
if current_word_weight > 0:
casing[token.lower()][token] += current_word_weight
is_first_word = False
The full code is in the train()
function below:
import re
from collections import defaultdict, Counter
from six import text_type
from sacremoses.corpus import Perluniprops
from sacremoses.corpus import NonbreakingPrefixes
perluniprops = Perluniprops()
class MosesTruecaser(object):
"""
This is a Python port of the Moses Truecaser from
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/truecase.perl
"""
# Perl Unicode Properties character sets.
Lowercase_Letter = text_type(''.join(perluniprops.chars('Lowercase_Letter')))
Uppercase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))
Titlecase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))
def __init__(self):
# Initialize the object.
super(MosesTruecaser, self).__init__()
# Initialize the language specific nonbreaking prefixes.
self.SKIP_LETTERS_REGEX = r"[{}{}{}]".format(Lowercase_Letter,
Uppercase_Letter, Titlecase_Letter)
self.SENT_END = [".", ":", "?", "!"]
self.DELAYED_SENT_START = ["(", "[", """, "'", "'", """, "[", "]"]
def train(self, filename, possibly_use_first_token=False):
casing = defaultdict(Counter)
with open(filename) as fin:
for line in fin:
# Keep track of first words in the sentence(s) of the line.
is_first_word = True
for i, token in enumerate(line.split()):
# Skip XML tags.
if re.search(r"(<S[^>]*>)", token):
continue
# Skip if sentence start symbols.
elif token in self.DELAYED_SENT_START:
continue
# Resets the `is_first_word` after seeing sent end symbols.
if not is_first_word and token in self.SENT_END:
is_first_word = True
continue
# Skips words with nothing to case.
if not re.search(r"[{}]".format(ll_lu_lt), token):
is_first_word = False
continue
current_word_weight = 0
if not is_first_word:
current_word_weight = 1
elif possibly_use_first_token:
# Gated special handling of first word of sentence.
# Check if first characer of token is lowercase.
if token[0].is_lower():
current_word_weight = 1
elif i == 1:
current_word_weight = 0.1
if current_word_weight > 0:
casing[token.lower()][token] += current_word_weight
is_first_word = False
return casing
python regex natural-language-processing
python regex natural-language-processing
edited 18 mins ago
200_success
127k15149412
127k15149412
asked 22 mins ago
alvas
278311
278311
Could you provide some example inputs and the corresponding outputs?
– 200_success
15 mins ago
add a comment |
Could you provide some example inputs and the corresponding outputs?
– 200_success
15 mins ago
Could you provide some example inputs and the corresponding outputs?
– 200_success
15 mins ago
Could you provide some example inputs and the corresponding outputs?
– 200_success
15 mins ago
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209335%2fcounting-lower-vs-non-lowercase-tokens-for-tokenized-text-with-several-condition%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Could you provide some example inputs and the corresponding outputs?
– 200_success
15 mins ago