Counting lower vs non-lowercase tokens for tokenized text with several conditions

up vote
1
down vote

favorite

Assuming that the text is tokenized with whitespace for a natural language processing task, the goal is to check the count of the words (regardless of casing) and check them through some conditions.

The current code works as its supposed to but is there a way to optimize and if-else conditions to make it cleaner or more directly?

First the function has to determine whether :

token is an xml tag, if so ignore it and move to the next token

token is in a list of predefined delayed sentence start, if so ignore it and move to the next token

# Skip XML tags.

if re.search(r"(<S[^>]*>)", token):

     continue

# Skip if sentence start symbols.

elif token in self.DELAYED_SENT_START:

    continue

Then it checks whether to toggle the is_first_word condition to whether the token is the first word of the sentence; note that there can be many sentences in each line.

if token in a list of pre-defined sentence ending and the is_first_word condition is False, then set is_first_word to True and then move on to the next token

if there's nothing to case, since none of the characters falls under the letter regex, then set is_first_word to False and move on to the next token

# Resets the `is_first_word` after seeing sent end symbols.

if not is_first_word and token in self.SENT_END:

    is_first_word = True

    continue



# Skips words with nothing to case.

if not re.search(r"[{}]".format(ll_lu_lt), token):

    is_first_word = False

    continue

Then finally after checking for unweight-able words, and the function continues to finally updates the weight.

First all weights are set to 0, and then set to 1 if it's not is_first_word.

Then if the possibly_use_first_token option is set, then check if the token is lower case, if so use the word. Otherwise, assign a 0.1 weight to it, that's better than setting the weights to 0.

Then finally, update the weights if it's non-zero. And set the is_first_word toggle to False

current_word_weight = 0

if not is_first_word:

    current_word_weight = 1

elif possibly_use_first_token:

    # Gated special handling of first word of sentence.

    # Check if first characer of token is lowercase.

    if token[0].is_lower():

        current_word_weight = 1

    elif i == 1:

        current_word_weight = 0.1



if current_word_weight > 0:

    casing[token.lower()][token] += current_word_weight



is_first_word = False

The full code is in the train() function below:

import re



from collections import defaultdict, Counter

from six import text_type



from sacremoses.corpus import Perluniprops

from sacremoses.corpus import NonbreakingPrefixes



perluniprops = Perluniprops()





class MosesTruecaser(object):

    """

    This is a Python port of the Moses Truecaser from

    https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl

    https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/truecase.perl

    """

    # Perl Unicode Properties character sets.

    Lowercase_Letter = text_type(''.join(perluniprops.chars('Lowercase_Letter')))

    Uppercase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))

    Titlecase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))



    def __init__(self):

        # Initialize the object.

        super(MosesTruecaser, self).__init__()

        # Initialize the language specific nonbreaking prefixes.

        self.SKIP_LETTERS_REGEX = r"[{}{}{}]".format(Lowercase_Letter,

                                    Uppercase_Letter, Titlecase_Letter)



        self.SENT_END = [".", ":", "?", "!"]

        self.DELAYED_SENT_START = ["(", "[", """, "'", "&apos;", "&quot;", "[", "]"]



    def train(self, filename, possibly_use_first_token=False):

        casing = defaultdict(Counter)

        with open(filename) as fin:

            for line in fin:

                # Keep track of first words in the sentence(s) of the line.

                is_first_word = True

                for i, token in enumerate(line.split()):

                    # Skip XML tags.

                    if re.search(r"(<S[^>]*>)", token):

                        continue

                    # Skip if sentence start symbols.

                    elif token in self.DELAYED_SENT_START:

                        continue



                    # Resets the `is_first_word` after seeing sent end symbols.

                    if not is_first_word and token in self.SENT_END:

                        is_first_word = True

                        continue



                    # Skips words with nothing to case.

                    if not re.search(r"[{}]".format(ll_lu_lt), token):

                        is_first_word = False

                        continue



                    current_word_weight = 0

                    if not is_first_word:

                        current_word_weight = 1

                    elif possibly_use_first_token:

                        # Gated special handling of first word of sentence.

                        # Check if first characer of token is lowercase.

                        if token[0].is_lower():

                            current_word_weight = 1

                        elif i == 1:

                            current_word_weight = 0.1



                    if current_word_weight > 0:

                        casing[token.lower()][token] += current_word_weight



                    is_first_word = False

        return casing

edited 18 mins ago

200_success

127k15149412

asked 22 mins ago

alvas

278311

Could you provide some example inputs and the corresponding outputs?
– 200_success
15 mins ago

add a comment |

up vote
1
down vote

favorite

Assuming that the text is tokenized with whitespace for a natural language processing task, the goal is to check the count of the words (regardless of casing) and check them through some conditions.

The current code works as its supposed to but is there a way to optimize and if-else conditions to make it cleaner or more directly?

First the function has to determine whether :

token is an xml tag, if so ignore it and move to the next token

token is in a list of predefined delayed sentence start, if so ignore it and move to the next token

# Skip XML tags.

if re.search(r"(<S[^>]*>)", token):

     continue

# Skip if sentence start symbols.

elif token in self.DELAYED_SENT_START:

    continue

Then it checks whether to toggle the is_first_word condition to whether the token is the first word of the sentence; note that there can be many sentences in each line.

if token in a list of pre-defined sentence ending and the is_first_word condition is False, then set is_first_word to True and then move on to the next token

if there's nothing to case, since none of the characters falls under the letter regex, then set is_first_word to False and move on to the next token

# Resets the `is_first_word` after seeing sent end symbols.

if not is_first_word and token in self.SENT_END:

    is_first_word = True

    continue



# Skips words with nothing to case.

if not re.search(r"[{}]".format(ll_lu_lt), token):

    is_first_word = False

    continue

Then finally after checking for unweight-able words, and the function continues to finally updates the weight.

First all weights are set to 0, and then set to 1 if it's not is_first_word.

Then if the possibly_use_first_token option is set, then check if the token is lower case, if so use the word. Otherwise, assign a 0.1 weight to it, that's better than setting the weights to 0.

Then finally, update the weights if it's non-zero. And set the is_first_word toggle to False

current_word_weight = 0

if not is_first_word:

    current_word_weight = 1

elif possibly_use_first_token:

    # Gated special handling of first word of sentence.

    # Check if first characer of token is lowercase.

    if token[0].is_lower():

        current_word_weight = 1

    elif i == 1:

        current_word_weight = 0.1



if current_word_weight > 0:

    casing[token.lower()][token] += current_word_weight



is_first_word = False

The full code is in the train() function below:

import re



from collections import defaultdict, Counter

from six import text_type



from sacremoses.corpus import Perluniprops

from sacremoses.corpus import NonbreakingPrefixes



perluniprops = Perluniprops()





class MosesTruecaser(object):

    """

    This is a Python port of the Moses Truecaser from

    https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl

    https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/truecase.perl

    """

    # Perl Unicode Properties character sets.

    Lowercase_Letter = text_type(''.join(perluniprops.chars('Lowercase_Letter')))

    Uppercase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))

    Titlecase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))



    def __init__(self):

        # Initialize the object.

        super(MosesTruecaser, self).__init__()

        # Initialize the language specific nonbreaking prefixes.

        self.SKIP_LETTERS_REGEX = r"[{}{}{}]".format(Lowercase_Letter,

                                    Uppercase_Letter, Titlecase_Letter)



        self.SENT_END = [".", ":", "?", "!"]

        self.DELAYED_SENT_START = ["(", "[", """, "'", "&apos;", "&quot;", "[", "]"]



    def train(self, filename, possibly_use_first_token=False):

        casing = defaultdict(Counter)

        with open(filename) as fin:

            for line in fin:

                # Keep track of first words in the sentence(s) of the line.

                is_first_word = True

                for i, token in enumerate(line.split()):

                    # Skip XML tags.

                    if re.search(r"(<S[^>]*>)", token):

                        continue

                    # Skip if sentence start symbols.

                    elif token in self.DELAYED_SENT_START:

                        continue



                    # Resets the `is_first_word` after seeing sent end symbols.

                    if not is_first_word and token in self.SENT_END:

                        is_first_word = True

                        continue



                    # Skips words with nothing to case.

                    if not re.search(r"[{}]".format(ll_lu_lt), token):

                        is_first_word = False

                        continue



                    current_word_weight = 0

                    if not is_first_word:

                        current_word_weight = 1

                    elif possibly_use_first_token:

                        # Gated special handling of first word of sentence.

                        # Check if first characer of token is lowercase.

                        if token[0].is_lower():

                            current_word_weight = 1

                        elif i == 1:

                            current_word_weight = 0.1



                    if current_word_weight > 0:

                        casing[token.lower()][token] += current_word_weight



                    is_first_word = False

        return casing

edited 18 mins ago

200_success

127k15149412

asked 22 mins ago

alvas

278311

Could you provide some example inputs and the corresponding outputs?
– 200_success
15 mins ago

add a comment |

up vote
1
down vote

favorite

Assuming that the text is tokenized with whitespace for a natural language processing task, the goal is to check the count of the words (regardless of casing) and check them through some conditions.

The current code works as its supposed to but is there a way to optimize and if-else conditions to make it cleaner or more directly?

First the function has to determine whether :

token is an xml tag, if so ignore it and move to the next token

token is in a list of predefined delayed sentence start, if so ignore it and move to the next token

# Skip XML tags.

if re.search(r"(<S[^>]*>)", token):

     continue

# Skip if sentence start symbols.

elif token in self.DELAYED_SENT_START:

    continue

Then it checks whether to toggle the is_first_word condition to whether the token is the first word of the sentence; note that there can be many sentences in each line.

if token in a list of pre-defined sentence ending and the is_first_word condition is False, then set is_first_word to True and then move on to the next token

if there's nothing to case, since none of the characters falls under the letter regex, then set is_first_word to False and move on to the next token

# Resets the `is_first_word` after seeing sent end symbols.

if not is_first_word and token in self.SENT_END:

    is_first_word = True

    continue



# Skips words with nothing to case.

if not re.search(r"[{}]".format(ll_lu_lt), token):

    is_first_word = False

    continue

Then finally after checking for unweight-able words, and the function continues to finally updates the weight.

First all weights are set to 0, and then set to 1 if it's not is_first_word.

Then if the possibly_use_first_token option is set, then check if the token is lower case, if so use the word. Otherwise, assign a 0.1 weight to it, that's better than setting the weights to 0.

Then finally, update the weights if it's non-zero. And set the is_first_word toggle to False

current_word_weight = 0

if not is_first_word:

    current_word_weight = 1

elif possibly_use_first_token:

    # Gated special handling of first word of sentence.

    # Check if first characer of token is lowercase.

    if token[0].is_lower():

        current_word_weight = 1

    elif i == 1:

        current_word_weight = 0.1



if current_word_weight > 0:

    casing[token.lower()][token] += current_word_weight



is_first_word = False

The full code is in the train() function below:

import re



from collections import defaultdict, Counter

from six import text_type



from sacremoses.corpus import Perluniprops

from sacremoses.corpus import NonbreakingPrefixes



perluniprops = Perluniprops()





class MosesTruecaser(object):

    """

    This is a Python port of the Moses Truecaser from

    https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl

    https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/truecase.perl

    """

    # Perl Unicode Properties character sets.

    Lowercase_Letter = text_type(''.join(perluniprops.chars('Lowercase_Letter')))

    Uppercase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))

    Titlecase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))



    def __init__(self):

        # Initialize the object.

        super(MosesTruecaser, self).__init__()

        # Initialize the language specific nonbreaking prefixes.

        self.SKIP_LETTERS_REGEX = r"[{}{}{}]".format(Lowercase_Letter,

                                    Uppercase_Letter, Titlecase_Letter)



        self.SENT_END = [".", ":", "?", "!"]

        self.DELAYED_SENT_START = ["(", "[", """, "'", "&apos;", "&quot;", "[", "]"]



    def train(self, filename, possibly_use_first_token=False):

        casing = defaultdict(Counter)

        with open(filename) as fin:

            for line in fin:

                # Keep track of first words in the sentence(s) of the line.

                is_first_word = True

                for i, token in enumerate(line.split()):

                    # Skip XML tags.

                    if re.search(r"(<S[^>]*>)", token):

                        continue

                    # Skip if sentence start symbols.

                    elif token in self.DELAYED_SENT_START:

                        continue



                    # Resets the `is_first_word` after seeing sent end symbols.

                    if not is_first_word and token in self.SENT_END:

                        is_first_word = True

                        continue



                    # Skips words with nothing to case.

                    if not re.search(r"[{}]".format(ll_lu_lt), token):

                        is_first_word = False

                        continue



                    current_word_weight = 0

                    if not is_first_word:

                        current_word_weight = 1

                    elif possibly_use_first_token:

                        # Gated special handling of first word of sentence.

                        # Check if first characer of token is lowercase.

                        if token[0].is_lower():

                            current_word_weight = 1

                        elif i == 1:

                            current_word_weight = 0.1



                    if current_word_weight > 0:

                        casing[token.lower()][token] += current_word_weight



                    is_first_word = False

        return casing

edited 18 mins ago

200_success

127k15149412

asked 22 mins ago

alvas

278311

Assuming that the text is tokenized with whitespace for a natural language processing task, the goal is to check the count of the words (regardless of casing) and check them through some conditions.

The current code works as its supposed to but is there a way to optimize and if-else conditions to make it cleaner or more directly?

First the function has to determine whether :

token is an xml tag, if so ignore it and move to the next token

token is in a list of predefined delayed sentence start, if so ignore it and move to the next token

# Skip XML tags.

if re.search(r"(<S[^>]*>)", token):

     continue

# Skip if sentence start symbols.

elif token in self.DELAYED_SENT_START:

    continue

Then it checks whether to toggle the is_first_word condition to whether the token is the first word of the sentence; note that there can be many sentences in each line.

if token in a list of pre-defined sentence ending and the is_first_word condition is False, then set is_first_word to True and then move on to the next token

if there's nothing to case, since none of the characters falls under the letter regex, then set is_first_word to False and move on to the next token

# Resets the `is_first_word` after seeing sent end symbols.

if not is_first_word and token in self.SENT_END:

    is_first_word = True

    continue



# Skips words with nothing to case.

if not re.search(r"[{}]".format(ll_lu_lt), token):

    is_first_word = False

    continue

Then finally after checking for unweight-able words, and the function continues to finally updates the weight.

First all weights are set to 0, and then set to 1 if it's not is_first_word.

Then if the possibly_use_first_token option is set, then check if the token is lower case, if so use the word. Otherwise, assign a 0.1 weight to it, that's better than setting the weights to 0.

Then finally, update the weights if it's non-zero. And set the is_first_word toggle to False

current_word_weight = 0

if not is_first_word:

    current_word_weight = 1

elif possibly_use_first_token:

    # Gated special handling of first word of sentence.

    # Check if first characer of token is lowercase.

    if token[0].is_lower():

        current_word_weight = 1

    elif i == 1:

        current_word_weight = 0.1



if current_word_weight > 0:

    casing[token.lower()][token] += current_word_weight



is_first_word = False

The full code is in the train() function below:

import re



from collections import defaultdict, Counter

from six import text_type



from sacremoses.corpus import Perluniprops

from sacremoses.corpus import NonbreakingPrefixes



perluniprops = Perluniprops()





class MosesTruecaser(object):

    """

    This is a Python port of the Moses Truecaser from

    https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl

    https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/truecase.perl

    """

    # Perl Unicode Properties character sets.

    Lowercase_Letter = text_type(''.join(perluniprops.chars('Lowercase_Letter')))

    Uppercase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))

    Titlecase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))



    def __init__(self):

        # Initialize the object.

        super(MosesTruecaser, self).__init__()

        # Initialize the language specific nonbreaking prefixes.

        self.SKIP_LETTERS_REGEX = r"[{}{}{}]".format(Lowercase_Letter,

                                    Uppercase_Letter, Titlecase_Letter)



        self.SENT_END = [".", ":", "?", "!"]

        self.DELAYED_SENT_START = ["(", "[", """, "'", "&apos;", "&quot;", "[", "]"]



    def train(self, filename, possibly_use_first_token=False):

        casing = defaultdict(Counter)

        with open(filename) as fin:

            for line in fin:

                # Keep track of first words in the sentence(s) of the line.

                is_first_word = True

                for i, token in enumerate(line.split()):

                    # Skip XML tags.

                    if re.search(r"(<S[^>]*>)", token):

                        continue

                    # Skip if sentence start symbols.

                    elif token in self.DELAYED_SENT_START:

                        continue



                    # Resets the `is_first_word` after seeing sent end symbols.

                    if not is_first_word and token in self.SENT_END:

                        is_first_word = True

                        continue



                    # Skips words with nothing to case.

                    if not re.search(r"[{}]".format(ll_lu_lt), token):

                        is_first_word = False

                        continue



                    current_word_weight = 0

                    if not is_first_word:

                        current_word_weight = 1

                    elif possibly_use_first_token:

                        # Gated special handling of first word of sentence.

                        # Check if first characer of token is lowercase.

                        if token[0].is_lower():

                            current_word_weight = 1

                        elif i == 1:

                            current_word_weight = 0.1



                    if current_word_weight > 0:

                        casing[token.lower()][token] += current_word_weight



                    is_first_word = False

        return casing

python regex natural-language-processing

edited 18 mins ago

200_success

127k15149412

asked 22 mins ago

alvas

278311

edited 18 mins ago

200_success

127k15149412

asked 22 mins ago

alvas

278311

edited 18 mins ago

200_success

127k15149412

edited 18 mins ago

200_success

127k15149412

edited 18 mins ago

200_success

127k15149412

asked 22 mins ago

alvas

278311

asked 22 mins ago

alvas

278311

asked 22 mins ago

alvas

278311

Could you provide some example inputs and the corresponding outputs?
– 200_success
15 mins ago

add a comment |

Could you provide some example inputs and the corresponding outputs?
– 200_success
15 mins ago

Could you provide some example inputs and the corresponding outputs?
– 200_success
15 mins ago

add a comment |

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209335%2fcounting-lower-vs-non-lowercase-tokens-for-tokenized-text-with-several-condition%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Tukukkk