Replace words not in a dict to

I have a dictionary (with 10k+ words) and a passage (with 10M+ words). I want to replace all words which don't appear in the dictionary with <unk>.

I tried str.maketrans but its key should be a single char.

Then I tried this https://stackoverflow.com/a/40348578/5634636 but the regex is extremely slow.

Are there better solutions?

edited Nov 25 '18 at 2:36

W.Ambrozic

901212

asked Nov 25 '18 at 1:54

Huang Yuheng

154216

I think you'll have to follow a similar approach as the solution you listed. But instead of iterating a list, a set is way faster in my experience to check for membership.

– Thu Yein Tun
Nov 25 '18 at 3:31

add a comment |

I have a dictionary (with 10k+ words) and a passage (with 10M+ words). I want to replace all words which don't appear in the dictionary with <unk>.

I tried str.maketrans but its key should be a single char.

Then I tried this https://stackoverflow.com/a/40348578/5634636 but the regex is extremely slow.

Are there better solutions?

edited Nov 25 '18 at 2:36

W.Ambrozic

901212

asked Nov 25 '18 at 1:54

Huang Yuheng

154216

I think you'll have to follow a similar approach as the solution you listed. But instead of iterating a list, a set is way faster in my experience to check for membership.

– Thu Yein Tun
Nov 25 '18 at 3:31

add a comment |

I have a dictionary (with 10k+ words) and a passage (with 10M+ words). I want to replace all words which don't appear in the dictionary with <unk>.

I tried str.maketrans but its key should be a single char.

Then I tried this https://stackoverflow.com/a/40348578/5634636 but the regex is extremely slow.

Are there better solutions?

edited Nov 25 '18 at 2:36

W.Ambrozic

901212

asked Nov 25 '18 at 1:54

Huang Yuheng

154216

I have a dictionary (with 10k+ words) and a passage (with 10M+ words). I want to replace all words which don't appear in the dictionary with <unk>.

I tried str.maketrans but its key should be a single char.

Then I tried this https://stackoverflow.com/a/40348578/5634636 but the regex is extremely slow.

Are there better solutions?

python string

edited Nov 25 '18 at 2:36

W.Ambrozic

901212

asked Nov 25 '18 at 1:54

Huang Yuheng

154216

edited Nov 25 '18 at 2:36

W.Ambrozic

901212

asked Nov 25 '18 at 1:54

Huang Yuheng

154216

edited Nov 25 '18 at 2:36

W.Ambrozic

901212

edited Nov 25 '18 at 2:36

W.Ambrozic

901212

edited Nov 25 '18 at 2:36

W.Ambrozic

901212

asked Nov 25 '18 at 1:54

Huang Yuheng

154216

asked Nov 25 '18 at 1:54

Huang Yuheng

154216

asked Nov 25 '18 at 1:54

Huang Yuheng

154216

I think you'll have to follow a similar approach as the solution you listed. But instead of iterating a list, a set is way faster in my experience to check for membership.

– Thu Yein Tun
Nov 25 '18 at 3:31

add a comment |

I think you'll have to follow a similar approach as the solution you listed. But instead of iterating a list, a set is way faster in my experience to check for membership.

– Thu Yein Tun
Nov 25 '18 at 3:31

I think you'll have to follow a similar approach as the solution you listed. But instead of iterating a list, a set is way faster in my experience to check for membership.

– Thu Yein Tun
Nov 25 '18 at 3:31

add a comment |

1 Answer
1

active

oldest

votes

We break down the problem into two parts :

Given the list of words, passage, find the indices, i for which passage[i] is not in another list of words dictionary.

Then simpy put <unk> at those indices.

The, major work is required in 1. To do that, we first convert the list of strings to 2D numpy arrays, so that we can carry out the operations efficiently. Also, we sort the dictionary which is required below in binary search. Also, we pad the dictionary with 0s to have the same number of columns as passage_enc.

# assume passage, dictionary are initially lists of words

passage = np.array(passage)  # np array of dtype='<U4'

passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]  # 2D np array of size len(passage) x max(len(x) for x in passage), with ords of chars



dictionary = np.array(dictionary)

dictionary = np.sort(dictionary)    

dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]

pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))    

dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)

Then we just iterate over passage, and check if the string(now an array) is in dictionary. It will take O(n * m), n, m being the sizes of passage and dictionary respectively.
But, we can improve this by sorting dictionary before hand and doing a binary search in that. So, it becomes O(n * logm).

Also, we JIT compile the code to make it faster. Below, I use numba.

import numba as nb

import numpy as np



@nb.njit(cache=True)  # cache as being used multiple times

def smaller(a, b):

    n = len(a)

    i = 0

    while(i<n and a[i] == b[i]):

        i+=1

    if(i==n):

        return False

    return a[i] < b[i]



@nb.njit(cache=True)

def bin_index(array, item):

    first, last = 0, len(array) - 1



    while first <= last:

        mid = (first + last) // 2

        if np.all(array[mid] == item):

            return mid



        if smaller(item, array[mid]):

            last = mid - 1

        else:

            first = mid + 1



    return -1



@nb.njit(cache=True)

def replace(dictionary, passage):

    unknown_indices = 

    n = len(passage)

    for i in range(n):

        ind = bin_index(dictionary, passage[i])

        if(ind == -1):

            unknown_indices.append(i)

    return unknown_indices

Check it on sample data

import nltk

emma = nltk.corpus.gutenberg.words('austen-emma.txt')

passage = np.array(emma)

passage = np.repeat(passage, 50)  # bloat coprus to have around 10mil words

passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]



persuasion = nltk.corpus.gutenberg.words('austen-persuasion.txt')

dictionary = np.array(persuasion)

dictionary = np.sort(dictionary)  # sort for binary search



dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]

pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))



dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)  # pad with zeros so as to make dictionary_enc and passage_enc of same shape[1]

Size of both passage and dictionary, finally come out to be of the order the OP require, for timing purposes. This call :

unknown_indices = replace(dictionary_enc, passage_enc)

takes 17.028s(including the preprocessing time, obviously not including the time to load the corpora) on my 8 core, 16 G system.

Then, it is simple:

passage[unknown_indices] = "<unk>"

P.S : I guess, we can get a bit more speed by using parallel=True in the njit decorator for replace. I am getting some weird error in that, will edit if I am able to sort it out.

edited Dec 4 '18 at 10:22

answered Nov 25 '18 at 7:16

Deepak Saini

1,599815

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53464017%2freplace-words-not-in-a-dict-to-unk%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

We break down the problem into two parts :

Given the list of words, passage, find the indices, i for which passage[i] is not in another list of words dictionary.

Then simpy put <unk> at those indices.

# assume passage, dictionary are initially lists of words

passage = np.array(passage)  # np array of dtype='<U4'

passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]  # 2D np array of size len(passage) x max(len(x) for x in passage), with ords of chars



dictionary = np.array(dictionary)

dictionary = np.sort(dictionary)    

dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]

pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))    

dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)

Also, we JIT compile the code to make it faster. Below, I use numba.

import numba as nb

import numpy as np



@nb.njit(cache=True)  # cache as being used multiple times

def smaller(a, b):

    n = len(a)

    i = 0

    while(i<n and a[i] == b[i]):

        i+=1

    if(i==n):

        return False

    return a[i] < b[i]



@nb.njit(cache=True)

def bin_index(array, item):

    first, last = 0, len(array) - 1



    while first <= last:

        mid = (first + last) // 2

        if np.all(array[mid] == item):

            return mid



        if smaller(item, array[mid]):

            last = mid - 1

        else:

            first = mid + 1



    return -1



@nb.njit(cache=True)

def replace(dictionary, passage):

    unknown_indices = 

    n = len(passage)

    for i in range(n):

        ind = bin_index(dictionary, passage[i])

        if(ind == -1):

            unknown_indices.append(i)

    return unknown_indices

Check it on sample data

import nltk

emma = nltk.corpus.gutenberg.words('austen-emma.txt')

passage = np.array(emma)

passage = np.repeat(passage, 50)  # bloat coprus to have around 10mil words

passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]



persuasion = nltk.corpus.gutenberg.words('austen-persuasion.txt')

dictionary = np.array(persuasion)

dictionary = np.sort(dictionary)  # sort for binary search



dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]

pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))



dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)  # pad with zeros so as to make dictionary_enc and passage_enc of same shape[1]

Size of both passage and dictionary, finally come out to be of the order the OP require, for timing purposes. This call :

unknown_indices = replace(dictionary_enc, passage_enc)

takes 17.028s(including the preprocessing time, obviously not including the time to load the corpora) on my 8 core, 16 G system.

Then, it is simple:

passage[unknown_indices] = "<unk>"

P.S : I guess, we can get a bit more speed by using parallel=True in the njit decorator for replace. I am getting some weird error in that, will edit if I am able to sort it out.

edited Dec 4 '18 at 10:22

answered Nov 25 '18 at 7:16

Deepak Saini

1,599815

add a comment |

We break down the problem into two parts :

Given the list of words, passage, find the indices, i for which passage[i] is not in another list of words dictionary.

Then simpy put <unk> at those indices.

# assume passage, dictionary are initially lists of words

passage = np.array(passage)  # np array of dtype='<U4'

passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]  # 2D np array of size len(passage) x max(len(x) for x in passage), with ords of chars



dictionary = np.array(dictionary)

dictionary = np.sort(dictionary)    

dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]

pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))    

dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)

Also, we JIT compile the code to make it faster. Below, I use numba.

import numba as nb

import numpy as np



@nb.njit(cache=True)  # cache as being used multiple times

def smaller(a, b):

    n = len(a)

    i = 0

    while(i<n and a[i] == b[i]):

        i+=1

    if(i==n):

        return False

    return a[i] < b[i]



@nb.njit(cache=True)

def bin_index(array, item):

    first, last = 0, len(array) - 1



    while first <= last:

        mid = (first + last) // 2

        if np.all(array[mid] == item):

            return mid



        if smaller(item, array[mid]):

            last = mid - 1

        else:

            first = mid + 1



    return -1



@nb.njit(cache=True)

def replace(dictionary, passage):

    unknown_indices = 

    n = len(passage)

    for i in range(n):

        ind = bin_index(dictionary, passage[i])

        if(ind == -1):

            unknown_indices.append(i)

    return unknown_indices

Check it on sample data

import nltk

emma = nltk.corpus.gutenberg.words('austen-emma.txt')

passage = np.array(emma)

passage = np.repeat(passage, 50)  # bloat coprus to have around 10mil words

passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]



persuasion = nltk.corpus.gutenberg.words('austen-persuasion.txt')

dictionary = np.array(persuasion)

dictionary = np.sort(dictionary)  # sort for binary search



dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]

pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))



dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)  # pad with zeros so as to make dictionary_enc and passage_enc of same shape[1]

Size of both passage and dictionary, finally come out to be of the order the OP require, for timing purposes. This call :

unknown_indices = replace(dictionary_enc, passage_enc)

takes 17.028s(including the preprocessing time, obviously not including the time to load the corpora) on my 8 core, 16 G system.

Then, it is simple:

passage[unknown_indices] = "<unk>"

P.S : I guess, we can get a bit more speed by using parallel=True in the njit decorator for replace. I am getting some weird error in that, will edit if I am able to sort it out.

edited Dec 4 '18 at 10:22

answered Nov 25 '18 at 7:16

Deepak Saini

1,599815

add a comment |

We break down the problem into two parts :

Given the list of words, passage, find the indices, i for which passage[i] is not in another list of words dictionary.

Then simpy put <unk> at those indices.

# assume passage, dictionary are initially lists of words

passage = np.array(passage)  # np array of dtype='<U4'

passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]  # 2D np array of size len(passage) x max(len(x) for x in passage), with ords of chars



dictionary = np.array(dictionary)

dictionary = np.sort(dictionary)    

dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]

pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))    

dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)

Also, we JIT compile the code to make it faster. Below, I use numba.

import numba as nb

import numpy as np



@nb.njit(cache=True)  # cache as being used multiple times

def smaller(a, b):

    n = len(a)

    i = 0

    while(i<n and a[i] == b[i]):

        i+=1

    if(i==n):

        return False

    return a[i] < b[i]



@nb.njit(cache=True)

def bin_index(array, item):

    first, last = 0, len(array) - 1



    while first <= last:

        mid = (first + last) // 2

        if np.all(array[mid] == item):

            return mid



        if smaller(item, array[mid]):

            last = mid - 1

        else:

            first = mid + 1



    return -1



@nb.njit(cache=True)

def replace(dictionary, passage):

    unknown_indices = 

    n = len(passage)

    for i in range(n):

        ind = bin_index(dictionary, passage[i])

        if(ind == -1):

            unknown_indices.append(i)

    return unknown_indices

Check it on sample data

import nltk

emma = nltk.corpus.gutenberg.words('austen-emma.txt')

passage = np.array(emma)

passage = np.repeat(passage, 50)  # bloat coprus to have around 10mil words

passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]



persuasion = nltk.corpus.gutenberg.words('austen-persuasion.txt')

dictionary = np.array(persuasion)

dictionary = np.sort(dictionary)  # sort for binary search



dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]

pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))



dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)  # pad with zeros so as to make dictionary_enc and passage_enc of same shape[1]

Size of both passage and dictionary, finally come out to be of the order the OP require, for timing purposes. This call :

unknown_indices = replace(dictionary_enc, passage_enc)

takes 17.028s(including the preprocessing time, obviously not including the time to load the corpora) on my 8 core, 16 G system.

Then, it is simple:

passage[unknown_indices] = "<unk>"

P.S : I guess, we can get a bit more speed by using parallel=True in the njit decorator for replace. I am getting some weird error in that, will edit if I am able to sort it out.

edited Dec 4 '18 at 10:22

answered Nov 25 '18 at 7:16

Deepak Saini

1,599815

We break down the problem into two parts :

Given the list of words, passage, find the indices, i for which passage[i] is not in another list of words dictionary.

Then simpy put <unk> at those indices.

# assume passage, dictionary are initially lists of words

passage = np.array(passage)  # np array of dtype='<U4'

passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]  # 2D np array of size len(passage) x max(len(x) for x in passage), with ords of chars



dictionary = np.array(dictionary)

dictionary = np.sort(dictionary)    

dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]

pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))    

dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)

Also, we JIT compile the code to make it faster. Below, I use numba.

import numba as nb

import numpy as np



@nb.njit(cache=True)  # cache as being used multiple times

def smaller(a, b):

    n = len(a)

    i = 0

    while(i<n and a[i] == b[i]):

        i+=1

    if(i==n):

        return False

    return a[i] < b[i]



@nb.njit(cache=True)

def bin_index(array, item):

    first, last = 0, len(array) - 1



    while first <= last:

        mid = (first + last) // 2

        if np.all(array[mid] == item):

            return mid



        if smaller(item, array[mid]):

            last = mid - 1

        else:

            first = mid + 1



    return -1



@nb.njit(cache=True)

def replace(dictionary, passage):

    unknown_indices = 

    n = len(passage)

    for i in range(n):

        ind = bin_index(dictionary, passage[i])

        if(ind == -1):

            unknown_indices.append(i)

    return unknown_indices

Check it on sample data

import nltk

emma = nltk.corpus.gutenberg.words('austen-emma.txt')

passage = np.array(emma)

passage = np.repeat(passage, 50)  # bloat coprus to have around 10mil words

passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]



persuasion = nltk.corpus.gutenberg.words('austen-persuasion.txt')

dictionary = np.array(persuasion)

dictionary = np.sort(dictionary)  # sort for binary search



dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]

pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))



dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)  # pad with zeros so as to make dictionary_enc and passage_enc of same shape[1]

Size of both passage and dictionary, finally come out to be of the order the OP require, for timing purposes. This call :

unknown_indices = replace(dictionary_enc, passage_enc)

takes 17.028s(including the preprocessing time, obviously not including the time to load the corpora) on my 8 core, 16 G system.

Then, it is simple:

passage[unknown_indices] = "<unk>"

P.S : I guess, we can get a bit more speed by using parallel=True in the njit decorator for replace. I am getting some weird error in that, will edit if I am able to sort it out.

edited Dec 4 '18 at 10:22

answered Nov 25 '18 at 7:16

Deepak Saini

1,599815

edited Dec 4 '18 at 10:22

answered Nov 25 '18 at 7:16

Deepak Saini

1,599815

answered Nov 25 '18 at 7:16

Deepak Saini

1,599815

answered Nov 25 '18 at 7:16

Deepak Saini

1,599815

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Kyx8 US7G,A6oBVS6,VMcmDt FDKITjpvHik,Qt8doHf9YdQ8tyGhBNlPagPnVr,v9rHs1ylcis97hz5sxm4GmuoEn r6

搜尋此網誌

Tukukkk