Replace words not in a dict to
I have a dictionary (with 10k+ words) and a passage (with 10M+ words). I want to replace all words which don't appear in the dictionary with <unk>
.
I tried str.maketrans
but its key should be a single char.
Then I tried this https://stackoverflow.com/a/40348578/5634636 but the regex is extremely slow.
Are there better solutions?
python string
add a comment |
I have a dictionary (with 10k+ words) and a passage (with 10M+ words). I want to replace all words which don't appear in the dictionary with <unk>
.
I tried str.maketrans
but its key should be a single char.
Then I tried this https://stackoverflow.com/a/40348578/5634636 but the regex is extremely slow.
Are there better solutions?
python string
I think you'll have to follow a similar approach as the solution you listed. But instead of iterating a list, aset
is way faster in my experience to check for membership.
– Thu Yein Tun
Nov 25 '18 at 3:31
add a comment |
I have a dictionary (with 10k+ words) and a passage (with 10M+ words). I want to replace all words which don't appear in the dictionary with <unk>
.
I tried str.maketrans
but its key should be a single char.
Then I tried this https://stackoverflow.com/a/40348578/5634636 but the regex is extremely slow.
Are there better solutions?
python string
I have a dictionary (with 10k+ words) and a passage (with 10M+ words). I want to replace all words which don't appear in the dictionary with <unk>
.
I tried str.maketrans
but its key should be a single char.
Then I tried this https://stackoverflow.com/a/40348578/5634636 but the regex is extremely slow.
Are there better solutions?
python string
python string
edited Nov 25 '18 at 2:36
W.Ambrozic
901212
901212
asked Nov 25 '18 at 1:54
Huang YuhengHuang Yuheng
154216
154216
I think you'll have to follow a similar approach as the solution you listed. But instead of iterating a list, aset
is way faster in my experience to check for membership.
– Thu Yein Tun
Nov 25 '18 at 3:31
add a comment |
I think you'll have to follow a similar approach as the solution you listed. But instead of iterating a list, aset
is way faster in my experience to check for membership.
– Thu Yein Tun
Nov 25 '18 at 3:31
I think you'll have to follow a similar approach as the solution you listed. But instead of iterating a list, a
set
is way faster in my experience to check for membership.– Thu Yein Tun
Nov 25 '18 at 3:31
I think you'll have to follow a similar approach as the solution you listed. But instead of iterating a list, a
set
is way faster in my experience to check for membership.– Thu Yein Tun
Nov 25 '18 at 3:31
add a comment |
1 Answer
1
active
oldest
votes
We break down the problem into two parts :
- Given the list of words,
passage
, find the indices, i for whichpassage[i]
is not in another list of wordsdictionary
. - Then simpy put
<unk>
at those indices.
The, major work is required in 1. To do that, we first convert the list of strings to 2D numpy arrays, so that we can carry out the operations efficiently. Also, we sort the dictionary which is required below in binary search. Also, we pad the dictionary with 0s to have the same number of columns as passage_enc
.
# assume passage, dictionary are initially lists of words
passage = np.array(passage) # np array of dtype='<U4'
passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4] # 2D np array of size len(passage) x max(len(x) for x in passage), with ords of chars
dictionary = np.array(dictionary)
dictionary = np.sort(dictionary)
dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))
dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)
Then we just iterate over passage, and check if the string(now an array) is in dictionary. It will take O(n * m), n, m being the sizes of passage and dictionary respectively.
But, we can improve this by sorting dictionary before hand and doing a binary search in that. So, it becomes O(n * logm).
Also, we JIT compile the code to make it faster. Below, I use numba.
import numba as nb
import numpy as np
@nb.njit(cache=True) # cache as being used multiple times
def smaller(a, b):
n = len(a)
i = 0
while(i<n and a[i] == b[i]):
i+=1
if(i==n):
return False
return a[i] < b[i]
@nb.njit(cache=True)
def bin_index(array, item):
first, last = 0, len(array) - 1
while first <= last:
mid = (first + last) // 2
if np.all(array[mid] == item):
return mid
if smaller(item, array[mid]):
last = mid - 1
else:
first = mid + 1
return -1
@nb.njit(cache=True)
def replace(dictionary, passage):
unknown_indices =
n = len(passage)
for i in range(n):
ind = bin_index(dictionary, passage[i])
if(ind == -1):
unknown_indices.append(i)
return unknown_indices
Check it on sample data
import nltk
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
passage = np.array(emma)
passage = np.repeat(passage, 50) # bloat coprus to have around 10mil words
passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]
persuasion = nltk.corpus.gutenberg.words('austen-persuasion.txt')
dictionary = np.array(persuasion)
dictionary = np.sort(dictionary) # sort for binary search
dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))
dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8) # pad with zeros so as to make dictionary_enc and passage_enc of same shape[1]
Size of both passage and dictionary, finally come out to be of the order the OP require, for timing purposes. This call :
unknown_indices = replace(dictionary_enc, passage_enc)
takes 17.028s(including the preprocessing time, obviously not including the time to load the corpora) on my 8 core, 16 G system.
Then, it is simple:
passage[unknown_indices] = "<unk>"
P.S : I guess, we can get a bit more speed by using parallel=True
in the njit decorator for replace
. I am getting some weird error in that, will edit if I am able to sort it out.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53464017%2freplace-words-not-in-a-dict-to-unk%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
We break down the problem into two parts :
- Given the list of words,
passage
, find the indices, i for whichpassage[i]
is not in another list of wordsdictionary
. - Then simpy put
<unk>
at those indices.
The, major work is required in 1. To do that, we first convert the list of strings to 2D numpy arrays, so that we can carry out the operations efficiently. Also, we sort the dictionary which is required below in binary search. Also, we pad the dictionary with 0s to have the same number of columns as passage_enc
.
# assume passage, dictionary are initially lists of words
passage = np.array(passage) # np array of dtype='<U4'
passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4] # 2D np array of size len(passage) x max(len(x) for x in passage), with ords of chars
dictionary = np.array(dictionary)
dictionary = np.sort(dictionary)
dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))
dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)
Then we just iterate over passage, and check if the string(now an array) is in dictionary. It will take O(n * m), n, m being the sizes of passage and dictionary respectively.
But, we can improve this by sorting dictionary before hand and doing a binary search in that. So, it becomes O(n * logm).
Also, we JIT compile the code to make it faster. Below, I use numba.
import numba as nb
import numpy as np
@nb.njit(cache=True) # cache as being used multiple times
def smaller(a, b):
n = len(a)
i = 0
while(i<n and a[i] == b[i]):
i+=1
if(i==n):
return False
return a[i] < b[i]
@nb.njit(cache=True)
def bin_index(array, item):
first, last = 0, len(array) - 1
while first <= last:
mid = (first + last) // 2
if np.all(array[mid] == item):
return mid
if smaller(item, array[mid]):
last = mid - 1
else:
first = mid + 1
return -1
@nb.njit(cache=True)
def replace(dictionary, passage):
unknown_indices =
n = len(passage)
for i in range(n):
ind = bin_index(dictionary, passage[i])
if(ind == -1):
unknown_indices.append(i)
return unknown_indices
Check it on sample data
import nltk
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
passage = np.array(emma)
passage = np.repeat(passage, 50) # bloat coprus to have around 10mil words
passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]
persuasion = nltk.corpus.gutenberg.words('austen-persuasion.txt')
dictionary = np.array(persuasion)
dictionary = np.sort(dictionary) # sort for binary search
dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))
dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8) # pad with zeros so as to make dictionary_enc and passage_enc of same shape[1]
Size of both passage and dictionary, finally come out to be of the order the OP require, for timing purposes. This call :
unknown_indices = replace(dictionary_enc, passage_enc)
takes 17.028s(including the preprocessing time, obviously not including the time to load the corpora) on my 8 core, 16 G system.
Then, it is simple:
passage[unknown_indices] = "<unk>"
P.S : I guess, we can get a bit more speed by using parallel=True
in the njit decorator for replace
. I am getting some weird error in that, will edit if I am able to sort it out.
add a comment |
We break down the problem into two parts :
- Given the list of words,
passage
, find the indices, i for whichpassage[i]
is not in another list of wordsdictionary
. - Then simpy put
<unk>
at those indices.
The, major work is required in 1. To do that, we first convert the list of strings to 2D numpy arrays, so that we can carry out the operations efficiently. Also, we sort the dictionary which is required below in binary search. Also, we pad the dictionary with 0s to have the same number of columns as passage_enc
.
# assume passage, dictionary are initially lists of words
passage = np.array(passage) # np array of dtype='<U4'
passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4] # 2D np array of size len(passage) x max(len(x) for x in passage), with ords of chars
dictionary = np.array(dictionary)
dictionary = np.sort(dictionary)
dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))
dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)
Then we just iterate over passage, and check if the string(now an array) is in dictionary. It will take O(n * m), n, m being the sizes of passage and dictionary respectively.
But, we can improve this by sorting dictionary before hand and doing a binary search in that. So, it becomes O(n * logm).
Also, we JIT compile the code to make it faster. Below, I use numba.
import numba as nb
import numpy as np
@nb.njit(cache=True) # cache as being used multiple times
def smaller(a, b):
n = len(a)
i = 0
while(i<n and a[i] == b[i]):
i+=1
if(i==n):
return False
return a[i] < b[i]
@nb.njit(cache=True)
def bin_index(array, item):
first, last = 0, len(array) - 1
while first <= last:
mid = (first + last) // 2
if np.all(array[mid] == item):
return mid
if smaller(item, array[mid]):
last = mid - 1
else:
first = mid + 1
return -1
@nb.njit(cache=True)
def replace(dictionary, passage):
unknown_indices =
n = len(passage)
for i in range(n):
ind = bin_index(dictionary, passage[i])
if(ind == -1):
unknown_indices.append(i)
return unknown_indices
Check it on sample data
import nltk
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
passage = np.array(emma)
passage = np.repeat(passage, 50) # bloat coprus to have around 10mil words
passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]
persuasion = nltk.corpus.gutenberg.words('austen-persuasion.txt')
dictionary = np.array(persuasion)
dictionary = np.sort(dictionary) # sort for binary search
dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))
dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8) # pad with zeros so as to make dictionary_enc and passage_enc of same shape[1]
Size of both passage and dictionary, finally come out to be of the order the OP require, for timing purposes. This call :
unknown_indices = replace(dictionary_enc, passage_enc)
takes 17.028s(including the preprocessing time, obviously not including the time to load the corpora) on my 8 core, 16 G system.
Then, it is simple:
passage[unknown_indices] = "<unk>"
P.S : I guess, we can get a bit more speed by using parallel=True
in the njit decorator for replace
. I am getting some weird error in that, will edit if I am able to sort it out.
add a comment |
We break down the problem into two parts :
- Given the list of words,
passage
, find the indices, i for whichpassage[i]
is not in another list of wordsdictionary
. - Then simpy put
<unk>
at those indices.
The, major work is required in 1. To do that, we first convert the list of strings to 2D numpy arrays, so that we can carry out the operations efficiently. Also, we sort the dictionary which is required below in binary search. Also, we pad the dictionary with 0s to have the same number of columns as passage_enc
.
# assume passage, dictionary are initially lists of words
passage = np.array(passage) # np array of dtype='<U4'
passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4] # 2D np array of size len(passage) x max(len(x) for x in passage), with ords of chars
dictionary = np.array(dictionary)
dictionary = np.sort(dictionary)
dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))
dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)
Then we just iterate over passage, and check if the string(now an array) is in dictionary. It will take O(n * m), n, m being the sizes of passage and dictionary respectively.
But, we can improve this by sorting dictionary before hand and doing a binary search in that. So, it becomes O(n * logm).
Also, we JIT compile the code to make it faster. Below, I use numba.
import numba as nb
import numpy as np
@nb.njit(cache=True) # cache as being used multiple times
def smaller(a, b):
n = len(a)
i = 0
while(i<n and a[i] == b[i]):
i+=1
if(i==n):
return False
return a[i] < b[i]
@nb.njit(cache=True)
def bin_index(array, item):
first, last = 0, len(array) - 1
while first <= last:
mid = (first + last) // 2
if np.all(array[mid] == item):
return mid
if smaller(item, array[mid]):
last = mid - 1
else:
first = mid + 1
return -1
@nb.njit(cache=True)
def replace(dictionary, passage):
unknown_indices =
n = len(passage)
for i in range(n):
ind = bin_index(dictionary, passage[i])
if(ind == -1):
unknown_indices.append(i)
return unknown_indices
Check it on sample data
import nltk
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
passage = np.array(emma)
passage = np.repeat(passage, 50) # bloat coprus to have around 10mil words
passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]
persuasion = nltk.corpus.gutenberg.words('austen-persuasion.txt')
dictionary = np.array(persuasion)
dictionary = np.sort(dictionary) # sort for binary search
dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))
dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8) # pad with zeros so as to make dictionary_enc and passage_enc of same shape[1]
Size of both passage and dictionary, finally come out to be of the order the OP require, for timing purposes. This call :
unknown_indices = replace(dictionary_enc, passage_enc)
takes 17.028s(including the preprocessing time, obviously not including the time to load the corpora) on my 8 core, 16 G system.
Then, it is simple:
passage[unknown_indices] = "<unk>"
P.S : I guess, we can get a bit more speed by using parallel=True
in the njit decorator for replace
. I am getting some weird error in that, will edit if I am able to sort it out.
We break down the problem into two parts :
- Given the list of words,
passage
, find the indices, i for whichpassage[i]
is not in another list of wordsdictionary
. - Then simpy put
<unk>
at those indices.
The, major work is required in 1. To do that, we first convert the list of strings to 2D numpy arrays, so that we can carry out the operations efficiently. Also, we sort the dictionary which is required below in binary search. Also, we pad the dictionary with 0s to have the same number of columns as passage_enc
.
# assume passage, dictionary are initially lists of words
passage = np.array(passage) # np array of dtype='<U4'
passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4] # 2D np array of size len(passage) x max(len(x) for x in passage), with ords of chars
dictionary = np.array(dictionary)
dictionary = np.sort(dictionary)
dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))
dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)
Then we just iterate over passage, and check if the string(now an array) is in dictionary. It will take O(n * m), n, m being the sizes of passage and dictionary respectively.
But, we can improve this by sorting dictionary before hand and doing a binary search in that. So, it becomes O(n * logm).
Also, we JIT compile the code to make it faster. Below, I use numba.
import numba as nb
import numpy as np
@nb.njit(cache=True) # cache as being used multiple times
def smaller(a, b):
n = len(a)
i = 0
while(i<n and a[i] == b[i]):
i+=1
if(i==n):
return False
return a[i] < b[i]
@nb.njit(cache=True)
def bin_index(array, item):
first, last = 0, len(array) - 1
while first <= last:
mid = (first + last) // 2
if np.all(array[mid] == item):
return mid
if smaller(item, array[mid]):
last = mid - 1
else:
first = mid + 1
return -1
@nb.njit(cache=True)
def replace(dictionary, passage):
unknown_indices =
n = len(passage)
for i in range(n):
ind = bin_index(dictionary, passage[i])
if(ind == -1):
unknown_indices.append(i)
return unknown_indices
Check it on sample data
import nltk
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
passage = np.array(emma)
passage = np.repeat(passage, 50) # bloat coprus to have around 10mil words
passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]
persuasion = nltk.corpus.gutenberg.words('austen-persuasion.txt')
dictionary = np.array(persuasion)
dictionary = np.sort(dictionary) # sort for binary search
dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))
dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8) # pad with zeros so as to make dictionary_enc and passage_enc of same shape[1]
Size of both passage and dictionary, finally come out to be of the order the OP require, for timing purposes. This call :
unknown_indices = replace(dictionary_enc, passage_enc)
takes 17.028s(including the preprocessing time, obviously not including the time to load the corpora) on my 8 core, 16 G system.
Then, it is simple:
passage[unknown_indices] = "<unk>"
P.S : I guess, we can get a bit more speed by using parallel=True
in the njit decorator for replace
. I am getting some weird error in that, will edit if I am able to sort it out.
edited Dec 4 '18 at 10:22
answered Nov 25 '18 at 7:16
Deepak SainiDeepak Saini
1,599815
1,599815
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53464017%2freplace-words-not-in-a-dict-to-unk%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I think you'll have to follow a similar approach as the solution you listed. But instead of iterating a list, a
set
is way faster in my experience to check for membership.– Thu Yein Tun
Nov 25 '18 at 3:31