Replace words not in a dict to












2















I have a dictionary (with 10k+ words) and a passage (with 10M+ words). I want to replace all words which don't appear in the dictionary with <unk>.



I tried str.maketrans but its key should be a single char.



Then I tried this https://stackoverflow.com/a/40348578/5634636 but the regex is extremely slow.



Are there better solutions?










share|improve this question

























  • I think you'll have to follow a similar approach as the solution you listed. But instead of iterating a list, a set is way faster in my experience to check for membership.

    – Thu Yein Tun
    Nov 25 '18 at 3:31
















2















I have a dictionary (with 10k+ words) and a passage (with 10M+ words). I want to replace all words which don't appear in the dictionary with <unk>.



I tried str.maketrans but its key should be a single char.



Then I tried this https://stackoverflow.com/a/40348578/5634636 but the regex is extremely slow.



Are there better solutions?










share|improve this question

























  • I think you'll have to follow a similar approach as the solution you listed. But instead of iterating a list, a set is way faster in my experience to check for membership.

    – Thu Yein Tun
    Nov 25 '18 at 3:31














2












2








2


1






I have a dictionary (with 10k+ words) and a passage (with 10M+ words). I want to replace all words which don't appear in the dictionary with <unk>.



I tried str.maketrans but its key should be a single char.



Then I tried this https://stackoverflow.com/a/40348578/5634636 but the regex is extremely slow.



Are there better solutions?










share|improve this question
















I have a dictionary (with 10k+ words) and a passage (with 10M+ words). I want to replace all words which don't appear in the dictionary with <unk>.



I tried str.maketrans but its key should be a single char.



Then I tried this https://stackoverflow.com/a/40348578/5634636 but the regex is extremely slow.



Are there better solutions?







python string






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 25 '18 at 2:36









W.Ambrozic

901212




901212










asked Nov 25 '18 at 1:54









Huang YuhengHuang Yuheng

154216




154216













  • I think you'll have to follow a similar approach as the solution you listed. But instead of iterating a list, a set is way faster in my experience to check for membership.

    – Thu Yein Tun
    Nov 25 '18 at 3:31



















  • I think you'll have to follow a similar approach as the solution you listed. But instead of iterating a list, a set is way faster in my experience to check for membership.

    – Thu Yein Tun
    Nov 25 '18 at 3:31

















I think you'll have to follow a similar approach as the solution you listed. But instead of iterating a list, a set is way faster in my experience to check for membership.

– Thu Yein Tun
Nov 25 '18 at 3:31





I think you'll have to follow a similar approach as the solution you listed. But instead of iterating a list, a set is way faster in my experience to check for membership.

– Thu Yein Tun
Nov 25 '18 at 3:31












1 Answer
1






active

oldest

votes


















2














We break down the problem into two parts :




  • Given the list of words, passage, find the indices, i for which passage[i] is not in another list of words dictionary.

  • Then simpy put <unk> at those indices.


The, major work is required in 1. To do that, we first convert the list of strings to 2D numpy arrays, so that we can carry out the operations efficiently. Also, we sort the dictionary which is required below in binary search. Also, we pad the dictionary with 0s to have the same number of columns as passage_enc.



# assume passage, dictionary are initially lists of words
passage = np.array(passage) # np array of dtype='<U4'
passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4] # 2D np array of size len(passage) x max(len(x) for x in passage), with ords of chars

dictionary = np.array(dictionary)
dictionary = np.sort(dictionary)
dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))
dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)


Then we just iterate over passage, and check if the string(now an array) is in dictionary. It will take O(n * m), n, m being the sizes of passage and dictionary respectively.
But, we can improve this by sorting dictionary before hand and doing a binary search in that. So, it becomes O(n * logm).



Also, we JIT compile the code to make it faster. Below, I use numba.



import numba as nb
import numpy as np

@nb.njit(cache=True) # cache as being used multiple times
def smaller(a, b):
n = len(a)
i = 0
while(i<n and a[i] == b[i]):
i+=1
if(i==n):
return False
return a[i] < b[i]

@nb.njit(cache=True)
def bin_index(array, item):
first, last = 0, len(array) - 1

while first <= last:
mid = (first + last) // 2
if np.all(array[mid] == item):
return mid

if smaller(item, array[mid]):
last = mid - 1
else:
first = mid + 1

return -1

@nb.njit(cache=True)
def replace(dictionary, passage):
unknown_indices =
n = len(passage)
for i in range(n):
ind = bin_index(dictionary, passage[i])
if(ind == -1):
unknown_indices.append(i)
return unknown_indices


Check it on sample data



import nltk
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
passage = np.array(emma)
passage = np.repeat(passage, 50) # bloat coprus to have around 10mil words
passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]

persuasion = nltk.corpus.gutenberg.words('austen-persuasion.txt')
dictionary = np.array(persuasion)
dictionary = np.sort(dictionary) # sort for binary search

dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))

dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8) # pad with zeros so as to make dictionary_enc and passage_enc of same shape[1]


Size of both passage and dictionary, finally come out to be of the order the OP require, for timing purposes. This call :



unknown_indices = replace(dictionary_enc, passage_enc)


takes 17.028s(including the preprocessing time, obviously not including the time to load the corpora) on my 8 core, 16 G system.



Then, it is simple:



passage[unknown_indices] = "<unk>"


P.S : I guess, we can get a bit more speed by using parallel=True in the njit decorator for replace. I am getting some weird error in that, will edit if I am able to sort it out.






share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53464017%2freplace-words-not-in-a-dict-to-unk%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2














    We break down the problem into two parts :




    • Given the list of words, passage, find the indices, i for which passage[i] is not in another list of words dictionary.

    • Then simpy put <unk> at those indices.


    The, major work is required in 1. To do that, we first convert the list of strings to 2D numpy arrays, so that we can carry out the operations efficiently. Also, we sort the dictionary which is required below in binary search. Also, we pad the dictionary with 0s to have the same number of columns as passage_enc.



    # assume passage, dictionary are initially lists of words
    passage = np.array(passage) # np array of dtype='<U4'
    passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4] # 2D np array of size len(passage) x max(len(x) for x in passage), with ords of chars

    dictionary = np.array(dictionary)
    dictionary = np.sort(dictionary)
    dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
    pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))
    dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)


    Then we just iterate over passage, and check if the string(now an array) is in dictionary. It will take O(n * m), n, m being the sizes of passage and dictionary respectively.
    But, we can improve this by sorting dictionary before hand and doing a binary search in that. So, it becomes O(n * logm).



    Also, we JIT compile the code to make it faster. Below, I use numba.



    import numba as nb
    import numpy as np

    @nb.njit(cache=True) # cache as being used multiple times
    def smaller(a, b):
    n = len(a)
    i = 0
    while(i<n and a[i] == b[i]):
    i+=1
    if(i==n):
    return False
    return a[i] < b[i]

    @nb.njit(cache=True)
    def bin_index(array, item):
    first, last = 0, len(array) - 1

    while first <= last:
    mid = (first + last) // 2
    if np.all(array[mid] == item):
    return mid

    if smaller(item, array[mid]):
    last = mid - 1
    else:
    first = mid + 1

    return -1

    @nb.njit(cache=True)
    def replace(dictionary, passage):
    unknown_indices =
    n = len(passage)
    for i in range(n):
    ind = bin_index(dictionary, passage[i])
    if(ind == -1):
    unknown_indices.append(i)
    return unknown_indices


    Check it on sample data



    import nltk
    emma = nltk.corpus.gutenberg.words('austen-emma.txt')
    passage = np.array(emma)
    passage = np.repeat(passage, 50) # bloat coprus to have around 10mil words
    passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]

    persuasion = nltk.corpus.gutenberg.words('austen-persuasion.txt')
    dictionary = np.array(persuasion)
    dictionary = np.sort(dictionary) # sort for binary search

    dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
    pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))

    dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8) # pad with zeros so as to make dictionary_enc and passage_enc of same shape[1]


    Size of both passage and dictionary, finally come out to be of the order the OP require, for timing purposes. This call :



    unknown_indices = replace(dictionary_enc, passage_enc)


    takes 17.028s(including the preprocessing time, obviously not including the time to load the corpora) on my 8 core, 16 G system.



    Then, it is simple:



    passage[unknown_indices] = "<unk>"


    P.S : I guess, we can get a bit more speed by using parallel=True in the njit decorator for replace. I am getting some weird error in that, will edit if I am able to sort it out.






    share|improve this answer






























      2














      We break down the problem into two parts :




      • Given the list of words, passage, find the indices, i for which passage[i] is not in another list of words dictionary.

      • Then simpy put <unk> at those indices.


      The, major work is required in 1. To do that, we first convert the list of strings to 2D numpy arrays, so that we can carry out the operations efficiently. Also, we sort the dictionary which is required below in binary search. Also, we pad the dictionary with 0s to have the same number of columns as passage_enc.



      # assume passage, dictionary are initially lists of words
      passage = np.array(passage) # np array of dtype='<U4'
      passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4] # 2D np array of size len(passage) x max(len(x) for x in passage), with ords of chars

      dictionary = np.array(dictionary)
      dictionary = np.sort(dictionary)
      dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
      pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))
      dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)


      Then we just iterate over passage, and check if the string(now an array) is in dictionary. It will take O(n * m), n, m being the sizes of passage and dictionary respectively.
      But, we can improve this by sorting dictionary before hand and doing a binary search in that. So, it becomes O(n * logm).



      Also, we JIT compile the code to make it faster. Below, I use numba.



      import numba as nb
      import numpy as np

      @nb.njit(cache=True) # cache as being used multiple times
      def smaller(a, b):
      n = len(a)
      i = 0
      while(i<n and a[i] == b[i]):
      i+=1
      if(i==n):
      return False
      return a[i] < b[i]

      @nb.njit(cache=True)
      def bin_index(array, item):
      first, last = 0, len(array) - 1

      while first <= last:
      mid = (first + last) // 2
      if np.all(array[mid] == item):
      return mid

      if smaller(item, array[mid]):
      last = mid - 1
      else:
      first = mid + 1

      return -1

      @nb.njit(cache=True)
      def replace(dictionary, passage):
      unknown_indices =
      n = len(passage)
      for i in range(n):
      ind = bin_index(dictionary, passage[i])
      if(ind == -1):
      unknown_indices.append(i)
      return unknown_indices


      Check it on sample data



      import nltk
      emma = nltk.corpus.gutenberg.words('austen-emma.txt')
      passage = np.array(emma)
      passage = np.repeat(passage, 50) # bloat coprus to have around 10mil words
      passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]

      persuasion = nltk.corpus.gutenberg.words('austen-persuasion.txt')
      dictionary = np.array(persuasion)
      dictionary = np.sort(dictionary) # sort for binary search

      dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
      pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))

      dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8) # pad with zeros so as to make dictionary_enc and passage_enc of same shape[1]


      Size of both passage and dictionary, finally come out to be of the order the OP require, for timing purposes. This call :



      unknown_indices = replace(dictionary_enc, passage_enc)


      takes 17.028s(including the preprocessing time, obviously not including the time to load the corpora) on my 8 core, 16 G system.



      Then, it is simple:



      passage[unknown_indices] = "<unk>"


      P.S : I guess, we can get a bit more speed by using parallel=True in the njit decorator for replace. I am getting some weird error in that, will edit if I am able to sort it out.






      share|improve this answer




























        2












        2








        2







        We break down the problem into two parts :




        • Given the list of words, passage, find the indices, i for which passage[i] is not in another list of words dictionary.

        • Then simpy put <unk> at those indices.


        The, major work is required in 1. To do that, we first convert the list of strings to 2D numpy arrays, so that we can carry out the operations efficiently. Also, we sort the dictionary which is required below in binary search. Also, we pad the dictionary with 0s to have the same number of columns as passage_enc.



        # assume passage, dictionary are initially lists of words
        passage = np.array(passage) # np array of dtype='<U4'
        passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4] # 2D np array of size len(passage) x max(len(x) for x in passage), with ords of chars

        dictionary = np.array(dictionary)
        dictionary = np.sort(dictionary)
        dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
        pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))
        dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)


        Then we just iterate over passage, and check if the string(now an array) is in dictionary. It will take O(n * m), n, m being the sizes of passage and dictionary respectively.
        But, we can improve this by sorting dictionary before hand and doing a binary search in that. So, it becomes O(n * logm).



        Also, we JIT compile the code to make it faster. Below, I use numba.



        import numba as nb
        import numpy as np

        @nb.njit(cache=True) # cache as being used multiple times
        def smaller(a, b):
        n = len(a)
        i = 0
        while(i<n and a[i] == b[i]):
        i+=1
        if(i==n):
        return False
        return a[i] < b[i]

        @nb.njit(cache=True)
        def bin_index(array, item):
        first, last = 0, len(array) - 1

        while first <= last:
        mid = (first + last) // 2
        if np.all(array[mid] == item):
        return mid

        if smaller(item, array[mid]):
        last = mid - 1
        else:
        first = mid + 1

        return -1

        @nb.njit(cache=True)
        def replace(dictionary, passage):
        unknown_indices =
        n = len(passage)
        for i in range(n):
        ind = bin_index(dictionary, passage[i])
        if(ind == -1):
        unknown_indices.append(i)
        return unknown_indices


        Check it on sample data



        import nltk
        emma = nltk.corpus.gutenberg.words('austen-emma.txt')
        passage = np.array(emma)
        passage = np.repeat(passage, 50) # bloat coprus to have around 10mil words
        passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]

        persuasion = nltk.corpus.gutenberg.words('austen-persuasion.txt')
        dictionary = np.array(persuasion)
        dictionary = np.sort(dictionary) # sort for binary search

        dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
        pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))

        dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8) # pad with zeros so as to make dictionary_enc and passage_enc of same shape[1]


        Size of both passage and dictionary, finally come out to be of the order the OP require, for timing purposes. This call :



        unknown_indices = replace(dictionary_enc, passage_enc)


        takes 17.028s(including the preprocessing time, obviously not including the time to load the corpora) on my 8 core, 16 G system.



        Then, it is simple:



        passage[unknown_indices] = "<unk>"


        P.S : I guess, we can get a bit more speed by using parallel=True in the njit decorator for replace. I am getting some weird error in that, will edit if I am able to sort it out.






        share|improve this answer















        We break down the problem into two parts :




        • Given the list of words, passage, find the indices, i for which passage[i] is not in another list of words dictionary.

        • Then simpy put <unk> at those indices.


        The, major work is required in 1. To do that, we first convert the list of strings to 2D numpy arrays, so that we can carry out the operations efficiently. Also, we sort the dictionary which is required below in binary search. Also, we pad the dictionary with 0s to have the same number of columns as passage_enc.



        # assume passage, dictionary are initially lists of words
        passage = np.array(passage) # np array of dtype='<U4'
        passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4] # 2D np array of size len(passage) x max(len(x) for x in passage), with ords of chars

        dictionary = np.array(dictionary)
        dictionary = np.sort(dictionary)
        dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
        pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))
        dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8)


        Then we just iterate over passage, and check if the string(now an array) is in dictionary. It will take O(n * m), n, m being the sizes of passage and dictionary respectively.
        But, we can improve this by sorting dictionary before hand and doing a binary search in that. So, it becomes O(n * logm).



        Also, we JIT compile the code to make it faster. Below, I use numba.



        import numba as nb
        import numpy as np

        @nb.njit(cache=True) # cache as being used multiple times
        def smaller(a, b):
        n = len(a)
        i = 0
        while(i<n and a[i] == b[i]):
        i+=1
        if(i==n):
        return False
        return a[i] < b[i]

        @nb.njit(cache=True)
        def bin_index(array, item):
        first, last = 0, len(array) - 1

        while first <= last:
        mid = (first + last) // 2
        if np.all(array[mid] == item):
        return mid

        if smaller(item, array[mid]):
        last = mid - 1
        else:
        first = mid + 1

        return -1

        @nb.njit(cache=True)
        def replace(dictionary, passage):
        unknown_indices =
        n = len(passage)
        for i in range(n):
        ind = bin_index(dictionary, passage[i])
        if(ind == -1):
        unknown_indices.append(i)
        return unknown_indices


        Check it on sample data



        import nltk
        emma = nltk.corpus.gutenberg.words('austen-emma.txt')
        passage = np.array(emma)
        passage = np.repeat(passage, 50) # bloat coprus to have around 10mil words
        passage_enc = passage.view(np.uint8).reshape(-1, passage.itemsize)[:, ::4]

        persuasion = nltk.corpus.gutenberg.words('austen-persuasion.txt')
        dictionary = np.array(persuasion)
        dictionary = np.sort(dictionary) # sort for binary search

        dictionary_enc = dictionary.view(np.uint8).reshape(-1, dictionary.itemsize)[:, ::4]
        pad = np.zeros((len(dictionary), passage_enc.shape[1] - dictionary_enc.shape[1]))

        dictionary_enc = np.hstack([dictionary_enc, pad]).astype(np.uint8) # pad with zeros so as to make dictionary_enc and passage_enc of same shape[1]


        Size of both passage and dictionary, finally come out to be of the order the OP require, for timing purposes. This call :



        unknown_indices = replace(dictionary_enc, passage_enc)


        takes 17.028s(including the preprocessing time, obviously not including the time to load the corpora) on my 8 core, 16 G system.



        Then, it is simple:



        passage[unknown_indices] = "<unk>"


        P.S : I guess, we can get a bit more speed by using parallel=True in the njit decorator for replace. I am getting some weird error in that, will edit if I am able to sort it out.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Dec 4 '18 at 10:22

























        answered Nov 25 '18 at 7:16









        Deepak SainiDeepak Saini

        1,599815




        1,599815
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53464017%2freplace-words-not-in-a-dict-to-unk%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            404 Error Contact Form 7 ajax form submitting

            How to know if a Active Directory user can login interactively

            TypeError: fit_transform() missing 1 required positional argument: 'X'