Pythonic way to compare a list of words against a list of sentences and print the matching line












3














I'm currently cleaning out our database and its becoming very time consuming. The typical



for email in emails:   


loop is in nowhere even close to fast enough.



For instance I am currently comparing a list of 230,000 emails to a 39,000,000 line full records list. It would take hours to match these emails to the records line they belong to and print. Does anyone have any idea how to implement threading into this query to speed it up? and athough this is super fast



strings = ("string1", "string2", "string3")
for line in file:
if any(s in line for s in strings):
print "yay!"


That would never print the matching line, just the needle.



Thank you in Advance










share|improve this question















migrated from serverfault.com Nov 21 '18 at 20:16


This question came from our site for system and network administrators.















  • So you want to print "yay" (or do something meaningful) if any word in the file matches one of the strings (emails)?
    – slider
    Nov 21 '18 at 20:29










  • Are you scanning for whole lines or parts of lines in emails from the records list?
    – Marcel Wilson
    Nov 21 '18 at 20:35










  • This isn't the fault of the loop, but rather that you have an O(n*m) code where both n and m are large values. This will require to at least 9 trillion comparison operations (!). Any language will be slow with the same algorithm. You'd want to convert the problem to an O(n+m) problem instead, and slider's answer with using set does that.
    – Lie Ryan
    Nov 21 '18 at 22:14


















3














I'm currently cleaning out our database and its becoming very time consuming. The typical



for email in emails:   


loop is in nowhere even close to fast enough.



For instance I am currently comparing a list of 230,000 emails to a 39,000,000 line full records list. It would take hours to match these emails to the records line they belong to and print. Does anyone have any idea how to implement threading into this query to speed it up? and athough this is super fast



strings = ("string1", "string2", "string3")
for line in file:
if any(s in line for s in strings):
print "yay!"


That would never print the matching line, just the needle.



Thank you in Advance










share|improve this question















migrated from serverfault.com Nov 21 '18 at 20:16


This question came from our site for system and network administrators.















  • So you want to print "yay" (or do something meaningful) if any word in the file matches one of the strings (emails)?
    – slider
    Nov 21 '18 at 20:29










  • Are you scanning for whole lines or parts of lines in emails from the records list?
    – Marcel Wilson
    Nov 21 '18 at 20:35










  • This isn't the fault of the loop, but rather that you have an O(n*m) code where both n and m are large values. This will require to at least 9 trillion comparison operations (!). Any language will be slow with the same algorithm. You'd want to convert the problem to an O(n+m) problem instead, and slider's answer with using set does that.
    – Lie Ryan
    Nov 21 '18 at 22:14
















3












3








3







I'm currently cleaning out our database and its becoming very time consuming. The typical



for email in emails:   


loop is in nowhere even close to fast enough.



For instance I am currently comparing a list of 230,000 emails to a 39,000,000 line full records list. It would take hours to match these emails to the records line they belong to and print. Does anyone have any idea how to implement threading into this query to speed it up? and athough this is super fast



strings = ("string1", "string2", "string3")
for line in file:
if any(s in line for s in strings):
print "yay!"


That would never print the matching line, just the needle.



Thank you in Advance










share|improve this question















I'm currently cleaning out our database and its becoming very time consuming. The typical



for email in emails:   


loop is in nowhere even close to fast enough.



For instance I am currently comparing a list of 230,000 emails to a 39,000,000 line full records list. It would take hours to match these emails to the records line they belong to and print. Does anyone have any idea how to implement threading into this query to speed it up? and athough this is super fast



strings = ("string1", "string2", "string3")
for line in file:
if any(s in line for s in strings):
print "yay!"


That would never print the matching line, just the needle.



Thank you in Advance







python






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 21 '18 at 22:06









pizza static void main

1,5361924




1,5361924










asked Nov 21 '18 at 20:13









wuzzwuzz

517




517




migrated from serverfault.com Nov 21 '18 at 20:16


This question came from our site for system and network administrators.






migrated from serverfault.com Nov 21 '18 at 20:16


This question came from our site for system and network administrators.














  • So you want to print "yay" (or do something meaningful) if any word in the file matches one of the strings (emails)?
    – slider
    Nov 21 '18 at 20:29










  • Are you scanning for whole lines or parts of lines in emails from the records list?
    – Marcel Wilson
    Nov 21 '18 at 20:35










  • This isn't the fault of the loop, but rather that you have an O(n*m) code where both n and m are large values. This will require to at least 9 trillion comparison operations (!). Any language will be slow with the same algorithm. You'd want to convert the problem to an O(n+m) problem instead, and slider's answer with using set does that.
    – Lie Ryan
    Nov 21 '18 at 22:14




















  • So you want to print "yay" (or do something meaningful) if any word in the file matches one of the strings (emails)?
    – slider
    Nov 21 '18 at 20:29










  • Are you scanning for whole lines or parts of lines in emails from the records list?
    – Marcel Wilson
    Nov 21 '18 at 20:35










  • This isn't the fault of the loop, but rather that you have an O(n*m) code where both n and m are large values. This will require to at least 9 trillion comparison operations (!). Any language will be slow with the same algorithm. You'd want to convert the problem to an O(n+m) problem instead, and slider's answer with using set does that.
    – Lie Ryan
    Nov 21 '18 at 22:14


















So you want to print "yay" (or do something meaningful) if any word in the file matches one of the strings (emails)?
– slider
Nov 21 '18 at 20:29




So you want to print "yay" (or do something meaningful) if any word in the file matches one of the strings (emails)?
– slider
Nov 21 '18 at 20:29












Are you scanning for whole lines or parts of lines in emails from the records list?
– Marcel Wilson
Nov 21 '18 at 20:35




Are you scanning for whole lines or parts of lines in emails from the records list?
– Marcel Wilson
Nov 21 '18 at 20:35












This isn't the fault of the loop, but rather that you have an O(n*m) code where both n and m are large values. This will require to at least 9 trillion comparison operations (!). Any language will be slow with the same algorithm. You'd want to convert the problem to an O(n+m) problem instead, and slider's answer with using set does that.
– Lie Ryan
Nov 21 '18 at 22:14






This isn't the fault of the loop, but rather that you have an O(n*m) code where both n and m are large values. This will require to at least 9 trillion comparison operations (!). Any language will be slow with the same algorithm. You'd want to convert the problem to an O(n+m) problem instead, and slider's answer with using set does that.
– Lie Ryan
Nov 21 '18 at 22:14














2 Answers
2






active

oldest

votes


















1














Here's example solution using threads. This code splits your data in equal chunks and use them as arguments for compare() by amount threads that we declare.



strings = ("string1", "string2", "string3")
lines = ['some random', 'lines with string3', 'and without it',
'1234', 'string2', 'string1',
"string1", 'abcd', 'xyz']

def compare(x, thread_idx):
print('Thread-{} started'.format(thread_idx))
for line in x:
if any(s in line for s in strings):
print("We got one of strings in line: {}".format(line))
print('Thread-{} finished'.format(thread_idx))


Threading part:



from threading import Thread

threads =
threads_amount = 3
chunk_size = len(lines) // threads_amount

for chunk in range(len(lines) // chunk_size):
threads.append(Thread(target=compare, args=(lines[chunk*chunk_size: (chunk+1)*chunk_size], chunk+1)))
threads[-1].start()

for i in range(threads_amount):
threads[i].join()


Output:



Thread-1 started
Thread-2 started
Thread-3 started
We got one of strings in line: string2
We got one of strings in line: string1
We got one of strings in line: string1
We got one of strings in line: lines with string3
Thread-2 finished
Thread-3 finished
Thread-1 finished





share|improve this answer





















  • OMG Thank you so much, this is perfect.
    – wuzz
    Nov 21 '18 at 22:34










  • Just want to say thank you one more time, just sorted a 39 million line file against 280,000 unique records in 17 seconds using 1000 threads
    – wuzz
    Nov 21 '18 at 23:44



















2














One possibility is to use a set to store emails. This makes the the check if word in emails O(1). So work done is proportional to the total number of words in your file:



emails = {"string1", "string2", "string3"} # this is a set

for line in f:
if any(word in emails for word in line.split()):
print("yay!")


You original solution is O(nm) (for n words and m emails) as opposed to O(n) with the set.






share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53419852%2fpythonic-way-to-compare-a-list-of-words-against-a-list-of-sentences-and-print-th%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Here's example solution using threads. This code splits your data in equal chunks and use them as arguments for compare() by amount threads that we declare.



    strings = ("string1", "string2", "string3")
    lines = ['some random', 'lines with string3', 'and without it',
    '1234', 'string2', 'string1',
    "string1", 'abcd', 'xyz']

    def compare(x, thread_idx):
    print('Thread-{} started'.format(thread_idx))
    for line in x:
    if any(s in line for s in strings):
    print("We got one of strings in line: {}".format(line))
    print('Thread-{} finished'.format(thread_idx))


    Threading part:



    from threading import Thread

    threads =
    threads_amount = 3
    chunk_size = len(lines) // threads_amount

    for chunk in range(len(lines) // chunk_size):
    threads.append(Thread(target=compare, args=(lines[chunk*chunk_size: (chunk+1)*chunk_size], chunk+1)))
    threads[-1].start()

    for i in range(threads_amount):
    threads[i].join()


    Output:



    Thread-1 started
    Thread-2 started
    Thread-3 started
    We got one of strings in line: string2
    We got one of strings in line: string1
    We got one of strings in line: string1
    We got one of strings in line: lines with string3
    Thread-2 finished
    Thread-3 finished
    Thread-1 finished





    share|improve this answer





















    • OMG Thank you so much, this is perfect.
      – wuzz
      Nov 21 '18 at 22:34










    • Just want to say thank you one more time, just sorted a 39 million line file against 280,000 unique records in 17 seconds using 1000 threads
      – wuzz
      Nov 21 '18 at 23:44
















    1














    Here's example solution using threads. This code splits your data in equal chunks and use them as arguments for compare() by amount threads that we declare.



    strings = ("string1", "string2", "string3")
    lines = ['some random', 'lines with string3', 'and without it',
    '1234', 'string2', 'string1',
    "string1", 'abcd', 'xyz']

    def compare(x, thread_idx):
    print('Thread-{} started'.format(thread_idx))
    for line in x:
    if any(s in line for s in strings):
    print("We got one of strings in line: {}".format(line))
    print('Thread-{} finished'.format(thread_idx))


    Threading part:



    from threading import Thread

    threads =
    threads_amount = 3
    chunk_size = len(lines) // threads_amount

    for chunk in range(len(lines) // chunk_size):
    threads.append(Thread(target=compare, args=(lines[chunk*chunk_size: (chunk+1)*chunk_size], chunk+1)))
    threads[-1].start()

    for i in range(threads_amount):
    threads[i].join()


    Output:



    Thread-1 started
    Thread-2 started
    Thread-3 started
    We got one of strings in line: string2
    We got one of strings in line: string1
    We got one of strings in line: string1
    We got one of strings in line: lines with string3
    Thread-2 finished
    Thread-3 finished
    Thread-1 finished





    share|improve this answer





















    • OMG Thank you so much, this is perfect.
      – wuzz
      Nov 21 '18 at 22:34










    • Just want to say thank you one more time, just sorted a 39 million line file against 280,000 unique records in 17 seconds using 1000 threads
      – wuzz
      Nov 21 '18 at 23:44














    1












    1








    1






    Here's example solution using threads. This code splits your data in equal chunks and use them as arguments for compare() by amount threads that we declare.



    strings = ("string1", "string2", "string3")
    lines = ['some random', 'lines with string3', 'and without it',
    '1234', 'string2', 'string1',
    "string1", 'abcd', 'xyz']

    def compare(x, thread_idx):
    print('Thread-{} started'.format(thread_idx))
    for line in x:
    if any(s in line for s in strings):
    print("We got one of strings in line: {}".format(line))
    print('Thread-{} finished'.format(thread_idx))


    Threading part:



    from threading import Thread

    threads =
    threads_amount = 3
    chunk_size = len(lines) // threads_amount

    for chunk in range(len(lines) // chunk_size):
    threads.append(Thread(target=compare, args=(lines[chunk*chunk_size: (chunk+1)*chunk_size], chunk+1)))
    threads[-1].start()

    for i in range(threads_amount):
    threads[i].join()


    Output:



    Thread-1 started
    Thread-2 started
    Thread-3 started
    We got one of strings in line: string2
    We got one of strings in line: string1
    We got one of strings in line: string1
    We got one of strings in line: lines with string3
    Thread-2 finished
    Thread-3 finished
    Thread-1 finished





    share|improve this answer












    Here's example solution using threads. This code splits your data in equal chunks and use them as arguments for compare() by amount threads that we declare.



    strings = ("string1", "string2", "string3")
    lines = ['some random', 'lines with string3', 'and without it',
    '1234', 'string2', 'string1',
    "string1", 'abcd', 'xyz']

    def compare(x, thread_idx):
    print('Thread-{} started'.format(thread_idx))
    for line in x:
    if any(s in line for s in strings):
    print("We got one of strings in line: {}".format(line))
    print('Thread-{} finished'.format(thread_idx))


    Threading part:



    from threading import Thread

    threads =
    threads_amount = 3
    chunk_size = len(lines) // threads_amount

    for chunk in range(len(lines) // chunk_size):
    threads.append(Thread(target=compare, args=(lines[chunk*chunk_size: (chunk+1)*chunk_size], chunk+1)))
    threads[-1].start()

    for i in range(threads_amount):
    threads[i].join()


    Output:



    Thread-1 started
    Thread-2 started
    Thread-3 started
    We got one of strings in line: string2
    We got one of strings in line: string1
    We got one of strings in line: string1
    We got one of strings in line: lines with string3
    Thread-2 finished
    Thread-3 finished
    Thread-1 finished






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 21 '18 at 20:32









    Filip MłynarskiFilip Młynarski

    1,5781311




    1,5781311












    • OMG Thank you so much, this is perfect.
      – wuzz
      Nov 21 '18 at 22:34










    • Just want to say thank you one more time, just sorted a 39 million line file against 280,000 unique records in 17 seconds using 1000 threads
      – wuzz
      Nov 21 '18 at 23:44


















    • OMG Thank you so much, this is perfect.
      – wuzz
      Nov 21 '18 at 22:34










    • Just want to say thank you one more time, just sorted a 39 million line file against 280,000 unique records in 17 seconds using 1000 threads
      – wuzz
      Nov 21 '18 at 23:44
















    OMG Thank you so much, this is perfect.
    – wuzz
    Nov 21 '18 at 22:34




    OMG Thank you so much, this is perfect.
    – wuzz
    Nov 21 '18 at 22:34












    Just want to say thank you one more time, just sorted a 39 million line file against 280,000 unique records in 17 seconds using 1000 threads
    – wuzz
    Nov 21 '18 at 23:44




    Just want to say thank you one more time, just sorted a 39 million line file against 280,000 unique records in 17 seconds using 1000 threads
    – wuzz
    Nov 21 '18 at 23:44













    2














    One possibility is to use a set to store emails. This makes the the check if word in emails O(1). So work done is proportional to the total number of words in your file:



    emails = {"string1", "string2", "string3"} # this is a set

    for line in f:
    if any(word in emails for word in line.split()):
    print("yay!")


    You original solution is O(nm) (for n words and m emails) as opposed to O(n) with the set.






    share|improve this answer


























      2














      One possibility is to use a set to store emails. This makes the the check if word in emails O(1). So work done is proportional to the total number of words in your file:



      emails = {"string1", "string2", "string3"} # this is a set

      for line in f:
      if any(word in emails for word in line.split()):
      print("yay!")


      You original solution is O(nm) (for n words and m emails) as opposed to O(n) with the set.






      share|improve this answer
























        2












        2








        2






        One possibility is to use a set to store emails. This makes the the check if word in emails O(1). So work done is proportional to the total number of words in your file:



        emails = {"string1", "string2", "string3"} # this is a set

        for line in f:
        if any(word in emails for word in line.split()):
        print("yay!")


        You original solution is O(nm) (for n words and m emails) as opposed to O(n) with the set.






        share|improve this answer












        One possibility is to use a set to store emails. This makes the the check if word in emails O(1). So work done is proportional to the total number of words in your file:



        emails = {"string1", "string2", "string3"} # this is a set

        for line in f:
        if any(word in emails for word in line.split()):
        print("yay!")


        You original solution is O(nm) (for n words and m emails) as opposed to O(n) with the set.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 21 '18 at 20:31









        sliderslider

        8,10011129




        8,10011129






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53419852%2fpythonic-way-to-compare-a-list-of-words-against-a-list-of-sentences-and-print-th%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            404 Error Contact Form 7 ajax form submitting

            How to know if a Active Directory user can login interactively

            TypeError: fit_transform() missing 1 required positional argument: 'X'