Pythonic way to compare a list of words against a list of sentences and print the matching line

I'm currently cleaning out our database and its becoming very time consuming. The typical

for email in emails:

loop is in nowhere even close to fast enough.

For instance I am currently comparing a list of 230,000 emails to a 39,000,000 line full records list. It would take hours to match these emails to the records line they belong to and print. Does anyone have any idea how to implement threading into this query to speed it up? and athough this is super fast

strings = ("string1", "string2", "string3")

for line in file:

    if any(s in line for s in strings):

        print "yay!"

That would never print the matching line, just the needle.

Thank you in Advance

edited Nov 21 '18 at 22:06

pizza static void main

1,5361924

asked Nov 21 '18 at 20:13

wuzz

517

migrated from serverfault.com Nov 21 '18 at 20:16

This question came from our site for system and network administrators.

So you want to print "yay" (or do something meaningful) if any word in the file matches one of the strings (emails)?
– slider
Nov 21 '18 at 20:29

Are you scanning for whole lines or parts of lines in emails from the records list?
– Marcel Wilson
Nov 21 '18 at 20:35

This isn't the fault of the loop, but rather that you have an O(n*m) code where both n and m are large values. This will require to at least 9 trillion comparison operations (!). Any language will be slow with the same algorithm. You'd want to convert the problem to an O(n+m) problem instead, and slider's answer with using set does that.
– Lie Ryan
Nov 21 '18 at 22:14

add a comment |

I'm currently cleaning out our database and its becoming very time consuming. The typical

for email in emails:

loop is in nowhere even close to fast enough.

strings = ("string1", "string2", "string3")

for line in file:

    if any(s in line for s in strings):

        print "yay!"

That would never print the matching line, just the needle.

Thank you in Advance

edited Nov 21 '18 at 22:06

pizza static void main

1,5361924

asked Nov 21 '18 at 20:13

wuzz

517

migrated from serverfault.com Nov 21 '18 at 20:16

This question came from our site for system and network administrators.

So you want to print "yay" (or do something meaningful) if any word in the file matches one of the strings (emails)?
– slider
Nov 21 '18 at 20:29

Are you scanning for whole lines or parts of lines in emails from the records list?
– Marcel Wilson
Nov 21 '18 at 20:35

This isn't the fault of the loop, but rather that you have an O(n*m) code where both n and m are large values. This will require to at least 9 trillion comparison operations (!). Any language will be slow with the same algorithm. You'd want to convert the problem to an O(n+m) problem instead, and slider's answer with using set does that.
– Lie Ryan
Nov 21 '18 at 22:14

add a comment |

I'm currently cleaning out our database and its becoming very time consuming. The typical

for email in emails:

loop is in nowhere even close to fast enough.

strings = ("string1", "string2", "string3")

for line in file:

    if any(s in line for s in strings):

        print "yay!"

That would never print the matching line, just the needle.

Thank you in Advance

edited Nov 21 '18 at 22:06

pizza static void main

1,5361924

asked Nov 21 '18 at 20:13

wuzz

517

I'm currently cleaning out our database and its becoming very time consuming. The typical

for email in emails:

loop is in nowhere even close to fast enough.

strings = ("string1", "string2", "string3")

for line in file:

    if any(s in line for s in strings):

        print "yay!"

That would never print the matching line, just the needle.

Thank you in Advance

python

edited Nov 21 '18 at 22:06

pizza static void main

1,5361924

asked Nov 21 '18 at 20:13

wuzz

517

edited Nov 21 '18 at 22:06

pizza static void main

1,5361924

asked Nov 21 '18 at 20:13

wuzz

517

edited Nov 21 '18 at 22:06

pizza static void main

1,5361924

edited Nov 21 '18 at 22:06

pizza static void main

1,5361924

edited Nov 21 '18 at 22:06

pizza static void main

1,5361924

asked Nov 21 '18 at 20:13

wuzz

517

asked Nov 21 '18 at 20:13

wuzz

517

asked Nov 21 '18 at 20:13

wuzz

517

migrated from serverfault.com Nov 21 '18 at 20:16

This question came from our site for system and network administrators.

migrated from serverfault.com Nov 21 '18 at 20:16

This question came from our site for system and network administrators.

So you want to print "yay" (or do something meaningful) if any word in the file matches one of the strings (emails)?
– slider
Nov 21 '18 at 20:29

Are you scanning for whole lines or parts of lines in emails from the records list?
– Marcel Wilson
Nov 21 '18 at 20:35

This isn't the fault of the loop, but rather that you have an O(n*m) code where both n and m are large values. This will require to at least 9 trillion comparison operations (!). Any language will be slow with the same algorithm. You'd want to convert the problem to an O(n+m) problem instead, and slider's answer with using set does that.
– Lie Ryan
Nov 21 '18 at 22:14

add a comment |

So you want to print "yay" (or do something meaningful) if any word in the file matches one of the strings (emails)?
– slider
Nov 21 '18 at 20:29

Are you scanning for whole lines or parts of lines in emails from the records list?
– Marcel Wilson
Nov 21 '18 at 20:35

This isn't the fault of the loop, but rather that you have an O(n*m) code where both n and m are large values. This will require to at least 9 trillion comparison operations (!). Any language will be slow with the same algorithm. You'd want to convert the problem to an O(n+m) problem instead, and slider's answer with using set does that.
– Lie Ryan
Nov 21 '18 at 22:14

So you want to print "yay" (or do something meaningful) if any word in the file matches one of the strings (emails)?
– slider
Nov 21 '18 at 20:29

Are you scanning for whole lines or parts of lines in emails from the records list?
– Marcel Wilson
Nov 21 '18 at 20:35

This isn't the fault of the loop, but rather that you have an O(n*m) code where both n and m are large values. This will require to at least 9 trillion comparison operations (!). Any language will be slow with the same algorithm. You'd want to convert the problem to an O(n+m) problem instead, and slider's answer with using set does that.
– Lie Ryan
Nov 21 '18 at 22:14

add a comment |

2 Answers
2

active

oldest

votes

Here's example solution using threads. This code splits your data in equal chunks and use them as arguments for compare() by amount threads that we declare.

strings = ("string1", "string2", "string3")

lines = ['some random', 'lines with string3', 'and without it',

         '1234', 'string2', 'string1',

         "string1", 'abcd', 'xyz']



def compare(x, thread_idx):

    print('Thread-{} started'.format(thread_idx))

    for line in x:

        if any(s in line for s in strings):

            print("We got one of strings in line: {}".format(line))

    print('Thread-{} finished'.format(thread_idx))

Threading part:

from threading import Thread



threads = 

threads_amount = 3

chunk_size = len(lines) // threads_amount



for chunk in range(len(lines) // chunk_size):

    threads.append(Thread(target=compare, args=(lines[chunk*chunk_size: (chunk+1)*chunk_size], chunk+1)))

    threads[-1].start()



for i in range(threads_amount):

    threads[i].join()

Output:

Thread-1 started

Thread-2 started

Thread-3 started

We got one of strings in line: string2

We got one of strings in line: string1

We got one of strings in line: string1

We got one of strings in line: lines with string3

Thread-2 finished

Thread-3 finished

Thread-1 finished

answered Nov 21 '18 at 20:32

Filip Młynarski

1,5781311

OMG Thank you so much, this is perfect.
– wuzz
Nov 21 '18 at 22:34

Just want to say thank you one more time, just sorted a 39 million line file against 280,000 unique records in 17 seconds using 1000 threads
– wuzz
Nov 21 '18 at 23:44

add a comment |

One possibility is to use a set to store emails. This makes the the check if word in emails O(1). So work done is proportional to the total number of words in your file:

emails = {"string1", "string2", "string3"} # this is a set



for line in f:

    if any(word in emails for word in line.split()):

        print("yay!")

You original solution is O(nm) (for n words and m emails) as opposed to O(n) with the set.

answered Nov 21 '18 at 20:31

slider

8,10011129

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53419852%2fpythonic-way-to-compare-a-list-of-words-against-a-list-of-sentences-and-print-th%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Here's example solution using threads. This code splits your data in equal chunks and use them as arguments for compare() by amount threads that we declare.

strings = ("string1", "string2", "string3")

lines = ['some random', 'lines with string3', 'and without it',

         '1234', 'string2', 'string1',

         "string1", 'abcd', 'xyz']



def compare(x, thread_idx):

    print('Thread-{} started'.format(thread_idx))

    for line in x:

        if any(s in line for s in strings):

            print("We got one of strings in line: {}".format(line))

    print('Thread-{} finished'.format(thread_idx))

Threading part:

from threading import Thread



threads = 

threads_amount = 3

chunk_size = len(lines) // threads_amount



for chunk in range(len(lines) // chunk_size):

    threads.append(Thread(target=compare, args=(lines[chunk*chunk_size: (chunk+1)*chunk_size], chunk+1)))

    threads[-1].start()



for i in range(threads_amount):

    threads[i].join()

Output:

Thread-1 started

Thread-2 started

Thread-3 started

We got one of strings in line: string2

We got one of strings in line: string1

We got one of strings in line: string1

We got one of strings in line: lines with string3

Thread-2 finished

Thread-3 finished

Thread-1 finished

answered Nov 21 '18 at 20:32

Filip Młynarski

1,5781311

OMG Thank you so much, this is perfect.
– wuzz
Nov 21 '18 at 22:34

Just want to say thank you one more time, just sorted a 39 million line file against 280,000 unique records in 17 seconds using 1000 threads
– wuzz
Nov 21 '18 at 23:44

add a comment |

Here's example solution using threads. This code splits your data in equal chunks and use them as arguments for compare() by amount threads that we declare.

strings = ("string1", "string2", "string3")

lines = ['some random', 'lines with string3', 'and without it',

         '1234', 'string2', 'string1',

         "string1", 'abcd', 'xyz']



def compare(x, thread_idx):

    print('Thread-{} started'.format(thread_idx))

    for line in x:

        if any(s in line for s in strings):

            print("We got one of strings in line: {}".format(line))

    print('Thread-{} finished'.format(thread_idx))

Threading part:

from threading import Thread



threads = 

threads_amount = 3

chunk_size = len(lines) // threads_amount



for chunk in range(len(lines) // chunk_size):

    threads.append(Thread(target=compare, args=(lines[chunk*chunk_size: (chunk+1)*chunk_size], chunk+1)))

    threads[-1].start()



for i in range(threads_amount):

    threads[i].join()

Output:

Thread-1 started

Thread-2 started

Thread-3 started

We got one of strings in line: string2

We got one of strings in line: string1

We got one of strings in line: string1

We got one of strings in line: lines with string3

Thread-2 finished

Thread-3 finished

Thread-1 finished

answered Nov 21 '18 at 20:32

Filip Młynarski

1,5781311

OMG Thank you so much, this is perfect.
– wuzz
Nov 21 '18 at 22:34

Just want to say thank you one more time, just sorted a 39 million line file against 280,000 unique records in 17 seconds using 1000 threads
– wuzz
Nov 21 '18 at 23:44

add a comment |

Here's example solution using threads. This code splits your data in equal chunks and use them as arguments for compare() by amount threads that we declare.

strings = ("string1", "string2", "string3")

lines = ['some random', 'lines with string3', 'and without it',

         '1234', 'string2', 'string1',

         "string1", 'abcd', 'xyz']



def compare(x, thread_idx):

    print('Thread-{} started'.format(thread_idx))

    for line in x:

        if any(s in line for s in strings):

            print("We got one of strings in line: {}".format(line))

    print('Thread-{} finished'.format(thread_idx))

Threading part:

from threading import Thread



threads = 

threads_amount = 3

chunk_size = len(lines) // threads_amount



for chunk in range(len(lines) // chunk_size):

    threads.append(Thread(target=compare, args=(lines[chunk*chunk_size: (chunk+1)*chunk_size], chunk+1)))

    threads[-1].start()



for i in range(threads_amount):

    threads[i].join()

Output:

Thread-1 started

Thread-2 started

Thread-3 started

We got one of strings in line: string2

We got one of strings in line: string1

We got one of strings in line: string1

We got one of strings in line: lines with string3

Thread-2 finished

Thread-3 finished

Thread-1 finished

answered Nov 21 '18 at 20:32

Filip Młynarski

1,5781311

Here's example solution using threads. This code splits your data in equal chunks and use them as arguments for compare() by amount threads that we declare.

strings = ("string1", "string2", "string3")

lines = ['some random', 'lines with string3', 'and without it',

         '1234', 'string2', 'string1',

         "string1", 'abcd', 'xyz']



def compare(x, thread_idx):

    print('Thread-{} started'.format(thread_idx))

    for line in x:

        if any(s in line for s in strings):

            print("We got one of strings in line: {}".format(line))

    print('Thread-{} finished'.format(thread_idx))

Threading part:

from threading import Thread



threads = 

threads_amount = 3

chunk_size = len(lines) // threads_amount



for chunk in range(len(lines) // chunk_size):

    threads.append(Thread(target=compare, args=(lines[chunk*chunk_size: (chunk+1)*chunk_size], chunk+1)))

    threads[-1].start()



for i in range(threads_amount):

    threads[i].join()

Output:

Thread-1 started

Thread-2 started

Thread-3 started

We got one of strings in line: string2

We got one of strings in line: string1

We got one of strings in line: string1

We got one of strings in line: lines with string3

Thread-2 finished

Thread-3 finished

Thread-1 finished

answered Nov 21 '18 at 20:32

Filip Młynarski

1,5781311

answered Nov 21 '18 at 20:32

Filip Młynarski

1,5781311

answered Nov 21 '18 at 20:32

Filip Młynarski

1,5781311

answered Nov 21 '18 at 20:32

Filip Młynarski

1,5781311

OMG Thank you so much, this is perfect.
– wuzz
Nov 21 '18 at 22:34

Just want to say thank you one more time, just sorted a 39 million line file against 280,000 unique records in 17 seconds using 1000 threads
– wuzz
Nov 21 '18 at 23:44

add a comment |

OMG Thank you so much, this is perfect.
– wuzz
Nov 21 '18 at 22:34

Just want to say thank you one more time, just sorted a 39 million line file against 280,000 unique records in 17 seconds using 1000 threads
– wuzz
Nov 21 '18 at 23:44

OMG Thank you so much, this is perfect.
– wuzz
Nov 21 '18 at 22:34

Just want to say thank you one more time, just sorted a 39 million line file against 280,000 unique records in 17 seconds using 1000 threads
– wuzz
Nov 21 '18 at 23:44

add a comment |

One possibility is to use a set to store emails. This makes the the check if word in emails O(1). So work done is proportional to the total number of words in your file:

emails = {"string1", "string2", "string3"} # this is a set



for line in f:

    if any(word in emails for word in line.split()):

        print("yay!")

You original solution is O(nm) (for n words and m emails) as opposed to O(n) with the set.

answered Nov 21 '18 at 20:31

slider

8,10011129

add a comment |

One possibility is to use a set to store emails. This makes the the check if word in emails O(1). So work done is proportional to the total number of words in your file:

emails = {"string1", "string2", "string3"} # this is a set



for line in f:

    if any(word in emails for word in line.split()):

        print("yay!")

You original solution is O(nm) (for n words and m emails) as opposed to O(n) with the set.

answered Nov 21 '18 at 20:31

slider

8,10011129

add a comment |

One possibility is to use a set to store emails. This makes the the check if word in emails O(1). So work done is proportional to the total number of words in your file:

emails = {"string1", "string2", "string3"} # this is a set



for line in f:

    if any(word in emails for word in line.split()):

        print("yay!")

You original solution is O(nm) (for n words and m emails) as opposed to O(n) with the set.

answered Nov 21 '18 at 20:31

slider

8,10011129

One possibility is to use a set to store emails. This makes the the check if word in emails O(1). So work done is proportional to the total number of words in your file:

emails = {"string1", "string2", "string3"} # this is a set



for line in f:

    if any(word in emails for word in line.split()):

        print("yay!")

You original solution is O(nm) (for n words and m emails) as opposed to O(n) with the set.

answered Nov 21 '18 at 20:31

slider

8,10011129

answered Nov 21 '18 at 20:31

slider

8,10011129

answered Nov 21 '18 at 20:31

slider

8,10011129

answered Nov 21 '18 at 20:31

slider

8,10011129

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Tukukkk