Pythonic way to compare a list of words against a list of sentences and print the matching line
I'm currently cleaning out our database and its becoming very time consuming. The typical
for email in emails:
loop is in nowhere even close to fast enough.
For instance I am currently comparing a list of 230,000 emails to a 39,000,000 line full records list. It would take hours to match these emails to the records line they belong to and print. Does anyone have any idea how to implement threading into this query to speed it up? and athough this is super fast
strings = ("string1", "string2", "string3")
for line in file:
if any(s in line for s in strings):
print "yay!"
That would never print the matching line, just the needle.
Thank you in Advance
python
migrated from serverfault.com Nov 21 '18 at 20:16
This question came from our site for system and network administrators.
add a comment |
I'm currently cleaning out our database and its becoming very time consuming. The typical
for email in emails:
loop is in nowhere even close to fast enough.
For instance I am currently comparing a list of 230,000 emails to a 39,000,000 line full records list. It would take hours to match these emails to the records line they belong to and print. Does anyone have any idea how to implement threading into this query to speed it up? and athough this is super fast
strings = ("string1", "string2", "string3")
for line in file:
if any(s in line for s in strings):
print "yay!"
That would never print the matching line, just the needle.
Thank you in Advance
python
migrated from serverfault.com Nov 21 '18 at 20:16
This question came from our site for system and network administrators.
So you want to print "yay" (or do something meaningful) if any word in the file matches one of the strings (emails)?
– slider
Nov 21 '18 at 20:29
Are you scanning for whole lines or parts of lines in emails from the records list?
– Marcel Wilson
Nov 21 '18 at 20:35
This isn't the fault of the loop, but rather that you have anO(n*m)
code where both n and m are large values. This will require to at least 9 trillion comparison operations (!). Any language will be slow with the same algorithm. You'd want to convert the problem to anO(n+m)
problem instead, and slider's answer with usingset
does that.
– Lie Ryan
Nov 21 '18 at 22:14
add a comment |
I'm currently cleaning out our database and its becoming very time consuming. The typical
for email in emails:
loop is in nowhere even close to fast enough.
For instance I am currently comparing a list of 230,000 emails to a 39,000,000 line full records list. It would take hours to match these emails to the records line they belong to and print. Does anyone have any idea how to implement threading into this query to speed it up? and athough this is super fast
strings = ("string1", "string2", "string3")
for line in file:
if any(s in line for s in strings):
print "yay!"
That would never print the matching line, just the needle.
Thank you in Advance
python
I'm currently cleaning out our database and its becoming very time consuming. The typical
for email in emails:
loop is in nowhere even close to fast enough.
For instance I am currently comparing a list of 230,000 emails to a 39,000,000 line full records list. It would take hours to match these emails to the records line they belong to and print. Does anyone have any idea how to implement threading into this query to speed it up? and athough this is super fast
strings = ("string1", "string2", "string3")
for line in file:
if any(s in line for s in strings):
print "yay!"
That would never print the matching line, just the needle.
Thank you in Advance
python
python
edited Nov 21 '18 at 22:06
pizza static void main
1,5361924
1,5361924
asked Nov 21 '18 at 20:13
wuzzwuzz
517
517
migrated from serverfault.com Nov 21 '18 at 20:16
This question came from our site for system and network administrators.
migrated from serverfault.com Nov 21 '18 at 20:16
This question came from our site for system and network administrators.
So you want to print "yay" (or do something meaningful) if any word in the file matches one of the strings (emails)?
– slider
Nov 21 '18 at 20:29
Are you scanning for whole lines or parts of lines in emails from the records list?
– Marcel Wilson
Nov 21 '18 at 20:35
This isn't the fault of the loop, but rather that you have anO(n*m)
code where both n and m are large values. This will require to at least 9 trillion comparison operations (!). Any language will be slow with the same algorithm. You'd want to convert the problem to anO(n+m)
problem instead, and slider's answer with usingset
does that.
– Lie Ryan
Nov 21 '18 at 22:14
add a comment |
So you want to print "yay" (or do something meaningful) if any word in the file matches one of the strings (emails)?
– slider
Nov 21 '18 at 20:29
Are you scanning for whole lines or parts of lines in emails from the records list?
– Marcel Wilson
Nov 21 '18 at 20:35
This isn't the fault of the loop, but rather that you have anO(n*m)
code where both n and m are large values. This will require to at least 9 trillion comparison operations (!). Any language will be slow with the same algorithm. You'd want to convert the problem to anO(n+m)
problem instead, and slider's answer with usingset
does that.
– Lie Ryan
Nov 21 '18 at 22:14
So you want to print "yay" (or do something meaningful) if any word in the file matches one of the strings (emails)?
– slider
Nov 21 '18 at 20:29
So you want to print "yay" (or do something meaningful) if any word in the file matches one of the strings (emails)?
– slider
Nov 21 '18 at 20:29
Are you scanning for whole lines or parts of lines in emails from the records list?
– Marcel Wilson
Nov 21 '18 at 20:35
Are you scanning for whole lines or parts of lines in emails from the records list?
– Marcel Wilson
Nov 21 '18 at 20:35
This isn't the fault of the loop, but rather that you have an
O(n*m)
code where both n and m are large values. This will require to at least 9 trillion comparison operations (!). Any language will be slow with the same algorithm. You'd want to convert the problem to an O(n+m)
problem instead, and slider's answer with using set
does that.– Lie Ryan
Nov 21 '18 at 22:14
This isn't the fault of the loop, but rather that you have an
O(n*m)
code where both n and m are large values. This will require to at least 9 trillion comparison operations (!). Any language will be slow with the same algorithm. You'd want to convert the problem to an O(n+m)
problem instead, and slider's answer with using set
does that.– Lie Ryan
Nov 21 '18 at 22:14
add a comment |
2 Answers
2
active
oldest
votes
Here's example solution using threads. This code splits your data in equal chunks and use them as arguments for compare()
by amount threads that we declare.
strings = ("string1", "string2", "string3")
lines = ['some random', 'lines with string3', 'and without it',
'1234', 'string2', 'string1',
"string1", 'abcd', 'xyz']
def compare(x, thread_idx):
print('Thread-{} started'.format(thread_idx))
for line in x:
if any(s in line for s in strings):
print("We got one of strings in line: {}".format(line))
print('Thread-{} finished'.format(thread_idx))
Threading part:
from threading import Thread
threads =
threads_amount = 3
chunk_size = len(lines) // threads_amount
for chunk in range(len(lines) // chunk_size):
threads.append(Thread(target=compare, args=(lines[chunk*chunk_size: (chunk+1)*chunk_size], chunk+1)))
threads[-1].start()
for i in range(threads_amount):
threads[i].join()
Output:
Thread-1 started
Thread-2 started
Thread-3 started
We got one of strings in line: string2
We got one of strings in line: string1
We got one of strings in line: string1
We got one of strings in line: lines with string3
Thread-2 finished
Thread-3 finished
Thread-1 finished
OMG Thank you so much, this is perfect.
– wuzz
Nov 21 '18 at 22:34
Just want to say thank you one more time, just sorted a 39 million line file against 280,000 unique records in 17 seconds using 1000 threads
– wuzz
Nov 21 '18 at 23:44
add a comment |
One possibility is to use a set
to store emails. This makes the the check if word in emails
O(1). So work done is proportional to the total number of words in your file:
emails = {"string1", "string2", "string3"} # this is a set
for line in f:
if any(word in emails for word in line.split()):
print("yay!")
You original solution is O(nm) (for n words and m emails) as opposed to O(n) with the set
.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53419852%2fpythonic-way-to-compare-a-list-of-words-against-a-list-of-sentences-and-print-th%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Here's example solution using threads. This code splits your data in equal chunks and use them as arguments for compare()
by amount threads that we declare.
strings = ("string1", "string2", "string3")
lines = ['some random', 'lines with string3', 'and without it',
'1234', 'string2', 'string1',
"string1", 'abcd', 'xyz']
def compare(x, thread_idx):
print('Thread-{} started'.format(thread_idx))
for line in x:
if any(s in line for s in strings):
print("We got one of strings in line: {}".format(line))
print('Thread-{} finished'.format(thread_idx))
Threading part:
from threading import Thread
threads =
threads_amount = 3
chunk_size = len(lines) // threads_amount
for chunk in range(len(lines) // chunk_size):
threads.append(Thread(target=compare, args=(lines[chunk*chunk_size: (chunk+1)*chunk_size], chunk+1)))
threads[-1].start()
for i in range(threads_amount):
threads[i].join()
Output:
Thread-1 started
Thread-2 started
Thread-3 started
We got one of strings in line: string2
We got one of strings in line: string1
We got one of strings in line: string1
We got one of strings in line: lines with string3
Thread-2 finished
Thread-3 finished
Thread-1 finished
OMG Thank you so much, this is perfect.
– wuzz
Nov 21 '18 at 22:34
Just want to say thank you one more time, just sorted a 39 million line file against 280,000 unique records in 17 seconds using 1000 threads
– wuzz
Nov 21 '18 at 23:44
add a comment |
Here's example solution using threads. This code splits your data in equal chunks and use them as arguments for compare()
by amount threads that we declare.
strings = ("string1", "string2", "string3")
lines = ['some random', 'lines with string3', 'and without it',
'1234', 'string2', 'string1',
"string1", 'abcd', 'xyz']
def compare(x, thread_idx):
print('Thread-{} started'.format(thread_idx))
for line in x:
if any(s in line for s in strings):
print("We got one of strings in line: {}".format(line))
print('Thread-{} finished'.format(thread_idx))
Threading part:
from threading import Thread
threads =
threads_amount = 3
chunk_size = len(lines) // threads_amount
for chunk in range(len(lines) // chunk_size):
threads.append(Thread(target=compare, args=(lines[chunk*chunk_size: (chunk+1)*chunk_size], chunk+1)))
threads[-1].start()
for i in range(threads_amount):
threads[i].join()
Output:
Thread-1 started
Thread-2 started
Thread-3 started
We got one of strings in line: string2
We got one of strings in line: string1
We got one of strings in line: string1
We got one of strings in line: lines with string3
Thread-2 finished
Thread-3 finished
Thread-1 finished
OMG Thank you so much, this is perfect.
– wuzz
Nov 21 '18 at 22:34
Just want to say thank you one more time, just sorted a 39 million line file against 280,000 unique records in 17 seconds using 1000 threads
– wuzz
Nov 21 '18 at 23:44
add a comment |
Here's example solution using threads. This code splits your data in equal chunks and use them as arguments for compare()
by amount threads that we declare.
strings = ("string1", "string2", "string3")
lines = ['some random', 'lines with string3', 'and without it',
'1234', 'string2', 'string1',
"string1", 'abcd', 'xyz']
def compare(x, thread_idx):
print('Thread-{} started'.format(thread_idx))
for line in x:
if any(s in line for s in strings):
print("We got one of strings in line: {}".format(line))
print('Thread-{} finished'.format(thread_idx))
Threading part:
from threading import Thread
threads =
threads_amount = 3
chunk_size = len(lines) // threads_amount
for chunk in range(len(lines) // chunk_size):
threads.append(Thread(target=compare, args=(lines[chunk*chunk_size: (chunk+1)*chunk_size], chunk+1)))
threads[-1].start()
for i in range(threads_amount):
threads[i].join()
Output:
Thread-1 started
Thread-2 started
Thread-3 started
We got one of strings in line: string2
We got one of strings in line: string1
We got one of strings in line: string1
We got one of strings in line: lines with string3
Thread-2 finished
Thread-3 finished
Thread-1 finished
Here's example solution using threads. This code splits your data in equal chunks and use them as arguments for compare()
by amount threads that we declare.
strings = ("string1", "string2", "string3")
lines = ['some random', 'lines with string3', 'and without it',
'1234', 'string2', 'string1',
"string1", 'abcd', 'xyz']
def compare(x, thread_idx):
print('Thread-{} started'.format(thread_idx))
for line in x:
if any(s in line for s in strings):
print("We got one of strings in line: {}".format(line))
print('Thread-{} finished'.format(thread_idx))
Threading part:
from threading import Thread
threads =
threads_amount = 3
chunk_size = len(lines) // threads_amount
for chunk in range(len(lines) // chunk_size):
threads.append(Thread(target=compare, args=(lines[chunk*chunk_size: (chunk+1)*chunk_size], chunk+1)))
threads[-1].start()
for i in range(threads_amount):
threads[i].join()
Output:
Thread-1 started
Thread-2 started
Thread-3 started
We got one of strings in line: string2
We got one of strings in line: string1
We got one of strings in line: string1
We got one of strings in line: lines with string3
Thread-2 finished
Thread-3 finished
Thread-1 finished
answered Nov 21 '18 at 20:32
Filip MłynarskiFilip Młynarski
1,5781311
1,5781311
OMG Thank you so much, this is perfect.
– wuzz
Nov 21 '18 at 22:34
Just want to say thank you one more time, just sorted a 39 million line file against 280,000 unique records in 17 seconds using 1000 threads
– wuzz
Nov 21 '18 at 23:44
add a comment |
OMG Thank you so much, this is perfect.
– wuzz
Nov 21 '18 at 22:34
Just want to say thank you one more time, just sorted a 39 million line file against 280,000 unique records in 17 seconds using 1000 threads
– wuzz
Nov 21 '18 at 23:44
OMG Thank you so much, this is perfect.
– wuzz
Nov 21 '18 at 22:34
OMG Thank you so much, this is perfect.
– wuzz
Nov 21 '18 at 22:34
Just want to say thank you one more time, just sorted a 39 million line file against 280,000 unique records in 17 seconds using 1000 threads
– wuzz
Nov 21 '18 at 23:44
Just want to say thank you one more time, just sorted a 39 million line file against 280,000 unique records in 17 seconds using 1000 threads
– wuzz
Nov 21 '18 at 23:44
add a comment |
One possibility is to use a set
to store emails. This makes the the check if word in emails
O(1). So work done is proportional to the total number of words in your file:
emails = {"string1", "string2", "string3"} # this is a set
for line in f:
if any(word in emails for word in line.split()):
print("yay!")
You original solution is O(nm) (for n words and m emails) as opposed to O(n) with the set
.
add a comment |
One possibility is to use a set
to store emails. This makes the the check if word in emails
O(1). So work done is proportional to the total number of words in your file:
emails = {"string1", "string2", "string3"} # this is a set
for line in f:
if any(word in emails for word in line.split()):
print("yay!")
You original solution is O(nm) (for n words and m emails) as opposed to O(n) with the set
.
add a comment |
One possibility is to use a set
to store emails. This makes the the check if word in emails
O(1). So work done is proportional to the total number of words in your file:
emails = {"string1", "string2", "string3"} # this is a set
for line in f:
if any(word in emails for word in line.split()):
print("yay!")
You original solution is O(nm) (for n words and m emails) as opposed to O(n) with the set
.
One possibility is to use a set
to store emails. This makes the the check if word in emails
O(1). So work done is proportional to the total number of words in your file:
emails = {"string1", "string2", "string3"} # this is a set
for line in f:
if any(word in emails for word in line.split()):
print("yay!")
You original solution is O(nm) (for n words and m emails) as opposed to O(n) with the set
.
answered Nov 21 '18 at 20:31
sliderslider
8,10011129
8,10011129
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53419852%2fpythonic-way-to-compare-a-list-of-words-against-a-list-of-sentences-and-print-th%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
So you want to print "yay" (or do something meaningful) if any word in the file matches one of the strings (emails)?
– slider
Nov 21 '18 at 20:29
Are you scanning for whole lines or parts of lines in emails from the records list?
– Marcel Wilson
Nov 21 '18 at 20:35
This isn't the fault of the loop, but rather that you have an
O(n*m)
code where both n and m are large values. This will require to at least 9 trillion comparison operations (!). Any language will be slow with the same algorithm. You'd want to convert the problem to anO(n+m)
problem instead, and slider's answer with usingset
does that.– Lie Ryan
Nov 21 '18 at 22:14