Word frequencies from large body of scraped text
I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.
Here is a snippet from the original file:
1 środka(byłe
1 środka.było
1 środkacccxli.
1 (środkach)
1 „środkach”
1 środkach
1 środkach...
1 środkach.",
1 środkach"
1 środkach".
1 środkachwzorem
1 środkach.życie
1 środkajak
1 "środkami"
1 (środkami)
1 „środkami”)
1 środkami!"
1 środkami”
1 środkami)?
1 środkami˝.
My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script.
#-*- coding: utf-8 -*-
import io
import pandas as pd
import numpy as np
num_batches = 54
for i in range(1,num_batches+1):
infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'
with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:
entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]
data = pd.DataFrame({"word": , "freq": })
for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]
As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script:
freq_dict = dict()
keys = np.unique(data['word'])
for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]
file_name = u'sample_' + str(i) + u'.csv'
data.to_csv(file_name, index=False)
The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for
loops? Or by using a different data structure for word-frequency lookup than a dictionary?
Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:
# -*- coding: utf-8 -*-
import io
import pandas as pd
import numpy as np
num_batches = 54
for i in range(1 ,num_batches +1):
infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'
with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:
entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]
data = pd.DataFrame({"word": , "freq": })
for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]
freq_dict = dict()
keys = np.unique(data['word'])
for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]
file_name = u'sample_' + str(i) + u'.csv'
data.to_csv(file_name, index=False)
python performance dictionary lookup
add a comment |
I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.
Here is a snippet from the original file:
1 środka(byłe
1 środka.było
1 środkacccxli.
1 (środkach)
1 „środkach”
1 środkach
1 środkach...
1 środkach.",
1 środkach"
1 środkach".
1 środkachwzorem
1 środkach.życie
1 środkajak
1 "środkami"
1 (środkami)
1 „środkami”)
1 środkami!"
1 środkami”
1 środkami)?
1 środkami˝.
My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script.
#-*- coding: utf-8 -*-
import io
import pandas as pd
import numpy as np
num_batches = 54
for i in range(1,num_batches+1):
infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'
with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:
entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]
data = pd.DataFrame({"word": , "freq": })
for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]
As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script:
freq_dict = dict()
keys = np.unique(data['word'])
for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]
file_name = u'sample_' + str(i) + u'.csv'
data.to_csv(file_name, index=False)
The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for
loops? Or by using a different data structure for word-frequency lookup than a dictionary?
Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:
# -*- coding: utf-8 -*-
import io
import pandas as pd
import numpy as np
num_batches = 54
for i in range(1 ,num_batches +1):
infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'
with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:
entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]
data = pd.DataFrame({"word": , "freq": })
for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]
freq_dict = dict()
keys = np.unique(data['word'])
for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]
file_name = u'sample_' + str(i) + u'.csv'
data.to_csv(file_name, index=False)
python performance dictionary lookup
1
Your code will not run. Please fix your indentation.
– Reinderien
26 mins ago
I've added the fixed code in one piece below. Thank you!
– Des Grieux
18 mins ago
add a comment |
I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.
Here is a snippet from the original file:
1 środka(byłe
1 środka.było
1 środkacccxli.
1 (środkach)
1 „środkach”
1 środkach
1 środkach...
1 środkach.",
1 środkach"
1 środkach".
1 środkachwzorem
1 środkach.życie
1 środkajak
1 "środkami"
1 (środkami)
1 „środkami”)
1 środkami!"
1 środkami”
1 środkami)?
1 środkami˝.
My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script.
#-*- coding: utf-8 -*-
import io
import pandas as pd
import numpy as np
num_batches = 54
for i in range(1,num_batches+1):
infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'
with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:
entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]
data = pd.DataFrame({"word": , "freq": })
for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]
As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script:
freq_dict = dict()
keys = np.unique(data['word'])
for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]
file_name = u'sample_' + str(i) + u'.csv'
data.to_csv(file_name, index=False)
The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for
loops? Or by using a different data structure for word-frequency lookup than a dictionary?
Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:
# -*- coding: utf-8 -*-
import io
import pandas as pd
import numpy as np
num_batches = 54
for i in range(1 ,num_batches +1):
infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'
with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:
entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]
data = pd.DataFrame({"word": , "freq": })
for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]
freq_dict = dict()
keys = np.unique(data['word'])
for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]
file_name = u'sample_' + str(i) + u'.csv'
data.to_csv(file_name, index=False)
python performance dictionary lookup
I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.
Here is a snippet from the original file:
1 środka(byłe
1 środka.było
1 środkacccxli.
1 (środkach)
1 „środkach”
1 środkach
1 środkach...
1 środkach.",
1 środkach"
1 środkach".
1 środkachwzorem
1 środkach.życie
1 środkajak
1 "środkami"
1 (środkami)
1 „środkami”)
1 środkami!"
1 środkami”
1 środkami)?
1 środkami˝.
My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script.
#-*- coding: utf-8 -*-
import io
import pandas as pd
import numpy as np
num_batches = 54
for i in range(1,num_batches+1):
infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'
with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:
entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]
data = pd.DataFrame({"word": , "freq": })
for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]
As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script:
freq_dict = dict()
keys = np.unique(data['word'])
for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]
file_name = u'sample_' + str(i) + u'.csv'
data.to_csv(file_name, index=False)
The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for
loops? Or by using a different data structure for word-frequency lookup than a dictionary?
Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:
# -*- coding: utf-8 -*-
import io
import pandas as pd
import numpy as np
num_batches = 54
for i in range(1 ,num_batches +1):
infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'
with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:
entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]
data = pd.DataFrame({"word": , "freq": })
for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]
freq_dict = dict()
keys = np.unique(data['word'])
for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]
file_name = u'sample_' + str(i) + u'.csv'
data.to_csv(file_name, index=False)
python performance dictionary lookup
python performance dictionary lookup
edited 11 mins ago
Jamal♦
30.2k11116226
30.2k11116226
asked 35 mins ago
Des Grieux
184
184
1
Your code will not run. Please fix your indentation.
– Reinderien
26 mins ago
I've added the fixed code in one piece below. Thank you!
– Des Grieux
18 mins ago
add a comment |
1
Your code will not run. Please fix your indentation.
– Reinderien
26 mins ago
I've added the fixed code in one piece below. Thank you!
– Des Grieux
18 mins ago
1
1
Your code will not run. Please fix your indentation.
– Reinderien
26 mins ago
Your code will not run. Please fix your indentation.
– Reinderien
26 mins ago
I've added the fixed code in one piece below. Thank you!
– Des Grieux
18 mins ago
I've added the fixed code in one piece below. Thank you!
– Des Grieux
18 mins ago
add a comment |
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210192%2fword-frequencies-from-large-body-of-scraped-text%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210192%2fword-frequencies-from-large-body-of-scraped-text%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Your code will not run. Please fix your indentation.
– Reinderien
26 mins ago
I've added the fixed code in one piece below. Thank you!
– Des Grieux
18 mins ago