Word frequencies from large body of scraped text












0














I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.



Here is a snippet from the original file:




 1 środka(byłe
1 środka.było
1 środkacccxli.
1 (środkach)
1 „środkach”
1 środ­kach
1 środkach...
1 środkach.",
1 środkach"
1 środkach".
1 środkachwzorem
1 środkach.życie
1 środkajak
1 "środkami"
1 (środkami)
1 „środkami”)
1 środkami!"
1 środkami”
1 środkami)?
1 środkami˝.



My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script.




#-*- coding: utf-8 -*-

import io
import pandas as pd
import numpy as np

num_batches = 54

for i in range(1,num_batches+1):

infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'

with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:

entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]

data = pd.DataFrame({"word": , "freq": })

for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]



As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script:




     freq_dict = dict()
keys = np.unique(data['word'])

for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]

file_name = u'sample_' + str(i) + u'.csv'
data.to_csv(file_name, index=False)



The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for loops? Or by using a different data structure for word-frequency lookup than a dictionary?



Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:



# -*- coding: utf-8 -*-

import io
import pandas as pd
import numpy as np

num_batches = 54

for i in range(1 ,num_batches +1):

infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'

with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:

entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]

data = pd.DataFrame({"word": , "freq": })

for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]

freq_dict = dict()
keys = np.unique(data['word'])

for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]

file_name = u'sample_' + str(i) + u'.csv'
data.to_csv(file_name, index=False)









share|improve this question




















  • 1




    Your code will not run. Please fix your indentation.
    – Reinderien
    26 mins ago










  • I've added the fixed code in one piece below. Thank you!
    – Des Grieux
    18 mins ago
















0














I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.



Here is a snippet from the original file:




 1 środka(byłe
1 środka.było
1 środkacccxli.
1 (środkach)
1 „środkach”
1 środ­kach
1 środkach...
1 środkach.",
1 środkach"
1 środkach".
1 środkachwzorem
1 środkach.życie
1 środkajak
1 "środkami"
1 (środkami)
1 „środkami”)
1 środkami!"
1 środkami”
1 środkami)?
1 środkami˝.



My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script.




#-*- coding: utf-8 -*-

import io
import pandas as pd
import numpy as np

num_batches = 54

for i in range(1,num_batches+1):

infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'

with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:

entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]

data = pd.DataFrame({"word": , "freq": })

for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]



As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script:




     freq_dict = dict()
keys = np.unique(data['word'])

for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]

file_name = u'sample_' + str(i) + u'.csv'
data.to_csv(file_name, index=False)



The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for loops? Or by using a different data structure for word-frequency lookup than a dictionary?



Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:



# -*- coding: utf-8 -*-

import io
import pandas as pd
import numpy as np

num_batches = 54

for i in range(1 ,num_batches +1):

infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'

with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:

entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]

data = pd.DataFrame({"word": , "freq": })

for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]

freq_dict = dict()
keys = np.unique(data['word'])

for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]

file_name = u'sample_' + str(i) + u'.csv'
data.to_csv(file_name, index=False)









share|improve this question




















  • 1




    Your code will not run. Please fix your indentation.
    – Reinderien
    26 mins ago










  • I've added the fixed code in one piece below. Thank you!
    – Des Grieux
    18 mins ago














0












0








0







I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.



Here is a snippet from the original file:




 1 środka(byłe
1 środka.było
1 środkacccxli.
1 (środkach)
1 „środkach”
1 środ­kach
1 środkach...
1 środkach.",
1 środkach"
1 środkach".
1 środkachwzorem
1 środkach.życie
1 środkajak
1 "środkami"
1 (środkami)
1 „środkami”)
1 środkami!"
1 środkami”
1 środkami)?
1 środkami˝.



My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script.




#-*- coding: utf-8 -*-

import io
import pandas as pd
import numpy as np

num_batches = 54

for i in range(1,num_batches+1):

infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'

with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:

entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]

data = pd.DataFrame({"word": , "freq": })

for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]



As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script:




     freq_dict = dict()
keys = np.unique(data['word'])

for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]

file_name = u'sample_' + str(i) + u'.csv'
data.to_csv(file_name, index=False)



The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for loops? Or by using a different data structure for word-frequency lookup than a dictionary?



Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:



# -*- coding: utf-8 -*-

import io
import pandas as pd
import numpy as np

num_batches = 54

for i in range(1 ,num_batches +1):

infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'

with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:

entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]

data = pd.DataFrame({"word": , "freq": })

for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]

freq_dict = dict()
keys = np.unique(data['word'])

for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]

file_name = u'sample_' + str(i) + u'.csv'
data.to_csv(file_name, index=False)









share|improve this question















I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.



Here is a snippet from the original file:




 1 środka(byłe
1 środka.było
1 środkacccxli.
1 (środkach)
1 „środkach”
1 środ­kach
1 środkach...
1 środkach.",
1 środkach"
1 środkach".
1 środkachwzorem
1 środkach.życie
1 środkajak
1 "środkami"
1 (środkami)
1 „środkami”)
1 środkami!"
1 środkami”
1 środkami)?
1 środkami˝.



My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script.




#-*- coding: utf-8 -*-

import io
import pandas as pd
import numpy as np

num_batches = 54

for i in range(1,num_batches+1):

infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'

with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:

entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]

data = pd.DataFrame({"word": , "freq": })

for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]



As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script:




     freq_dict = dict()
keys = np.unique(data['word'])

for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]

file_name = u'sample_' + str(i) + u'.csv'
data.to_csv(file_name, index=False)



The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for loops? Or by using a different data structure for word-frequency lookup than a dictionary?



Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:



# -*- coding: utf-8 -*-

import io
import pandas as pd
import numpy as np

num_batches = 54

for i in range(1 ,num_batches +1):

infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'

with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:

entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]

data = pd.DataFrame({"word": , "freq": })

for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]

freq_dict = dict()
keys = np.unique(data['word'])

for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]

file_name = u'sample_' + str(i) + u'.csv'
data.to_csv(file_name, index=False)






python performance dictionary lookup






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 11 mins ago









Jamal

30.2k11116226




30.2k11116226










asked 35 mins ago









Des Grieux

184




184








  • 1




    Your code will not run. Please fix your indentation.
    – Reinderien
    26 mins ago










  • I've added the fixed code in one piece below. Thank you!
    – Des Grieux
    18 mins ago














  • 1




    Your code will not run. Please fix your indentation.
    – Reinderien
    26 mins ago










  • I've added the fixed code in one piece below. Thank you!
    – Des Grieux
    18 mins ago








1




1




Your code will not run. Please fix your indentation.
– Reinderien
26 mins ago




Your code will not run. Please fix your indentation.
– Reinderien
26 mins ago












I've added the fixed code in one piece below. Thank you!
– Des Grieux
18 mins ago




I've added the fixed code in one piece below. Thank you!
– Des Grieux
18 mins ago















active

oldest

votes











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210192%2fword-frequencies-from-large-body-of-scraped-text%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Code Review Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210192%2fword-frequencies-from-large-body-of-scraped-text%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

404 Error Contact Form 7 ajax form submitting

How to know if a Active Directory user can login interactively

Refactoring coordinates for Minecraft Pi buildings written in Python