Word frequencies from large body of scraped text

I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.

Here is a snippet from the original file:

 1 środka(byłe

 1 środka.było

 1 środkacccxli.

 1 (środkach)

 1 „środkach”

 1 środkach

 1 środkach...

 1 środkach.",

 1 środkach"

 1 środkach".

 1 środkachwzorem

 1 środkach.życie

 1 środkajak

 1 "środkami"

 1 (środkami)

 1 „środkami”)

 1 środkami!"

 1 środkami”

 1 środkami)?

 1 środkami˝.

My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script.

#-*- coding: utf-8 -*-



import io

import pandas as pd

import numpy as np



num_batches = 54



for i in range(1,num_batches+1):



infile_path = r'input_batch_' + str(i) + r'.txt'

outfile_path = r'output_batch_' + str(i) + r'.txt'



with io.open(infile_path, 'r', encoding = 'utf8') as infile,

     io.open(outfile_path, 'w', encoding='utf8') as outfile:



     entries_raw = infile.readlines()

     entries_single = [x.strip() for x in entries_raw]

     entries = [x.split('t') for x in entries_single]



     data = pd.DataFrame({"word": , "freq": })



     for j in range(len(entries)):

        data.loc[j] = entries[j][1], entries[j][0]

As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script:

     freq_dict = dict()

     keys = np.unique(data['word'])



     for key in keys:

        for x in range(len(data)):

             if data['word'][x] == key:

                 if key in freq_dict:

                     prior_freq = freq_dict.get(key)

                     freq_dict[key] = prior_freq + data['freq'][x]

                 else:

                    freq_dict[key] = data['freq'][x]



    file_name = u'sample_' + str(i) + u'.csv'

    data.to_csv(file_name, index=False)

The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for loops? Or by using a different data structure for word-frequency lookup than a dictionary?

Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:

# -*- coding: utf-8 -*-



import io

import pandas as pd

import numpy as np



num_batches = 54



for i in range(1 ,num_batches +1):



    infile_path = r'input_batch_' + str(i) + r'.txt'

    outfile_path = r'output_batch_' + str(i) + r'.txt'



    with io.open(infile_path, 'r', encoding = 'utf8') as infile, 

         io.open(outfile_path, 'w', encoding='utf8') as outfile:



        entries_raw = infile.readlines()

        entries_single = [x.strip() for x in entries_raw]

        entries = [x.split('t') for x in entries_single]



        data = pd.DataFrame({"word": , "freq": })



        for j in range(len(entries)):

            data.loc[j] = entries[j][1], entries[j][0]



            freq_dict = dict()

            keys = np.unique(data['word'])



            for key in keys:

                for x in range(len(data)):

                    if data['word'][x] == key:

                        if key in freq_dict:

                            prior_freq = freq_dict.get(key)

                            freq_dict[key] = prior_freq + data['freq'][x]

                        else:

                            freq_dict[key] = data['freq'][x]



        file_name = u'sample_' + str(i) + u'.csv'

        data.to_csv(file_name, index=False)

edited 11 mins ago

Jamal♦

30.2k11116226

asked 35 mins ago

Des Grieux

184

1

Your code will not run. Please fix your indentation.
– Reinderien
26 mins ago

I've added the fixed code in one piece below. Thank you!
– Des Grieux
18 mins ago

add a comment |

Here is a snippet from the original file:

 1 środka(byłe

 1 środka.było

 1 środkacccxli.

 1 (środkach)

 1 „środkach”

 1 środkach

 1 środkach...

 1 środkach.",

 1 środkach"

 1 środkach".

 1 środkachwzorem

 1 środkach.życie

 1 środkajak

 1 "środkami"

 1 (środkami)

 1 „środkami”)

 1 środkami!"

 1 środkami”

 1 środkami)?

 1 środkami˝.

My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script.

#-*- coding: utf-8 -*-



import io

import pandas as pd

import numpy as np



num_batches = 54



for i in range(1,num_batches+1):



infile_path = r'input_batch_' + str(i) + r'.txt'

outfile_path = r'output_batch_' + str(i) + r'.txt'



with io.open(infile_path, 'r', encoding = 'utf8') as infile,

     io.open(outfile_path, 'w', encoding='utf8') as outfile:



     entries_raw = infile.readlines()

     entries_single = [x.strip() for x in entries_raw]

     entries = [x.split('t') for x in entries_single]



     data = pd.DataFrame({"word": , "freq": })



     for j in range(len(entries)):

        data.loc[j] = entries[j][1], entries[j][0]

     freq_dict = dict()

     keys = np.unique(data['word'])



     for key in keys:

        for x in range(len(data)):

             if data['word'][x] == key:

                 if key in freq_dict:

                     prior_freq = freq_dict.get(key)

                     freq_dict[key] = prior_freq + data['freq'][x]

                 else:

                    freq_dict[key] = data['freq'][x]



    file_name = u'sample_' + str(i) + u'.csv'

    data.to_csv(file_name, index=False)

Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:

# -*- coding: utf-8 -*-



import io

import pandas as pd

import numpy as np



num_batches = 54



for i in range(1 ,num_batches +1):



    infile_path = r'input_batch_' + str(i) + r'.txt'

    outfile_path = r'output_batch_' + str(i) + r'.txt'



    with io.open(infile_path, 'r', encoding = 'utf8') as infile, 

         io.open(outfile_path, 'w', encoding='utf8') as outfile:



        entries_raw = infile.readlines()

        entries_single = [x.strip() for x in entries_raw]

        entries = [x.split('t') for x in entries_single]



        data = pd.DataFrame({"word": , "freq": })



        for j in range(len(entries)):

            data.loc[j] = entries[j][1], entries[j][0]



            freq_dict = dict()

            keys = np.unique(data['word'])



            for key in keys:

                for x in range(len(data)):

                    if data['word'][x] == key:

                        if key in freq_dict:

                            prior_freq = freq_dict.get(key)

                            freq_dict[key] = prior_freq + data['freq'][x]

                        else:

                            freq_dict[key] = data['freq'][x]



        file_name = u'sample_' + str(i) + u'.csv'

        data.to_csv(file_name, index=False)

edited 11 mins ago

Jamal♦

30.2k11116226

asked 35 mins ago

Des Grieux

184

1

Your code will not run. Please fix your indentation.
– Reinderien
26 mins ago

I've added the fixed code in one piece below. Thank you!
– Des Grieux
18 mins ago

add a comment |

Here is a snippet from the original file:

 1 środka(byłe

 1 środka.było

 1 środkacccxli.

 1 (środkach)

 1 „środkach”

 1 środkach

 1 środkach...

 1 środkach.",

 1 środkach"

 1 środkach".

 1 środkachwzorem

 1 środkach.życie

 1 środkajak

 1 "środkami"

 1 (środkami)

 1 „środkami”)

 1 środkami!"

 1 środkami”

 1 środkami)?

 1 środkami˝.

My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script.

#-*- coding: utf-8 -*-



import io

import pandas as pd

import numpy as np



num_batches = 54



for i in range(1,num_batches+1):



infile_path = r'input_batch_' + str(i) + r'.txt'

outfile_path = r'output_batch_' + str(i) + r'.txt'



with io.open(infile_path, 'r', encoding = 'utf8') as infile,

     io.open(outfile_path, 'w', encoding='utf8') as outfile:



     entries_raw = infile.readlines()

     entries_single = [x.strip() for x in entries_raw]

     entries = [x.split('t') for x in entries_single]



     data = pd.DataFrame({"word": , "freq": })



     for j in range(len(entries)):

        data.loc[j] = entries[j][1], entries[j][0]

     freq_dict = dict()

     keys = np.unique(data['word'])



     for key in keys:

        for x in range(len(data)):

             if data['word'][x] == key:

                 if key in freq_dict:

                     prior_freq = freq_dict.get(key)

                     freq_dict[key] = prior_freq + data['freq'][x]

                 else:

                    freq_dict[key] = data['freq'][x]



    file_name = u'sample_' + str(i) + u'.csv'

    data.to_csv(file_name, index=False)

Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:

# -*- coding: utf-8 -*-



import io

import pandas as pd

import numpy as np



num_batches = 54



for i in range(1 ,num_batches +1):



    infile_path = r'input_batch_' + str(i) + r'.txt'

    outfile_path = r'output_batch_' + str(i) + r'.txt'



    with io.open(infile_path, 'r', encoding = 'utf8') as infile, 

         io.open(outfile_path, 'w', encoding='utf8') as outfile:



        entries_raw = infile.readlines()

        entries_single = [x.strip() for x in entries_raw]

        entries = [x.split('t') for x in entries_single]



        data = pd.DataFrame({"word": , "freq": })



        for j in range(len(entries)):

            data.loc[j] = entries[j][1], entries[j][0]



            freq_dict = dict()

            keys = np.unique(data['word'])



            for key in keys:

                for x in range(len(data)):

                    if data['word'][x] == key:

                        if key in freq_dict:

                            prior_freq = freq_dict.get(key)

                            freq_dict[key] = prior_freq + data['freq'][x]

                        else:

                            freq_dict[key] = data['freq'][x]



        file_name = u'sample_' + str(i) + u'.csv'

        data.to_csv(file_name, index=False)

edited 11 mins ago

Jamal♦

30.2k11116226

asked 35 mins ago

Des Grieux

184

Here is a snippet from the original file:

 1 środka(byłe

 1 środka.było

 1 środkacccxli.

 1 (środkach)

 1 „środkach”

 1 środkach

 1 środkach...

 1 środkach.",

 1 środkach"

 1 środkach".

 1 środkachwzorem

 1 środkach.życie

 1 środkajak

 1 "środkami"

 1 (środkami)

 1 „środkami”)

 1 środkami!"

 1 środkami”

 1 środkami)?

 1 środkami˝.

My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script.

#-*- coding: utf-8 -*-



import io

import pandas as pd

import numpy as np



num_batches = 54



for i in range(1,num_batches+1):



infile_path = r'input_batch_' + str(i) + r'.txt'

outfile_path = r'output_batch_' + str(i) + r'.txt'



with io.open(infile_path, 'r', encoding = 'utf8') as infile,

     io.open(outfile_path, 'w', encoding='utf8') as outfile:



     entries_raw = infile.readlines()

     entries_single = [x.strip() for x in entries_raw]

     entries = [x.split('t') for x in entries_single]



     data = pd.DataFrame({"word": , "freq": })



     for j in range(len(entries)):

        data.loc[j] = entries[j][1], entries[j][0]

     freq_dict = dict()

     keys = np.unique(data['word'])



     for key in keys:

        for x in range(len(data)):

             if data['word'][x] == key:

                 if key in freq_dict:

                     prior_freq = freq_dict.get(key)

                     freq_dict[key] = prior_freq + data['freq'][x]

                 else:

                    freq_dict[key] = data['freq'][x]



    file_name = u'sample_' + str(i) + u'.csv'

    data.to_csv(file_name, index=False)

Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:

# -*- coding: utf-8 -*-



import io

import pandas as pd

import numpy as np



num_batches = 54



for i in range(1 ,num_batches +1):



    infile_path = r'input_batch_' + str(i) + r'.txt'

    outfile_path = r'output_batch_' + str(i) + r'.txt'



    with io.open(infile_path, 'r', encoding = 'utf8') as infile, 

         io.open(outfile_path, 'w', encoding='utf8') as outfile:



        entries_raw = infile.readlines()

        entries_single = [x.strip() for x in entries_raw]

        entries = [x.split('t') for x in entries_single]



        data = pd.DataFrame({"word": , "freq": })



        for j in range(len(entries)):

            data.loc[j] = entries[j][1], entries[j][0]



            freq_dict = dict()

            keys = np.unique(data['word'])



            for key in keys:

                for x in range(len(data)):

                    if data['word'][x] == key:

                        if key in freq_dict:

                            prior_freq = freq_dict.get(key)

                            freq_dict[key] = prior_freq + data['freq'][x]

                        else:

                            freq_dict[key] = data['freq'][x]



        file_name = u'sample_' + str(i) + u'.csv'

        data.to_csv(file_name, index=False)

python performance dictionary lookup

edited 11 mins ago

Jamal♦

30.2k11116226

asked 35 mins ago

Des Grieux

184

edited 11 mins ago

Jamal♦

30.2k11116226

asked 35 mins ago

Des Grieux

184

edited 11 mins ago

Jamal♦

30.2k11116226

edited 11 mins ago

Jamal♦

30.2k11116226

edited 11 mins ago

Jamal♦

30.2k11116226

asked 35 mins ago

Des Grieux

184

asked 35 mins ago

Des Grieux

184

asked 35 mins ago

Des Grieux

184

1

Your code will not run. Please fix your indentation.
– Reinderien
26 mins ago

I've added the fixed code in one piece below. Thank you!
– Des Grieux
18 mins ago

add a comment |

1

Your code will not run. Please fix your indentation.
– Reinderien
26 mins ago

I've added the fixed code in one piece below. Thank you!
– Des Grieux
18 mins ago

Your code will not run. Please fix your indentation.
– Reinderien
26 mins ago

I've added the fixed code in one piece below. Thank you!
– Des Grieux
18 mins ago

add a comment |

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210192%2fword-frequencies-from-large-body-of-scraped-text%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

6dxoOn9K6U8x,WRi

搜尋此網誌

Tukukkk