improving a pipeline in python when that is so slow for big files

I have a text file like this example:

small example:

>chr12:86512-86521

CGGCCAAAG

>chr16:96990-96999

CTTTCATTT

>chr16:97016-97025

TTTTGATTA

>chr16:97068-97077

ATTTAGGGA

this file is divided into different parts, every part has 2 lines. the line which starts with > is ID and the 2nd line is a sequence of letters and the letters are A, T, C or G and also the length of each sequence is 9 so, for every sequence of letters there are 9 positions. I want to get the frequency of the 4 mentioned letters in every position (we have 9 positions).
here is the expected output for the small example:

expected output:

one = {'T': 1, 'A': 1, 'C': 2, 'G': 0}

two = {'T': 3, 'A': 0, 'C': 0, 'G': 1}

three = {'T': 3, 'A': 0, 'C': 0, 'G': 1}

four = {'T': 3, 'A': 0, 'C': 1, 'G': 0}

five = {'T': 0, 'A': 1, 'C': 2, 'G': 1}

six  = {'T': 0, 'A': 3, 'C': 0, 'G': 1}

seven = {'T': 2, 'A': 1, 'C': 0, 'G': 1}

eight = {'T': 2, 'A': 1, 'C': 0, 'G': 1}

nine = ({'T': 1, 'A': 2, 'C': 0, 'G': 1}

I am doing that in python using the following command. this command has 3 steps. steps 1 and 2 work fine but would you help me to improve the step 3 which made this pipeline so slow for big files.

`step1`: to parse the file into a comma-separated file

def fasta_to_textfile(filename, outfile): 

    with open(filename) as f, open(outfile, 'w') as outfile:

        header = sequence = None

        out = csv.writer(outfile, delimiter=',')

        for line in f:

            if line.startswith('>'):

                if header:

                    entry = header + [''.join(sequence)]

                    out.writerow(entry)

                header = line.strip('>n').split('|')

                sequence = 

            else:

                sequence.append(line.strip())

        if header:

            entry = header + [''.join(sequence)]

            out.writerow(entry)

`step2`: comma-separated file to a python dictionary

def file_to_dict(filename):

    f = open(filename, 'r')

    answer = {}

    for line in f:

        k, v = line.strip().split(',')

        answer[k.strip()] = v.strip()

    return answer

to print functiones from step1 and 2:

a = fasta_to_textfile('infile.txt', 'out.txt')

d = file_to_dict('out.txt')

`step3`: to get the frequency

one=

two=

three=

four=

five=

six=

seven=

eight=

nine=

mylist = d.values()

for seq in mylist:

    one.append(seq[0])

    two.append(seq[1])

    se.append(seq[2])

    four.append(seq[3])

    five.append(seq[4])

    six.append(seq[5])

    seven.append(seq[6])

    eight.append(seq[7])

    nine.append(seq[8])



from collections import Counter

one=Counter(one)    

two=Counter(two)

three=Counter(three)

four=Counter(four)

five=Counter(five)

asked 10 mins ago

user188727

New contributor

add a comment |

I have a text file like this example:

small example:

>chr12:86512-86521

CGGCCAAAG

>chr16:96990-96999

CTTTCATTT

>chr16:97016-97025

TTTTGATTA

>chr16:97068-97077

ATTTAGGGA

expected output:

one = {'T': 1, 'A': 1, 'C': 2, 'G': 0}

two = {'T': 3, 'A': 0, 'C': 0, 'G': 1}

three = {'T': 3, 'A': 0, 'C': 0, 'G': 1}

four = {'T': 3, 'A': 0, 'C': 1, 'G': 0}

five = {'T': 0, 'A': 1, 'C': 2, 'G': 1}

six  = {'T': 0, 'A': 3, 'C': 0, 'G': 1}

seven = {'T': 2, 'A': 1, 'C': 0, 'G': 1}

eight = {'T': 2, 'A': 1, 'C': 0, 'G': 1}

nine = ({'T': 1, 'A': 2, 'C': 0, 'G': 1}

I am doing that in python using the following command. this command has 3 steps. steps 1 and 2 work fine but would you help me to improve the step 3 which made this pipeline so slow for big files.

`step1`: to parse the file into a comma-separated file

def fasta_to_textfile(filename, outfile): 

    with open(filename) as f, open(outfile, 'w') as outfile:

        header = sequence = None

        out = csv.writer(outfile, delimiter=',')

        for line in f:

            if line.startswith('>'):

                if header:

                    entry = header + [''.join(sequence)]

                    out.writerow(entry)

                header = line.strip('>n').split('|')

                sequence = 

            else:

                sequence.append(line.strip())

        if header:

            entry = header + [''.join(sequence)]

            out.writerow(entry)

`step2`: comma-separated file to a python dictionary

def file_to_dict(filename):

    f = open(filename, 'r')

    answer = {}

    for line in f:

        k, v = line.strip().split(',')

        answer[k.strip()] = v.strip()

    return answer

to print functiones from step1 and 2:

a = fasta_to_textfile('infile.txt', 'out.txt')

d = file_to_dict('out.txt')

`step3`: to get the frequency

one=

two=

three=

four=

five=

six=

seven=

eight=

nine=

mylist = d.values()

for seq in mylist:

    one.append(seq[0])

    two.append(seq[1])

    se.append(seq[2])

    four.append(seq[3])

    five.append(seq[4])

    six.append(seq[5])

    seven.append(seq[6])

    eight.append(seq[7])

    nine.append(seq[8])



from collections import Counter

one=Counter(one)    

two=Counter(two)

three=Counter(three)

four=Counter(four)

five=Counter(five)

asked 10 mins ago

user188727

New contributor

add a comment |

I have a text file like this example:

small example:

>chr12:86512-86521

CGGCCAAAG

>chr16:96990-96999

CTTTCATTT

>chr16:97016-97025

TTTTGATTA

>chr16:97068-97077

ATTTAGGGA

expected output:

one = {'T': 1, 'A': 1, 'C': 2, 'G': 0}

two = {'T': 3, 'A': 0, 'C': 0, 'G': 1}

three = {'T': 3, 'A': 0, 'C': 0, 'G': 1}

four = {'T': 3, 'A': 0, 'C': 1, 'G': 0}

five = {'T': 0, 'A': 1, 'C': 2, 'G': 1}

six  = {'T': 0, 'A': 3, 'C': 0, 'G': 1}

seven = {'T': 2, 'A': 1, 'C': 0, 'G': 1}

eight = {'T': 2, 'A': 1, 'C': 0, 'G': 1}

nine = ({'T': 1, 'A': 2, 'C': 0, 'G': 1}

I am doing that in python using the following command. this command has 3 steps. steps 1 and 2 work fine but would you help me to improve the step 3 which made this pipeline so slow for big files.

`step1`: to parse the file into a comma-separated file

def fasta_to_textfile(filename, outfile): 

    with open(filename) as f, open(outfile, 'w') as outfile:

        header = sequence = None

        out = csv.writer(outfile, delimiter=',')

        for line in f:

            if line.startswith('>'):

                if header:

                    entry = header + [''.join(sequence)]

                    out.writerow(entry)

                header = line.strip('>n').split('|')

                sequence = 

            else:

                sequence.append(line.strip())

        if header:

            entry = header + [''.join(sequence)]

            out.writerow(entry)

`step2`: comma-separated file to a python dictionary

def file_to_dict(filename):

    f = open(filename, 'r')

    answer = {}

    for line in f:

        k, v = line.strip().split(',')

        answer[k.strip()] = v.strip()

    return answer

to print functiones from step1 and 2:

a = fasta_to_textfile('infile.txt', 'out.txt')

d = file_to_dict('out.txt')

`step3`: to get the frequency

one=

two=

three=

four=

five=

six=

seven=

eight=

nine=

mylist = d.values()

for seq in mylist:

    one.append(seq[0])

    two.append(seq[1])

    se.append(seq[2])

    four.append(seq[3])

    five.append(seq[4])

    six.append(seq[5])

    seven.append(seq[6])

    eight.append(seq[7])

    nine.append(seq[8])



from collections import Counter

one=Counter(one)    

two=Counter(two)

three=Counter(three)

four=Counter(four)

five=Counter(five)

asked 10 mins ago

user188727

New contributor

I have a text file like this example:

small example:

>chr12:86512-86521

CGGCCAAAG

>chr16:96990-96999

CTTTCATTT

>chr16:97016-97025

TTTTGATTA

>chr16:97068-97077

ATTTAGGGA

expected output:

one = {'T': 1, 'A': 1, 'C': 2, 'G': 0}

two = {'T': 3, 'A': 0, 'C': 0, 'G': 1}

three = {'T': 3, 'A': 0, 'C': 0, 'G': 1}

four = {'T': 3, 'A': 0, 'C': 1, 'G': 0}

five = {'T': 0, 'A': 1, 'C': 2, 'G': 1}

six  = {'T': 0, 'A': 3, 'C': 0, 'G': 1}

seven = {'T': 2, 'A': 1, 'C': 0, 'G': 1}

eight = {'T': 2, 'A': 1, 'C': 0, 'G': 1}

nine = ({'T': 1, 'A': 2, 'C': 0, 'G': 1}

I am doing that in python using the following command. this command has 3 steps. steps 1 and 2 work fine but would you help me to improve the step 3 which made this pipeline so slow for big files.

`step1`: to parse the file into a comma-separated file

def fasta_to_textfile(filename, outfile): 

    with open(filename) as f, open(outfile, 'w') as outfile:

        header = sequence = None

        out = csv.writer(outfile, delimiter=',')

        for line in f:

            if line.startswith('>'):

                if header:

                    entry = header + [''.join(sequence)]

                    out.writerow(entry)

                header = line.strip('>n').split('|')

                sequence = 

            else:

                sequence.append(line.strip())

        if header:

            entry = header + [''.join(sequence)]

            out.writerow(entry)

`step2`: comma-separated file to a python dictionary

def file_to_dict(filename):

    f = open(filename, 'r')

    answer = {}

    for line in f:

        k, v = line.strip().split(',')

        answer[k.strip()] = v.strip()

    return answer

to print functiones from step1 and 2:

a = fasta_to_textfile('infile.txt', 'out.txt')

d = file_to_dict('out.txt')

`step3`: to get the frequency

one=

two=

three=

four=

five=

six=

seven=

eight=

nine=

mylist = d.values()

for seq in mylist:

    one.append(seq[0])

    two.append(seq[1])

    se.append(seq[2])

    four.append(seq[3])

    five.append(seq[4])

    six.append(seq[5])

    seven.append(seq[6])

    eight.append(seq[7])

    nine.append(seq[8])



from collections import Counter

one=Counter(one)    

two=Counter(two)

three=Counter(three)

four=Counter(four)

five=Counter(five)

python

asked 10 mins ago

user188727

New contributor

asked 10 mins ago

user188727

New contributor

asked 10 mins ago

user188727

New contributor

asked 10 mins ago

user188727

asked 10 mins ago

user188727

New contributor

user188727 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

user188727 is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210184%2fimproving-a-pipeline-in-python-when-that-is-so-slow-for-big-files%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

user188727 is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

user188727 is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Tukukkk

improving a pipeline in python when that is so slow for big files

`step1`: to parse the file into a comma-separated file

`step2`: comma-separated file to a python dictionary

to print functiones from step1 and 2:

`step3`: to get the frequency

`step1`: to parse the file into a comma-separated file

`step2`: comma-separated file to a python dictionary

to print functiones from step1 and 2:

`step3`: to get the frequency

`step1`: to parse the file into a comma-separated file

`step2`: comma-separated file to a python dictionary

to print functiones from step1 and 2:

`step3`: to get the frequency

`step1`: to parse the file into a comma-separated file

`step2`: comma-separated file to a python dictionary

to print functiones from step1 and 2:

`step3`: to get the frequency

Your Answer

Post as a guest

Post as a guest

Popular posts from this blog

404 Error Contact Form 7 ajax form submitting

How to know if a Active Directory user can login interactively

C# WPF - Problem with Material Design Textbox

improving a pipeline in python when that is so slow for big files

step1: to parse the file into a comma-separated file

step2: comma-separated file to a python dictionary

to print functiones from step1 and 2:

step3: to get the frequency

step1: to parse the file into a comma-separated file

step2: comma-separated file to a python dictionary

to print functiones from step1 and 2:

step3: to get the frequency

step1: to parse the file into a comma-separated file

step2: comma-separated file to a python dictionary

to print functiones from step1 and 2:

step3: to get the frequency

step1: to parse the file into a comma-separated file

step2: comma-separated file to a python dictionary

to print functiones from step1 and 2:

step3: to get the frequency

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

404 Error Contact Form 7 ajax form submitting

How to know if a Active Directory user can login interactively

C# WPF - Problem with Material Design Textbox

`step1`: to parse the file into a comma-separated file

`step2`: comma-separated file to a python dictionary

`step3`: to get the frequency

`step1`: to parse the file into a comma-separated file

`step2`: comma-separated file to a python dictionary

`step3`: to get the frequency

`step1`: to parse the file into a comma-separated file

`step2`: comma-separated file to a python dictionary

`step3`: to get the frequency

`step1`: to parse the file into a comma-separated file

`step2`: comma-separated file to a python dictionary

`step3`: to get the frequency