Getting the frequency of letters in each position

I have a text file like this example:

>chr12:86512-86521

CGGCCAAAG

>chr16:96990-96999

CTTTCATTT

>chr16:97016-97025

TTTTGATTA

>chr16:97068-97077

ATTTAGGGA

This file is divided into different parts, and every part has 2 lines. The line which starts with > is ID and the 2nd line is a sequence of letters and the letters are A, T, C or G and also the length of each sequence is 9 so, for every sequence of letters there are 9 positions. I want to get the frequency of the 4 mentioned letters in every position (we have 9 positions).

Here is the expected output for the small example:

one = {'T': 1, 'A': 1, 'C': 2, 'G': 0}

two = {'T': 3, 'A': 0, 'C': 0, 'G': 1}

three = {'T': 3, 'A': 0, 'C': 0, 'G': 1}

four = {'T': 3, 'A': 0, 'C': 1, 'G': 0}

five = {'T': 0, 'A': 1, 'C': 2, 'G': 1}

six  = {'T': 0, 'A': 3, 'C': 0, 'G': 1}

seven = {'T': 2, 'A': 1, 'C': 0, 'G': 1}

eight = {'T': 2, 'A': 1, 'C': 0, 'G': 1}

nine = ({'T': 1, 'A': 2, 'C': 0, 'G': 1}

I am doing that in Python using the following command. This command has 3 steps. Steps 1 and 2 work fine, but would you help me to improve step 3, which made this pipeline so slow for big files?

Step 1: to parse the file into a comma-separated file

def fasta_to_textfile(filename, outfile): 

    with open(filename) as f, open(outfile, 'w') as outfile:

        header = sequence = None

        out = csv.writer(outfile, delimiter=',')

        for line in f:

            if line.startswith('>'):

                if header:

                    entry = header + [''.join(sequence)]

                    out.writerow(entry)

                header = line.strip('>n').split('|')

                sequence = 

            else:

                sequence.append(line.strip())

        if header:

            entry = header + [''.join(sequence)]

            out.writerow(entry)

Step 2: comma-separated file to a Python dictionary

def file_to_dict(filename):

    f = open(filename, 'r')

    answer = {}

    for line in f:

        k, v = line.strip().split(',')

        answer[k.strip()] = v.strip()

    return answer

To print functions from steps 1 and 2:

a = fasta_to_textfile('infile.txt', 'out.txt')

d = file_to_dict('out.txt')

Step 3: to get the frequency

one=

two=

three=

four=

five=

six=

seven=

eight=

nine=

mylist = d.values()

for seq in mylist:

    one.append(seq[0])

    two.append(seq[1])

    se.append(seq[2])

    four.append(seq[3])

    five.append(seq[4])

    six.append(seq[5])

    seven.append(seq[6])

    eight.append(seq[7])

    nine.append(seq[8])



from collections import Counter

one=Counter(one)    

two=Counter(two)

three=Counter(three)

four=Counter(four)

five=Counter(five)

edited 12 mins ago

Jamal♦

30.3k11119227

asked Dec 22 '18 at 20:20

user188727

add a comment |

I have a text file like this example:

>chr12:86512-86521

CGGCCAAAG

>chr16:96990-96999

CTTTCATTT

>chr16:97016-97025

TTTTGATTA

>chr16:97068-97077

ATTTAGGGA

Here is the expected output for the small example:

one = {'T': 1, 'A': 1, 'C': 2, 'G': 0}

two = {'T': 3, 'A': 0, 'C': 0, 'G': 1}

three = {'T': 3, 'A': 0, 'C': 0, 'G': 1}

four = {'T': 3, 'A': 0, 'C': 1, 'G': 0}

five = {'T': 0, 'A': 1, 'C': 2, 'G': 1}

six  = {'T': 0, 'A': 3, 'C': 0, 'G': 1}

seven = {'T': 2, 'A': 1, 'C': 0, 'G': 1}

eight = {'T': 2, 'A': 1, 'C': 0, 'G': 1}

nine = ({'T': 1, 'A': 2, 'C': 0, 'G': 1}

I am doing that in Python using the following command. This command has 3 steps. Steps 1 and 2 work fine, but would you help me to improve step 3, which made this pipeline so slow for big files?

Step 1: to parse the file into a comma-separated file

def fasta_to_textfile(filename, outfile): 

    with open(filename) as f, open(outfile, 'w') as outfile:

        header = sequence = None

        out = csv.writer(outfile, delimiter=',')

        for line in f:

            if line.startswith('>'):

                if header:

                    entry = header + [''.join(sequence)]

                    out.writerow(entry)

                header = line.strip('>n').split('|')

                sequence = 

            else:

                sequence.append(line.strip())

        if header:

            entry = header + [''.join(sequence)]

            out.writerow(entry)

Step 2: comma-separated file to a Python dictionary

def file_to_dict(filename):

    f = open(filename, 'r')

    answer = {}

    for line in f:

        k, v = line.strip().split(',')

        answer[k.strip()] = v.strip()

    return answer

To print functions from steps 1 and 2:

a = fasta_to_textfile('infile.txt', 'out.txt')

d = file_to_dict('out.txt')

Step 3: to get the frequency

one=

two=

three=

four=

five=

six=

seven=

eight=

nine=

mylist = d.values()

for seq in mylist:

    one.append(seq[0])

    two.append(seq[1])

    se.append(seq[2])

    four.append(seq[3])

    five.append(seq[4])

    six.append(seq[5])

    seven.append(seq[6])

    eight.append(seq[7])

    nine.append(seq[8])



from collections import Counter

one=Counter(one)    

two=Counter(two)

three=Counter(three)

four=Counter(four)

five=Counter(five)

edited 12 mins ago

Jamal♦

30.3k11119227

asked Dec 22 '18 at 20:20

user188727

add a comment |

I have a text file like this example:

>chr12:86512-86521

CGGCCAAAG

>chr16:96990-96999

CTTTCATTT

>chr16:97016-97025

TTTTGATTA

>chr16:97068-97077

ATTTAGGGA

Here is the expected output for the small example:

one = {'T': 1, 'A': 1, 'C': 2, 'G': 0}

two = {'T': 3, 'A': 0, 'C': 0, 'G': 1}

three = {'T': 3, 'A': 0, 'C': 0, 'G': 1}

four = {'T': 3, 'A': 0, 'C': 1, 'G': 0}

five = {'T': 0, 'A': 1, 'C': 2, 'G': 1}

six  = {'T': 0, 'A': 3, 'C': 0, 'G': 1}

seven = {'T': 2, 'A': 1, 'C': 0, 'G': 1}

eight = {'T': 2, 'A': 1, 'C': 0, 'G': 1}

nine = ({'T': 1, 'A': 2, 'C': 0, 'G': 1}

I am doing that in Python using the following command. This command has 3 steps. Steps 1 and 2 work fine, but would you help me to improve step 3, which made this pipeline so slow for big files?

Step 1: to parse the file into a comma-separated file

def fasta_to_textfile(filename, outfile): 

    with open(filename) as f, open(outfile, 'w') as outfile:

        header = sequence = None

        out = csv.writer(outfile, delimiter=',')

        for line in f:

            if line.startswith('>'):

                if header:

                    entry = header + [''.join(sequence)]

                    out.writerow(entry)

                header = line.strip('>n').split('|')

                sequence = 

            else:

                sequence.append(line.strip())

        if header:

            entry = header + [''.join(sequence)]

            out.writerow(entry)

Step 2: comma-separated file to a Python dictionary

def file_to_dict(filename):

    f = open(filename, 'r')

    answer = {}

    for line in f:

        k, v = line.strip().split(',')

        answer[k.strip()] = v.strip()

    return answer

To print functions from steps 1 and 2:

a = fasta_to_textfile('infile.txt', 'out.txt')

d = file_to_dict('out.txt')

Step 3: to get the frequency

one=

two=

three=

four=

five=

six=

seven=

eight=

nine=

mylist = d.values()

for seq in mylist:

    one.append(seq[0])

    two.append(seq[1])

    se.append(seq[2])

    four.append(seq[3])

    five.append(seq[4])

    six.append(seq[5])

    seven.append(seq[6])

    eight.append(seq[7])

    nine.append(seq[8])



from collections import Counter

one=Counter(one)    

two=Counter(two)

three=Counter(three)

four=Counter(four)

five=Counter(five)

edited 12 mins ago

Jamal♦

30.3k11119227

asked Dec 22 '18 at 20:20

user188727

I have a text file like this example:

>chr12:86512-86521

CGGCCAAAG

>chr16:96990-96999

CTTTCATTT

>chr16:97016-97025

TTTTGATTA

>chr16:97068-97077

ATTTAGGGA

Here is the expected output for the small example:

one = {'T': 1, 'A': 1, 'C': 2, 'G': 0}

two = {'T': 3, 'A': 0, 'C': 0, 'G': 1}

three = {'T': 3, 'A': 0, 'C': 0, 'G': 1}

four = {'T': 3, 'A': 0, 'C': 1, 'G': 0}

five = {'T': 0, 'A': 1, 'C': 2, 'G': 1}

six  = {'T': 0, 'A': 3, 'C': 0, 'G': 1}

seven = {'T': 2, 'A': 1, 'C': 0, 'G': 1}

eight = {'T': 2, 'A': 1, 'C': 0, 'G': 1}

nine = ({'T': 1, 'A': 2, 'C': 0, 'G': 1}

I am doing that in Python using the following command. This command has 3 steps. Steps 1 and 2 work fine, but would you help me to improve step 3, which made this pipeline so slow for big files?

Step 1: to parse the file into a comma-separated file

def fasta_to_textfile(filename, outfile): 

    with open(filename) as f, open(outfile, 'w') as outfile:

        header = sequence = None

        out = csv.writer(outfile, delimiter=',')

        for line in f:

            if line.startswith('>'):

                if header:

                    entry = header + [''.join(sequence)]

                    out.writerow(entry)

                header = line.strip('>n').split('|')

                sequence = 

            else:

                sequence.append(line.strip())

        if header:

            entry = header + [''.join(sequence)]

            out.writerow(entry)

Step 2: comma-separated file to a Python dictionary

def file_to_dict(filename):

    f = open(filename, 'r')

    answer = {}

    for line in f:

        k, v = line.strip().split(',')

        answer[k.strip()] = v.strip()

    return answer

To print functions from steps 1 and 2:

a = fasta_to_textfile('infile.txt', 'out.txt')

d = file_to_dict('out.txt')

Step 3: to get the frequency

one=

two=

three=

four=

five=

six=

seven=

eight=

nine=

mylist = d.values()

for seq in mylist:

    one.append(seq[0])

    two.append(seq[1])

    se.append(seq[2])

    four.append(seq[3])

    five.append(seq[4])

    six.append(seq[5])

    seven.append(seq[6])

    eight.append(seq[7])

    nine.append(seq[8])



from collections import Counter

one=Counter(one)    

two=Counter(two)

three=Counter(three)

four=Counter(four)

five=Counter(five)

python

edited 12 mins ago

Jamal♦

30.3k11119227

asked Dec 22 '18 at 20:20

user188727

edited 12 mins ago

Jamal♦

30.3k11119227

asked Dec 22 '18 at 20:20

user188727

edited 12 mins ago

Jamal♦

30.3k11119227

edited 12 mins ago

Jamal♦

30.3k11119227

edited 12 mins ago

Jamal♦

30.3k11119227

asked Dec 22 '18 at 20:20

user188727

asked Dec 22 '18 at 20:20

user188727

asked Dec 22 '18 at 20:20

user188727

add a comment |

1 Answer
1

active

oldest

votes

You forgot to add import csv and from collections import Counter. Probably missed it while copy pasting. Also, your = signs are inconsistent in step 3. Try to follow PEP8. Also, a is useless in this line:

a = fasta_to_textfile('infile.txt', 'out.txt')

Since you've programmed a void function, a = None because it returns nothing.

Is the conversion to the CSV file really necessary? This would be an example of the pipeline:

Read the file.

Extract the sequence and load it into a N*9 table, where N is the number of sequences

Swap the rows and columns (numpy can help you out here)

A simple for loop that uses the Counter function on each row (but really column), refactored into less lines. Sadly I don't have time right to rewrite bits of your code right now.

One last thing - are you sure your example is correct? I tried loading it but got: ValueError: not enough values to unpack (expected 2, got 1)...

edited Dec 23 '18 at 0:28

answered Dec 23 '18 at 0:17

user171191

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210184%2fgetting-the-frequency-of-letters-in-each-position%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

a = fasta_to_textfile('infile.txt', 'out.txt')

Since you've programmed a void function, a = None because it returns nothing.

Is the conversion to the CSV file really necessary? This would be an example of the pipeline:

Read the file.

Extract the sequence and load it into a N*9 table, where N is the number of sequences

Swap the rows and columns (numpy can help you out here)

A simple for loop that uses the Counter function on each row (but really column), refactored into less lines. Sadly I don't have time right to rewrite bits of your code right now.

One last thing - are you sure your example is correct? I tried loading it but got: ValueError: not enough values to unpack (expected 2, got 1)...

edited Dec 23 '18 at 0:28

answered Dec 23 '18 at 0:17

user171191

add a comment |

a = fasta_to_textfile('infile.txt', 'out.txt')

Since you've programmed a void function, a = None because it returns nothing.

Is the conversion to the CSV file really necessary? This would be an example of the pipeline:

Read the file.

Extract the sequence and load it into a N*9 table, where N is the number of sequences

Swap the rows and columns (numpy can help you out here)

A simple for loop that uses the Counter function on each row (but really column), refactored into less lines. Sadly I don't have time right to rewrite bits of your code right now.

One last thing - are you sure your example is correct? I tried loading it but got: ValueError: not enough values to unpack (expected 2, got 1)...

edited Dec 23 '18 at 0:28

answered Dec 23 '18 at 0:17

user171191

add a comment |

a = fasta_to_textfile('infile.txt', 'out.txt')

Since you've programmed a void function, a = None because it returns nothing.

Is the conversion to the CSV file really necessary? This would be an example of the pipeline:

Read the file.

Extract the sequence and load it into a N*9 table, where N is the number of sequences

Swap the rows and columns (numpy can help you out here)

A simple for loop that uses the Counter function on each row (but really column), refactored into less lines. Sadly I don't have time right to rewrite bits of your code right now.

One last thing - are you sure your example is correct? I tried loading it but got: ValueError: not enough values to unpack (expected 2, got 1)...

edited Dec 23 '18 at 0:28

answered Dec 23 '18 at 0:17

user171191

a = fasta_to_textfile('infile.txt', 'out.txt')

Since you've programmed a void function, a = None because it returns nothing.

Is the conversion to the CSV file really necessary? This would be an example of the pipeline:

Read the file.

Extract the sequence and load it into a N*9 table, where N is the number of sequences

Swap the rows and columns (numpy can help you out here)

A simple for loop that uses the Counter function on each row (but really column), refactored into less lines. Sadly I don't have time right to rewrite bits of your code right now.

One last thing - are you sure your example is correct? I tried loading it but got: ValueError: not enough values to unpack (expected 2, got 1)...

edited Dec 23 '18 at 0:28

answered Dec 23 '18 at 0:17

user171191

edited Dec 23 '18 at 0:28

answered Dec 23 '18 at 0:17

user171191

answered Dec 23 '18 at 0:17

user171191

answered Dec 23 '18 at 0:17

user171191

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Tukukkk

Getting the frequency of letters in each position

Step 1: to parse the file into a comma-separated file

Step 2: comma-separated file to a Python dictionary

To print functions from steps 1 and 2:

Step 3: to get the frequency

Step 1: to parse the file into a comma-separated file

Step 2: comma-separated file to a Python dictionary

To print functions from steps 1 and 2:

Step 3: to get the frequency

Step 1: to parse the file into a comma-separated file

Step 2: comma-separated file to a Python dictionary

To print functions from steps 1 and 2:

Step 3: to get the frequency

Step 1: to parse the file into a comma-separated file

Step 2: comma-separated file to a Python dictionary

To print functions from steps 1 and 2:

Step 3: to get the frequency

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

404 Error Contact Form 7 ajax form submitting

How to know if a Active Directory user can login interactively

TypeError: fit_transform() missing 1 required positional argument: 'X'

Getting the frequency of letters in each position

Step 1: to parse the file into a comma-separated file

Step 2: comma-separated file to a Python dictionary

To print functions from steps 1 and 2:

Step 3: to get the frequency

Step 1: to parse the file into a comma-separated file

Step 2: comma-separated file to a Python dictionary

To print functions from steps 1 and 2:

Step 3: to get the frequency

Step 1: to parse the file into a comma-separated file

Step 2: comma-separated file to a Python dictionary

To print functions from steps 1 and 2:

Step 3: to get the frequency

Step 1: to parse the file into a comma-separated file

Step 2: comma-separated file to a Python dictionary

To print functions from steps 1 and 2:

Step 3: to get the frequency

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

404 Error Contact Form 7 ajax form submitting

How to know if a Active Directory user can login interactively

TypeError: fit_transform() missing 1 required positional argument: 'X'

1 Answer
1

1 Answer
1

1 Answer
1