s04 - Computing GC Content | 自由微信

Y叔 2018-06-01

Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are ‘C’ or ‘G’. For example, the GC-content of “AGCTATAG” is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with ‘>’, followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with ‘>’ indicates the label of the next string.

In Rosalind’s implementation, a string in FASTA format will be labeled by the ID “Rosalind_xxxx”, where “xxxx” denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

Sample Dataset

>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample Output

Rosalind_0808
60.919540

Solution

#!/usr/bin/env python3

f=open("DATA/rosalind_gc.txt", "r")
seq = f.readlines()
gc = 0
n = 0
max_gc = 0
for s in seq:
    s = s.strip()
    if s[0] == '>':
        if n == 0:
            desc = s[1:]
            continue
        else:
            if gc*100/n > max_gc:
                max_gc = gc*100/n
                max_desc = desc
        n = 0
        gc = 0
        desc = s[1:]
     else:
        n += len(s)
        gc += s.count('G')
        gc += s.count('C')

if max_gc < gc*100/n:
    max_gc = gc*100/n
    max_desc = desc

print(max_desc)
print(max_gc)

这道题第一次出现了fasta文件，直接解析文件算gc含量，当我们第二次看到fasta文件的时候，就该写函数来解析了，而不是像上面这样子。预先上biopython的SeqIO来解析fasta，下次我们将自己写一个解析fasta的函数。

#!/usr/bin/env python3

from Bio import SeqIO

fasta_sequences = SeqIO.parse("DATA/rosalind_gc.txt",'fasta')

def gc(seq):
    return (seq.count('G') + seq.count('C'))/len(seq)

id_vector = []
gc_vector = []

for fasta in fasta_sequences:
    id_vector.append(fasta.id)
    gc_vector.append(gc(str(fasta.seq)))

max_gc = max(gc_vector)
max_gc_idx = gc_vector.index(max_gc)
max_gc_id = id_vector[max_gc_idx]

print(max_gc_id)
print(max_gc * 100)

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

有的人走了，却永远活着

圈内疯传某谣言

不要放过这些人渣

“辣条二哥”直追卫龙，麻辣王子工厂遇洪受关注

Problem

Sample Dataset

Sample Output

Solution

您可能也对以下帖子感兴趣

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

有的人走了，却永远活着

圈内疯传某谣言

不要放过这些人渣

“辣条二哥”直追卫龙，麻辣王子工厂遇洪受关注

生成图片，分享到微信朋友圈

Problem

Sample Dataset

Sample Output

Solution

您可能也对以下帖子感兴趣