查看原文
其他

Y叔 2018-06-01

Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are ‘C’ or ‘G’. For example, the GC-content of “AGCTATAG” is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with ‘>’, followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with ‘>’ indicates the label of the next string.

In Rosalind’s implementation, a string in FASTA format will be labeled by the ID “Rosalind_xxxx”, where “xxxx” denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

Sample Dataset

>Rosalind_6404 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC TCCCACTAATAATTCTGAGG >Rosalind_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT ATATCCATTTGTCAGCAGACACGC >Rosalind_0808 CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample Output

Rosalind_0808 60.919540

Solution

#!/usr/bin/env python3

f=open("DATA/rosalind_gc.txt", "r") seq = f.readlines()
gc = 0
n = 0
max_gc = 0
for s in seq:    s = s.strip()
   if s[0] == '>':
       if n == 0:            desc = s[1:]
          continue        else:
           if gc*100/n > max_gc:                max_gc = gc*100/n                max_desc = desc        n = 0        gc = 0        desc = s[1:]
    else:        n += len(s)        gc += s.count('G')        gc += s.count('C')

if
max_gc < gc*100/n:    max_gc = gc*100/n    max_desc = desc print(max_desc) print(max_gc)

这道题第一次出现了fasta文件,直接解析文件算gc含量,当我们第二次看到fasta文件的时候,就该写函数来解析了,而不是像上面这样子。预先上biopython的SeqIO来解析fasta,下次我们将自己写一个解析fasta的函数。

#!/usr/bin/env python3

from Bio import SeqIO fasta_sequences = SeqIO.parse("DATA/rosalind_gc.txt",'fasta')

def gc(seq):    return (seq.count('G') + seq.count('C'))/len(seq) id_vector = [] gc_vector = []

for fasta in fasta_sequences:    id_vector.append(fasta.id)    gc_vector.append(gc(str(fasta.seq))) max_gc = max(gc_vector) max_gc_idx = gc_vector.index(max_gc) max_gc_id = id_vector[max_gc_idx] print(max_gc_id) print(max_gc * 100)

    您可能也对以下帖子感兴趣

    文章有问题?点此查看未经处理的缓存