其他
跟着 Cell 学 Ribo-seq 分析 三 (Metagene Plot)
随缘
1前言
走过一条长长的路,路两旁的一幅幅动态,提醒着曾经我也和他们一样的拥有。
2引言
绘制所有基因在 start codon 上游 50nt 及下游 1500 nt 的平均 desity 的分布。作者采用对每个位置处的 density 对该基因的长度和表达量进行标准化。
3distance from start codon
读入之前 center weighting
计算的 density 以及自己准备的基因位置信息文件
, 首先计算每个基因的表达量,作者 筛选了大于 50 的基因 进行分析,此外筛选 长度大于 400nt 的基因 作为分析,然后对基因 上下游的-50-1500 nt 的 density 进行标准化, 最后对 所有基因每个位置出的标准化的 density 进行加和。
代码实现:
"""
Supplementary Note 12: Meta-gene analysis from start codon
Authors: Eugene Oh, Annemarie Becker
inputFileP:
read density file for plus strand (Supplementary Note 2)
col0: position along genome
col1: read density at that position
inputFileM:
read density file for minus strand (Supplementary Note 2)
col0: position along genome
col1: read density at that position
inputListP:
E. coli MC4100 gene list of the plus strand
col0: gene name
col1: start coordinate of gene
col2: stop coordinate of gene
inputListM
E. coli MC4100 gene list of the minus strand
col0: gene name
col1: start coordinate of gene
col2: stop coordinate of gene
outputFile:
average read density along an average transcript
col0: position along the average transcript, counted from the start codon
col1: mean read density at that position
"""
def metagene(inputFileP, inputFileM, inputListP, inputListM, outputFile):
mylist = range(-50, 1501)
rangeDict = dict([i, 0] for i in mylist)
countDict = dict([i, 0] for i in mylist)
### Plus Strand ###
# Upload plus strand data in dictionary
DictP = {}
inFile = open(inputFileP, 'r')
line = inFile.readline()
while line != '':
fields = line.split()
col0 = int(fields[0])
col1 = float(fields[1])
DictP[col0] = col1
line = inFile.readline()
# Upload plus strand gene list
inFile = open(inputListP, 'r')
line = inFile.readline()
while line != '':
fields = line.split()
col0 = str(fields[0]) #gene name
col1 = int(fields[1]) #start of gene
col2 = int(fields[2]) #stop of gene
length = abs(col1 - col2) + 1
# Select genes
if length > 400: #use genes longer than 400 nt; this value can be changed
normVal = 0
for Z in range(col1, col2 + 1):
if Z in DictP:
normVal += DictP[Z] #determine expression level
if normVal > 50: #continue if sum of reads is greater than 50; this can be changed if desired
meanNorm = normVal / length #calculate read density (average reads per base)
for K in range(-50, 1501):
elem = col1 + K
if elem in DictP and K <= length:
reads = DictP[elem] / meanNorm #divide reads at one position by average number of reads per base for this gene
rangeDict[K] += reads #sum up reads at that position from all genes
for K in countDict: #how often was this position counted in the calculation
if K <= length:
countDict[K] += 1
elif length <= 400:
pass
line = inFile.readline()
### Minus Strand ###
## Upload minus strand data in dictionary
DictM = {}
inFile = open(inputFileM, 'r')
line = inFile.readline()
while line != '':
fields = line.split()
col0 = int(fields[0])
col1 = float(fields[1])
DictM[col0] = col1
line = inFile.readline()
# Upload minus strand gene list
inFile = open(inputListM, 'r')
line = inFile.readline()
while line != '':
fields = line.split()
col0 = str(fields[0])
col1 = int(fields[1])
col2 = int(fields[2])
length = abs(col1 - col2) + 1
# Select genes
if length > 400:
normVal = 0
for Z in range(col2, col1 + 1):
if Z in DictM:
normVal += DictM[Z]
if normVal > 50:
meanNorm = normVal / length
for K in range(-50, 1501):
elem = col1 - K
if elem in DictM and K <= length:
reads = DictM[elem] / meanNorm
rangeDict[K] += reads
for K in countDict:
if K <= length:
countDict[K] += 1
elif length <= 400:
pass
line = inFile.readline()
### Output data ###
tupledlist1 = list(rangeDict.items())
tupledlist1.sort()
tupledlist2 = list(countDict.items())
tupledlist2.sort()
fullDict = {}
zippedlist = zip(tupledlist1, tupledlist2)
for elem in zippedlist:
col0 = elem[0][0] #list0 col0 = position (K)
col1 = elem[0][1] #list0 col1 = norm read number
col2 = elem[1][1] #list1 col1 = how often was position counted
fullDict[col0] = col1 / col2 #normalization2
# Finish output
tupledlist = list(fullDict.items())
tupledlist.sort()
outFile = open(outputFile, 'w')
for J in tupledlist:
for K in range(2):
outFile.write(str(J[K]))
if K < 1:
outFile.write('\t')
outFile.write('\n')
代码也比较好懂,最后批量运行:
# metagene(inputFileP, inputFileM, inputListP, inputListM, outputFile)
# sample name
inputFileP = ['Ssb1-inter-rep1.PDensity.txt','Ssb1-inter-rep2.PDensity.txt','Ssb2-inter-rep1.PDensity.txt','Ssb2-inter-rep2.PDensity.txt',
'Ssb1-trans-rep1.PDensity.txt','Ssb1-trans-rep2.PDensity.txt','Ssb2-trans-rep1.PDensity.txt','Ssb2-trans-rep2.PDensity.txt']
inputFileM = ['Ssb1-inter-rep1.MDensity.txt','Ssb1-inter-rep2.MDensity.txt','Ssb2-inter-rep1.MDensity.txt','Ssb2-inter-rep2.MDensity.txt',
'Ssb1-trans-rep1.MDensity.txt','Ssb1-trans-rep2.MDensity.txt','Ssb2-trans-rep1.MDensity.txt','Ssb2-trans-rep2.MDensity.txt']
inputListP = 'geneListPlus.txt';inputListM = 'geneListMlus.txt'
outputFile = ['Ssb1-inter-rep1.MetaDist2StartCodon.txt','Ssb1-inter-rep2.MetaDist2StartCodon.txt','Ssb2-inter-rep1.MetaDist2StartCodon.txt','Ssb2-inter-rep2.MetaDist2StartCodon.txt',
'Ssb1-trans-rep1.MetaDist2StartCodon.txt','Ssb1-trans-rep2.MetaDist2StartCodon.txt','Ssb2-trans-rep1.MetaDist2StartCodon.txt','Ssb2-trans-rep2.MetaDist2StartCodon.txt']
# run
for i in range(0,8):
metagene(''.join(['2.ribo-density-data/',inputFileP[i]]),''.join(['2.ribo-density-data/',inputFileM[i]]),
inputListP,inputListM,
''.join(['5.metagene-plot-data/',outputFile[i]]))
4distance from stop codon
然后绘制离 stop codon 上游 1500nt
及下游 50nt
的,代码类似:
"""
Supplementary Note 13: Meta-gene analysis from stop codon
Authors: Eugene Oh, Annemarie Becker
inputFileP:
read density file for plus strand (Supplementary Note 2)
col0: position along genome
col1: read density at that position
inputFileM:
read density file for minus strand (Supplementary Note 2)
col0: position along genome
col1: read density at that position
inputListP:
E. coli MC4100 gene list of the plus strand
col0: gene name
col1: start coordinate of gene
col2: stop coordinate of gene
inputListM
E. coli MC4100 gene list of the minus strand
col0: gene name
col1: start coordinate of gene
col2: stop coordinate of gene
outputFile:
average read density along an average transcript
col0: position along the average transcript, counted from the stop codon
col1: mean read density at that position
"""
def metagene(inputFileP, inputFileM, inputListP, inputListM, outputFile):
mylist = range(-1500, 51)
rangeDict = dict([i, 0] for i in mylist)
countDict = dict([i, 0] for i in mylist)
### Plus Strand ###
# Upload plus strand data in dictionary
DictP = {}
inFile = open(inputFileP, 'r')
line = inFile.readline()
while line != '':
fields = line.split()
col0 = int(fields[0])
col1 = float(fields[1])
DictP[col0] = col1
line = inFile.readline()
# Upload plus strand gene list
inFile = open(inputListP, 'r')
line = inFile.readline()
while line != '':
fields = line.split()
col0 = str(fields[0]) #gene name
col1 = int(fields[1]) #start of gene
col2 = int(fields[2]) #stop of gene
length = abs(col1 - col2) + 1
# Select genes
if length > 400: #use genes longer than 400 nt; this value can be changed
normVal = 0
for Z in range(col1, col2 + 1):
if Z in DictP:
normVal += DictP[Z] #determine expression level
if normVal > 50: #continue if sum of reads is greater than 50; this can be changed if desired
meanNorm = normVal / length #calculate read density (average reads per base)
for K in range(-1500, 51):
elem = col2 + K
if elem in DictP and abs(K) <= length:
reads = DictP[elem] / meanNorm #divide reads at one position by average number of reads per base for this gene
rangeDict[K] += reads #sum up reads at that position from all genes
for K in countDict: #how often was this position counted in the calculation
if abs(K) <= length:
countDict[K] += 1
elif length <= 400:
pass
line = inFile.readline()
### Minus Strand ###
## Upload minus strand data in dictionary
DictM = {}
inFile = open(inputFileM, 'r')
line = inFile.readline()
while line != '':
fields = line.split()
col0 = int(fields[0])
col1 = float(fields[1])
DictM[col0] = col1
line = inFile.readline()
# Upload minus strand gene list
inFile = open(inputListM, 'r')
line = inFile.readline()
while line != '':
fields = line.split()
col0 = str(fields[0])
col1 = int(fields[1])
col2 = int(fields[2])
length = abs(col1 - col2) + 1
# Select genes
if length > 400:
normVal = 0
for Z in range(col2, col1 + 1):
if Z in DictM:
normVal += DictM[Z]
if normVal > 50:
meanNorm = normVal / length
for K in range(-1500, 51):
elem = col2 - K
if elem in DictM and abs(K) <= length:
reads = DictM[elem] / meanNorm
rangeDict[K] += reads
for K in countDict:
if abs(K) <= length:
countDict[K] += 1
elif length <= 400:
pass
line = inFile.readline()
### Output data ###
tupledlist1 = list(rangeDict.items())
tupledlist1.sort()
tupledlist2 = list(countDict.items())
tupledlist2.sort()
fullDict = {}
zippedlist = zip(tupledlist1, tupledlist2)
for elem in zippedlist:
col0 = elem[0][0] #list0 col0 = position (K)
col1 = elem[0][1] #list0 col1 = norm read number
col2 = elem[1][1] #list1 col1 = how often was position counted
fullDict[col0] = col1 / col2 #normalization2
# Finish output
tupledlist = list(fullDict.items())
tupledlist.sort()
outFile = open(outputFile, 'w')
for J in tupledlist:
for K in range(2):
outFile.write(str(J[K]))
if K < 1:
outFile.write('\t')
outFile.write('\n')
批量运行:
# metagene(inputFileP, inputFileM, inputListP, inputListM, outputFile)
# sample name
inputFileP = ['Ssb1-inter-rep1.PDensity.txt','Ssb1-inter-rep2.PDensity.txt','Ssb2-inter-rep1.PDensity.txt','Ssb2-inter-rep2.PDensity.txt',
'Ssb1-trans-rep1.PDensity.txt','Ssb1-trans-rep2.PDensity.txt','Ssb2-trans-rep1.PDensity.txt','Ssb2-trans-rep2.PDensity.txt']
inputFileM = ['Ssb1-inter-rep1.MDensity.txt','Ssb1-inter-rep2.MDensity.txt','Ssb2-inter-rep1.MDensity.txt','Ssb2-inter-rep2.MDensity.txt',
'Ssb1-trans-rep1.MDensity.txt','Ssb1-trans-rep2.MDensity.txt','Ssb2-trans-rep1.MDensity.txt','Ssb2-trans-rep2.MDensity.txt']
inputListP = 'geneListPlus.txt';inputListM = 'geneListMlus.txt'
outputFile = ['Ssb1-inter-rep1.MetaDist2StopCodon.txt','Ssb1-inter-rep2.MetaDist2StopCodon.txt','Ssb2-inter-rep1.MetaDist2StopCodon.txt','Ssb2-inter-rep2.MetaDist2StopCodon.txt',
'Ssb1-trans-rep1.MetaDist2StopCodon.txt','Ssb1-trans-rep2.MetaDist2StopCodon.txt','Ssb2-trans-rep1.MetaDist2StopCodon.txt','Ssb2-trans-rep2.MetaDist2StopCodon.txt']
# run
for i in range(0,8):
metagene(''.join(['2.ribo-density-data/',inputFileP[i]]),''.join(['2.ribo-density-data/',inputFileM[i]]),
inputListP,inputListM,
''.join(['5.metagene-plot-data/',outputFile[i]]))
最终文件:
5绘图
微信扫一扫付费阅读本文
可试读69%
微信扫一扫付费阅读本文