北京生物醫(yī)學(xué)工程

基于高通量測(cè)序數(shù)據(jù)的微生物檢測(cè)算法

Microorganism detecting algorithm based on high-throughput sequencing

作者：李江域王小磊劉陽毛逸清趙東升王玉民

單位：軍事醫(yī)學(xué)科學(xué)院衛(wèi)生勤務(wù)與醫(yī)學(xué)情報(bào)研究所(北京100850)

關(guān)鍵詞：高通量測(cè)序；微生物檢測(cè)；序列比對(duì)；序列拼接；算法

分類號(hào)：

出版年·卷·期（頁碼）：2013·32·5（463-466）

摘要：

目的設(shè)計(jì)一種基于高通量測(cè)序數(shù)據(jù)的功能強(qiáng)大、處理速度快且不依賴于運(yùn)行環(huán)境的本地化的微生物檢測(cè)算法。方法對(duì)微生物基因組進(jìn)行分組，每次使用一組微生物基因組提取映射到其上的測(cè)序數(shù)據(jù)并濾除數(shù)據(jù)中的人類基因組數(shù)據(jù)，然后對(duì)序列進(jìn)行拼接和拼接片段比對(duì)。如果根據(jù)比對(duì)結(jié)果檢測(cè)出微生物種屬則流程結(jié)束，否則使用下一組微生物基因組進(jìn)行分析。若使用所有微生物基因組分析結(jié)束后仍未確定微生物種屬，則濾除剩余的測(cè)序序列中的人類測(cè)序數(shù)據(jù)并進(jìn)行拼接，拼接片段通過序列比對(duì)無法匹配到微生物基因組，則將這些拼接片段歸為未知病原微生物的基因組片段。結(jié)果利用新的檢測(cè)算法對(duì)模擬數(shù)據(jù)和實(shí)際測(cè)序數(shù)據(jù)進(jìn)行分析，以RINS作為對(duì)比。對(duì)于已知病原微生物，新算法的平均處理時(shí)間為75min，RINS的平均處理時(shí)間為767min，兩個(gè)算法檢測(cè)結(jié)果一致，新算法得到的拼接序列更長(zhǎng)。對(duì)于未知病原微生物樣本，新算法檢測(cè)的平均處理時(shí)間為64min，RINS的為584min，新算法得到了較完整的原始序列。對(duì)于實(shí)測(cè)數(shù)據(jù)，新算法的平均處理時(shí)間為23min，RINS的為68min，檢測(cè)結(jié)果一致。結(jié)論本文實(shí)現(xiàn)的微生物檢測(cè)算法能夠?qū)ξ⑸镞M(jìn)行準(zhǔn)確、快速的檢測(cè)，同時(shí)，新的檢測(cè)算法可以發(fā)現(xiàn)未知的微生物并獲取未知微生物的基因組片段。

Objective To design a microorganism detecting algorithm based on high-throughput sequencing that can detect the sample fast and be independent of any runtime environment.Methods The microorganism genomes are divided into the groups of bacteria，virus and fungi.First we use the virus genomes as reference to get the reads mapped to them，and filter the human sequencing data，then assemble the reads and align the contigs to virus genomes.If the microorganism is virus，the detecting finished，otherwise，genome sequencing of bacteria and fungi is used if the microorganism does not belong to the former group.If we still cannot get result when all the groups have been used，we use the sequencing data left to filter the human data and assemble the rest reads.After verified，the contigs are the genome fragment of unknown microorganism.Results The simulated data and real sequencing data are analyzed by the new algorithm and RINS to compare.The detecting results are the same yet the runtime of new algorithm is 75min and 64min for the two simulated data and 23min for SRR073726，comparing to RINS being 767min，64min and 68min，respectively.For the two simulated sequencing，the outputs of new algorithm are much longer than those of RINS.Conclusions The new algorithm can detect the microorganism fast and accurately，and can also detect the unknown microorganism and output the fragments of its genome.

參考文獻(xiàn)：

［1］Illumina Website.An Introduction to Next-Generation Sequencing Technology ［EB/OL］.(2012-12-20).http：//www.illumina.com/Documents/products/Illumina_Sequencing_Introduction.pdf.

［2］Hausen Z.The Search for Infectious Causes of Human Cancers: Where and Why［J］.Virology，2009，392:1-10.
［3］Kostic AD，Ojesina AI，Pedamallu CS，et al.PathSeq： software to identify or discover microbes by deep sequencing of human tissue［J］.Nature Biotechnology，2011，29(5): 393-396.
［4］Bhaduri A，Qu K，Lee CS，et al.Rapid identification of nonhuman sequences in high throughput sequencing data sets［J］.Bioinformatics，2012，28(8): 1174-1175.
［5］Chen YX，Yao H，Thompson EJ，et al.VirusSeq: Software to identify viruses and their integration sites using next-generation sequencing of human cancer tissue［J］.Bioinformatics，2013，29(2): 266-267.
［6］Borozan I，Wilson S，Blanchette P，et al.CaPSID: A bioinformatics platform for computational pathogen sequence identification in human genomes and transcriptomes［J］.BMC Bioinformatics，2012，13: 206-217.
［7］Introduction for Amazon EC2 cloud computing platform［EB/OL］.(2013-01-12).http://aws.amazon.com/cn/ec2/.
［8］Introduction for Amazon S3 cloud storage［EB/OL］.(2013-01-12).http://aws.amazon.com/cn/s3/.
［9］Li H，Durbin R.Fast and accurate short read alignment with Burrows-Wheeler Transform［J］.Bioinformatics，2009，25: 1754-60.
［10］Langmead B，Trapnell C，Pop M，Salzberg SL.Ultrafast and memory-efficient aligment of short DNA sequencing to the human genome［M］.Genome Biology，2009，10(3): R25.
［11］Rodriguez N，Hackenberg M，Aransay AM.Bioinformatics for High Throughput Sequencing［M］.Springer Science+Business Media，2012: 90-103.
［12］Zerbino D，Birney E.Velvet: algorithms for de novo short read assembly using de Bruijn graphs［J］.Genome Research，2008，18: 821-829.
［13］Altschul SF，Gish W，Miller W，et al.Basic local alignment search tool［J］.Journal of Molecular Biology，1990，215 (3):403-410.
［14］Hg19［EB/OL］.(2013-01-05).http：//hgdownload.cse.ucsc.edu/goldenPath/hg19.
［15］NCBI［EB/OL］.(2013-01-05).www.ncbi.nlm.nih.gov/.
［16］McElroy KE，Luciani F，Thomas T.GemSIM: general，error-model based simulator of next-generation sequencing data［J］.BMC Genomics，2012，13: 74.

服務(wù)與反饋：

【文章下載】【加入收藏】

提示：您還未登錄，請(qǐng)登錄！點(diǎn)此登錄

51黑料吃瓜在线观看,51黑料官网|51黑料捷克街头搭讪_51黑料入口最新视频