北京生物醫(yī)學(xué)工程

組學(xué)大數(shù)據(jù)環(huán)境下的基因變異信息并行處理與分析

Parallel information processing and analysis formutant gene under large data environments

作者：黃芝準(zhǔn) 王紅強

單位：中國科學(xué)院合肥智能機械研究所(合肥230031)

關(guān)鍵詞：第二代測序技術(shù)；Hadoop；序列數(shù)據(jù)分析；基因突變信息；單核苷酸多態(tài)性

分類號：R318.04

出版年·卷·期（頁碼）：2017·36·4（366-371）

摘要：

隨著第二代測序技術(shù)的發(fā)展與應(yīng)用，其產(chǎn)生的測序數(shù)據(jù)也呈現(xiàn)快速的增長趨勢，如何有效、快速、穩(wěn)定地對海量測序數(shù)據(jù)進(jìn)行分析成為生物研究領(lǐng)域迫切的需求。目前許多傳統(tǒng)的測序數(shù)據(jù)分析軟件僅支持單一功能，并不具備完整的數(shù)據(jù)分析能力，應(yīng)對海量的測序數(shù)據(jù)時其處理能力也顯著不足。為了應(yīng)對上述問題，本文設(shè)計了一款基于Hadoop框架的測序數(shù)據(jù)分析軟件，整合了現(xiàn)今生物研究領(lǐng)域內(nèi)常用的多款序列分析軟件，從而實現(xiàn)了對測序序列數(shù)據(jù)的自動化分析。該軟件輸入原始的測序數(shù)據(jù)后，經(jīng)過堿基質(zhì)量控制、序列比對、SNP位點信息提取、突變基因信息生成等幾個過程，最終輸出詳細(xì)的突變基因信息報告。該軟件實現(xiàn)了自動化的數(shù)據(jù)分析，提高了數(shù)據(jù)分析的效率，極大減輕了數(shù)據(jù)分析人員的工作量。

With the development and application of biomedical techniques such as second generation of sequencing technology,the output data show rapid and steady growth trend.Efficient,rapid and steady analyzation of the massive sequencing data becomes an urgent need in the field of biological research.At present,many of the traditional sequencing data analysis softwares support only a single function,without complete data analysis capabilities.In order to solve the problems,this paper designs a sequencing data analysis software based on Hadoop framework,which integrates many kinds of sequence analysis software commonly used in the field of biological research,and realizes the automatic analysis of sequencing data.After inputting the original sequencing data,the software outputs several detailed information of mutant genes after several processes such as base quality control,sequence alignment,SNP information extraction,generation of mutant genetic information and so on.The software realizes automatic data analysis and improves the efficiency of data analysis.

參考文獻(xiàn)：

［1］張如奎,徐增輝.淺論基因檢測對腫瘤精準(zhǔn)醫(yī)療的意義［J］.中國醫(yī)藥生物技術(shù),2016,11(2):103-109.

［2］Langmead B,Trapnell C,Pop M,et al.Ultrafast and memory-efficient alignment of short DNA sequences to the human genome［J］.Genome Biology,2009,10(3):R25.

［3］Li H,Durbin R.Fast and accurate short read alignment with Burrows-Wheeler transform［J］.Bioinformatics,2009,25(14):1754-1760.

［4］Smith AD,Chung WY,Hodges E,et al.Updates to the RMAP short-read mapping software［J］.Bioinformatics,2009,25(21):2841-2842.

［5］Langmead B,Salzberg SL.Fast gapped-read alignment with Bowtie 2［J］.Nature Methods,2012,9 (4):357-359.

［6］Chang F,Dean J,Ghemawat S,et al.Bigtable:a distributed storage system for structured data［J］.ACM Transactions on Computer Systems,2008,26 (2):205-218.

［7］Ghemawat S,Gobioff H,Leung ST.The Google file system［J］.ACM Sigops Operating Systems Review,2003,37(5):29-43.

［8］Dean J,Ghemawat S.Mapreduce:simplified data processing on large clusters［J］.Conference on Symposium on Operating Systems Design and Implementation,2004,51(1):137-150.

［9］Li H.A statistical framework for SNP calling,mutation discovery,association mapping and population genetical parameter estimation from sequencing data［J］.Bioinformatics,2011,27 (21):2987-2993.

［10］Luo R,Liu B,Xie Y,et al.SOAPdenovo2:an empirically improved memory-efficient short-read de novo assembler［J］.GigaScience,2012,1(1):18.

［11］Hong D,Rhie A,Park SS,et al.Fx:an RNA-Seq analysis tool on the cloud［J］.Bioinformatics,2012,28 (5):721-723.

［12］Patel RK,Jain M.NGS QC Toolkit:a toolkit for quality control of next generation sequencing data［J］.Plos One,2012,7(2):e30619.

［13］Broad Institute.A set of Java command line tools for manipulating high-throughput sequencing (HTS) data［EB/OL］.(2016-09-05).http://broadinstitute.github.io/picard/.

［14］Herrero J,Muffato M,Beal K,et al.Ensembl comparative genomics resources［J］.Database(Oxford),2016，2016:bav096.

［15］Mclaren W,Gil L,Hunt SE,et al.The ensembl variant effect predictor［J］.Genome Biology,2016,17(1):122.

服務(wù)與反饋：

【文章下載】【加入收藏】

提示：您還未登錄，請登錄！點此登錄

51黑料吃瓜在线观看,51黑料官网|51黑料捷克街头搭讪_51黑料入口最新视频