北京生物醫(yī)學工程

一種面向小樣本數(shù)據(jù)的錯標記樣本識別方法

A mislabeled sample recognition method for small sample data

作者：秦瑞斌鄭浩然周宏

單位：中國科學技術(shù)大學計算機科學與技術(shù)學院(合肥230027)

關(guān)鍵詞：錯標記;小樣本數(shù)據(jù);微陣列

分類號：

出版年·卷·期（頁碼）：2012·31·6（574-578）

摘要：

目的針對小樣本數(shù)據(jù)的錯標記問題，本文在CL-stability算法的基礎(chǔ)上提出一種加權(quán)的錯標記樣本識別算法(UCL-stability)。方法在UCL-stability算法中，根據(jù)樣本標記翻轉(zhuǎn)后數(shù)據(jù)所能選出的差異特征數(shù)目，定義了一個投票權(quán)值用于衡量翻轉(zhuǎn)不同樣本標記對分類的影響。結(jié)果兩組癌癥基因表達數(shù)據(jù)的實驗結(jié)果表明，UCL-stability與CL-stability算法均能有效識別數(shù)據(jù)中的可疑樣本。通過人為錯標記樣本的進一步實驗，顯示UCL-stability算法相比于無投票權(quán)的CL-stability算法可取得較高的precision和recall值。結(jié)論本文提出的UCL-stability算法不僅考慮了小樣本數(shù)據(jù)中單個樣本的標記錯誤對分類器設(shè)計造成的影響，更進一步考慮了不同樣本的標記錯誤對分類結(jié)果影響的差異。通過引入特征信息衡量該差異，UCL-stability取得了較好的結(jié)果。

Objective To propose a new method UCL-stability based on the CL-stability method to solve the mislabeled sample problem. Methods According to the number of significant differential features (after sample label flipping),UCL-stability proposes a voting weight in order to measure the effects of flipping different samples’ label. Results The experimental results of two cancer microarray data sets indicate that both UCL-stability and CL-stability can recognize the suspect samples effectively. The further experiments of artificial mislabeling show that UCL-stability can obtain a higher value of precision and recall. Conclusions The UCL-stability algorithm not only considers the effects of a single sample’s mislabeling,but also distinguishes the effects of different samples’ mislabeling. In order to measure the effects quantitatively,we employ the feature information and achieve preferable results.

參考文獻：

［1］Alon U,Barkai N,Notterman DA,et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotides array ［J］. Proceedings of the National Academy of Sciences of the United States of America,1999,96:6745-6750.
［2］West M,Blanchette C,Huang E,et al. Predicting the clinical status of human breast cancer by using gene expression profiles ［J］. Proceedings of the National Academy of Sciences of the United States of America,2001,98:11462-11467.
［3］West M. Bayesian factor regression models in the ‘Large p,Small n’ paradigm ［J］. Bayesian Statistics,2003,7:723-732.
［4］Brodley CE,Friedly MA. Identifying mislabeled training data ［J］. Journal of Artificial Intelligence Research,1999,11:131-166.
［5］Muhlenbach F,Lallich S,Zighed DA. Identifying and handling mislabeled instances ［J］.Journal of Intelligent Information Systems,2004,22:89-109.
［6］Venkataraman S,Metaxas D,Fradkin D,et al. Distinguishing mislabeled data from correctly labeled data in classifier design ［C］. In 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAL’04),2004：668-672.
［7］Malossini A,Blanzieri E,Ng RT. Detecting potential labeling errors in microarrays by data perturbation ［J］. Bioinformatics,2006,22:2114-2121.
［8］Zhang C,Wu C,Blanzieri E,et al. Methods for labeling error detection in microarrays based on the effect of data perturbation on the regression model ［J］. Bioinformatics,2009,25:2708-2714.
［9］Zhang W,Rekaya R,Bertrand K. A method for predicting disease subtypes in presence of misclassification among training samples using gene expression:application to human breast cancer ［J］. Bioinformatics,2006,22:317-325.
［10］Barnett V,Lewis T. Outliers in Statistical Data ［M］. New York：John Wiley and Sons,1994.

服務與反饋：

【文章下載】【加入收藏】

提示：您還未登錄，請登錄！點此登錄

51黑料吃瓜在线观看,51黑料官网|51黑料捷克街头搭讪_51黑料入口最新视频