北京生物醫(yī)學(xué)工程

基于Skip-gram詞嵌入算法的結(jié)構(gòu)化患者特征表示方法研究

Study on structured patient feature representation method based on Skip-gramword embedding algorithm

作者：黃艷群王妮劉紅蕾費(fèi)曉璐巍嵐陳卉

單位：首都醫(yī)科大學(xué)生物醫(yī)學(xué)工程學(xué)院（北京 100069）首都醫(yī)科大學(xué)臨床生物力學(xué)應(yīng)用基礎(chǔ)研究北京市重點(diǎn)實(shí)驗室（北京 100069）首都醫(yī)科大學(xué)宣武醫(yī)院（北京 100053）

關(guān)鍵詞：電子病歷； Skip-gram算法；特征表示；自然語言處理；詞嵌入

分類號：R318;TP31

出版年·卷·期（頁碼）：2019·38·6（568-574）

摘要：

目的基于表示學(xué)習(xí)中的Skip-gram詞嵌入算法，尋找能夠克服電子病歷中結(jié)構(gòu)化特征的高維性并在語義層次上表示特征的方法。方法本文的數(shù)據(jù)來源于北京市某三甲醫(yī)院的電子病歷系統(tǒng)，從中提取患者的結(jié)構(gòu)化特征，包括疾病，藥物和實(shí)驗室指標(biāo)，其中實(shí)驗室指標(biāo)通過正常值范圍離散化；利用Skip-gram算法，將電子病歷中離散型患者特征（疾病和藥物）和離散后的連續(xù)型患者特征（實(shí)驗室指標(biāo)）嵌入到同一個低維實(shí)數(shù)向量空間中。通過t-SNE降維可視化方法顯示低維實(shí)數(shù)空間中特征向量的關(guān)系，并與特征向量間的余弦距離計算結(jié)果相互印證，從而評價特征表示的有效性和揭示特征向量間的潛在聯(lián)系。結(jié)果患者特征的低維實(shí)數(shù)向量既降低了患者特征的維度，又很好地表征了特征間的潛在聯(lián)系，臨床含義相關(guān)的特征表示成的低維實(shí)數(shù)向量也很相近。結(jié)論基于Skip-gram算法將患者結(jié)構(gòu)化特征表示成低維實(shí)數(shù)向量取得了較好的效果，為解決EMR數(shù)據(jù)表示的高維性以及結(jié)構(gòu)化特征間潛在關(guān)系分析提供一種思路。

Objective To reduce the dimension of structured patient features in electronic medical records (EMR) system and to represent the patient features at a semantic level. Methods Data used in this study was derived from the EMR system of a tertiary hospital in Beijing, China. Three categories of structured patient features were extracted, including two discrete patient features (i.e., disease history and medications) and one continuous patient features (laboratory tests). These features were then represented as the concept vectors by being embedded into a unified low-dimensional vector space using Skip-gram algorithm. In order to evaluate the effectiveness of feature representation and reveal the potential relationship between features, t-SNE technology was used to visualize the concept space and cosine distances in concept vectors were calculated to reflect the relationship quantitively. Results The representation of concept vectors for patient features not only reduced the dimension of the traditional feature representation, but also revealed the potential relationship between features to some degree. Clinically relevant features were also close in the concept vector space. Conclusions Structured patient features can be represented as meaningful low-dimensional vectors based on the Skip-gram algorithm, providing a new idea for representing structured features in EMR.

參考文獻(xiàn)：

[1] Girardi D, Wartner S, Halmerbauer G, et al. Using concept hierarchies to improve calculation of patient similarity[J]. Journal of Biomedical Informatics, 2016, 63: 66-73.

[2] Gottlieb A, Stein GY, Ruppin E, et al. A method for inferring medical diagnoses from patient similarities[J]. BMC Medicine, 2013, 11(1): 194.

[3] Bloomingdale P, Mager DE. Machine learning models for the prediction of chemotherapy-induced peripheral neuropathy[J]. Pharmaceutical Research, 2019, 36: 35.

[4] Lodhi MK, Ansari R, Yao Y, et al. Predictive modeling for comfortable death outcome using electronic health records[C]//Proceedings of 2015 IEEE International Congress on Big Data. New York, USA: IEEE Press, 2015: 409-415.

[5] Rodriguez-Lujan I, Bailador G, Sanchez-Avila C, et al. Analysis of pattern recognition and dimensionality reduction techniques for odor biometrics[J]. Knowledge-Based Systems, 2013, 52: 279-289.

[6] Bengio Y, Courville A, Vincent P. Representation learning: a review and new perspectives[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1798-1828.

[7] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[EB/OL].[2019-09-06]. https://arxiv.org/abs/1301.3781.

[8] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26: 3111-3119.

[9] Choi E, Schuetz A, Stewart WF, et al. Medical concept representation learning from electronic health records and its application on heart failure prediction[EB/OL].[2019-09-06]. https://arxiv.org/abs/1602.03686.

[10] Tran T, Nguyen TD, Phung D, et al. Learning vector representation of medical objects via EMR-driven nonnegative restricted boltzmann machines (eNRBM)[J]. Journal of Biomedical Informatics, 2015, 54: 96–105.

[11] 張?zhí)忑R, 卞鷹. 應(yīng)用ICD-10編碼輔助分析診斷質(zhì)量[J]. 解放軍醫(yī)院管理雜志, 2017, 24(11): 1001-1004.

Zhang TQ，Bian Y. Auxiliary analysis of diagnosis quality by ICD--10 coding[J]. Hospital Administration Journal of Chinese People's Liberation Army, 2017, 24(11): 1001-1004.

[12] van der Maaten L. Accelerating t-SNE using tree-based algorithms[J]. Journal of Machine Learning Research, 2014, 15: 3221-3245.

[13] Cui L, Xie X, Shen Z. Prediction task guided representation learning of medical codes in EHR[J]. Journal of Biomedical Informatics, 2018, 84: 1-10.

[14] 鄭剛. 糖尿病患者高血壓管理的指南回顧及解讀[J]. 世界臨床藥物, 2019, 40(3): 145-149.

Zheng G. Review and interpretation of the guidelines for hypertension management in diabetic patients[J]. World Clinical Drugs, 2019, 40(3): 145-149.

[15] Lei L, Zhou Y, Zhai J, et al. An effective patient representation learning for time-series prediction tasks based on EHRs[C]//2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Madrid, Spain: IEEE Press, 2018: 885-892.

[16] Zhou C, Jia Y, Motani M, et al. Learning deep representations from heterogeneous patient data for predictive diagnosis[C]// the 8th ACM International Conference. New York, USA, 2017: 115-123.

服務(wù)與反饋：

【文章下載】【加入收藏】

提示：您還未登錄，請登錄！點(diǎn)此登錄

51黑料吃瓜在线观看,51黑料官网|51黑料捷克街头搭讪_51黑料入口最新视频