北京生物醫(yī)學(xué)工程

基于規(guī)則和機(jī)器學(xué)習(xí)的中文電子病歷患者隱私保護(hù)算法

Patients privacy preserving algorithm of Chinese electronic medical record based on rule and machine learning

作者：王陽(yáng)陽(yáng) 鄭西川

單位：上海交通大學(xué)附屬第六人民醫(yī)院（上海 200033）上海交通大學(xué)生物醫(yī)學(xué)工程學(xué)院（上海 200230）

關(guān)鍵詞：隱私保護(hù); 電子病歷; 命名實(shí)體; 正則表達(dá)式; 隱馬爾科夫模型

分類號(hào)：R318.04

出版年·卷·期（頁(yè)碼）：2019·38·5（492-497）

摘要：

目的針對(duì)醫(yī)療數(shù)據(jù)發(fā)布和共享中患者隱私泄露風(fēng)險(xiǎn)以及人工去標(biāo)識(shí)效率低的問(wèn)題，本文提出了一種基于規(guī)則和機(jī)器學(xué)習(xí)結(jié)合的算法，以有效去除電子病歷中的患者隱私信息。方法根據(jù)美國(guó)健康可攜行與責(zé)任性法案和中文電子病歷的表達(dá)習(xí)慣，將隱私數(shù)據(jù)分為數(shù)字、日期及命名實(shí)體三大類，利用正則表達(dá)式識(shí)別數(shù)字以及日期隱私數(shù)據(jù)，引入隱馬爾科夫模型識(shí)別命名實(shí)體。最后使用上海市第六人民醫(yī)院的出院小結(jié)作為測(cè)試數(shù)據(jù)，利用留出法測(cè)試了隱私數(shù)據(jù)識(shí)別的召回率和精確率。結(jié)果該模型總體得到了超過(guò)90%的召回率，其中數(shù)字和日期類型的隱私數(shù)據(jù)召回率都超過(guò)96%，中文人名的識(shí)別效果也超過(guò)了單人識(shí)別的效果。結(jié)論規(guī)則和機(jī)器學(xué)習(xí)結(jié)合的模型有效地識(shí)別了患者的隱私數(shù)據(jù)，有助于醫(yī)療數(shù)據(jù)的共享。

Objective Aiming at the risk of patient privacy leakage and the low efficiency of manual de-identification in medical data publishing and sharing, this paper proposes a method based on rule and machine learning to remove effectively patient privacy information in electronic medical records. Methods According to the Health Insurance Portability and Accountability Act and the expression habits of Chinese electronic medical records, the privacy data is divided into three categories: numbers, dates and named entities. Regular expressions are used to identify numbers and date privacy data, and hidden Markov model is used to identify named entities. Lastly, we use discharges summaries from Shanghai Sixth People Hospital to evaluate the precision and recall using hold-out method. Results The model obtains overall recall more than 90%, including recall of digital and date privacy data is more than 96%, meanwhile, the recognition performance of Chinese names is also better than that of one person. Conclusions The model based on rules and machine learning effectively identifies patient's privacy data and helps to share medical data.

參考文獻(xiàn)：

[1] 黃尤江, 賀蓮, 蘇煥群,等. 醫(yī)療大數(shù)據(jù)的應(yīng)用及其隱私保護(hù)[J]. 中華醫(yī)學(xué)圖書(shū)情報(bào)雜志, 2015, 24(9):43-45.

Huang YJ, He L, Su HQ, et al. Application of big data in medical care and their privacy protection[J]. Chinese Journal of Medical Library and Information Science, 2015, 24(9):43-45

[2] 岳思,吳偉明,谷勇浩.數(shù)據(jù)發(fā)布中k-匿名隱私保護(hù)技術(shù)研究[J].軟件,2017,38(11):12-17.

Yue S，Wu WM，Gu YH. Research on K-anonymous privacy protection technology in the data release[J]. Computer Engineering & Software,2017,38(11):12-17

[3] 何賢芒. 隱私保護(hù)中k-匿名算法和匿名技術(shù)研究[D]. 上海：復(fù)旦大學(xué), 2011.

He XM. Study on K-anonymity algorithm and anonymity technology in privacy protection[D]. Shanghai：Fudan University, 2011.

[4] El EK, Dankar FK, Issa R, et al. A globally optimal k-anonymity method for the de-identification of health data[J]. Journal of the American Medical Informatics Association Jamia, 2009, 16(5):670-682.

[5] Nosowsky R, Giordano T J. The Health Insurance Portability and Accountability Act of 1996 (HIPAA) privacy rule: implications for clinical research[J]. Annual Review of Medicine, 2006, 57(1):575-590.

[6] Johnson AEW , Pollard TJ , Shen L , et al. MIMIC-III, a freely accessible critical care database[J]. Scientific Data, 2016, 3:160035.

[7] Douglass M , Cliffford G , Reisner A , et al. De-Identification algorithm for free-text nursing notes[J]. Computers in Cardiology , 2005，32:331 - 334.

[8] Neamatullah I , Douglass MM , Lehman LWH , et al. Automated de-identification of free-text medical records[J]. BMC Medical Informatics and Decision Making, 2008, 8:32.

[9] 徐益輝, 姚琴, 袁冬生. 中文醫(yī)療文本匿名化方法研究[J]. 中國(guó)數(shù)字醫(yī)學(xué), 2014, 9(7):19-21.

Xu XH, Yao Q, Yuan DS. Study on the anonymization method of Chinese medical document[J]. China Digital Medicine, 2014, 9(7):19-21

[10] Uzuner O, Sibanda TC, Luo Y, et al. A de-identifier for medical discharge summaries[J]. Artificial Intelligence in Medicine, 2008, 42(1):13-35.

[11] Y. Guo, R. Gaizauskas, I. Roberts, G et al. Identifying personal health information using support vector machines[C]. i2b2 workshop on challenges in natural language processing for clinical data, 2006,10-11.

[12] Mcmurry AJ, Fitch B, Savova G, et al. Improved de-identification of physician notes through integrative modeling of both public and private medical text[J]. BMC Medical Informatics and Decision Making, 2013, 13:112.

[13] He B , Guan Y , Cheng J , et al. CRFs based de-identification of medical records[J]. Journal of Biomedical Informatics, 2015, 58:S39-S46.

[14] Liu Z, Chen Y, Tang B, et al. Automatic de-identification of electronic medical records using token-level and character-level conditional random fields[J]. Journal of Biomedical Informatics, 2015, 58(Suppl): S47-S52.

[15] Sakharov A, Sakharov T. The Viterbi algorithm for subsets of stochastic context-free languages[J]. Information Processing Letters, 2018, 135:68-72.

[16] 張華平, 劉群. 基于角色標(biāo)注的中國(guó)人名自動(dòng)識(shí)別研究[J]. 計(jì)算機(jī)學(xué)報(bào), 2004, 27(1):85-91.

Zhang HP, Liu Q. Automatic recognition of chinese personal name based on role tagging[J]. Chinese Journal of Computers, 2004, 27(1):85-91

服務(wù)與反饋：

【文章下載】【加入收藏】

提示：您還未登錄，請(qǐng)登錄！點(diǎn)此登錄

51黑料吃瓜在线观看,51黑料官网|51黑料捷克街头搭讪_51黑料入口最新视频