北京生物醫(yī)學(xué)工程

基于Tesseract的醫(yī)學(xué)化驗(yàn)單內(nèi)容識(shí)別技術(shù)

Recognition technology of the laboratory sheet based on Tesseract

作者：張淙悅尹梓名孫大運(yùn) 戴維

單位：上海理工大學(xué)醫(yī)療器械與食品學(xué)院（上海 200093）

關(guān)鍵詞：化驗(yàn)單; 光學(xué)字符識(shí)別; 圖像處理; 錯(cuò)誤校正

分類號(hào)：R318.04；TP391.5

出版年·卷·期（頁(yè)碼）：2019·38·3（283-289）

摘要：

目的由于化驗(yàn)單內(nèi)容可以真實(shí)地記錄患者健康狀態(tài)，因此將紙質(zhì)的化驗(yàn)單轉(zhuǎn)為醫(yī)療電子檔案進(jìn)行存儲(chǔ)在進(jìn)行保險(xiǎn)理賠、轉(zhuǎn)院、遠(yuǎn)程會(huì)診、建立健康檔案時(shí)都具有重要作用。但目前在臨床上尚缺乏能識(shí)別化驗(yàn)單內(nèi)容，把化驗(yàn)單直接轉(zhuǎn)成醫(yī)療電子檔案的工具，為此本文設(shè)計(jì)了一套完整的自動(dòng)化醫(yī)學(xué)化驗(yàn)單內(nèi)容的光學(xué)字符識(shí)別（optical character recognition，OCR）識(shí)別方法。方法首先對(duì)化驗(yàn)單圖像進(jìn)行預(yù)處理，利用大津法對(duì)化驗(yàn)單圖像進(jìn)行二值化、用霍夫變換對(duì)圖像進(jìn)行抗扭斜和特征提取，然后使用Tesseract的集束搜索算法和K鄰近算法對(duì)化驗(yàn)單內(nèi)容進(jìn)行識(shí)別，對(duì)字庫(kù)進(jìn)行訓(xùn)練，利用醫(yī)學(xué)詞典文件與模糊字文件來(lái)對(duì)識(shí)別內(nèi)容進(jìn)行糾錯(cuò)，并以此建立醫(yī)學(xué)化驗(yàn)單OCR引擎。最后利用從上海某社區(qū)醫(yī)院收集的302條化驗(yàn)單數(shù)據(jù)對(duì)OCR引擎的準(zhǔn)確率進(jìn)行了評(píng)估。結(jié)果經(jīng)評(píng)估驗(yàn)證，本文方法的識(shí)別準(zhǔn)確率為92.72%，可基本滿足臨床需求。結(jié)論基于Tesseract建立的醫(yī)學(xué)化驗(yàn)單OCR引擎可以免去手動(dòng)輸入化驗(yàn)單數(shù)據(jù)的麻煩，醫(yī)生僅需要拍照上傳化驗(yàn)單照片，即可將化驗(yàn)單中的內(nèi)容轉(zhuǎn)成結(jié)構(gòu)化醫(yī)療電子檔案，極大提高了醫(yī)生的工作效率，有助于數(shù)據(jù)的進(jìn)一步利用。

Objective As the contents of the laboratory sheet can truly record patients’ health status, it plays an important role to convert the paper laboratory sheet into medical electronic files for storage in insurance claims, transfer, remote consultation, and establishment of health records. However, there is no tool to identify the contents of laboratory sheet and convert the laboratory sheet directly into structured medical electronic files at present. For this reason, this paper designs a complete optical character recognition（OCR）identification methods for automatic identification of medical laboratory sheet. Methods First, the image of laboratory sheet was preprocessed, binarized by Otsu method. A deskew and feature extraction was performed by Hough transform, then the content of laboratory sheet was identified by Tesseract's beam search algorithm and K-neighboring algorithm, the word bank was trained, and the recognition content was corrected by the medical dictionary file and the unicharambigs file. Based on this, an OCR engine for laboratory sheets was built. Finally, the accuracy of OCR engine was evaluated by using 302 laboratory sheets collected from a community hospital in Shanghai. Results The recognition accuracy of this method was 92.72%, which could basically meet the clinical needs. Conclusion The OCR engine based on Tesseract can avoid the trouble of manually inputting the laboratory sheet data. Doctors only need to take photos of laboratory sheets and upload these photos by internet, the OCR engine can transform the contents of the laboratory sheet into structured medical electronic files, which greatly improves the efficiency of doctors and helps to further use the data.

參考文獻(xiàn)：

[1] 王宸敏. 基于OCR技術(shù)的化驗(yàn)單識(shí)別方法研究[D]. 杭州：浙江大學(xué), 2016.

Wang CM,. Research on the method of laboratory sheet recognition based on OCR technology [D]. Hangzhou: Zhejiang University, 2016.

[2] 黃宇. OCR技術(shù)在金融領(lǐng)域的應(yīng)用[J]. 金融電子化, 2001，(1):86-88.

[3] 陳晨. 智能交通系統(tǒng)中車(chē)牌識(shí)別的關(guān)鍵技術(shù)研究[D]. 南京：南京理工大學(xué), 2014.

Chen CH. Research on key technologies of license plate recognition in intelligent traffic system [D]. Nanjing: Nanjing University of Science and Technology, 2014.

[4] 張巍. 基于Android平臺(tái)的名片掃描識(shí)別系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D]. 長(zhǎng)春：吉林大學(xué), 2015.

Zhang W. Design and implementation of business card scanning recognition system based on Android platform [D]. Changchun: Jilin University, 2015.

[5] 劉泳文. 基于圖像識(shí)別的搜題系統(tǒng)的研究與實(shí)現(xiàn)[D].南充：西華師范大學(xué),2016.

Liu YW. Research and implementation of searching test system based on image recognition [D].Nanchong:China West Normal University, 2016.

[6] 萬(wàn)松.基于Tesseract-OCR的名片識(shí)別系統(tǒng)的研究與實(shí)現(xiàn)[D]. 廣州：華南理工大學(xué)，2014.

Wan S. Research and implementation of business card recognition system based on Tesseract-OCR engine[D]. Guangzhou:South China University of Technology, 2014

[7] 郭佳, 劉曉玉, 吳冰,等. 一種光照不均勻圖像的二值化方法[J]. 計(jì)算機(jī)應(yīng)用與軟件, 2014, (3):183-186.

Guo J, Liu XY, Wu B. Binarisation method for images acquired under non-uniform illumination [J]. Computer Applications and Software, 2014(3):183-186

[8] 羅松, 王俊峰, 唐鵬,等. 面向條碼識(shí)讀的自適應(yīng)二值化改進(jìn)算法[J]. 計(jì)算機(jī)工程與設(shè)計(jì), 2013, 34(4):1324-1330.

Luo S, Wang JF, Tang P, Improved adaptive thresholding algorithm used in barcode reading[J]. Computer Engineering and Design, 2013, 34(4):1324-1330.

[9] 武玉坤. 基于OCR技術(shù)的名片識(shí)別系統(tǒng)的研究[D]. 長(zhǎng)沙：長(zhǎng)沙理工大學(xué), 2008.

Wu YK. Research on business card recognition system based on OCR technology [D]. Changsha:Changsha University of Science and Technology, 2008.

[10] 鄔滿. 基于跳變檢測(cè)和Tesseract的機(jī)打發(fā)票識(shí)別算法[J]. 信息與電腦(理論版), 2015，(18):43-45.

[11] Smith RW . History of the Tesseract OCR engine: what worked and what didn't[C]// Proceedings of SPIE Document Recognition and Retrieval. San Francisco: SPIE， 2013.

[12] Tesseract ocr wiki[EB/OL]. [2018-09]

https://github.com/tesseract-ocr/tesseract/wiki

[13] Quehl B, Yang H, Sack H. Improving text recognition by distinguishing scene and overlay text[C]// International Conference on Machine Vision. San Diego: International Society for Optics and Photonics, 2015.

[14] Improve quality[EB/OL]. [2018-09] https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality/

[15] Smith R, Antonova D, Lee DS. Adapting the Tesseract open source OCR engine for multilingual OCR[C]// International Workshop on Multilingual Ocr. Barcelona : ACM,2009:1.

服務(wù)與反饋：

【文章下載】【加入收藏】

提示：您還未登錄，請(qǐng)登錄！點(diǎn)此登錄

51黑料吃瓜在线观看,51黑料官网|51黑料捷克街头搭讪_51黑料入口最新视频