help button home button JAMIA Hate scrolling?
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH

First published June 28, 2007 as JAMIA PrePrint; doi:10.1197/jamia.M2444
Journal of the American Medical Informatics Association 2007;14(5):550-563
© 2007 American Medical Informatics Association


A more recent version of this article appeared on September 1, 2007
This Article
Right arrow Full Text (PDF)
Right arrow Data Supplement
Right arrow All Versions of this Article:
M2444v1
14/5/550    most recent
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Uzuner, O.
Right arrow Articles by Szolovits, P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Uzuner, O.
Right arrow Articles by Szolovits, P.

Submitted on March 19, 2007
Accepted on June 15, 2007

Evaluating the State-of-the-Art in Automatic De-identification

Özlem Uzuner PhD1*, Yuan Luo1, and Peter Szolovits PhD2

Affiliation of the authors: 1 University at Albany, SUNY, Albany, NY; 2 MIT CSAIL, Cambridge, MA

* To whom correspondence should be addressed.

As a part of the i2b2 (Informatics for Integrating Biology to the Bedside) project, authors orga-nized a Natural Language Processing (NLP) challenge on automatically removing private health information (PHI) from medical discharge records. This manuscript provides an overview of this "de-identification challenge", describes the data and the annotation process, explains the evalua-tion metrics, discusses the nature of the systems that addressed the challenge, and analyzes the results of received system runs. The challenge data consisted of discharge summaries drawn from the Partners Healthcare system. Authors prepared this data for the challenge by replacing authentic PHI with synthe-sized surrogates. To focus the challenge on non-dictionary-based de-identification methods, the data was enriched with out-of-vocabulary PHI surrogates, i.e., made up names. The data also included some PHI surrogates that were ambiguous with medical non-PHI terms. A total of seven teams participated in the challenge. Each team submitted up to three system runs, for a total of sixteen submissions. The authors used precision, recall, and F-measure to evaluate the submitted system runs based on their token-level and instance-level performance on the ground truth. The systems with the best performance scored above 98% in F-measure for all categories of identifiers. Most out-of-vocabulary PHI could be identified accurately. However, identifying ambiguous PHI proved challenging. The performance of systems on the test data set is encouraging. Future evaluations of these systems will involve larger data sets from more heterogeneous sources.




This article has been cited by other articles:


Home page
J. Am. Med. Inform. Assoc.Home page
F. J. Friedlin and C. J. McDonald
A Software Tool for Removing Patient Identifying Information from Clinical Documents
J. Am. Med. Inform. Assoc., September 1, 2008; 15(5): 601 - 610.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
O. Uzuner, I. Goldstein, Y. Luo, and I. Kohane
Identifying Patient Smoking Status from Medical Discharge Records
J. Am. Med. Inform. Assoc., January 1, 2008; 15(1): 14 - 24.
[Abstract] [Full Text] [PDF]




HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH
Copyright © 1994 by the American Medical Informatics Association.