help button home button JAMIA Bigger figures
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH

First published June 28, 2007 as JAMIA PrePrint; doi:10.1197/jamia.M2435
Journal of the American Medical Informatics Association 2007;14(5):564-573
© 2007 American Medical Informatics Association


A more recent version of this article appeared on September 1, 2007
This Article
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
M2435v1
14/5/564    most recent
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Wellner, B.
Right arrow Articles by Hirschman, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wellner, B.
Right arrow Articles by Hirschman, L.

Submitted on March 13, 2007
Accepted on June 11, 2007

Rapidly Retargetable Approaches to De-identification in Medical Records

Ben Wellner1, Matt Huyck2, Scott Mardis3, John Aberdeen3*, Alex Morgan4, Leonid Peshkin2, Alex Yeh3, Janet Hitzeman3, and Lynette Hirschman3

Affiliation of the authors: 1 The MITRE Corporation, Bedford, MA; Department of Computer Science, Brandeis University, Waltham, MA ; 2 Center for Biomedical Informatics, Harvard Medical School, Boston, MA; 3 The MITRE Corporation, Bedford, MA; 4 The MITRE Corporation, Bedford, MA; Stanford Biomedical Informatics, Palo Alto, CA

* To whom correspondence should be addressed.

Objective This paper describes a successful approach to de-identification that was developed to participate in a recent AMIA-sponsored challenge evaluation.

Method Our approach focused on rapid adaptation of existing toolkits for named entity recognition using two existing toolkits, Carafe and LingPipe.

Results The "out of the box" Carafe system achieved a very good score (phrase F-measure of 0.9664) with only four hours of work to adapt it to the de-identification task. With further tuning, we were able to reduce the token-level error term by over 36% through task-specific feature engineering and the introduction of a lexicon, achieving a phrase F-measure of 0.9736.

Conclusions We were able to achieve good performance on the de-identification task by the rapid retargeting of existing toolkits. For the Carafe system, we developed a method for tuning the balance of recall vs. precision, as well as a confidence score that correlated well with the measured F-score.




This article has been cited by other articles:


Home page
J. Am. Med. Inform. Assoc.Home page
M. Bloomrosen and D. Detmer
Advancing the Framework: Use of Health Data--A Report of a Working Conference of the American Medical Informatics Association
J. Am. Med. Inform. Assoc., November 1, 2008; 15(6): 715 - 722.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
F. J. Friedlin and C. J. McDonald
A Software Tool for Removing Patient Identifying Information from Clinical Documents
J. Am. Med. Inform. Assoc., September 1, 2008; 15(5): 601 - 610.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
O. Uzuner, Y. Luo, and P. Szolovits
Evaluating the State-of-the-Art in Automatic De-identification
J. Am. Med. Inform. Assoc., September 1, 2007; 14(5): 550 - 563.
[Abstract] [Full Text] [PDF]




HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH
Copyright © 1994 by the American Medical Informatics Association.