Evaluating and reducing the effect of data corruption when applying bag of words approaches to medic

Author:Ruch, P; Baud, R; Geissbühler, A

Article Title:Evaluating and reducing the effect of data corruption when applying bag of words approaches to medical records

Abstract:
Unlike journal corpora, which are supposed to be carefully reviewed before being published, the quality of documents in a patient record are often corrupted by misspelled words and conventional graphies or abbreviations. After a survey of the domain, the paper focuses on evaluating the effect of such corruption on an information retrieval (IR) engine. The IR system uses a classical bag of words approach, with stems as representation items and term frequency-inverse document frequency (tf-idf) as weighting schema; we pay special attention to the normalization factor. First results shows that even low corruption levels (3%) do affect retrieval effectiveness (4-7%), whereas higher corruption levels can affect retrieval effectiveness by 25%. Then, we show that the use of an improved automatic spelling correction system, applied on the corrupted collection, can almost restore the retrieval effectiveness of the engine. (C) 2002 Elsevier Science Ireland Ltd. All rights reserved.

Keywords: corruption; information retrieval; medical records; spelling correction; natural language processing

DOI: 10.1016/S1386-5056(02)00057-6

Source:INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS

Welcome to correct the error, please contact email: humanisticspider@gmail.com