To appear in Artificial Intelligence in Medicine journal 2002 Uniqueness of Medical Data Mining Krzysztof J Cios1 2 3 4 and G William Moore5 6 7 1 2 3 University of Colorado at Denver University of Colorado at Boulder University of Colorado Health 4 5 Sciences Center Denver CO 4cData LLC Golden CO Baltimore Veterans Affairs Medical Center 6 7 Baltimore MD University of Maryland School of Medicine Baltimore MD The Johns Hopkins University School of Medicine Baltimore MD Keywords medical data mining unique features of medical data mining and knowledge discovery ethical security and legal aspects of medical data mining 0 Introduction This article emphasizes the uniqueness of medical data mining This is a position paper in which the authors intent based on their medical and data mining experience is to alert the data mining community to the unique features of medical data mining The reason for writing the paper is that researchers who perform data mining in other fields may not be aware of the constraints and difficulties of mining the privacy sensitive heterogeneous data of medicine We discuss ethical security and legal aspects of medical data mining In addition we pose several questions that must be answered by the community so that both the patients on whom the data are collected as well as the data miners can benefit Human medical data are at once the most rewarding and difficult of all biological data to mine and analyze Humans are the most closely watched species on earth Human subjects can provide observations that cannot easily be gained from animal studies such as visual and auditory sensations the perception of pain discomfort hallucinations and recollection of possibly relevant prior traumas and exposures Most animal studies are short term and therefore cannot track long term disease processes of medical interest such as preneoplasia or atherosclerosis With human data there is no issue of having to extrapolate animal observations to the human species Some three quarter billions of persons living in North America Europe and Asia have at least some of their medical information collected in electronic form at least transiently These subjects generate volumes of data that an animal experimentalist can only dream of On the other hand there are ethical legal and social constraints on data collection and distribution that do not apply to non human species and that limit the scientific conclusions that may be drawn The major points of uniqueness of medical data may be organized under four general headings 1 Heterogeneity of medical data Ethical legal and social issues Statistical philosophy Special status of medicine Heterogeneity of medical data Raw medical data are voluminous and heterogeneous Medical data may be collected from various images interviews with the patient laboratory data and the physician s observations and interpretations All these components may bear upon the diagnosis prognosis and treatment of the patient and cannot be ignored The major areas of heterogeneity of medical data may be organized under these headings Volume and complexity of medical data Physician s interpretation Sensitivity and specificity analysis Poor mathematical characterization Canonical form To appear in Artificial Intelligence in Medicine journal 2002 1 1 Volume and complexity of medical data Raw medical data are voluminous and heterogeneous Medical data may be collected from various images interviews with the patient and physician s notes and interpretations All these data elements may bear upon the diagnosis prognosis and treatment of the patient and must be taken into account in data mining research More and more medical procedures employ imaging as a preferred diagnostic tool Thus there is a need to develop methods for efficient mining in databases of images which are more difficult than mining in purely numerical databases As an example imaging techniques like SPECT MRI PET and collection of ECG or EEG signals can generate gigabytes of data per day A single cardiac SPECT procedure on one patient may contain dozens of twodimensional images In addition an image of the patient s organ will almost always be accompanied by other clinical information as well as the physician s interpretation clinical impression diagnosis This heterogeneity requires high capacity data storage devices and new tools to analyze such data It is obviously very difficult for an unaided human to process gigabytes of records although dealing with images is relatively easier for humans because we are able to recognize patterns grasp basic trends in data and formulate rational decisions The stored information becomes less useful if it is not available in an easily comprehensible format Visualization techniques will play an increasing role in this setting since images are the easiest for humans to comprehend and they can provide a great deal of information in a single snapshot of the results 1 2 Importance of physician s interpretation The physician s interpretation of images signals or any other clinical data is written in unstructured free text English that is very difficult to standardize and thus difficult to mine Even specialists from the same discipline cannot agree on unambiguous terms to be used in describing a patient s condition Not only do they use different names synonyms to describe the same disease but they render the task even more daunting by using different grammatical constructions to describe relationships among medical entities It has been suggested that computer translation may hold part of the solution for processing the physician s interpretation Manning and Schuetze 2000 Ceusters 2000 Principles of computer translation may be summarized as follows Nagao 1992 Machine translation is typically composed of the following three steps analysis of a source language sentence transfer from one language to another and generation of a target language sentence Natural language can be regarded as a huge set of exceptional expressions as many expressions as possible must be collected in the dictionary It is an endless job One of the difficulties of translation is that the translation of an input sentence is not unique see 1 5 Canonical form Current translation systems can analyze and translate sentences composed of less than ten words A reason for such failure is the ambiguity Even a human cannot understand the meaning of a long sentence at the first reading Grammatical rules in machine translation can be
or
We will never post anything without your permission.
Don't have an account? Sign up
Unlocking...