Ontologies in Biomedicine: The Good, the Bad, and the Ugly

Ontologies in Biomedicine:

The Good, the Bad, and the Ugly

Compiled for internal use

Very First Draft. Comments, Corrections and Extensions Welcome

(to: phismith@buffalo.edu)

1. Caveats:

1. Not everything on this list is described by its authors as an ontology.

2. The list has been prepared for illustrative purposes, as a preliminary guide to the sorts of pitfalls that we face in the building of ontologies. Its goal is to draw attention primarily to what is wrong with ontologies. Thus it should be used in conjunction with lists of ontologies in biomedicine such as those prepared at:

http://www.cs.man.ac.uk/~stevensr/ontology.html

http://anil.cchmc.org/Bio-Ontologies.html

http://lsdis.cs.uga.edu/~cthomas/bio_ontologies.html

and of course with the OBO (Open Biomedical Ontologies Consortium) ontology library: http://obo.sourceforge.net

2. First Draft List of Criteria to be satisfied by Good Ontologies (NB think of these as rules of thumb, or goals to keep constantly in mind – the world is too messy to support them all simultaneously)

a. Each ontology should have as its backbone a taxonomy based on the is_a relation (for ‘is a subtype of’). This should be as far as possible a true hierarchy (single inheritance).

b. The taxonomy should have one root, with a suite of high-level children of the root of a sort which yield a top-down view of the structure of the whole ontology. (One does not have this e.g. in SNOMED, or in the Cell Ontology.)

c. The expressions corresponding to the constituent nodes of the taxonomy and to its relations (is_a, part_of, etc.) should be explicitly defined in both human-readable and computable formats. The latter should be formalized versions of the former. Such definitions should then provide the rationale for establishing the class subsumption inheritance hierarchy.

d. There should be clear rules governing how definitions are formulated.

e. The ontology should distinguish between the types (classes, universals) represented by this taxonomy, and the tokens (individuals, particulars, instances) instantiated by these types on the side of reality.

f. The relationships should be used consistently to ensure valid inferences both within and between ontologies. One should be able to reliably query on instance data, computationally.

g. Classification systems that have existed for centuries have been human interpretable, but never computable. So, being able to compute on an ontology is important.

h. An ontology should accommodate change in knowledge. It should have clear procedures for adding new terms, and clear procedures for correcting erroneous entries. All prior versions should be easily accessible.

i. There should be clear rules governing how to select terms and how to resolve problems in case of difficult terms.

j. The different types of problem cases in the treatment of terms, relations and definitions, should be carefully documented, and best practices for the resolution of these problems tested and promulgated.

k. The scope of an ontology should be clearly specified, both in terms of the domain of instances over which it applies and in terms of the types of relations in that domain (and thus to pertinent type of scientific inquiry). The family of terms in a given ontology should then have a natural unity, which should also be reflected in the name of the ontology. This criterion not satisfied e.g. by the various so-called 'tissue' ontologies discussed in the last below. Indeed the family of tissue terms is one important example of a problem area – reflecting the fact that the term 'tissue' is ambiguous as between KIND of tissue and PORTION of tissue. (A similar ambiguity applies e.g. to ‘substance’.)

3. The Rankings

3.1 Very Good

The Foundational Model of Anatomy (FMA): http://sig.biostr.washington.edu/projects/fm/

Very clear statement of scope (structural human anatomy, at all levels of granularity, from the whole organism to the biological macromolecule; very powerful treatment of definitions (from which the entire FMA hierarchy is generated); very quick turn-around time for correction of errors; very few unfortunate artifacts in the ontology deriving from its specific computer representation (Protégé)

3.2 The Good

GALEN

Motivation: to find ways of storing detailed clinical information in a computer system so that both (1) clinicians are able to store and review information at a level of detail relevant to them and (2) computers can manipulate what is stored, for retrieval, abstraction, display, comparison.
Very powerful (Description Logic-based) formal structure, thus tight organization and careful treatment of terms; unfortunately remains only partially developed after some years of lying fallow. Now in some respects outdated.

3.3 The Intermediate (= still need many modifications)

Gene Ontology

Open source; very useful; poor treatment of the relations between the entities covered by its three separate ontologies

Reactome http://www.reactome.org/

A rich knowledgebase of biological process, but with incoherent treatment of top-level categories. Thus ReferentEntity (embracing e.g. small molecules) is treated as a sibling of PhysicalEntity (embracing complexes, molecules, ions and particles). Similarly CatalystActivity is treated as a sibling of Event.

SNOMED http://www.snomed.org/

Swissprot http://us.expasy.org/sprot/
Protein knowledgebase

Sequence Ontology http://song.sourceforge.net/

Cell Ontology http://www.xspan.org/obo/

Zebrafish Anatomy and Development Ontology http://obo.sourceforge.net/cgi-bin/detail.cgi?zfishanat

NANDA International Taxonomy http://www.nanda.org/html/taxonomy.html

A conceptual system that guides the classification of nursing diagnoses in a taxonomy.

ICNP International Classification for Nursing Practice http://www.icn.ch/icnp.htm
A combinatorial terminology for nursing practice that facilitates crossmapping of local terms and existing vocabularies and classifications

National Cancer Institute Thesaurus

http://www.mindswap.org/2003/CancerOntology/

Top-level structure recognizes the existence of three (disjoint) classes of cells: cells, normal cells, abnormal cells. Recognizes three (disjoint) classes of plants: vascular plants, non-vascular plants, other plants. Inherits many of the problematic features from other terminologies in the UMLS.

UMLS
http://www.nlm.nih.gov/research/umls/

ICD-10
http://www.icd10.ch/index.asp?lang=EN

3.4 The Bad

UMLS Semantic Network http://semanticnetwork.nlm.nih.gov/

Recognizes only one subtype of plant – algae (which are not plants)

Treats the digestive system as a conceptual part of the organism

Clinical Terms Version 3 (The Read Codes):

http://www.nhsia.nhs.uk/terms/pages/publications/v3refman/chap2.pdf

(Early?) versions classify chemicals into: chemicals whose name begins with ‘A’, chemicals whose name begins with ‘B’, chemicals whose name begins with ‘C’, ...

Incorporated into SNOMED-CT

LOINC Logical Observation Identifiers Names and Codes: http://www.regenstrief.org/loinc

Goal: to facilitate the exchange and pooling of results, such as blood hemoglobin, serum potassium, or vital signs, for clinical care, outcomes management, and research

tissue ontologies

Problem: reveals its origins in the punchcard era; typical string:

12189-7 | CREATINE KINASE.MB/CREATINE KINASE.TOTAL | CFR | PT | SET/PLAS | QN | CALCULATION

Health Level 7 Reference Information Model (HL7 RIM): http://www.hl7.org/Library/data-model/RIM/modelpage_mem.htm

HL7 is a standard for exchange of information between clinical information systems (has proved very crumbly as a standard; every hospital has its own version of HL7); the RIM is designed to overcome this problem by defining the world of healthcare data (a consensus view of the entire healthcare universe); one problem with the RIM is that very many entities in the healthcare universe (e.g. disorders, genes, ribosomes) are identified as documents; because of the counterintuitive nature of this identification, RIM documentation is itself highly counterintuitive, and the RIM community itself is subject to constant fights

Medical Entities Dictionary (MED): http://med.dmi.columbia.edu/

Semantic network style.

MedDRA v. 3: http://www.meddramsso.com/NewWeb2003/medra_overview/

Has hierarchies, but you can’t tell by browsing through the hierarchies whether different terms represent the same thing or not.

MedDRA v. 3 does not assign unique codes to its terms, but rather works with unique terms collected from various sources which are left unchanged for reasons of ‘compatibility’. Some source terminologies, such as WHO-ART (World Health Organization Adverse Reaction Terminology) had all terms in uppercase, some not. So a unique term might be “COLD”, but also “cold”, and “Cold”, and “cOLd”, .... Each unique term in MedDRA v3 must be assigned a single meaning, but MedDRA does this in a haphazard way. Thus the 4-character string “COLD” might be assigned the meaning common cold or cold temperature or (as is in fact the case <check>) chronic obstructive lung disease. Suppose, now, that a medical doctor in a pharmaceutical company has the task of coding into MedDRA handwritten reports received from practising physicians engaged in clinical studies. She must then, according to the coding rules set up by her department, either code a sentence such as “patient coughing and sneezing, ... diagnosis: COLD” as referring to chronic obstructive lung disease (which is obviously wrong), or make a phone call to the physician to ensure that he in fact meant “cold” and not “COLD”.

MEDCIN: http://www.medicomp.com/index_html.htm

Mixes up everything that can be mixed up

International Classification of Primary Care (ICPC): http://www.ulb.ac.be/esp/wicc/icpc2.html

tries to explain general medicine (family medicine) by means of about 800 classes.

MeSH
MGED
eVoc
PATO
Mouse Pathology

3.5 The Ugly

ICD-10-PCS: http://www.cms.hhs.gov/paymentsystems/icd9/icd10.asp

based on good principles but worked out using an ugly representation

UMLS Semantic Network

Special Mention: Ugly Tissue Ontologies
1. TissueDB (http://tissuedb.ontology.ims.u-tokyo.ac.jp:8082/tissuedb/)
has nothing to do with tissues. What they call tissue is basically all the
structures one can identify histologically.
2. Brenda Tissue Ontology
http://www.brenda.uni-koeln.de/ontology/tissue/tree/update/update_files/BrendaTissue
has nothing to do with tissue. Or rather, here, basically everything a tissue. Thus it contains statements like: arm is-a limb
3. Aukland Anatomy Ontology Tissue Class View
http://n2.bioeng5.bioeng.auckland.ac.nz/ontology/anatomy/ontology_class_view?class_uri=http%3A//physiome.bioeng.auckland.ac.nz/anatomy/all%23Tissue
Classifies tissue into: Connective tissue, Epithelial tissue, Glandular tissue, Muscle tissue, Nervous tissue; but proceeding further down the hierarchy we find not tissues but organs and organ parts such as SimpleTubularGland, SimpleAcinarGland, etc. Moreover EndocrineGland is asserted to have two ‘instances’ (we presume they mean subclasses): EndocrineGland (!), and FollicularEndocrineGland. Among the ‘instances’ of ConnectiveTissue are listed: Left Humerus, Right Tibia, and so on. So nonsense, here, too.