Ontologies in Biomedicine:
The Good, the Bad, and the Ugly
Compiled for internal use
Very First Draft. Comments, Corrections and Extensions Welcome
(to: phismith@buffalo.edu)
1.
Caveats:
1. Not everything on this list is described by its authors as an ontology.
2. The list has been prepared for illustrative purposes, as a
preliminary guide to the sorts of pitfalls that we face in the building of
ontologies. Its goal is to draw attention primarily to what is wrong with
ontologies. Thus it should be used in conjunction with lists of ontologies in
biomedicine such as those prepared at:
http://www.cs.man.ac.uk/~stevensr/ontology.html
http://anil.cchmc.org/Bio-Ontologies.html
http://lsdis.cs.uga.edu/~cthomas/bio_ontologies.html
and of course with the OBO (Open Biomedical Ontologies Consortium)
ontology library: http://obo.sourceforge.net
2.
First Draft List of Criteria to be satisfied by Good
Ontologies (NB think of these as rules of thumb, or goals to keep constantly in
mind – the world is too messy to support them all simultaneously)
a.
Each ontology
should have as its backbone a taxonomy based on the is_a relation (for ‘is a subtype of’). This should be as far as
possible a true hierarchy (single inheritance).
b.
The taxonomy should
have one root, with a suite of high-level children of the root of a sort which yield a top-down
view of the structure of the whole ontology. (One does not have this e.g. in
SNOMED, or in the Cell Ontology.)
c.
The expressions
corresponding to the constituent nodes of the taxonomy and to its relations (is_a, part_of, etc.) should be explicitly defined in both human-readable
and computable formats. The latter should be formalized versions of the former.
Such definitions should then provide the rationale for establishing the class
subsumption inheritance hierarchy.
d.
There should be
clear rules governing how definitions are formulated.
e.
The ontology
should distinguish between the types (classes, universals) represented by this
taxonomy, and the tokens (individuals, particulars, instances) instantiated by
these types on the side of reality.
f.
The relationships
should be used consistently to ensure valid inferences both within and between
ontologies. One should be able to reliably query on instance data, computationally.
g.
Classification
systems that have existed for centuries have been human interpretable, but
never computable. So, being able to compute on an ontology is important.
h.
An ontology
should accommodate change in knowledge. It should have clear procedures for
adding new terms, and clear procedures for correcting erroneous entries. All
prior versions should be easily accessible.
i.
There should be
clear rules governing how to select terms and how to resolve problems in case
of difficult terms.
j.
The different
types of problem cases in the treatment of terms, relations and definitions,
should be carefully documented, and best practices for the resolution of these
problems tested and promulgated.
k.
The scope of an ontology should be clearly
specified, both in terms of the domain of instances over which it applies and
in terms of the types of relations in that domain (and thus to pertinent type
of scientific inquiry). The family of terms in a given ontology should then
have a natural unity, which should also be reflected in the name of the
ontology. This criterion not satisfied e.g. by the various so-called 'tissue'
ontologies discussed in the last below. Indeed the family of tissue terms is one important
example of a problem area – reflecting the fact that the term 'tissue' is
ambiguous as between KIND of tissue and PORTION of tissue. (A similar ambiguity
applies e.g. to ‘substance’.)
3.
The Rankings
3.1 Very Good
The Foundational Model of Anatomy (FMA): http://sig.biostr.washington.edu/projects/fm/
Very clear statement of scope (structural human anatomy, at all levels
of granularity, from the whole organism to the biological macromolecule; very
powerful treatment of definitions (from which the entire FMA hierarchy is
generated); very quick turn-around time for correction of errors; very few
unfortunate artifacts in the ontology deriving from its specific computer
representation (Protégé)
3.2
The Good
GALEN
Motivation: to find ways of storing detailed clinical information in a
computer system so that both (1) clinicians are able to store and review
information at a level of detail relevant to them and (2) computers can
manipulate what is stored, for retrieval, abstraction, display, comparison.
Very powerful (Description Logic-based) formal structure, thus tight
organization and careful treatment of terms; unfortunately remains only
partially developed after some years of lying fallow. Now in some respects
outdated.
3.3 The Intermediate (= still need many
modifications)
Gene Ontology
Open source; very useful; poor treatment of the relations between the
entities covered by its three separate ontologies
Reactome http://www.reactome.org/
A rich knowledgebase of biological process, but with incoherent
treatment of top-level categories. Thus ReferentEntity (embracing e.g. small
molecules) is treated as a sibling of
PhysicalEntity (embracing complexes, molecules, ions and particles). Similarly
CatalystActivity is treated as a sibling of Event.
SNOMED http://www.snomed.org/
Swissprot http://us.expasy.org/sprot/
Protein knowledgebase
Sequence Ontology http://song.sourceforge.net/
Cell Ontology http://www.xspan.org/obo/
Zebrafish Anatomy and Development Ontology http://obo.sourceforge.net/cgi-bin/detail.cgi?zfishanat
NANDA International Taxonomy http://www.nanda.org/html/taxonomy.html
A conceptual system that guides the classification of nursing diagnoses
in a taxonomy.
ICNP International Classification for Nursing Practice http://www.icn.ch/icnp.htm
A combinatorial terminology for nursing
practice that facilitates crossmapping of local terms and existing vocabularies
and classifications
National Cancer Institute Thesaurus
http://www.mindswap.org/2003/CancerOntology/
Top-level structure recognizes the existence of three (disjoint)
classes of cells: cells, normal cells, abnormal cells. Recognizes three
(disjoint) classes of plants: vascular plants, non-vascular plants, other
plants. Inherits many of the problematic features from other terminologies in
the UMLS.
UMLS
http://www.nlm.nih.gov/research/umls/
ICD-10
http://www.icd10.ch/index.asp?lang=EN
3.4
The Bad
UMLS Semantic Network http://semanticnetwork.nlm.nih.gov/
Recognizes only one subtype of plant – algae (which are not plants)
Treats the digestive system as a conceptual part of
the organism
Clinical Terms Version 3 (The Read Codes):
http://www.nhsia.nhs.uk/terms/pages/publications/v3refman/chap2.pdf
(Early?) versions classify chemicals into: chemicals
whose name begins with ‘A’, chemicals whose name begins with ‘B’, chemicals
whose name begins with ‘C’, ...
Incorporated into SNOMED-CT
LOINC Logical Observation Identifiers Names and Codes: http://www.regenstrief.org/loinc
Goal: to facilitate the exchange and pooling of
results, such as blood hemoglobin, serum potassium, or vital signs, for
clinical care, outcomes management, and research
tissue ontologies
Problem: reveals its origins in the punchcard era;
typical string:
12189-7 | CREATINE
KINASE.MB/CREATINE KINASE.TOTAL | CFR | PT | SET/PLAS | QN | CALCULATION
Health Level 7 Reference Information
Model (HL7 RIM): http://www.hl7.org/Library/data-model/RIM/modelpage_mem.htm
HL7 is a standard for exchange of information between
clinical information systems (has proved very crumbly as a standard; every
hospital has its own version of HL7); the RIM is designed to overcome this
problem by defining the world of healthcare data (a consensus view of the
entire healthcare universe); one problem with the RIM is that very many
entities in the healthcare universe (e.g. disorders, genes, ribosomes) are
identified as documents; because of the counterintuitive nature of this
identification, RIM documentation is itself highly counterintuitive, and the
RIM community itself is subject to constant fights
Medical Entities Dictionary (MED): http://med.dmi.columbia.edu/
Semantic network style.
MedDRA v. 3: http://www.meddramsso.com/NewWeb2003/medra_overview/
Has hierarchies, but you can’t tell by browsing through
the hierarchies whether different terms represent the same thing or not.
MedDRA v. 3 does not assign unique codes to its terms,
but rather works with unique terms collected from various sources which are left
unchanged for reasons of ‘compatibility’. Some source terminologies, such as
WHO-ART (World Health Organization Adverse Reaction Terminology) had all terms
in uppercase, some not. So a unique term might be “COLD”, but also “cold”, and “Cold”,
and “cOLd”, .... Each unique term in MedDRA v3 must be assigned a single meaning,
but MedDRA does this in a haphazard way. Thus the 4-character string “COLD”
might be assigned the meaning common cold
or cold temperature or (as is in fact
the case <check>) chronic
obstructive lung disease. Suppose, now, that a medical doctor in a
pharmaceutical company has the task of coding into MedDRA handwritten reports received
from practising physicians engaged in clinical studies. She must then, according
to the coding rules set up by her department, either code a sentence such as “patient
coughing and sneezing, ... diagnosis: COLD” as referring to chronic obstructive lung disease (which
is obviously wrong), or make a phone call to the physician to ensure that he in
fact meant “cold” and not “COLD”.
MEDCIN: http://www.medicomp.com/index_html.htm
Mixes up everything that can be mixed up
International Classification of Primary Care (ICPC): http://www.ulb.ac.be/esp/wicc/icpc2.html
tries to explain general medicine (family medicine) by
means of about 800 classes.
MeSH
MGED
eVoc
PATO
Mouse Pathology
3.5 The Ugly
ICD-10-PCS: http://www.cms.hhs.gov/paymentsystems/icd9/icd10.asp
based on good principles but worked out using an ugly
representation
UMLS Semantic Network
Special Mention: Ugly Tissue Ontologies
1. TissueDB (http://tissuedb.ontology.ims.u-tokyo.ac.jp:8082/tissuedb/)
has nothing to do with tissues. What they call tissue is basically all the
structures one can identify histologically.
2. Brenda Tissue Ontology
http://www.brenda.uni-koeln.de/ontology/tissue/tree/update/update_files/BrendaTissue
has nothing to do with tissue. Or rather, here, basically everything a
tissue. Thus it contains statements like: arm is-a limb
3. Aukland Anatomy Ontology Tissue Class View
http://n2.bioeng5.bioeng.auckland.ac.nz/ontology/anatomy/ontology_class_view?class_uri=http%3A//physiome.bioeng.auckland.ac.nz/anatomy/all%23Tissue
Classifies tissue into: Connective tissue, Epithelial tissue, Glandular tissue,
Muscle tissue, Nervous tissue; but proceeding further down the hierarchy we
find not tissues but organs and organ parts such as SimpleTubularGland,
SimpleAcinarGland, etc. Moreover EndocrineGland is asserted to have two ‘instances’
(we presume they mean subclasses): EndocrineGland (!), and
FollicularEndocrineGland. Among the ‘instances’ of ConnectiveTissue are
listed: Left Humerus, Right Tibia, and so on. So nonsense, here, too.