2017-03-14

Extensible Code Lists: an RDF Solution

We are all familiar with code lists, or value sets as they are also called. These are permissible values for a variable. For example Race, as described by the U.S. Office of Management and Budget (OMB), can have 5 permissible values: White, Black or African American, Asian, American Indian or Alaska Native, and Native Hawaiian or other Pacific Islander.

Some standard code lists are incomplete, i.e. they don't capture the universe of possible values for a variable. These code lists are called extensible. The Sponsor may create custom terms and add them to the code list. Managing these is a challenge. Here is an idea that we are proposing for the PhUSE SDTM Data in RDF project that we are kicking off next week at the PhUSE Computational Science Symposium.  RDF has a unique advantage over other solutions in that it is designed to work with data that are distributed across the web. It can be used to integrate multiple dictionaries from multiple sources. Here's one way it can work. 

First one creates a study terminology ontology containing all the standard terminology concepts needed for clinical trials. It looks something like this:


One can see how to leverage other terminologies. For example, the Vital Signs class links to SDTM terminology expressed in RDF. In this case the resources shown here are for Diastolic Blood Pressure.

Now you create a second ontology for custom terms, which looks very similar to the first one:


In this example, the sponsor performed three custom flags for subjects who completed 8, 16, and 24 weeks of treatment, respectively. These are entered as custom:PopulationFlag analyses. Next, one imports the standard terminology ontology and specifies using the rdfs:subClassOf property that the custom terms are sub-classes of the standard concepts. So now it looks like this:


Looking at the code:PopulationFlag example, there are three standard population flags specified: Efficacy (EFF), Safety (SAF), and Intent to Treat (ITT). Furthermore there are the three custom flags as previously described.

The nice thing about this approach is that the custom terms exist independently from the standard terms and can be easily removed/ignored for the next study, yet they can be linked in this way to the standard terms so tools treat them the same. A SPARQL query looking for all members of the code:PopulationFlag class will return 6 individuals. For the next study, one can create a different set of custom terms. The "web" of study terminologies begins to look like the figure below. One can imagine a diverse library of controlled terms all available for implementation almost literally at one's fingertips.

One can link to other terminologies in the same way. Ideally, all the standard ontologies exist on the web and one merely links to them, thereby taking advantage of Linked Data principles.

I appreciate your comments. 

2017-03-03

Temporal Concepts in Clinical Trials

As the saying goes, "timing is everything." This is no less true in clinical trials because knowing when activities occurred or how long they last often holds the key to proper interpretation of the data. Documenting temporal elements for activities in clinical trials is therefore crucial.

In the RDF world, we can leverage the work of others who have thought about this issue in great detail. It turns out that the World Wide Web Consortium (www.w3c.org), has developed a time ontology in OWL for anyone to use. It's rather simple and elegant and has useful application for our Study Ontology. Linking our study ontology with the w3c time ontology is a nice example of the benefits of Linked Data. The ontology goes like this....

A temporal entity can be either a time:Instant (a single point in time) or a time:Interval (has a duration). Intervals have properties like time:hasBeginning and time:hasEnd. These are not totally disjoint because one can consider an Instant as an Interval where the start and end Instants are the same, but this is a minor point.

For many Activities, such as a blood test or a vital signs measurement, all we really care about is the date/time it occurred. For all practical purposes, a time:Instant.  Some activities do have a duration worth knowing about, so one can attach a time:Interval to them. The nice thing about Intervals is that it links the beginning instant to the end instant ... they go together. The time:Interval resource is the link that holds them together.

So let's look at some examples taken from the SDTM of some important Intervals and how they might look in the RDF when we link to the w3c time ontology. As always, I use Turtle syntax as it's very human-readable:

study:ReferenceStudyInterval rdf:type time:Interval;
     time:hasBeginning sdtm:RFSTDTC ;
     time:hasEnd sdtm:RFENDTC .

study:ReferenceExpsosureInterval rdf:type time:Interval;
     time:hasBeginning sdtm:RFXSTDTC ;
     time:hasEnd sdtm:RFXENDTC .

and another important one:

study:Lifespan rdf:type time:Interval;
     time:hasBeginning sdtm:BRTHDTC ;
     time:hasEnd sdtm:DTHDTC .

Now here is where it gets fun. Let's say you want to derive RFXSTDTC and RFXENDTC (first and last day of exposure). Imagine your database has various time:Interval triples for each subject, each describing a fixed dose interval. Imagine in this example, Person1 participates in 3 fixed dosing intervals, as shown in the RDF as follows:

study:Person1 study:participatesIn study:DrugAdministration1, study:DrugAdministration2,
          study:DrugAdministration3.

Each administration is associated with an interval: Interval1, Interval2, Interval3, each of which has a time:Beginning and time:End date. One can write a SPARQL query to pull out the minimum (earliest) time:hasBeginning date and the maximum (latest) time:hasEnd date for all the drug administration intervals and thereby derive automatically the two SDTM dates of interest.  The same can be done for RFPENDTC (reference participation end date). I can't tell you how often this date is wrong in actual study data submissions. A SPARQL query can identify all dates for all study activities associated with a Subject and pick out the maximum date, which happens to be the RFPENDTC. Best of all, these standard queries can exist as a resource on the web using SPIN (SPARQL Inference Notation) for anyone to use.

But first, you need study data in the RDF and a Study Ontology.


2017-03-01

What's in a Name?

Standardizing clinical trial data is all about automation. Standard data enable automated processes that bring efficiency and less human error. But automating a process, for example, an analysis of a lab test across multiple subjects in a trial, requires computers and information systems to be able to unambiguously identify that lab test. This is called computable semantic interoperability (CSI). The key is "computable." It's not enough that a human can identify the lab test of interest, but computers need to do the same.  I previously wrote about the interoperability problem and I revisit it here today, focusing on test names.

There are two situations that impede CSI: [1] when the same Thing goes by two different names, or even more troublesome [2] when two different Things go by the same name.  When I say Mustang do I mean the car, or the horse? Some describe the term Mustang is "overloaded" because it can represent more than one Thing. Issue #1 is addressed by controlled terminology. Synonyms can then be mapped to a controlled term that all agree to use. Issue #2 is more challenging, but it is avoidable by assigning different names to different things. I consider this a best practice to promote CSI.

As an example, let's look at the CDISC controlled term "glucose" (code C105585). The definition is "a measurement of the glucose in a biological specimen." The reality is that a serum glucose and a urine glucose are two completely different tests, having different clinical meaning and interpretation. I have been advocating for more granular lab test names for a long time so that computers can easily distinguish different tests. The counter-argument is that serum glucose is really two concepts: the specimen and the "thing" being measured (known as the component, or analyte in LOINC), and therefore should be represented as two different variables. In fact, the SDTM does have a separate field for specimen information (LBSPEC), and don't get me wrong, there is value is separate specimen information, but that doesn't diminish the need for different test names. The problem is, one has to tell or program a computer "if test=glucose, look at specimen information to pick out the correct glucose test." But what about another observation, say "Occurrence Indicator" (an FATEST as described in the Malaria Therapeutic Area User's Guide). One must know to look at another field (FAOBJ) to understand that the occurrence is a fever, or a chill. Where to look for that additional data is not always obvious and varies by test. In the Malaria example, we have two different occurrences and they should each have their own name: Fever Indicator, Chills Indicator.

There are two problems with relying on other data fields to disambiguate an overloaded concept: [1] keeping track of which field to disambiguate which test is onerous, and [2] new lab tests are being added all the time. (By the way, LOINC avoids this problem by assigning different codes to different tests and providing separate data fields for analyte, source, method, etc.)

This problem became clear to me when a colleague at FDA, who was using an automated analysis tool and was analyzing serum glucose levels among thousands of patients and was getting funny results. After quite some digging, she realized the tool was pooling serum and urine glucoses. She and I knew to look at LBSPEC. The tool, however, wasn't smart enough to do so. I wonder how many other analyses of other tests have this problem and go unrecognized.

So, in the interest of promoting true computable semantic interoperability without burdening data recipients with unnecessary algorithms to disambiguate overloaded terms, please remember to name different things differently. It can be that simple.