Semantic Database Part 1

Preface

“Complexity, do not walk away from it. …..   Run”!!! Venkat Subramaniam

We all sit on shoulders of giants. Remember this giant from the past: Edgar Codd. He made it possible to use slow computers with small hard-disks efficiently. He did this by normalizing databases. He connected data, splitted them, rearranged them, in a way that they did not lose integrity but needed as less space as possible.

Complexity

It was a technical solution for a technical problem. And see the picture how the result looked like. It was completely stripped off semantics. To understand the data it was needed to study the structures, including the software-code. This complexity was often the cause of problems.

People have died from complexity of software, airplanes have crashed, medication-errors in hospitals, amputations of the wrong leg, misunderstandings, inaccessibility of data, bugs. It is not a small problem. Terrible things can happen if a part of software is misunderstood?

Semantics

A company has in a table for parts with an attribute for pressure. Everything is in mmHg, or was it HPa, or ATM or PSI. The whole table can become worthless if the sourcecode get lost

Imagine, you are hired as the new programmer and you have to maintain this table. How do you know what is in it? Maybe you hope there is documentation? Maybe you hope the software around it is good maintained? Maybe you hope other programmers understood the documentation in the same way you do?

Change

There is one constant in everything, that is change. Everything changes always, and software changes with it. There is no code that stands as a house, the lifetime of code is short, but the lifetime of data is long. So software must be able to change smooth, with respect to the data which are there to stay.

In a Codd database change is hard, data are connected to the software-code, and software code changes but the data do not. This makes change dangerous and expensive.

The Semantic Database

Thomas Beale invented OpenEhr in the late nineties of the last century. OpenEhr is a specification for a semantic health-database. It took ten years more before computers were able to write software that could implement these ideas. The idea is that software must relate to the semantics in all details.

In OpenEhr you can never confuse centimetres with inches, because the unit which is used is described in the archetype, and archetypes are the documentation of ALL data. No data will ever enter the system before it is defined and validated in an archetype!!!

In Openehr change is easier. Old data remain accessible, understandable, can be queried, new data can be added without conflicting with the old data, according changed definitions and changed requirements. Change becomes smooth and easy.

In Openehr semantic compositions remain grouped together in the way as they were meant to be grouped at the moment of storage. No tricks, no indexes, no lengthy complicated SQL necessary.

All advantages which are in other systems not, or very hard to achieve. For these reasons I was one of the early followers of OpenEhr. I learned about it in 2003. In 2008 I wrote a partial implementation, I wrote it a few times in the years after that, every time a little better, more complete.

Now I think that writing an OpenEhr kernel is not so difficult anymore. With optionally support of NoSQL, cloud-technology, good grammar interpreters, data-exchange standards, API-concepts, modularity in microservices.

And now, thank you Thomas, for letting me sit on your shoulder, I come to the next idea, and that is what this blog is about. Explaining the Semantic Database, OpenEhr is a good example to start with.

OpenEhr, a description of the parts

So for the moment we can focus on the overall architecture. As an example we see an overview of the OpenEhr system. I will discuss the parts.

First the Reference Model. It is important to understand the different parts, and how domain-knowledge is projected.

Reference Model

The Reference Model, abbreviated: RM, is determinative for the purpose of the system. Thomas Beale designed different parts of the RM. Some are usable in every domain, and only one part is domain specific.

Datastructures/Datavalues Part

In this part are the data-structures and datavalue-classes defined. Datavalue-classes represent extended primitives and the coded text-classes. Datastructures are List, Table and Tree. All data-structures and datavalues are semantic-oriented, not technical oriented. So you will not find a hashtable, a string, an integer, a double in this part of the domain, but semantic replacements of this. This part is not domain-specific and can be used in the generic Semantic Database concept.

Common Part

Also the Common Part of the RM is not domain specific, but it has classes which support the Archetype-concept, which I explain in the next part of this blog. So  the Common-part will also be part of the Semantic Database concept.

Support Part

Has supporting classes for terminology, measurement-units, id-generation. Most applications will need some of it.

Demographic Part

This is the optional third part. Contains all classes you need to describe people, organizations, addresses, relations, etc.

Domain Part

The part that is domain-based. In Openehr it consists of clinical class-objects, but an open way. Unlike most clinical datamodels, the OpenEhr domain part does not describe detailed specifications. You will not find a class for Bloodpressure or for a specific Treatment or any clinical item, and it is focussed on process instead of situation.

The OBSERVATION (which can represent bloodpressure body temperature, heartbeat, laboratory data, radiology), is very important. Clinicians observe a lot. Like all other domain-classes of OpenEhr the OBSERVATION is very open. It only has two attributes: data and state, both of type DATASTRUCTURE which is the parent of all possible structures. So it is very flexible. This is an important lesson for when one designs a Domain Part of a Semantic Database. Some other classes in OpenEhr domain-part are EVALUATION, INSTRUCTION, ACTIVITY. A special class in the OpenEhr domain is COMPOSITION. It is used to group various clinical classes together, for example, the clinical classes belonging to a single encounter can be grouped together to an COMPOSITION.

For example, a Domain-part for a accountancy could have SALES, PURCHASE, BANKTRANSACTION as primary classes. Modelling the Domain part must be done very carefully. It will be the central part of the system, and it will remain that for many years. Classes must be as generic as possible, but still have a base-idea for which they are designed..

When you build an application to store animals, maybe you have classes  for BIRD, MAMMAL, FISH etc. But where do you put the Platypus? Have a special class for that? I don’t think so. Maybe have a class for Exceptional Animals. So  how about an Animal-application RM based on continent, or on environment? This is the most important task of starting an application using the Semantic Database. It takes weeks/months, to think it over and over, rethink it, sleep on it. Afterwards it should look as if it was designed in a few hours. As simple as possible, but not simpler.

Archetype Model

The archetypes are the entities that give the generic classes a specific meaning. They (kind of) specialize an OBSERVATION to a blood-pressure measurement-class. The archetypes follow the RM, all attributes which are defined in the RM are allowed in the archetype, no others. Archetypes make attributes obligatory, or optionally, they give cardinality to lists, and they give constraints to datavalues.

The archetypes do that, using the Archetype Object Model. This a class model which can, independent from the RM, validate archetypes. The Archetype Object Model, AOM, is able to validate an archetyped dataset against an archetype.

What is the use of this? Why would we do this?

Most programmers with 10 years of experience know the horror stories of application which became unmaintainable, stranded in complexity. Companies started a new one, were able to transfer some of the data, millions of dollars lost. Mostly they say it different. They always find a reason to spend millions again, lose data. They never call it a failure. There is a new reason to do it over again: the Semantic Database. Maybe there will be new applications afterwards, but they never have to lose or restructure data.

Documentation

The weak point in software is many times documentation, a non-semantic application/database does not describe data from itself. It is not an implicit mechanism. Some companies dump data into NoSQL databases, and rely on the software to understand them. They make the same error as the one who used the Codd databases.

Despite good intentions and worried managers, documentation is not always up to date. Documentation goes its own way, which is not necessary the same way as the code goes. Bad documentation is even worse than no documentation at all. Working code is the primary goal, documentation is not always formalized, not everything to know is in there.

I once found this comment: “When I wrote this, only God and I understood what I was doing. Now, God only knows, I found another job.”

Check here for more fun on documentation: https://wiki.c2.com/?FunnyThingsSeenInSourceCodeAndDocumentation

What Uncle Bob told us about documentation is true. Good software documents itself. The semantic database does. Documentation is one of the implicit pillars of it. It is an implicit part of it because of the archetype-system. See many examples of clinical archetypes: https://www.openehr.org/ckm/

The archetypes in this clinical repository are well build, are reviewed by many domain-experts, not by technicians, they are translated, some even in Dutch. And these archetypes are used to validate, store and retrieve data, and no technician is part of this process.

Querying

Now we have this semantic datastore, we have every dataset documented as a part of the process, we can be sure the documentation is up to date and will remain up to date.

Now we want to retrieve data. For this purpose is AQL, Archetype Query Language. It is a bit similar as SQL combined with XPath. Instead of tables and fields, it has archetype-paths. We only need SELECT statements. We never change data, we never create data, we never delete data. We just overwrite them by posting new dataset, we can mark them deleted. If an existing dataset is going to be overwritten, the replaced dataset is moved to an archive. Nothing ever gets lost in a semantic database. An audit service keeps up with all additions and reverts, and who did them.

So we only have SELECT, and the same semantic paths as are used in the archetypes are used in the queries. AQL is meant to be used by domain-expert. There is no need to be a technical expert, no knowledge of Codd is needed.

But archetypes are still technical entities, computers must read them, so precision is very necessary. For this purpose, tooling is needed. Tooling which hides the technical aspects of archetypes but shows the semantic concepts. This kind of tooling is not very hard to build. Same counts for a query-editor. Below an AQL query.

SELECT

o/data[at0001]/…/items[at0004]/value AS systolic,

o/data[at0001]/…/items[at0005]/value AS diastolic,

FROM

EHR[ehr_id=$ehrId]

CONTAINS

COMPOSITION c

[openEHR-EHR-COMPOSITION.encounter.v1]

CONTAINS

OBSERVATION o [openEHR-EHR-OBSERVATION.blood_pressure.v1]

WHERE

o/data[at0001]/…/items[at0004]/value/value >= 140 OR

o/data[at0001]/…/items[at0005]/value/value >= 90

 

We can recognize the archetypes we query. This query wants all high blood pressures measured during an encounter. Most queries in AQL are simple like this. This is because there is no Codd-structure to implement in the Query.

At last

This is it for now. The blog explains why a Semantic Database is needed, and explains the parts of it by using OpenEhr as an example.

Thank you for reading.

Bert Verhees, Senior Software Engineer at Profit4Cloud