Sunday, November 10, 2019

Data Mining on ClinicalTrials.gov's XML database

The clinicaltrials.gov is one of the biggest clinical trials registry. However the search functionality offered has its limitations.

For example, if you are trying to search for 'double-hit lymphoma' completed trials then ct.gov returns zero results. One could assume based on the results that the data doesn't exists or get the hands dirty to double-check the results.

You could do this using clinicaltrials.gov api or the xml database download. Although the api is a better way to scale things, the learning curve could be steep. So the second way using xml download might be better suited for beginners.

Luckily clinicaltrials.gov offers xml download of all of its data. So you can,
1. Download the data
2. Create your own non-relational database
3. Create queries to validate the default search engine results

Part 1:

The first step is pretty straight forward. All you have to do it paste the following link in the browser and hit enter: https://clinicaltrials.gov/AllPublicXML.zip.

This will download a single huge zip file with all the information we need. If you extraction it you can find individual study study in a tree-like xml structure.

In order to add more filtering on the data you can explore the clinicaltrials.gov's download documentation here.



Part 2:

For the second step, there is a pretty easy solution. You can install something like BaseX which can consume (load) the xml files. In fact BaseX has the ability to consume the archived xml blob of the data. 



You will have to create a new database and then add the AllPublicXML.zip as source for it. You can rename this to something more meaningful like ClinicalTrialsDB.


Note: In order to get the most recent results you will have to download the XML.zip again.

Part 3:

The last part is building queries. Let's create one to answer the 'double hit lymphoma' exploration.

for $article in db:open('ClinicalTrialsDB')/clinical_study
where $article//*[text() contains text { 'double hit lymphoma' } any]
return $article/id_info/nct_id/text()


The query above returns 20 results and 3 of those are completed. In fact you can even expand this query by using the definition of 'double-hit lymphoma'.

for $article in db:open('ClinicalTrialsDB')/clinical_study
where $article//*[text() contains text { 'lymphoma' } any] and 
           $article//*[text() contains text { 'MYC' } any] and 
          ($article//*[text() contains text { 'BCL2' } any] or
           $article//*[text() contains text { 'BCL6' } any])
return $article/id_info/nct_id/text()

These queries basically prove the advantage of this approach over the traditional clinicaltrials.gov search. As any other query you can keep adding AND, OR to consider more clauses.

You can use the XML schema and add conditions on specific nodes than the entire tree. 
e.g. article/Study Design/Masking='None (Open Label)'

No comments:

Post a Comment