Dataset Insights and Summaries from our Health Data Science Team

Whether you’re a seasoned scientist or new to the field, accessing and understanding complex health datasets can be a major challenge. Health Data Scientist Lars Murdock explains how the BHF Data Science Centre team has developed resources to help researchers explore data availability, quality, and key characteristics at a glance.

Making sense of messy data

Conducting research using routinely collected health data has many natural obstacles to overcome. A survey might be designed to answer a specific research question, but health data is not. We’re often using information collected for a different purpose, such as clinical use or hospital administration. Even data collected specifically for research might not have your particular research question in mind.

Additionally it might not be collected or recorded perfectly. Sometimes a piece of information can be logged in different ways, so different recording practices emerge. Sometimes different regions will use different software and systems. And there’ll be potential issues with collating it all. Billions of records, from millions of staff, spread across thousands of institutions.

This all leads to routine health data being, in scientific terms, ‘messy’. But there’s a lot we can do to address this, and even if it’s not perfect it can still be incredibly useful.

In practice

We’ve supported dozens of research projects at the BHF Data Science Centre, working to tackle some of these obstacles. One of the key things we’ve noticed is that although the research and methods can vary greatly, they’re often using the same datasets and asking the same questions:

Can this dataset, or a combination of datasets, actually answer the research question?
How reliable is it and what are the limitations?

We can assist in a few ways because this is simply a case of being familiar with the datasets. So we want to offer a resource that allows researchers to become ‘familiar enough’, relatively quickly.

Summary and Insight Notebooks

We’ve developed notebooks on each dataset, called Summary notebooks and Insight notebooks. A Summary notebook is essentially a document with some code that queries the datasets, and answers typical concerns like:

How many people and records does this dataset capture?
Are they older or younger, more male or female, etc?
How far back do records go? How recent are they?
Do we have records from everywhere in the country or only some places?
How much information is missing these records?

It can take a while to write and run the code for these. But if somebody has already done it, researchers just need to read through the document and find what they need to get started.

However, datasets are always changing. They are updated and grow – often on a monthly basis. This means that our reporting has to keep up. Fortunately, we can time our code to re-check the data each time it is updated – this way our understanding doesn’t fall too far behind.

We also don’t know exactly what will be of interest to researchers. They might need more detail. So providing them with this code means they have an easy starting point to build on with further questions.

For example, we might provide an age breakdown and a sex breakdown of people in a dataset. But a researcher might want to know the age breakdown of the men and women separately. They would find this easy to do as most of the code already exists. Not only can they use this code for checking the dataset is fit for purpose, but it introduces a lot of code and functions that could be useful in coding their research itself.

Insight notebooks cover the less routine aspects. Summary notebooks answer the questions everyone asks. Insight notebooks highlight specific quirks, strangeness, errors, or interactions. Things that people wouldn’t necessarily know but might happen upon during their research. If we can collect these, then it doesn’t guarantee there’s nothing ‘strange’ left in a specific dataset, but at least we know the things others already have found.

Sometimes these issues can be fixed at source, sometimes they can be fixed for newer records but not retrospectively, and sometimes it’s too difficult to fix them. This is the advantage of having a notebook that’s ‘dynamic’, so we can rerun it to check the current situation.

Summary notebooks give researchers a head start in understanding the dataset they want to research. Insight notebooks provide a map of some of the pitfalls. There’s more information on our GitHub page.

Dataset Summary Dashboard

Some research projects may not be able to answer all their questions through the datasets we have access to, and some might need to make adjustments along the way.

However, in the event that some datasets simply wouldn’t be able to answer the questions asked at all, we try to catch that early.

Health data is held, for good reason, very securely. This means that access takes time, money, and passing plenty of security checks. This can result in a circular problem. Data can’t be accessed without this thorough application process, but it’s hard to tell if the application process is worth doing without accessing the data first.

We’ve developed two approaches to this.

The first is recognising that some of those topline questions featured in the Summary notebooks don’t need to be answered with sensitive information included. For example, knowing that a dataset contains millions of individuals doesn’t disclose anything private. Nor does knowing that a dataset started in 1998.

This is information about the dataset, rather than information about the detail of the data. We call this topline information about datasets metadata. For researchers, just having some of the metadata like this can really help build confidence that their proposed work is viable. If they wanted to study people in the 1980s, they would know quickly whether or not a dataset is suitable and could move on quickly if necessary.

We make sure this metadata is safe, compliant, and user-friendly before sharing it. To help researchers explore it properly, we’ve built an interactive dashboard called our Dataset Summary Dashboard. This captures all the health datasets we have access to in our consortium of researchers.

Get in touch

Once a researcher has explored the dashboard and wants to know more about the feasibility of their proposed project, they can reach out for our advice.

Whether a project is possible may hinge on very specific criteria. However it’s not practical, or safe, for summary dashboards to cover every niche scenario.

But because the Health Data Science team works with this data every day, we often have a pretty good idea of whether a project would work. We can also flag any relevant issues, and signpost resources or similar projects that might be helpful to researchers.

Sometimes low-tech solutions are the best. We’re happy to chat, and a short call can save weeks. This is what team science is about. Please reach out if you have any questions about our work, or would like to discuss yours further.

Read our latest stories

Blogs

Helen: Ultimately, creating the right culture is about more than setting standards.

Blogs

Poh-Choo: Patient and Public Involvement and Engagement (PPIE) is more than a requirement; it’s the heartbeat of relevant research.

Blogs

John: We hope to promote trust in the research process

Blog
Impact

Making sense of messy data

In practice

Summary and Insight Notebooks

Dataset Summary Dashboard

Get in touch

Read our latest stories

Sign up to our events mailing list