Pipelines, notebooks, and functions – more resources from our Health Data Science Team - British Heart Foundation

At the BHF Data Science Centre, we’re tackling the hidden heavy-lifting of data curation, making it faster, easier, and more reusable across projects. From templates to tools, Health Data Scientist Lars Murdock outlines how we’re helping researchers start with their best foot forward.

Tools and tactics to tackle repetitive work

It’s estimated that around 80% of the effort spent doing health data research is on data curation. Gathering all the patient information from many, messy tables into something concise and clear is crucial to conduct the actual analysis.

For many new to health data research there are several obstacles that slow progress down:

Less familiarity with the platforms where datasets are stored, known as secure data environments, and what’s available on them
Less familiarity with systems that can handle large-scale data processing
Less experience working with very large datasets and best practices
Limited opportunity to share knowledge with researchers working on similar projects

This is partly because working with very large datasets is relatively new in the health research landscape, as well as working in secure data environments. Another reason is that researchers coming from a health background bring valuable clinical knowledge to the project, but might not have as much experience with coding.

Addressing these curation obstacles is a key priority for the Health Data Scientists in the BHF Data Science Centre, as we work on how to speed up the process for researchers. The quicker we can get datasets from raw to ready for analysis, the further the funding, research, and computing resources will go. And the more data-driven insights can be found.

Every project is unique – but not THAT unique

If we’re aiming to help dozens of different research projects each potentially using billions of different patient records, then we have to be efficient. On our side are data science tools – specialist languages and development environments, a large amount of computing power, and a strong culture of streamlining work through automation.

Another aspect is that although the projects are all unique, they overlap in a lot of ways. In our experience, we’ve found that although the 80% effort statement is probably true, for any project starting from scratch, it seems that for most projects around 80% of what they’re trying to do will already have been done in some form by other projects. So if curation work is thought of as building a bridge, then only small sections need to be redesigned for each project.

Getting data from raw to ready

A typical project might be interested in a particular group of people, and gathering information about them to understand their health outcomes. If one project is studying heart disease in older women, and another project is studying COVID-19 in younger men, both have the same steps. And therefore similar instructions to code: ‘Find all the people matching description X, and gather information Y about them’.

This is a very simplified view, but it helps that the architecture of the metaphorical bridge looks very similar. We can then think about the sections that do tend to differ and ask how much they differ.

Code is just a long sequence of instructions. When it gets really long, we try to break it up to make it more manageable – like chapters in a book. If we know that some chapters remain the same for every project, then we can offer that as standard, like a manual. And there will be some chapters that differ, but not by much. This approach is known as developing pipelines.

A reusable template

For example, there might be a book chapter dedicated to a particular criterion, like selecting women for the first project and men for the second. This chapter will still be long, but all we’d need to do is swap the word ‘female’ for ‘male’ or vice versa, and the rest of the chapter would work just fine.

This means that while we can’t offer a ready-made book chapter or bridge section, we can offer a template that works just as well, with very little input needed.

The two example projects also differ based on age. Sometimes though, an individual’s age might be mis-recorded in some of their data. So it’s sensible to gather all their records to identify any errors. However, many datasets will have this information, so for any given patient there may be dozens of potential age-related records to draw on. The code for this might not be very complicated, but it does require a lot of computational power and time to run across millions of patients. We’ve found that it’s better if we gather and prepare this information directly, a process called curation. That way we can hand over pre-cleaned data, rather than getting researchers on each project to clean it themselves.

The last detail the two example projects differed on is disease. We can’t provide some ‘oven-ready’ code that can handle the specifics of both heart disease and COVID-19. They’ll come from different datasets, rely on different labels, and have different cleaning steps.

Experience for smarter shortcuts

But even here there are similarities! The instructions to read in the data will still be similar, and many of the quality control steps – like checking how many patients there are or when they were diagnosed – will be very close. So even when there is some assembly required, we can still provide some form of template, or even chunks of ready-made code like functions.

In fact, the more projects we support, the greater our library of existing code, and the more expertise through experience we get under our belt as a team. We have a whole range of projects – many focused on COVID-19 or heart and circulatory disease or both. This means there’s little wheel-reinventing needed for future projects.

Bridges, books, and better workflows

The metaphorical curation bridge is often referred to as a pipeline or workflow, while the sections or book chapters are known as notebooks. And the parts people might use to self-assemble are packages of functions, pre-cleaned datasets, handy reference lists, or some combination tied together into an algorithm. These form the bulk of the resources that we can offer.

By working across so many projects, we can not only identify similarities and differences in projects, we can learn what code runs quickly or slowly. We can learn which practices are most interpretable or reproducible. So while we’re coming up with our own ideas, we can also borrow the best parts from everyone else. This means new researchers really are starting their projects with their best foot forward.

Our menu of resources is available for researchers to browse on our website.

Read our latest stories

Blogs

Helen: Ultimately, creating the right culture is about more than setting standards.

Blogs

Poh-Choo: Patient and Public Involvement and Engagement (PPIE) is more than a requirement; it’s the heartbeat of relevant research.

Blogs

John: We hope to promote trust in the research process

Blog
Impact

Pipelines, notebooks, and functions – more resources from our Health Data Science Team

Tools and tactics to tackle repetitive work

Every project is unique – but not THAT unique

Getting data from raw to ready

A reusable template

Experience for smarter shortcuts

Bridges, books, and better workflows

Read our latest stories

Sign up to our events mailing list