I can’t see the forest for its trees. Big Data: It’s current and it’s newsworthy. But it’s complex. What has motivated and allowed us to better understand huge datasets from multiple sources and pull out broad, meaningful conclusion – particularly in healthcare – are a number of factors:
- the digitization of information that was traditionally paper based.
- the need for resource planning to tackle societal issues such as rising healthcare usage rates (and costs) resulting from the “greying tsunami”.
- the incredible computing power that has emerged in the last decade.
But have “they” figured out how to do it yet? Well, maybe now they have.
“I started exploring my data using Derek Beaton’s powerful pipelines after I saw him present at a Toronto hackathon. Through his work with ONDRI, Derek and his colleagues had developed a set of algorithms that could not only isolate outliers in huge datasets, but could then help us tease out relationships between different aspects of our data that we could not even conceptualize without the tool”, said Nathan Chan, a graduate student at the Institute of Medical Science, University of Toronto, working with the Multimodal Imaging Group (MMIG) in Toronto’s Centre for Addiction and Mental Health (CAMH).
Led by Drs. Ariel Graff-Guerrero and Philip Gerretsen, MMIG focuses on understanding the relationship between brain imaging and psychometric tests. However, during COVID, they tasked Chan and PhD student Julia Kim with looking at peoples’ attitudes and behaviours towards COVID through surveys of the Canadian public, focusing on uncovering ‘modifiable’ factors. The Outlier & Robust Structures (OuRS) R package (https://github.com/derekbeaton/OuRS/), and a companion ShinyApp (https://github.com/derekbeaton/ONDRIApps), developed by the neuroinformatics & biostatistics team as part of the Ontario Neurodegenerative Disease Research Initiative (ONDRI), turned out to be the right instruments for the job.
In applied science, researchers are often looking for the answers to practical questions, which can lead them to look at one aspect at a time, often focusing on a hypothesis and/or a 1:1 relationship between data points. This makes a solution easier to conceptualize quickly but can lead to mistaken conclusions, or at least conclusions that cannot be reproduced.
ONDRI’s OuRS tool, developed by the Neuroinformatics & Biostatistics (NIBS) team at Baycrest, led by Stephen Strother and Malcolm Binns, came from observations on one of the key factors in neurodegenerative disease, age. Neurodegeneration, if present, typically progresses with age. A few early data points that the NIBS team curated from ONDRI’s foundational study did not follow this known relationship; this led to extensive interaction with study operators at various labs, changes to data collection processes, and data cleanup, to ensure data quality. As importantly, this process led to the development of OuRS.
The OuRS tool allows researchers to identify potential quality issues upfront, and helps them focus on hard to find outliers in their data. The analyses that follow are more meaningful and accurate. Then scientists can focus on the remaining data, and use the same sophisticated statistical tools that reveal ‘hidden’ relationships in the data.
“More data is great. But good data is always better than more data. From 2018 to 2020 at ONDRI, we’ve used these tools for close to 1,700 inquiries on our foundational study data, where we (neuroinformatics & biostatistics) go back and forth with data curators across the project. Our modeling tools, which were developed by our team including myself, Stephen Strother, Malcolm Binns, Steve Arnott, and Kelly Sunderland, were designed for this complex data and they are now helping us address these data queries in a highly efficient way. And we’re so glad to share these tools with others”, said Derek Beaton, postdoctoral fellow at Rotman Research Institute, Baycrest Health Sciences.
And share it they have: The ONDRI outlier detection pipelines helped researchers at CAMH discover interesting psychosocial profiles in response to COVID, and they can be used for many other applications, even outside of health, as well.