The Glue of Genomics: Will Science’s Unsung Data Heroes Abandon Academia?
Whitepaper
Published: November 21, 2023
Credit : iStock
Over the past twenty years advances in genomics have turbocharged our ability to decode DNA and other nucleic acids.
These advances owe much to increased raw sequencing power and ever-more ingenious techniques for sampling genomic data, but these advances would be for naught if we weren’t able to process and analyze the data torrent that pours from these sequencing projects.
This article explores how advances in genomic data analysis pipelines are supporting the growing field of next-generation sequencing.
Download this article to learn about:
- Collaboration in bioinformatics
- Improving best practices for utilizing software in science
- The battle to get recognition for research software engineers
The Glue of Genomics: Will Science’s
Unsung Data Heroes Abandon
Academia?
Article Published: May 17, 2023 | Ruairi J Mackenzie
Credit: iStock
Advances in the last twenty years of genomics have turbocharged our ability to
decode DNA and other nucleic acids. At the turn of the century, the Human
Genome Project completed a 13-year journey to produce the first complete human
genome at a great financial cost.
In 2023, genome sequencing is a routine process that has expanded our
understanding of biology and fueled a similar enhancement in other -omics
disciplines.
These advances owe much to increased raw sequencing power and ever-more
ingenious techniques for sampling genomic data, be it from millions of individuals
in genomic-wide association sequencing (GWAS) studies or from the heart of a lone
cell in single-nucleus RNA sequencing assays.
But all these advances would be for naught if we weren’t able to process and
analyze the data torrent that pours from these sequencing projects. Genomic data
analysis pipelines have had to progress as well to support this growing field.
A flexible field
Dr. Alison Meynert is a senior research fellow and the bioinformatics analysis core
manager at the MRC Institute of Genetics and Cancer (IGC) at the University of
Edinburgh. Meynert and her six-person team process data from
researcher–clinicians across the University. Meynert’s own background is in
computer science and software development, but she has now spent two decades
in bioinformatics – basically, she says, “from its infancy.”
Maximize Your Long-Read Sequencing Results
Assess the quality of the size-selected DNA.
View Application Note
Advertisement
While wet-lab researchers in individual niches might be able to laser-focus their
projects, Meynert’s team has to stay flexible. “We've had some years where there
have been big, nationally funded, whole-genome sequencing projects going on
where we'll have hundreds of samples coming in over the course of the year
across different cohorts for different research projects,” she says. Now, novel
techniques like nanopore and single-cell sequencing have to be considered as well.
That requires a deep and broad knowledge base. “Different sequencing machines
have totally different error profiles and output formats,” Meynert explains.
One thing that connects these different data sources is that, for Meynert and her
team, the goal remains to take complex, raw genomic data and turn it into a form
from which the researchers that produced it can extract relevant insights. “It’s a
very collaborative process,” says Meynert.
What makes bioinformatics so collaborative?
That collaboration is not just with the wet-lab researchers who come to the core
team for help, but with other core labs across the UK, Europe and even globally.
Meynert and her team are part of a community of bioinformaticians called nf-core.
This project began at the National Genomics Infrastructure in Stockholm, Sweden,
which created a set of standards for data analysis pipelines. The project is
sponsored by the Chan Zuckerberg Initiative and runs on cloud credits provided by
Amazon Web Services and Microsoft Azure, but the team is largely volunteers.
The IGC team is hosting nf-core’s next hackathon, Meynert tells me. “We basically
get the best of everybody’s contributions across the bioinformatics community. To
develop these pipelines, we have lots of arguments about how to do things.
Sometimes you come up with multiple ways of doing things.”
These collaborative events are commonplace across the bioinformatics world. But
this underlying ethos of sharing and collaboration, while heartening, stands in
stark contrast to processes in biological and biomedical science. Projects like the
hugely influential FAIR Guiding Principles, which aimed to exploit the massive
increase in digital data to make information in science more findable, accessible,
interoperable and re-usable, have increased the amount of lip service paid to data
sharing. But a recent study in the Journal of Clinical Investigation skewered the idea
that science has significantly embraced open-access principles. This paper trawled
through 3,556 articles from over 300 open-access journals that were all published
in January 2019.
Just half of these studies indicated that authors were willing to share their data,
and of these, a jaw-dropping 93% of authors either did not respond to or declined
requests for data access. Is the more relaxed attitude to data sharing in
bioinformatics circles a sign that staff in these areas are more magnanimous and
beneficent people? Meynert says that the reasons are likely to be more practical.
“The bioinformatics community has always been very strongly based around opensource software, I think in large part because if we want to develop a new tool,
we're going to need data to test it on. We need someone to have shared their data
for us to do that.”
Researchers in genomics and other biological disciplines fear being “scooped” by
rival scientists almost as much as they fear a deafening silence at the end of their
symposium talk. The resulting culture of secrecy and security around data has
proved difficult to rectify. But bioinformatics’ open-source approach has racked up
numerous success stories that benefit genomics data analysis researchers like
Meynert every day. She points to file formats like binary alignment map (BAM), a
compressed version of the text-based sequence alignment map (SAM) format.
SAM/BAM (and CRAM, a reference-aligned and compressed version) are some of
the most used formats in the genomics field. But they originated from individual
research groups becoming frustrated with existing formats and devising changes.
Initiatives like the Global Alliance for Genomics and Healthcare (GA4GH) have
helped these formats become standardized, enabling them to be widely adopted
in the field. Massive data repositories like GitHub make it easy for these
innovations to be shared and used by other research groups. GA4GH’s Genome
Analysis Toolkit, developed in close collaboration with the Broad Institute, itself a
collaboration between MIT and Harvard, is one of the “workhorse tools of
genomics,” says Meynert.
The unsung heroes of genomics
The Impact of Genomic DNA Handling Protocols on Sample
Quality
Discover how sample integrity can be affected by DNA extraction.
View Application Note
Advertisement
The efforts that have gone into creating these resources have arguably advanced
our ability to understand the genome as much as technical advances in gene
sequencing technology. But the researchers that make these innovations still work
in academic circles, where the standard for receiving grants and recognition
remains calculated in terms of publication and citations. How can these scientists
receive the recognition they deserve for their efforts? It’s a conundrum that has
motivated Neil Chue Hong, the founding director and principal investigator of
the Software Sustainability Institute (SSI), a project that works with all seven of the
UK’s major research councils to improve the practice of using software in science.
Chue Hong noted that one obvious barrier stopping academics who create
software and analysis tools from being recognized was the lack of a formal place in
the scientific register. At an SSI workshop in 2012, a group coined the term
research software engineer (RSE) to codify this position. “It’s a role that has always
been present in research for the last maybe four or five decades,” says Chue Hong.
“Because there's always been this idea of the researcher – the one who's good at
coding – that you ask, ‘How do I fix this piece of software that's not working?’ I think
what was causing problems was that there wasn't enough recognition for this role,
and as software use became more and more prevalent, that role became more
and more important.”
But the increasing influence of RSEs in academia has not been reflected in the
amount of recognition they receive. The supremacy of “the final publication” is part
of the issue. This gives an undue amount of credit to a mythical figure that Chue
Hong calls the “lone hero principal investigator” who is, in theory, meant to be
given responsibility for the bulk of work that goes into a scientific paper. In reality,
research is a collaborative practice, and a funding system that recognizes the
contribution of each lab member to a final publication is an important first step
towards realizing this, says Chue Hong: “[Funding bodies] are moving towards
recognizing things like narrative CVs and researcher resumes that show it’s not just
about the publications, but about the way you disseminate knowledge and the way
you pass on skills to other people,” he adds.
There’s progress being made elsewhere toward recognizing the contribution of RSE
to genomics and other fields. The American National Standards Institute (ANSI) and
National Information Standards Organization (NISO) jointly announced the
publication of its CRediT Contributor Roles Taxonomy last year, a framework that
divides the practice of science into 14 contributor roles, including
conceptualization, funding acquisition, investigation and software. In 2022, the
American Chemical Society announced a pilot of CRediT taxonomy in its journals.
Chue Hong says that he believes the main sticking point in this area lies in the peer
review process. “The last remaining barrier is getting peer reviewers to understand
that science is done very differently in 2023 than 20 years ago,” he suggests. This
might involve challenging principles that are embedded within researchers’
psyches. “Survivorship bias is unhelpful. Just because someone was successful, by
fighting through a particular way of doing things in academic research, doesn't
mean that everyone else has to fight that same fight,” says Chue Hong.
Genomics without glue
Could a Different Technique Improve Your gDNA Quality
Assessment?
Discover how you can achieve greater sample capacity.
View Infographic
Advertisement
The battle to get recognition for RSEs is one that should concern all of science.
Much of genomics’ analysis toolkit wouldn’t exist without them, and one doesn’t
have to look too far ahead to see what might happen if RSEs continue to be shut
out of a recognized place in academia. “Increasingly, what people will do is quit
academia,” says Chue Hong. You used to have a choice between an academic role
and an industry role in research. The difference was that academia was meant to
give you more job security, a better pension and a more fruitful working
environment, where you've got lots and lots of really interesting people to
collaborate with. The tradeoff was salary.
“Now, industry offers you possibly better working conditions, probably a better
pension, definitely better people to work with and a better salary. You have to
really want to work in academia now, if you're in a software role. The challenge
that I think the RSE sector faces is to encourage people to stay in academic
research environments and not go into industrial research environments.”
Those struggles will be familiar to many within academia. But RSEs play a vital role
at the intersection between an increasingly digital data-rich informatics
environment and the wet-lab work that drives biology. Meynert says that her day
job will often involve taking tools created by other research groups and fitting
them to the task at hand, applying mortar to make different analysis tools work in
tandem. “An awful lot of informatics,” she says, “is gluing things together.” If
academic research begins to lose RSEs and their contribution, it might start to see
how much is reliant on that glue’s grip.
©2023 Technology Networks, all rights reserved, Part of the LabX Media Group
Sponsored by
Download This Whitepaper for FREE Now!
Information you provide will be shared with the sponsors for this content. ,Technology Networks or its sponsors may contact you to offer you content or products based on your interest in this topic. You may opt-out at any time.