Dark biological data in times of viral overflow
The world changed forever in 1989 when Tim Berners-Lee invented the World Wide Web. Today, immediate access to massive amounts of daily content, information, photos, videos, opinions and stories is a normal part of everyday life.
Berners-Lee later said of his invention that “it is the unexpected reuse of information that is the value added by the Web”.
Assistant Research Professor Nathan Upham arrived at ASU in February 2020 to work with the University’s Center for the Integration of Biodiversity Knowledge and Natural History Collections.
Download the full image
Scientists and researchers around the world found this to be truer than ever when the March 2020 closures due to the COVID-19 pandemic effectively cut off access to laboratories, physical libraries and collections.
This halt in the flow of information has highlighted an issue of growing concern in the scientific and academic communities – the staggering volume of human knowledge that is digitally inaccessible, and therefore effectively unnecessary for synthetic analysis.
Nathan Upham, Assistant Research Professor at Arizona State University School of Life Sciences, along with other scientists around the world, recently published a position paper in The Lancet’s planetary health highlighting the importance of this issue and the obstacles it has posed for researchers battling COVID-19.
“The COVID-19 pandemic shows that siled science does not serve society as effectively as open, interconnected science,” Upham said.
Serving public health
In April 2020, Upham joined a intervention force formed by the Consortium of European Taxonomic Installations (CETAF) and the Distributed System of Scientific Collections (DiSSCo).
The objective was to bring together experts from the biological computing and collections communities with those in virology and public health to provide a perspective on the wild host organisms involved in the COVID-19 pandemic, and to identify infrastructure to prepare for future epidemics.
Upham led a subgroup of the working group through ASU Center for the integration of knowledge on biodiversity (BioKIC). Early information on the SARS-CoV-2 coronavirus suggested it came from a bat. But what bat?
“We didn’t know much about it, either in terms of taxonomy or ecology,” Upham said. “Not all bats are created equal. ”
There are over 1,400 species of bats in the world and they are found in all parts of the planet.
In addition, of the approximately 6,500 species of mammals described, all of them, as well as many birds, can harbor viruses that could potentially be transmitted to humans – a so-called “spillover” event.
The task force immediately saw that the great diversity of mammals was not treated with the same attention as viruses and their consequences for human health.
Viruses are primarily studied by immunologists and virologists with relatively less input from zoologists and disease ecologists. Host-mammal knowledge is a significant data gap.
“Evidence for the species – its DNA, preserved specimens, tissue and geography – should be linked to evidence for the virus,” Upham said. “And that’s the link that was missing.”
Virginia M. Ullman, Professor of Ecology and Director of Biocollections, Nico Franz, said: “Innovations in data science can resolve tensions between evolving taxonomic knowledge and societal needs to reliably integrate information on mammalian viral vectors.
“Dr. Upham is leading the way in building a data language that better prepares us for future fully expected disruptions in taxonomic knowledge.”
The National Institutes of Health recently awarded ASU a grant of $ 300,000 for “iintelligently predict the risks of viral spread of bats and other wild mammalsWhich will run until May 2023.
Upham is the NIH Project Principal Investigator, leading a School of Life Sciences team that includes Franz, Associate Professor Beckett sterner and teacher Arvind varsani, which also works with the Biodesign Center for Fundamental and Applied Microbiomics.
“We realized that public health people needed to know more about mammals, and they didn’t have the most accurate information,” Upham said.
To complicate matters further, large amounts of existing data on mammalian taxonomy – including on bats – remain inaccessible.
Obscure biological data
Physicists use the term “dark matter” to denote large amounts of still unmeasurable materials that make up the building blocks of the universe. Biodiversity scientists have adopted a similar name referring to scientific data which, although published, is cut off from digital knowledge resources.
Some data is inaccessible simply because it is not digitized. This includes old, rare physical collections or archives, and printed publications – also known as “gray literature”. Some can technically be in digital format but are just as inaccessible, locked behind paywalls or trapped in unstructured formats, such as PDFs.
“The old purpose of the publication was to print it – to send it,” Upham said. “We need a new publishing paradigm where the goal is to connect pieces of data to form knowledge.”
For data to form digital knowledge, Upham and its associates on the working group agree that it must comply with FAIR Data Principles – a set of guiding principles proposed by a consortium of scientists and organizations to support the reuse of digital assets, as published in Scientific Data in 2016.
According to the consortium’s proposal, data is “TRUE” when it is findable, accessible, interoperable and reusable.
However, even when the data is TRUE, it can still be behind paywalls and inside PDFs, rather than digitally connected and ready for analysis. Therefore, even though these guidelines have been adopted by many research institutions around the world since 2016, shedding light on the zoonotic origins of COVID-19 is exactly the kind of unexpected reuse of data for which biodiversity science was not. not prepared at the start of the pandemic.
Upham and his associates describe several ongoing efforts to address the problem, noting that the needed solution is two-fold:
- First, existing biological data must be brought out of obscurity – print publications must not only be digitized, but key findings must be extracted and logged into freely accessible formats across taxonomic groups, starting with groups of concern. immediate biomedical.
- Second, steps must be taken to stop contributing to the problem by switching to open access publishing formats that adhere to FAIR data principles and digitally tagging relevant data types for reuse, especially when linked. host-pathogen and host-host relationships. .
Taxonomists, ecologists, data scientists and policy makers all play a vital role in this paradigm shift towards digital knowledge.
“People think taxonomy is boring, that it’s known,” Upham said. “But it’s not known. It is a constantly evolving space of knowledge that must be treated as its own science.
Taxonomy is the scientific study of the naming, definition, and classification of groups of organisms based on common characteristics. And while it may seem simple, straightforward, and maybe even boring, it can be anything but.
Clear and consistent terminology in digital metadata is a key element in ensuring data that is findable, accessible, interoperable and reusable. But as our understanding of biodiversity expands over time, new traits are observed, classifications become more specific, species are divided (or grouped together), and new language is introduced.
This complicates matters when the majority of data records are categorized by species name and those names often change or change meaning over time.
“A good example would be the North American deer mouse, Peromyscus maniculatus,”Said Upham. “Until a few years ago, that unique name encompassed what is now five species. With the division of the species, the old name does not disappear, it takes on a new meaning, which in this case concerns the deer mice east of the Mississippi River.
Upham got his start in biocollections and fieldwork in the Nevada Great Basin, traveling to the wilderness to do population genetic surveys on weekends while studying for his undergraduate degree in Los Angeles. He then continued his graduate studies, where he studied the fossils and DNA of the rodent group that includes the capybaras of South America, the largest living rodent in the world.
“I really like the look of the field in the research lab,” he said. “It opens your eyes to nature in a different way – it’s your job to be there, not just because you’re camping or something. And you bring that information back to the lab.
He arrived at ASU in February 2020 to work with BioKIC and the ASU Natural History Collections, excited about the opportunity to work on “frameworks to move taxonomy to a new generation space, where it is no longer an obstacle but rather the main tool we use to link observations of biodiversity”, a- he declared.