Hecatomb: Skirting the Shadows with Viral Dark Matter

Mark Davis | UTS Features (2024-2025)

Schrodinger’s cat famously posed the question of how to classify the unobserved. In this thought experiment, Schrodinger imagines a cat in a box equipped with a mechanism to kill it. Until we look inside, the cat’s status is unknown. Unbound by reality, the cat is simultaneously alive and dead. Particles adhering to quantum mechanics operate in the same way. Their state is unknowable unless truly seen. Viral dark matter, or the undefined genetic data of viruses, is quite similar. While we know it exists, we are unsure of how to catalog it. A study done by the University of Pennsylvania School of Medicine in 2015 found that in samples taken from human skin, 90% of the viral data collected was dark matter. This abundance of unknown viral samples could act as the key to finding the next pandemic-causing virus before it finds us. In the wake of COVID-19, it’s imperative that scientists look inside of Schrodinger’s box and determine the true identity of the dark matter they possess. This problem is a lot bigger than a singular feline, though. Consider if there were instead thousands of cats in a box, could we truly know if each one is alive or dead, peacefully sleeping or ready to pounce? Similarly, the genetic identity of a virus and the dangers it poses are unknown until we peek inside.

While scientists worldwide possess many samples of viruses and their genetic data, they lack an efficient way to classify these viruses according to their genetic differences.
In order to catalog them, scientists first take an environmental sample of a substance containing viruses, such as soil. These viral samples are then analyzed for their genetic sequences by using computer programs. Sampled sequences are then compared to a database of reference virus sequences, which includes a host of different families of known viruses, like COVID-19. These reference sequences are contained in a massive viral database called the virosphere, which retains the genomes of all known viruses and bacteriophages. At Flinders University in Australia, an NIH-funded project has developed a new strategy for sequencing and classifying viral dark matter. They’re working to improve the field of viral metagenomics; the analysis of all viral genomes found in an environmental sample.

Once researchers at Flinders collect a physical sample containing viral DNA, modern genetic sequencing technologies allow them to read the order of nucleotides, or the chemical base pairs, that constitute an organism’s unique genetic profile. Now that this genetic sequence is stored in a digital format, researchers can annotate the sequence for similarities to other organisms, with the hope of classifying the sampled virus’ taxonomy. Successful annotation is determined by the size and diversity of the database used for comparisons and the capabilities of the search algorithms to find these similarities. These elements that comprise successful annotation also present troubles within metagenomics.

Metagenomics currently faces a roadblock when it comes to reference genome databases. Researchers must choose between a large database, or a wide net, and a smaller, more accurate net, which isn’t feasible for a large-scale study. This tradeoff forces researchers to choose between scalability and accuracy. Beyond this, annotations can also be flawed due to false-positive classifications. Since viruses integrate some host genetic material, they may be falsely classified as similar to that host organism.

In the face of these viral sequencing difficulties, researchers have developed a new bioinformatic program to combat issues in the field. Dubbed “Hecatomb,” it’s named for its similarities to the Ancient Greek tradition of sacrificing animals. Rather than animals, Hecatomb “sacrifices” thousands of sequences in order to find similarities between taxonomic comparisons in subsequent metagenomic analyses. Those sacrificed are the sequences that are incorrectly classified, and “killed” by Hecatomb’s quality control capabilities.

Hecatomb runs in four modules: preprocessing, annotations, assembly, and final annotations. In preprocessing, contaminants such as identified sequences from a host of the virus are removed. Next, annotations are made to the viral sequence data. Individual sequences, meaning shorter genome segments, are used here rather than an entire sequence. As matches are made to the reference database, the computer program verifies their quality multiple times before taxonomic data (kingdom, phylum, etc.) is put into a table. Scientists also determine a Baltimore virus type, which is a classification ranging from numbers one to seven depending upon certain biological mechanisms embedded in the sequence. For example, the sequence could be single-stranded (RNA) or double-stranded (DNA) in different viruses. For example, an adenovirus, which is a double-stranded DNA virus, is group number one. The third step is assembly of contiguous sequences, which are overlapping fragments of the genome, like a puzzle whose pieces connect at their ends to form a long chain. Finally, these assembled longer reads are annotated and contribute to a growing record of data for the sampled virome.

The true power of Hecatomb is in the way it keeps itself clean– like an Ancient Greek sacrificial altar wiped fresh between each religious rite. While managing reference-based annotations of both short and long sequences, the program cultivates a rich array of genetic data for accurate viral identification. Hecatomb has proven its potential in several studies, one of which involved Simian Immunodeficiency Virus (SIV)-infected rhesus macaques. Using the Hecatomb database to categorize SIV samples was more efficient than traditional methods and resulted in reduced time and resources needed to achieve the same viral identification.

These improved capabilities for viral metagenomics are both impactful and scalable. For instance, the false-positive classifications that are so prevalent in other computer programs, like VirSorter2, can have serious impacts if left unchecked; an artificially high estimate of species diversity can emerge from falsely classified viruses in an environmental sample. Virome analysis is also becoming increasingly important for ecological studies in the face of climate change. Studies being conducted in subzero Arctic brines are working against the clock to understand virus and microbe interactions in these extreme climates. Viruses in these fairly constant environmental conditions have been found to experience lower evolutionary pressures than those in a more fluctuating sea environment. These viruses can contribute important data for climate scientists to model ecological responses in microbial communities. These viral ecosystems affect carbon cycling, and a shift in this ecosystem’s structure could lead to broader negative climate impacts. It’s vital that efficient methods for metagenomic analysis in these studies are available in a warming climate. Otherwise, we won’t know the microbial impacts until they’ve already caused devastation on a large scale.

Hecatomb also has extensive public health applications. Accurate, efficient viral identification is key for future pandemic prevention as well as the current healthcare COVID-19 paradigm. For instance, a misclassified virus in a clinical sample can lead to an ill-informed patient diagnosis. Discovery and classification of novel viruses allows medical officials to form a public health framework, which is shaped around vulnerable populations, and can lead to behavior change strategies. A primary example would be encouraging the use of masks for a respiratory virus, so that immuno-compromised people are safer from infection. Developing these needed precautions requires knowledge of the virus itself first. Whereas previous metagenomic programs would struggle to quickly provide this vital information to public health officials, Hecatomb offers a brighter future. By shedding Hecatomb’s light on viral dark matter, virologists and geneticists can improve modern knowledge of viruses and the societal framework that upholds us all.


Works Cited:

Roach, Michael J, et al. “Hecatomb: An Integrated Software Platform for Viral Metagenomics.” OUP Academic, Oxford University Press

Desk, News. “Unlocking the World’s ‘Virosphere.’” News

“‘Sacrifice’ of Virus Data Clears the Path to Open a Disease Discovery Pipeline.” ScienceDaily, ScienceDaily

“Metagenomics.” Genome.Gov

Krishnamurthy, Siddharth R., and David Wang. “Origins and challenges of viral dark matter.” Virus Research, vol. 239

Roach, Michael J, Sarah J Beecroft, Kathie A Mihindukulasuriya, Leran Wang, Anne Paredes, Luis Alberto Cárdenas, et al. “Hecatomb: An integrated software platform for viral metagenomics.” GigaScience, vol. 13

“WHO Public Health and Social Measures Initiative.” World Health Organization, World Health Organization

Hannigan, Geoffrey D., et al. “The human skin double-stranded DNA Virome: Topographical and temporal diversity, genetic enrichment, and dynamic associations with the host microbiome.” mBio, vol. 6, no. 5