Scientists from the Universities of Manchester and Oxford have developed an artificial intelligence framework that can identify and track new and worrying COVID-19 variants and help deal with other infections in the future.The framework combines dimensionality reduction techniques with a new interpretable clustering algorithm called CLASSIX developed by mathematicians at the University of Manchester. In this way, viral genome groups that may pose risks in the future can be quickly identified from massive data.
The research, published this week in the Proceedings of the National Academy of Sciences (PNAS), could support traditional methods of tracking viral evolution, such as phylogenetic analysis, which currently require extensive manual curation.
Roberto Cahuantzi, a researcher at the University of Manchester, first author and corresponding author of the paper, said: "Since the emergence of COVID-19, we have seen multiple waves of new variants, increased transmissibility, immune response evasion and increased disease severity. Scientists We are now stepping up our efforts to target these new variants of concern, such as alpha, delta and omega, at their earliest stages. If we can find a fast and effective way to respond more aggressively, such as developing targeted vaccines, it is possible to eliminate variants before they even form."
Like many other RNA viruses, COVID-19 has a high mutation rate and a short time between generations, meaning it can evolve extremely quickly. This means identifying new strains that may cause problems in the future will require a huge effort.
Currently, nearly 16 million sequences are available from the GISAID database (Global Initiative for Sharing All Influenza Data), which provides genomic data for influenza viruses.
Mapping the evolution and history of all COVID-19 genomes from this data currently requires a significant amount of computer and human time.
The method described enables the automation of such tasks. It took the researchers just one to two days to process 5.7 million high-coverage sequences using a standard modern laptop; something that is not possible with existing methods, and the reduced resource requirements gave more researchers the ability to identify relevant pathogen strains.
Thomas House, professor of mathematical sciences at the University of Manchester, said: "The unprecedented amount of genetic data produced during the pandemic requires us to improve our methods and analyze it thoroughly. The data is still growing rapidly, but if the benefits of collating this data are not shown, this data may be removed or deleted."
"We know that human experts' time is limited, so our approach should not completely replace the work of humans, but should work alongside them to complete the work faster and free up our experts to work on other important development work."
The proposed method works by counting the genetic sequence of the COVID-19 virus into smaller "words" represented by numbers (called 3-mers). It then uses machine learning techniques to group similar sequences based on word patterns.
Stefan Güttel, Professor of Applied Mathematics at the University of Manchester, said: "The clustering algorithm we developed, CLASSIX, is much less computationally demanding than traditional methods and is fully interpretable, that is, it provides both textual and visual explanations of the calculated clusters."
Roberto Cahuantzi added: "Our analysis is a proof-of-concept that demonstrates the potential use of machine learning methods as an early warning tool for the early detection of emerging major variants without relying on generated phylogenies. While phylogeny remains the 'gold standard' for understanding viral ancestry, these machine learning methods are able to accommodate orders of magnitude more sequences than current phylogenetic methods at low computational cost."
Compiled from:ScitechDaily