Seminars

NO.197 Computational Pangenomics

Shonan Village Center

February 20 - 24, 2023 (Check-in: February 19, 2023 )

Organizers

  • Paola Bonizzoni
    • University of Milano-Bicocca, Italy
  • Alberto Policriti
    • University of Udine, Italy
  • Kunihiko Sadakane
    • University of Tokyo, Japan

Overview

Important : The venue will be "Shonan OVA", not Shonan Village Center.

Description of the Meeting

Computational Pangenomics encompasses different research efforts for transitioning the existing paradigm from a sequence-based reference genome to a pan-genome, i.e., an evolutionarily coherent collection of genomes. Such a transition is urgently needed to effectively exploit the data masses produced by the technical advances and the widespread adoption of sequencing technologies. Graph-based representations of collections of genomes and diploid-aware assemblers have been recently proposed, but a large amount of work is still needed to shift to a pan-genomic view into the current research practice. Indeed, the traditional approach considers a single sequence as a reference genome, and that sequence has been obtained starting from sample tissues of unknown donors, and it has been refined through the integration of different samples. So, the human reference genome is actually the fusion of several individuals’ genomes, where the characteristics of each single genome is lost. This approach led to important contributions to our understanding of human physiology and of several pathologies like cancer. However, it was essentially motivated by the limits of the early sequencing technologies and of the associated costs. In the recent years, new sequencing technologies have revolutionized the field by increasing the throughput (i.e., the amount of sequences produced in a single run), by increasing the length of the produced sequences (longer sequences allow to better disambiguate repetitive regions in the genome), by increasing the quality of the base calls (having less errors allows to reliably capture variations among individuals) while costs dramatically decreased (sequencing can be almost considered as a routine task). The resulting wealth of data bears the promise of a new course for precision medicine (i.e., adapting treatments to each individual’s genetic profile). For example, thanks to the advancements of sequencing technologies, it is now possible to characterize the genetic content of a single cell and this has profound implications in the study of the evolution of cancer, where the genetic content of different cancer cells may be different due to the progressive accumulation of mutations occurred during the replication of the cancer cells. Finally, we should note that, even if the human genome was the main focus in the early days of Bioinformatics, we are assisting a spread in the use of sequencing technologies for the characterization of a growing number of species. For example, widespread sequencing efforts of the novel SARS-CoV-2 virus played a central role in the response to the pandemic, since the characterization of virus variations are aiding in tracking the international spread and in the development of the vaccines. From the computational perspective, the core problem is now how to find, to represent, and to query/compare a very large set of genetic variations obtained from large collections of genomes with the ultimate goal of making sense of such a wealth of data both for improving our understanding of the underlying biological mechanisms and for implementing the promises of translational precision medicine. Some initial and promising representations have been proposed—either based on (multi)graphs or on indexes of highly-repetitive collections of strings—but much further work is needed to really perform the transition to novel practical representations of the reference pan-genome and to novel algorithms able to exploit them.

As a consequence, the development of computational pangenomic must be sustained by coordinated efforts of different highly-specialized research areas: starting from research in stringology and indexing (for developing novel and efficient representations of the pangenome) to research in any area of Bioinformatics (for transitioning existing algorithms to the new paradigm) and to research in the data mining area (for further exploring new applications and discovering potential new associations between genetic variations and phenotypic traits). The meeting aims to provide an occasion for researchers of these research areas to present recent advances and to foster the questions that will drive the future research efforts in computational pangenomics.

Report

No.197.pdf