REGINDEX
Compressed Indexes for Regular Languages with Applications to Computational Pan-genomics

Project

Sorting is, arguably, the most powerful algorithmic building block when it comes to indexing data. At the same time, the regularities exposed by sorting are precisely those enabling data compression. In the last two decades, this fascinating duality has led researchers to the design of compressed full-text indexes: data structures supporting fast pattern matching queries over compressed text. In this project, we revisit the natural generalization of the problem to labeled graphs from a new perspective: we interpret graphs as finite-state automata and investigate the connections existing between their propensity to be sorted and the regular languages they recognize. This novel language-theoretic approach makes it possible to transfer fundamental results between the mature fields of regular language theory and compressed text indexing. The project aims at building this bridge by developing a new theory of compressed regular language indexing.

Logo REGINDEX
Partial co-lexicographic order of a DFA

The project finds important applications to the rapidly-expanding field of computational pangenomics, where the goal is to study the variations contained in the genomes of an entire population. Recent research has shown that representing pan-genomes as labeled graphs is an important step to reduce reference allele bias. Existing approaches, however, can index only restricted classes of graphs, thereby limiting the practical applicability of such powerful pan-genome representations. 

The project’s approach, based on sorting regular languages by partial co-lexicographic orders (see figure), changes the perspective from which the compressed indexing problem has been tackled in the literature. This project aims at developing a theory of graph indexing and compression based on the natural interplay between sorting and regular language theory. 

Team

Nicola Prezza

Principal Investigator

Nicola received his PhD in Computer Science from the University of Udine (Italy) in 2017. He spent research periods (as a postdoc) at the universities of Pisa (Italy) and DTU (Denmark), and later has been Assistant professor at LUISS (Rome). He is now Associate professor at Ca' Foscari university of Venice. Nicola Prezza's research is focused on the interplay between data structures, data compression, and regular language theory. In 2018 he received the "best italian young researcher in Theoretical Computer Science" award from the italian chapter of the European Association for Theoretical Computer Science (IC-EATCS).

Sung-Hwan Kim
Sung-Hwan Kim

Postdoc

Sung-Hwan received his PhD degree in computer science and engineering from Pusan National University, Busan, South Korea, in 2018. Since then, he had been a postdoctoral researcher at Pusan National University until he joined Nicola Prezza's team in September 2022. He has been interested in several research topics including metric indexing, string matching, and sequence alignment, among which the most favorite is design and analysis of simple algorithms and space-efficient data structures for pattern matching problems.

Ruben Becker

Assistant professor (RTD-A)

Ruben holds a PhD degree from Saarland University (Germany). While being a doctoral student (2014-2018) he was a member of the Algorithms and Complexity department at Max Planck Institute for Informatics. He was then a Postdoctoral researcher in the Computer Science department of the Gran Sasso Science Institute in L'Aquila, Italy (2018-2022). Since November 2022 he is an assistant professor (RTD-A) at Ca' Foscari University of Venice. Ruben's research revolves around the topics algorithms, graphs, and randomness. For his PhD thesis, he obtained the Dr. Eduard Martin Preis that is awarded yearly to the best dissertation within the Faculty of Computer Science and Mathematics at Saarland University.

After studying Mathematics at the University of Zagreb (Zagreb, Croatia), Bojana completed a PhD in Theoretical Computer Science at the Max Planck Institute for Informatics (Saarbrücken, Germany) in 2018. Subsequently, she was a PostDoc at the Gran Sasso Science Institute (L’Aquila, Italy), until she joined Ca’ Foscari University of Venice (Venice, Italy) as a PostDoc on the REGINDEX project. Her areas of interest are Algorithms and Complexity and Algorithmic Game Theory.

After graduating in Medical bioinformatics in 2019, Davide is completing a Ph.D. in Computer Science at the University of Verona (Verona, Italy). He has now joined Ca' Foscari University of Venice (Venice, Italy) in February 2023 as a postdoc researcher in the REGINDEX group. His main area of interest is algorithms and data structures on strings; in particular, he focused on algorithms to perform pattern matching and to compute the Burrows-Wheeler Transform and the Suffix array of string collections.

Open positions

Several positions (PhDs, Postdocs, Assistant Professor) will be opened during the life span of the project (Sept 2022 - Sept 2027). Expressions of interest are welcome at any time: please send an email to nicola.prezza@unive.it if you are interested and/or have any questions.