REGINDEX
Compressed Indexes for Regular Languages with Applications to Computational Pan-genomics

Project Publications Team Contacts

Project

Sorting is, arguably, the most powerful algorithmic building block when it comes to indexing data. At the same time, the regularities exposed by sorting are precisely those enabling data compression. In the last two decades, this fascinating duality has led researchers to the design of compressed full-text indexes: data structures supporting fast pattern matching queries over compressed text. In this project, we revisit the natural generalization of the problem to labeled graphs from a new perspective: we interpret graphs as finite-state automata and investigate the connections existing between their propensity to be sorted and the regular languages they recognize. This novel language-theoretic approach makes it possible to transfer fundamental results between the mature fields of regular language theory and compressed text indexing. The project aims at building this bridge by developing a new theory of compressed regular language indexing.

The project finds important applications to the rapidly-expanding field of computational pangenomics, where the goal is to study the variations contained in the genomes of an entire population. Recent research has shown that representing pan-genomes as labeled graphs is an important step to reduce reference allele bias. Existing approaches, however, can index only restricted classes of graphs, thereby limiting the practical applicability of such powerful pan-genome representations.

The project’s approach, based on sorting regular languages by partial co-lexicographic orders (see figure), changes the perspective from which the compressed indexing problem has been tackled in the literature. This project aims at developing a theory of graph indexing and compression based on the natural interplay between sorting and regular language theory.

Software

Publications

Journal Papers

Nicola Cotumaccio, Giovanna D’Agostino, Alberto Policriti, Nicola Prezza. Co-lexicographically Ordering Automata and Regular Languages - Part I. Journal of the ACM (JACM), 2023.
Tomasz Kociumaka, Gonzalo Navarro, Nicola Prezza, Towards a Definitive Compressibility Measure for Repetitive Sequences, in IEEE Transactions on Information Theory (IEEE TIT), 2023.

Conference Papers

Ruben Becker, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Alberto Policriti and Nicola Prezza. Optimal Wheeler Language Recognition. In: proceedings of SPIRE 2023.
Nicola Cotumaccio, Travis Gagie, Dominik Koeppl and Nicola Prezza. Space-time Trade-offs for the LCP Array of Wheeler DFAs. In: proceedings of SPIRE 2023.
Ruben Becker, Manuel Cáceres, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Francisco Olivares, Nicola Prezza. Sorting Finite Automata via Partition Refinement. In: proceedings of ESA 2023.
Alessio Conte, Nicola Cotumaccio, Travis Gagie, Giovanni Manzini, Nicola Prezza, Marinella Sciortino. Computing matching statistics on Wheeler DFAs. In: proceedings of DCC 2023.
Sung-Hwan Kim, Francisco Olivares, Nicola Prezza. Faster Prefix-Sorting Algorithms for Deterministic Finite Automata. In: proceedings of CPM 2023.

Workshop Presentations

Ruben Becker, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Alberto Policriti and Nicola Prezza. Testing Wheelerness of Regular Languages. ICTCS 2023
Ruben Becker, Alessio Conte, Sung-Hwan Kim, Bojana Kodric, Andrea Marino, Nicola Prezza, Ana Shirley Ferreira da Silva and Davide Tonetto. The Graph Wheelerization Problem. Data Structures for Bioinformatics (DSB) 2023. Delft, Netherlands. March 21-22, 2023.
Riccardo Maso, Ruben Becker, Sung-Hwan Kim, Bojana Kodric and Nicola Prezza. Random Wheeler graphs. Data Structures for Bioinformatics (DSB) 2023. Delft, Netherlands. March 21-22, 2023.
Ruben Becker, Manuel Caceres Reyes, Sung-Hwan Kim, Bojana Kodric, Francisco Olivares and Nicola Prezza. Sorting Wheeler NFAs using relational partition refinement. Data Structures for Bioinformatics (DSB) 2023. Delft, Netherlands. March 21-22, 2023.
Sung-Hwan Kim, Francisco Olivares and Nicola Prezza. Prefix-Sorting Strings on Deterministic Finite Automata. Data Structures for Bioinformatics (DSB) 2023. Delft, Netherlands. March 21-22, 2023.
Ruben Becker, Manuel Caceres Reyes, Sung-Hwan Kim, Bojana Kodric, Francisco Olivares and Nicola Prezza. Using relational partition refinement to sort Wheeler NFAs. Sequences, 11-12 May 2023, University of London.

Team

Nicola Prezza

Principal Investigator

Bio

Nicola received his PhD in Computer Science from the University of Udine (Italy) in 2017. He spent research periods (as a postdoc) at the universities of Pisa (Italy) and DTU (Denmark), and later has been Assistant professor at LUISS (Rome). He is now Associate professor at Ca' Foscari university of Venice. Nicola Prezza's research is focused on the interplay between data structures, data compression, and regular language theory. In 2018 he received the "best italian young researcher in Theoretical Computer Science" award from the italian chapter of the European Association for Theoretical Computer Science (IC-EATCS).

Sung-Hwan Kim

Postdoc

Bio

Sung-Hwan received his PhD degree in computer science and engineering from Pusan National University, Busan, South Korea, in 2018. Since then, he had been a postdoctoral researcher at Pusan National University until he joined Nicola Prezza's team in September 2022. He has been interested in several research topics including metric indexing, string matching, and sequence alignment, among which the most favorite is design and analysis of simple algorithms and space-efficient data structures for pattern matching problems.

Ruben Becker

Assistant professor (RTD-A)

Bio

Ruben holds a PhD degree from Saarland University (Germany). While being a doctoral student (2014-2018) he was a member of the Algorithms and Complexity department at Max Planck Institute for Informatics. He was then a Postdoctoral researcher in the Computer Science department of the Gran Sasso Science Institute in L'Aquila, Italy (2018-2022). Since November 2022 he is an assistant professor (RTD-A) at Ca' Foscari University of Venice. Ruben's research revolves around the topics algorithms, graphs, and randomness. For his PhD thesis, he obtained the Dr. Eduard Martin Preis that is awarded yearly to the best dissertation within the Faculty of Computer Science and Mathematics at Saarland University.

Bojana Kodric

Postdoc

Bio

After studying Mathematics at the University of Zagreb (Zagreb, Croatia), Bojana completed a PhD in Theoretical Computer Science at the Max Planck Institute for Informatics (Saarbrücken, Germany) in 2018. Subsequently, she was a PostDoc at the Gran Sasso Science Institute (L’Aquila, Italy), until she joined Ca’ Foscari University of Venice (Venice, Italy) as a PostDoc on the REGINDEX project. Her areas of interest are Algorithms and Complexity and Algorithmic Game Theory.

Davide Cenzato

Postdoc

Bio

After graduating in Medical bioinformatics in 2019, Davide is completing a Ph.D. in Computer Science at the University of Verona (Verona, Italy). He has now joined Ca' Foscari University of Venice (Venice, Italy) in February 2023 as a postdoc researcher in the REGINDEX group. His main area of interest is algorithms and data structures on strings; in particular, he focused on algorithms to perform pattern matching and to compute the Burrows-Wheeler Transform and the Suffix array of string collections.

Carlo Tosoni

PhD student

Bio

After graduating in 2021 with a degree in computer engineering from the University of Perugia, Carlo enrolled in the master's degree programme in computer science at the University of Pisa, where he graduated in July 2023. His chosen curriculum, 'Big Data Technologies', aimed to develop the skills needed to manage large collections of data in the most efficient possible manner. Two months later, he started a PhD programme at Ca' Foscari University in Venice, in the research group of Nicola Prezza.

Daniel Puttini

PhD student

Bio

Daniel earned his master's degree in Computer Science from the University of Parma in March 2023, having completed a thesis in bioinformatics. Following graduation, he collaborated with the Department of Food and Drugs at the University of Parma for six months. In November 2023, he joined the REGINDEX research group at Ca' Foscari University of Venice (Venice, Italy), where he is currently engaged as a (“pre-doc”) research fellow, eagerly anticipating the commencement of his doctoral studies.

Alessio Campanelli

PhD student

Bio

Alessio graduated in October 2024 in Computer Science at Ca' Foscari University of Venice (Italy), with a thesis on Bioinformatics. He then started his PhD programme at the same university, with the REGINDEX research group. He is a competitive programming enthusiast and is main area of interest is compact data structures.

Open positions

Several positions (PhDs, Postdocs, Assistant Professor) will be opened during the life span of the project (Sept 2022 - Sept 2027). Expressions of interest are welcome at any time: please send an email to nicola.prezza@unive.it if you are interested and/or have any questions.

Type	Name	Sender (Domain)	Description	Duration	Policy
Type	Name	Sender (Domain)	Description	Duration	Policy
Essential	_ga	Google (unive.it)	It registers a unique ID used to generate static data on how the visitor uses the website.	2 years	Information by Google
Essential	_gid	Google (unive.it)	It monitors and analyses traffic data and it stores the user's behaviour in order to improve his/her experience.	1 day	Information by Google
Essential	_gat[*]	Google (unive.it)	Used by Google Analytics to limit the frequency of requests.	1 day	Information by Google
Essential	_shibsession[], _shibsstate[]	Unive.it (www.unive.it)	They maintain the session data of the SingleSignOn.	session	Information by Ca' Foscari University
Essential	PHPSESSID	Unive.it (www.unive.it)	Unique user identifier for the website applications.	session	Information by Ca' Foscari University
Essential	_Secure-3PSIDCC	Google (google.com)	It allows to count visits and traffic sources in order to measure and improve the website performances. It helps to discover which webpages are more/less popular and how visitors browse into the website. It does not collect information on the user's identity or any personal data. All information are aggregated and anonymised.	session	Information by Google
Essential	SIDCC	Google (google.com)	Security cookie used to protect users' data from unauthorised accesses.	1 year	Information by Google
Essential	__Secure-3PAPISID	Google (google.com)	It allows to count visits and traffic sources in order to measure and improve the website performances. It helps to discover which webpages are more/less popular and how visitors browse into the website. It does not collect information on the user's identity or any personal data. All information are aggregated and anonymised.	1 year	Information by Google
Essential	SSID	Google (google.com)	It collects information on the website use every time users visit webpages containing Google services.	2 year	Information by Google
Essential	__Secure-3PSID	Google (google.com)	It allows to count visits and traffic sources in order to measure and improve the website performances. It helps to discover which webpages are more/less popular and how visitors browse into the website. It does not collect information on the user's identity or any personal data. All information are aggregated and anonymised.	2 year	Information by Google
Essential	SID; HSID	Google (google.com)	They are used to autheticate users ensuring access to their account only if they are the relevant owner. The "SID" and "HSID" cookies contain records with digital signature and encrypted records related to the users' Google Account ID and their more recent access. The combination of these cookies allow to block many types of attacks, such as the attempts to steal the contents sent via Google forms.	2 year	Information by Google
Essential	SAPISID	Google (google.com)	It stores the user's preferences and information every time he/she visits webpages containing Google services.	2 year	Information by Google
Essential	APISID	Google (google.com)	It stores the user's preferences and information every time he/she visits webpages containing Google services.	2 year	Information by Google
Essential	cookie[*]	Unive.it (www.unive.it)	It stores the user's preferences on cookies. user preferences on cookies.	1 month	Information by Ca' Foscari University
Essential	cookie	idp.unive.it	It stores the user's preferences on cookies.	1 month	Information by Ca' Foscari University
Essential	_idp[*]	Unive.it (idp.unive.it)	Authentication management and SingleSignOn. management and SingleSignOn	session	Information by Ca' Foscari University
Essential	fe_typo_user	Unive.it (www.unive.it)	Unique user identifier for the reserved area of the website	session	Information by Ca' Foscari University
Essential	g_UA_45500906_8	Google (unive.it)	It collects information on the website use every time users visit webpages containing Google services.	session	Information by Google
Essential	JSESSIONID	Unive.it (www.unive.it)	Used to create web sessions into the Personal Area.	session	Information by Ca' Foscari University
Essential	ADMCMD_prev	Unive.it (www.unive.it)	Used to create web sessions into the Personal Area.	session	Information by Ca' Foscari University
Essential	unive.it	Unive.it (www.unive.it)	It stores the user's preferences on cookies.	6 months	Information by Ca' Foscari University
Essential	noiframe	Unive.it (www.unive.it)	It stores the user's preferences on cookies.	6 months	Information by Ca' Foscari University
Google - Youtube	__Secure-1PAPISID	Google (google.com)	Used for targeting purposes in order to acquire web visitors' interests and show them pertinent and customised Google advertising.	2 years	Information by Google
Google - Youtube	CONSENT	Google (google.com)	Used by Google to store the user's preferences.	17 years	Information by Google
Google - Youtube	NID	Google (google.com)	Used by Google to store the user's preferences.	6 months	Information by Google
Google - Youtube	__Secure-1PSID	Google (google.com)	Used for targeting purposes in order to acquire web visitors' interests and show them pertinent and customised Google advertising.	2 years	Information by Google

REGINDEX Compressed Indexes for Regular Languages with Applications to Computational Pan-genomics

Project

Publications

Journal Papers

Conference Papers

Workshop Presentations

Team

Nicola Prezza

Bio

Sung-Hwan Kim

Bio

Ruben Becker

Bio

Bojana Kodric

Bio

Davide Cenzato

Bio

Carlo Tosoni

Bio

Daniel Puttini

Bio

Alessio Campanelli

Bio

Open positions

REGINDEX
Compressed Indexes for Regular Languages with Applications to Computational Pan-genomics