REGINDEX
Compressed Indexes for Regular Languages with Applications to Computational Pan-genomics

Project

Sorting is, arguably, the most powerful algorithmic building block when it comes to indexing data. At the same time, the regularities exposed by sorting are precisely those enabling data compression. In the last two decades, this fascinating duality has led researchers to the design of compressed full-text indexes: data structures supporting fast pattern matching queries over compressed text. In this project, we revisit the natural generalization of the problem to labeled graphs from a new perspective: we interpret graphs as finite-state automata and investigate the connections existing between their propensity to be sorted and the regular languages they recognize. This novel language-theoretic approach makes it possible to transfer fundamental results between the mature fields of regular language theory and compressed text indexing. The project aims at building this bridge by developing a new theory of compressed regular language indexing.

Logo REGINDEX
Partial co-lexicographic order of a DFA

The project finds important applications to the rapidly-expanding field of computational pangenomics, where the goal is to study the variations contained in the genomes of an entire population. Recent research has shown that representing pan-genomes as labeled graphs is an important step to reduce reference allele bias. Existing approaches, however, can index only restricted classes of graphs, thereby limiting the practical applicability of such powerful pan-genome representations. 

The project’s approach, based on sorting regular languages by partial co-lexicographic orders (see figure), changes the perspective from which the compressed indexing problem has been tackled in the literature. This project aims at developing a theory of graph indexing and compression based on the natural interplay between sorting and regular language theory. 

Open positions

Several positions (PhDs, Postdocs, Assistant Professor) will be opened during the life span of the project (Sept 2022 - Sept 2027). Expressions of interest are welcome at any time: please send an email to nicola.prezza@unive.it if you are interested and/or have any questions.

At the moment, expressions of interest are welcome for the following positions.

2 Postdoc and 1 Assistant Professor positions within algorithms and data structures

I am looking for two Postdocs and one Assistant Professor (non-tenure-track / RTD-A) in the field of algorithms and data structures, starting September 1st 2022. As a member of the project’s team, you will collaborate with me and the rest of the team on topics including (depending on your expertise):

  • Compression and indexing of texts and labeled graphs
  • Measures of repetitiveness (compressibility) for texts and graphs
  • Regular expression matching and indexes for regular languages 
  • Fine-grained lower bounds for regular expression matching
  • Implementation of efficient tools for solving pattern matching on labeled graphs

The above problems will be attacked by generalizing existing solutions for strings (for example: suffix sorting, grammar compression, run-length Burrows-Wheeler Transform, string attractors) using the framework recently published (a natural generalization of suffix sorting to labeled graphs).

Other duties

You will be encouraged to publish and participate in international conferences and workshops related to the project’s topics (the grant will finance these activities). There are no teaching duties associated with the Postdoc positions. The Assistant Professor will have a very limited teaching load (30 hours per year).

Qualifications

Familiarity with the above topics is a plus but it is not a strict requirement. The candidate should, however, be able to demonstrate a good publication track in one (or more) of the following broad areas: algorithms and data structures, information theory, formal language theory. A PhD in Computer Science is required for the Assistant Professor position, while it represents a plus for the Postdoc positions. Additional titles/qualities that represent a plus are: proficiency with the C++ programming language, strong publication record, ability to work in a research team. 

The campus

The research will take place at Ca’ Foscari University of Venice (scientific campus of via Torino - a ten minutes bus ride from Venice). 

Position and salary

Each of the two offered Postdoc positions is full time and lasts one year starting from September 1st 2022, with the possibility to extend the contract for two further years. The Postdoc positions are tax-free so the net salary is very competitive (about 28,000 net Euros per year). The Assistant Professor position is full-time and lasts 3 year starting from September 1st 2022, with the possibility to extend the contract for two further years. The net salary of the Assistant Professor is about 25,000 Euros per year. 

Application procedure

This page will be updated as soon as the official application procedure will be made public. At the moment, if you are interested or desire further information you are welcome to send me an email at nicola.prezza@unive.it