CASPER
Cross-Modal Alignment for Semantic Processing and Effective Representation

Overview

CASPER is an MSCA research project advancing multimodal AI by aligning heterogeneous modalities – including neural signals – with language-centric models. The project combines graph-structured fusion, scalable training, and rigorous evaluation to deliver more grounded multimodal understanding and practical neural-to-speech reconstruction pipelines.
CASPER develops graph-aware multimodal AI that aligns language, vision, audio/speech, and neural signals (EEG/MEG/fMRI) to enable robust understanding and neural-to-speech reconstruction.

Cross-modal representation learning
Build unified representations across text, image, audio/speech, and video using structured fusion and graph-based learning.

GNN-LLM fusion and grounding
Integrate graph neural networks with multimodal LLMs to improve semantic grounding, interpretability, and compositional reasoning.

Neural decoding to speech and audio
Develop neural foundation models to decode and reconstruct speech/audio from brain activity (EEG/MEG/fMRI), targeting reliable zero-/few-shot transfer.

Research goals

  • Learn shared semantic spaces across text, image, audio/speech, and video with explicit cross-modal alignment objectives
  • Improve grounding and robustness of multimodal generation and reasoning via graph-structured representations
  • Enable neural decoding of speech/audio (and related percepts) from EEG/MEG/fMRI with strong generalization and uncertainty awareness
  • Release reproducible artifacts (code, models, evaluation protocols) aligned with open science practices

Methodology

  • Graph-based multimodal fusion: hierarchical aggregation and concept-level representations using GNNs
  • Alignment + supervised fine-tuning: cross-modal contrastive/objective design + task-specific adaptation on curated multimodal datasets
  • Neural foundation modeling: unified encoders for EEG/MEG/fMRI with cross-subject/session normalization and efficient decoding heads
  • Evaluation & reliability: benchmarking for factual/semantic grounding, robustness across domains, and neural reconstruction fidelity

Research

Objectives

Our research is structured around three specific objectives (SO) that drive innovation in multimodal AI, graph-structured fusion, and neural decoding

These objectives work synergistically to advance multimodal AI systems, from unified representation learning to neural decoding pipelines that bridge brain activity and speech/audio reconstruction.

Multi-modal representation learning

SO1
Develop a human-like multimodal agent that learns unified, graph-structured representations across text, image, audio/speech, and video for robust semantic understanding and generation.

Key activities

  • Build concept-level graphs from grid/patch and token features
  • Learn aligned multimodal embeddings with contrastive and structured objectives
  • Evaluate grounding, compositionality, and robustness across modalities
  • Release reusable representation and evaluation components

Supervised fine-tuning and optimization

SO2
Refine and optimize the multimodal agent through supervised fine-tuning on curated multimodal datasets, improving reliability, generalization, and controllability.

Key activities

  • Curate training mixtures and task formulations across modalities
  • Fine-tune with quality/robustness targets and safety-aware evaluation
  • Improve efficiency via parameter-efficient adaptation and scalable training
  • Establish benchmarking protocols for generalization and failure modes

Neural representation learning and decoding (EEG/MEG/fMRI)

SO3
Develop unified neural encoders and decoding pipelines that map brain activity to aligned multimodal semantics, enabling neural-to-speech/audio reconstruction with strong cross-subject transfer.

Key activities

  • Build modality-agnostic encoders for EEG/MEG/fMRI time series
  • Align neural embeddings with multimodal semantic spaces (text/audio/image)
  • Develop efficient decoders for speech/audio reconstruction and evaluation
  • Validate generalization across subjects, sessions, and datasets

Impact

Timeline
24-month technical roadmap, from literature review to neural decoding validation

Project duration:

24 months from literature review and data preparation to neural decoding validation

Key milestones:

3 major critical checkpoints ensuring project progress and quality deliverables

Technical phases:

4 structured research phases covering all aspects from design to validation

Phase 1
  • Literature review
  • State-of-the-art analysis
  • Dataset collection
  • Data preprocessing
Phase 2
  • Novel architecture design
  • Multimodal training framework
  • Cross-modal alignment
  • Integration of GNN & LLM
Phase 3
  • Visual task fine-tuning
  • Audio/speech processing
  • Model optimization
  • Hyperparameter tuning
Phase 4
  • Neural representation learning
  • Brain signal decoding
  • Lab validation (MIP & NoCE)
  • Performance evaluation

Project impacts

CASPER aims to generate significant scientific, societal, and economic impacts through groundbreaking research in cognitive neuroscience and artificial intelligence.

Scientific impact

  • Advance understanding of neural mechanisms underlying cognitive flexibility
  • Develop novel computational models bridging neuroscience and AI
  • Create new methodologies for analyzing brain-behavior relationships
  • Contribute to theoretical frameworks in cognitive neuroscience
  • Generate high-impact publications in top-tier journals

Societal impact

  • Improve diagnostic tools for neurological and psychiatric disorders
  • Enhance educational strategies for cognitive skill development
  • Inform policy decisions on mental health and cognitive wellness
  • Promote public understanding of brain function and plasticity
  • Support evidence-based interventions for cognitive enhancement

Economic impact

  • Create intellectual property and potential licensing opportunities
  • Foster innovation in neurotechnology and AI industries
  • Reduce healthcare costs through improved diagnostic methods
  • Generate employment opportunities in research and development
  • Attract investment in European neuroscience research initiatives

Long-term vision

  1. Research excellence
    Establish new standards for interdisciplinary research combining neuroscience, psychology, and artificial intelligence
  2. Knowledge translation
    Bridge the gap between basic research findings and practical applications in healthcare and education
  3. Capacity building
    Train the next generation of researchers in cutting-edge methodologies and interdisciplinary approaches
  4. Global collaboration
    Foster international partnerships and knowledge exchange in cognitive neuroscience research

Outcomes

Expected outcomes
Sharing our research findings and engaging with the scientific community through publications, conferences, and outreach activities

15+
Publications

5+
Collaborations

3
Patents

10+
Conferences

Team

The CASPER project brings together world-class researchers from leading academic institutions across Europe. This interdisciplinary collaboration combines expertise in computer science, neuroscience, and cognitive science to advance multimodal AI systems and neural decoding technologies for speech and audio reconstruction.

Researcher and expert supervisors

Ambuj Mehrish

MSCA Researcher (Principal Investigator)
Ca' Foscari University of Venice
ambuj.mehrish@unive.it

Expertise: Multimodal AI, Graph Neural Networks, Neural Decoding

Sebastiano Vascon

Principal Supervisor
Ca' Foscari University of Venice
sebastiano.vascon@unive.it

Expertise: Computer Vision, Graph Neural Networks, Pattern Recognition

Valentina Borghesani

Co-Supervisor
NoCE Labs, University of Geneva
valentina.borghesani@unige.ch

Expertise: Cognitive Neuroscience, Semantic Processing, fMRI Analysis

Dimitri Van De Ville

Co-Supervisor
Neuro-X Institute, EPFL
dimitri.vandeville@epfl.ch

Expertise: Neuroimaging, Signal Processing, Brain Connectivity

Partner institutions

CVML (Computer Vision & Machine Learning)
Computer Vision & Machine Learning laboratory - Ca' Foscari University of Venice
Internationally renowned for pioneering work in graph theory, game theory, and computer vision applications.
EPFL
MIPLab (Medical Image Processing Lab) - EPFL
At Neuro-X Institute, dedicated to advanced neuroimaging and signal processing for brain functional connectivity research.
NoCE
Neurobiology of Concepts Expression Laboratory - University of Geneva, Faculty of Psychology and Educational Sciences
Leading pioneering research at the nexus of neurobiology, semantics, and language processing.