A protein language model unveils the E. coli pangenome functional landscape regulating host proteostasis

Published in bioRxiv, 2026

Daniel Martinez-Martinez*, Andreea Aprodu*, Cassandra Backes*, Franziska Ottens, Aleksandra Zecic, Hannah Doherty, Jonas Widder, Ivan Andrew, Laurence Game, Iliyana Kaneva, Georgia Roumellioti, Alex Montoya, Holger Kramer, Thorsten Hoppe, Filipe Cabreiro. A protein language model unveils the E. coli pangenome functional landscape regulating host proteostasis. bioRxiv (2026). doi: 10.64898/2026.01.15.699719. *These authors contributed equally to this work.

Understanding how bacterial diversity at strain level resolution shapes host physiology is a central challenge in microbiome research. The vast, functionally unknown genetic diversity within a species pangenome makes it difficult to connect genes to function and their impact on host physiology. Here, we explore how the functional landscape of the Escherichia coli pangenome impacts transcriptional responses in Caenorhabditis elegans and show that traditional gene-centric methods fail to provide significant functional associations with the host. Thus, we developed a pangenome framework that leverages the protein language model ProtT5 and generates unique strain embeddings representing the functional potential of each 9,558 E. coli isolate. Stratification of the pangenome into distinct functional guilds aligned with key host processes such as cell division, metabolism and proteostasis. Further, we identify a critical interplay between the extensive network of bacterial chaperones and proteases in regulating host proteostasis. We find that the bacterial chaperone DNAK/HSP70 and protease ClpX fine-tune the host ubiquitin-proteasome system by controlling propionate and vitamin B12 availability. These findings reveal a conserved ‘co-proteostasis’ mechanism as a key phenomenon modulating host-microbe interactions through metabolic communication. Our pangenome-to-phenotype approach offers a powerful strategy to decode bacterial pangenome functional diversity, directly linking microbial genomic variation to host physiological outcomes.