Decoding the language of intrinsically disordered regions of proteins and their roles in human cancers

Kiersten Ruff, Matthew King in Rohit Pappu’s lab use machine learning to reveal findings

Beth Miller 11.12.2025

Using unsupervised machine learning, researchers in Rohit Pappu’s lab uncovered a finite number of grammars referred to as GIN clusters. The analysis showed that specific GIN clusters were associated with determining the localization preferences of proteins in cells. They also help explain the functional organization at the molecular level and temporal ordering of key molecular processes such as the production of ribosomes, which are the protein translation machines. (Credit: Pappu lab)

Facebook Twitter Linkedin Email

Every function in a cell is associated with a particular protein or group of proteins, typically in a well-defined three-dimensional structure. However, intrinsically disordered regions of proteins defy this structure-function paradigm. A team of researchers in the McKelvey School of Engineering at Washington University in St. Louis has developed an algorithm to understand how intrinsically disordered regions in proteins might be organized into distinct functional classes.

Kiersten Ruff, staff research scientist, and Matthew King, a postdoctoral research associate, both in the lab of Rohit V. Pappu, the Gene K. Beare Distinguished Professor in the Department of Biomedical Engineering, used an algorithm previously developed in the Pappu lab, NARDINI+, to analyze the amino acid sequences of intrinsically disordered regions (IDRs) of proteins to uncover and organize so-called molecular grammars. Using unsupervised learning, which detects for non-random amino acid usage patterns and non-random arrangements of amino acids along linear sequences, Ruff discovered that the grammars of naturally occurring sequences fall into a finite number of clusters, each with a specific set of functions. The effort led to the creation of a resource they refer to as GIN for Grammars Inferred using NARDINI+.

Results of the research were published in Cell Nov. 12, 2025.

Pappu, who has studied IDRs in his lab for nearly two decades, said IDRs challenge the conventional thinking and conventional approaches.

“If a protein region does not adopt a specific structure, then figuring out the functions it can perform and how it performs these functions becomes challenging,” he said.

His lab has leveraged physical principles aimed at decoding the information written into the amino acid sequences of IDRs to determine if there was sequence specificity to the types of conformations that particular IDRs adopt. Their work revealed that IDRs can adopt specific types of structures and perform specific types of functions, governed by the amino acid compositions, or the alphabet, and the linear arrangement of specific pairs of amino acid types, or the syntax, of IDR sequences.

Former doctoral student Megan (Cohan) White, former postdoctoral researcher Min Kyung Shinn, and Pappu introduced the NARDINI algorithm in 2022 that takes an IDR sequence as an input and assesses whether different syntaxes present within an IDR sequence can be non-random. Their work showed that key binary patterns were non-random, and IDRs that were associated with similar functions shared similar non-random binary patterns in their sequences.

That led Ruff to develop NARDINI+ as part of a collaboration with the lab of Cigall Kadoch, associate professor of pediatric oncology at the Dana-Farber Cancer Institute and Harvard Medical School and an investigator at the Howard Hughes Medical Institute. Ruff broadened the usage and scope of NARDINI+ with the intent to analyze all IDR sequences in the human proteome. In doing so, she asked if there was a finite set of grammars that are in use across the human proteome.

Using unsupervised machine learning, Ruff uncovered a finite number of grammars referred to as GIN clusters. The analysis showed that specific GIN clusters were associated with determining the localization preferences of proteins in cells. They also help explain the functional organization at the molecular level and temporal ordering of key molecular processes such as the production of ribosomes, which are the protein translation machines. Through collaborations with Kadoch and her students, the team accessed large-scale data generated by the Broad Institute to show that previously identified correlations between pairs of genes in cancer cells could be explained as being correlations between IDR grammars.

King, who works jointly in the Pappu and Kadoch labs, tested the accuracy of inferences that GIN generated regarding the localization preferences underwritten by IDR grammars.

“Having an IDR of a specific grammar seems to be a key determinant, although not the only determinant, of the preferred subcellular location of a protein,” King said.

“An important insight that has emerged from the creation of GIN is the realization that gene translocations that cause specific human cancers are the result of IDR grammars being upended by mutations,” Ruff said. “The prediction is that these altered grammars lead to a rewiring of specific interaction networks, which we can now identify, that activates cellular proliferation programs.”

GIN promises to be a useful resource to guide new research aimed at uncovering new information about IDRs, Pappu said. It also has the potential to enable the design of synthetic IDRs that can perform bespoke functions. Going forward, Pappu and his colleagues are working closely with Kadoch’s lab to design studies that will help them understand how altered IDR grammars drive proliferative programs in human cancers.

Further investigations into GIN were recently funded by the ASPIRE program sponsored by the Mark Foundation for Cancer Research, which has seeded a three-way collaboration among the labs of Denes Hnisz, research group leader at the Max Planck Institute for Molecular Genetics, Kadoch and Pappu.

Ruff KM, King MR, Ying AW, Liu V, Pant A, Lieberman WE, Shinn MK, Su X, Kadoch C, Pappu RV. Molecular grammars of predicted intrinsically disordered regions that span the human proteome. Cell, Nov. 12, 2025. DOI: https://doi.org/10.1016/j.cell.2025.10.019

This research was supported with funding from the Air Force Office of Scientific Research (FA9550-20-1-0241); St. Jude Research Collaborative on the Biology and Biophysics of RNP Granules; National Science Foundation (MCB-2227268), National Institutes of Health (F32GM146418-01A1, K99GM152778, R01NS121114), and the Howard Hughes Medical Institute.

Code and data that provide access to GIN and three Google Colab notebooks associated with the study are available at https://github.com/kierstenruff/RUFF_KING_Grammars_of_IDRs_using_NARDINI-.

The McKelvey School of Engineering at Washington University in St. Louis promotes independent inquiry and education with an emphasis on scientific excellence, innovation and collaboration without boundaries. McKelvey Engineering has top-ranked research and graduate programs across departments, particularly in biomedical engineering, environmental engineering and computing, and has one of the most selective undergraduate programs in the country. With 165 full-time faculty, 1,524 undergraduate students, 1,554 graduate students and 22,000 living alumni, we are working to solve some of society’s greatest challenges; to prepare students to become leaders and innovate throughout their careers; and to be a catalyst of economic development for the St. Louis region and beyond.