Decoding the language of intrinsically disordered regions of proteins and their roles in human cancers
Kiersten Ruff, Matthew King in Rohit Pappu’s lab use machine learning to reveal findings
Every function in a cell is associated with a particular protein or group of proteins, typically in a well-defined three-dimensional structure. However, intrinsically disordered regions of proteins defy this structure-function paradigm. A team of researchers in the McKelvey School of Engineering at Washington University in St. Louis has developed an algorithm to understand how intrinsically disordered regions in proteins might be organized into distinct functional classes.
Kiersten Ruff, staff research scientist, and Matthew King, a postdoctoral research associate, both in the lab of Rohit V. Pappu, the Gene K. Beare Distinguished Professor in the Department of Biomedical Engineering, used an algorithm previously developed in the Pappu lab, NARDINI+, to analyze the amino acid sequences of intrinsically disordered regions (IDRs) of proteins to uncover and organize so-called molecular grammars. Using unsupervised learning, which detects for non-random amino acid usage patterns and non-random arrangements of amino acids along linear sequences, Ruff discovered that the grammars of naturally occurring sequences fall into a finite number of clusters, each with a specific set of functions. The effort led to the creation of a resource they refer to as GIN for Grammars Inferred using NARDINI+.
Results of the research were published in Cell Nov. 12, 2025.
Pappu, who has studied IDRs in his lab for nearly two decades, said IDRs challenge the conventional thinking and conventional approaches.
“If a protein region does not adopt a specific structure, then figuring out the functions it can perform and how it performs these functions becomes challenging,” he said.
His lab has leveraged physical principles aimed at decoding the information written into the amino acid sequences of IDRs to determine if there was sequence specificity to the types of conformations that particular IDRs adopt. Their work revealed that IDRs can adopt specific types of structures and perform specific types of functions, governed by the amino acid compositions, or the alphabet, and the linear arrangement of specific pairs of amino acid types, or the syntax, of IDR sequences.
Former doctoral student Megan (Cohan) White, former postdoctoral researcher Min Kyung Shinn, and Pappu introduced the NARDINI algorithm in 2022 that takes an IDR sequence as an input and assesses whether different syntaxes present within an IDR sequence can be non-random. Their work showed that key binary patterns were non-random, and IDRs that were associated with similar functions shared similar non-random binary patterns in their sequences.
That led Ruff to develop NARDINI+ as part of a collaboration with the lab of Cigall Kadoch, associate professor of pediatric oncology at the Dana-Farber Cancer Institute and Harvard Medical School and an investigator at the Howard Hughes Medical Institute. Ruff broadened the usage and scope of NARDINI+ with the intent to analyze all IDR sequences in the human proteome. In doing so, she asked if there was a finite set of grammars that are in use across the human proteome.
Using unsupervised machine learning, Ruff uncovered a finite number of grammars referred to as GIN clusters. The analysis showed that specific GIN clusters were associated with determining the localization preferences of proteins in cells. They also help explain the functional organization at the molecular level and temporal ordering of key molecular processes such as the production of ribosomes, which are the protein translation machines. Through collaborations with Kadoch and her students, the team accessed large-scale data generated by the Broad Institute to show that previously identified correlations between pairs of genes in cancer cells could be explained as being correlations between IDR grammars.
King, who works jointly in the Pappu and Kadoch labs, tested the accuracy of inferences that GIN generated regarding the localization preferences underwritten by IDR grammars.
“Having an IDR of a specific grammar seems to be a key determinant, although not the only determinant, of the preferred subcellular location of a protein,” King said.
“An important insight that has emerged from the creation of GIN is the realization that gene translocations that cause specific human cancers are the result of IDR grammars being upended by mutations,” Ruff said. “The prediction is that these altered grammars lead to a rewiring of specific interaction networks, which we can now identify, that activates cellular proliferation programs.”
GIN promises to be a useful resource to guide new research aimed at uncovering new information about IDRs, Pappu said. It also has the potential to enable the design of synthetic IDRs that can perform bespoke functions. Going forward, Pappu and his colleagues are working closely with Kadoch’s lab to design studies that will help them understand how altered IDR grammars drive proliferative programs in human cancers.
Further investigations into GIN were recently funded by the ASPIRE program sponsored by the Mark Foundation for Cancer Research, which has seeded a three-way collaboration among the labs of Denes Hnisz, research group leader at the Max Planck Institute for Molecular Genetics, Kadoch and Pappu.
Ruff KM, King MR, Ying AW, Liu V, Pant A, Lieberman WE, Shinn MK, Su X, Kadoch C, Pappu RV. Molecular grammars of predicted intrinsically disordered regions that span the human proteome. Cell, Nov. 12, 2025. DOI: https://doi.org/10.1016/j.cell.2025.10.019
This research was supported with funding from the Air Force Office of Scientific Research (FA9550-20-1-0241); St. Jude Research Collaborative on the Biology and Biophysics of RNP Granules; National Science Foundation (MCB-2227268), National Institutes of Health (F32GM146418-01A1, K99GM152778, R01NS121114), and the Howard Hughes Medical Institute.
Code and data that provide access to GIN and three Google Colab notebooks associated with the study are available at https://github.com/kierstenruff/RUFF_KING_Grammars_of_IDRs_using_NARDINI-.