Using structural features for inter-helical residue contact prediction in alpha-helical transmembrane proteins
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Residue contact maps offer a simplified 2D representation of spatial relationships within 3D protein structures. These maps can serve as a crucial foundation for structural modeling. In addition, they can function as standalone analytical tools, offering valuable insights into protein characteristics and behavior, and helping to pinpoint inter-helical binding sites which is critical for understanding protein folding. ☐ Numerous computational approaches have been developed for protein residue contact prediction, leveraging diverse features derived from sequence analysis, physio-chemical properties, and evolutionary information. However, these methods have predominantly focused on using contact maps as a stepping stone towards 3D structure prediction, restricting themselves to sequence-based features. In contrast, in this work we incorporated structural information for residue contact prediction focusing only on inter-helical regions in transmembrane proteins. ☐ First, structural information surrounding a residue pair of interest was extracted and used to predict whether the pair forms a contact, while ensuring that any structural information specific to the residue pair was not included. Features such as relative distances and angles were derived from the residue pair’s neighborhood and were used to train a classifier. The proposed method was benchmarked against a state-of-the-art approach that relies solely on non-structural information. Experiments on held-out datasets demonstrated that our method achieves over 90% precision for top L/2 and L inter-helical contacts, significantly surpassing the comparison method. This performance might be an upper bound on achievable performance when only non-structural data is used. Additionally, the robustness of the model was tested by introducing Gaussian noise into PDB coordinates. The results indicated that the model maintains strong performance even under high coordinate noise levels - which percolated into derived features, underscoring its reliability. ☐ Then, given AlphaFold2’s strong performance in predicting 3D protein structures, we investigated the effectiveness of using AlphaFold2-predicted structures to identify residue contacts, and if structural features derived using predicted structures can enhance residue contact prediction further. On a well-known benchmark dataset, contact prediction using AlphaFold2’s structures limited to inter-helical residue pairs achieved an average precision of 83%, surpassing a state-of-the-art comparison model that relies solely on sequential features. We then developed a new procedure to extract features from a residue pair’s structural neighborhood, postulating that such structural features would improve contact prediction if the predicted 3D structure is reasonably accurate. Training on experimentally determined structures allowed the model to leverage knowledge from high-quality data. When tested on AlphaFold2 derived features, this approach significantly improved performance, with about 91.9% average precision on a held-out dataset and at least 89.5% average precision in all cross-validation experiments. These results emphasize the potential of integrating structural insights to enhance residue contact prediction. ☐ Finally, we explored the idea that crowdedness around a target residue pair influences whether or not it is a contact point. We developed two measures of crowdedness in a residue’s 3D neighborhood: bin counts - defined in terms of relative residue distance; and residue contact number for inter-helical TM proteins - the number of residues in a specified relative distance. Since unsupervised language models such as MSA transformer, trained on millions of sequences, are very accurate but also complementary to our approach, we combined MSA transformer score with our proposed features to assess the impact of crowdedness on residue contact prediction. We found that crowdedness measures can in fact increase the upper bound performance by at least 7.65% average precision in cross validation experiments and by at least 11.59% average precision in held out experiments. Further, we developed a method to “transfer” this information when ground truth crowdedness measures are unavailable. Our approach outperformed MSA transformer by at least 1.15% average precision in cross validation experiments and 1.85% average precision in held-out experiments.
Description
Keywords
AlphaFold2, Machine learning, Protein structure prediction, Residue contact prediction, Transmembrane proteins