Protessa: A New Method for Secondary Structure Assignment Based on Topology




Combs, Patrick Ford

Journal Title

Journal ISSN

Volume Title



Secondary structure assignment (SSA) is the classification of each residue in a protein structure as helix, strand, or coil and, in this work, a new method for SSA is developed. SS is vital for stabilizing the overall structure and function of a protein; therefore, it plays a significant role in protein classification schemes, homology modeling, and structure comparison. It is also used to train secondary structure prediction methods, which try to determine secondary structure based on the amino acid sequence alone. The task of SSA is difficult because helices and strands in proteins rarely conform to their theoretical ideals. Most existing SSA methods rely on parameters, such as hydrogen-bond patterns or inter-atomic distances with arbitrary cutoffs. As a result, various SSA methods generate substantially differing assignments. ProTeSSA (Protein Tessellation-based Secondary Structure Assignment) is a new method that does not require parameters. It is based on the Delaunay tessellation (DT) of a protein’s C-alpha coordinates (CAC). The DT of a protein is a simplicial complex, where each residue is a member of a set of simplices, or tetrahedra, each forming a group of four natural neighbors. This topological data is mined to generate a descriptor for each residue, in part using a novel application of persistent homology.Residue-based models were trained and tested on a test set of proteins, that was kept separate from training. The ProTeSSA models achieved greater than 85% accuracy at the residue level, using the protein structure author(s)’s assignments as ground truth, and low misclassification between helices and strands, less than 1 per test protein. A k-means cluster model was also developed and achieved high accuracy. Since the cluster model did no require training with SSAs from other methods, it is purely objective and provides a fascinating counterpoint to current SSA methods. The success of ProTeSSA indicates the potential to shift from parameter-based methods to an objective and consistent SSA method that relies solely on protein topology rather than parameters and cutoffs that stem from preconceived SS definitions.



Bioinformatics, Delaunay tessellation, Machine learning, Persistent homology, Secondary structure assignment, Topology