linked to PubMed where applicable.
To examine the possible relationship of guanine-dependent GpA conformations with ribonucleotide cleavage, two potential of mean force (PMF) calculations were performed in aqueous solution. In the first calculation, the guanosine glycosidic (Gchi) angle was used as the reaction coordinate, and computations were performed on two GpA ionic species: protonated (neutral) or deprotonated (negatively charged) guanosine ribose O2 '. Similar energetic profiles featuring two minima corresponding to the anti and syn Gchi regions were obtained for both ionic forms. For both forms the anti conformation was more stable than the syn, and barriers of approximately 4 kcal/mol were obtained for the anti --> syn transition. Structural analysis showed a remarkable sensitivity of the phosphate moiety to the conformation of the Gchi angle, suggesting a possible connection between this conformation and the mechanism of ribonucleotide cleavage. This hypothesis was confirmed by the second PMF calculations, for which the O2 '--P distance for the deprotonated GpA was used as reaction coordinate. The computations were performed from two selected starting points: the anti and syn minima determined in the first PMF study of the deprotonated guanosine ribose O2'. The simulations revealed that the O2 ' attack along the syn Gchi was more favorable than that along the anti Gchi: energetically, significantly lower barriers were obtained in the syn than in the anti conformation for the O--P bond formation; structurally, a lesser O2 '--P initial distance, and a better suited orientation for an in-line attack was observed in the syn relative to the anti conformation. These results are consistent with the catalytically competent conformation of barnase-ribonucleotide complex, which requires a guanine syn conformation of the substrate to enable abstraction of the ribose h1 ' proton by the general base Glu73, thereby suggesting a coupling between the reactive substrate conformation and enzyme structure and mechanism. (c) 2007 Wiley-Liss, Inc.
The performance of methods for predicting protein-protein interactions at the atomic scale is assessed by evaluating blind predictions performed during 2005-2007 as part of Rounds 6-12 of the community-wide experiment on Critical Assessment of PRedicted Interactions (CAPRI). These Rounds also included a new scoring experiment, where a larger set of models contributed by the predictors was made available to groups developing scoring functions. These groups scored the uploaded set and submitted their own best models for assessment. The structures of nine protein complexes including one homodimer were used as targets. These targets represent biologically relevant interactions involved in gene expression, signal transduction, RNA, or protein processing and membrane maintenance. For all the targets except one, predictions started from the experimentally determined structures of the free (unbound) components or from models derived by homology, making it mandatory for docking methods to model the conformational changes that often accompany association. In total, 63 groups and eight automatic servers, a substantial increase from previous years, submitted docking predictions, of which 1994 were evaluated here. Fifteen groups submitted 305 models for five targets in the scoring experiment. Assessment of the predictions reveals that 31 different groups produced models of acceptable and medium accuracy-but only one high accuracy submission-for all the targets, except the homodimer. In the latter, none of the docking procedures reproduced the large conformational adjustment required for correct assembly, underscoring yet again that handling protein flexibility remains a major challenge. In the scoring experiment, a large fraction of the groups attained the set goal of singling out the correct association modes from incorrect solutions in the limited ensembles of contributed models. But in general they seemed unable to identify the best models, indicating that current scoring methods are probably not sensitive enough. With the increased focus on protein assemblies, in particular by structural genomics efforts, the growing community of CAPRI predictors is engaged more actively than ever in the development of better scoring functions and means of modeling conformational flexibility, which hold promise for much progress in the future. (c) 2007 Wiley-Liss, Inc.
BACKGROUND: In structural genomics, an important goal is the detection and classification of protein-protein interactions, given the structures of the interacting partners. We have developed empirical energy functions to identify native structures of protein-protein complexes among sets of decoy structures. To understand the role of amino acid diversity, we parameterized a series of functions, using a hierarchy of amino acid alphabets of increasing complexity, with 2, 3, 4, 6, and 20 amino acid groups. Compared to previous work, we used the simplest possible functional form, with residue-residue interactions and a stepwise distance-dependence. We used increased computational resources, however, constructing 290,000 decoys for 219 protein-protein complexes, with a realistic docking protocol where the protein partners are flexible and interact through a molecular mechanics energy function. The energy parameters were optimized to correctly assign as many native complexes as possible. To resolve the multiple minimum problem in parameter space, over 64000 starting parameter guesses were tried for each energy function. The optimized functions were tested by cross validation on subsets of our native and decoy structures, by blind tests on series of native and decoy structures available on the Web, and on models for 13 complexes submitted to the CAPRI structure prediction experiment. RESULTS: Performance is similar to several other statistical potentials of the same complexity. For example, the CAPRI target structure is correctly ranked ahead of 90% of its decoys in 6 cases out of 13. The hierarchy of amino acid alphabets leads to a coherent hierarchy of energy functions, with qualitatively similar parameters for similar amino acid types at all levels. Most remarkably, the performance with six amino acid classes is equivalent to that of the most detailed, 20-class energy function. CONCLUSION: This suggests that six carefully chosen amino acid classes are sufficient to encode specificity in protein-protein interactions, and provide a starting point to develop more complicated energy functions.
CAPRI is a community-wide experiment to test protein-protein docking methods in blind predictions. The Toronto meeting assessed structure predictions made from 2005-2007 on nine target protein-protein complexes or homodimers, and reported new developments in functions used to score predicted interactions, in treatment of conformational flexibility, and in taking nonstructural information into account in the predictions.
BACKGROUND: Most methods for predicting functional sites in protein 3D structures, rely on information on related proteins and cannot be applied to proteins with no known relatives. Another limitation of these methods is the lack of a well annotated set of functional sites to use as benchmark for validating their predictions. Experimental findings and theoretical considerations suggest that residues involved in function often contribute unfavorably to the native state stability. We examine the possibility of systematically exploiting this intrinsic property to identify functional sites using an original procedure that detects destabilizing regions in protein structures. In addition, to relate destabilizing regions to known functional sites, a novel benchmark consisting of a diverse set of hand-curated protein functional sites is derived. RESULTS: A procedure for detecting clusters of destabilizing residues in protein structures is presented. Individual residue contributions to protein stability are evaluated using detailed atomic models and a force-field successfully applied in computational protein design. The most destabilizing residues, and some of their closest neighbours, are clustered into destabilizing regions following a rigorous protocol. Our procedure is applied to high quality apo-structures of 63 unrelated proteins. The biologically relevant binding sites of these proteins were annotated using all available information, including structural data and literature curation, resulting in the largest hand-curated data set of binding sites in proteins available to date. Comparing the destabilizing regions with the annotated binding sites in these proteins, we find that the overlap is on average limited, but significantly better than random. Results depend on the type of bound ligand. Significant overlap is obtained for most polysaccharide- and small ligand-binding sites, whereas no overlap is observed for most nucleic acid binding sites. These differences are rationalised in terms of the geometry and energetics of the binding site. CONCLUSION: We find that although destabilizing regions as detected here can in general not be used to predict binding sites in protein structures, they can provide useful information, particularly on the location of functional sites that bind polysaccharides and small ligands. This information can be exploited in methods for predicting function in protein structures with no known relatives. Our publicly available benchmark of hand-curated functional sites in proteins should help other workers derive and validate new prediction methods.
Reliable information on the physical and functional interactions between the gene products is an important prerequisite for deriving meaningful system-level descriptions of cellular processes. The available information about protein interactions in Saccharomyces cerevisiae has been vastly increased recently by two comprehensive tandem affinity purification/mass spectrometry (TAP/MS) studies. However, using somewhat different approaches, these studies produced diverging descriptions of the yeast interactome, clearly illustrating the fact that converting the purification data into accurate sets of protein-protein interactions and complexes remains a major challenge. Here, we review the major analytical steps involved in this process, with special focus on the task of deriving complexes from the network of binary interactions. Applying the Markov Cluster procedure to an alternative yeast interaction network, recently derived by combining the data from the two latest TAP/MS studies, we produce a new description of yeast protein complexes. Several objective criteria suggest that this new description is more accurate and meaningful than those previously published. The same criteria are also used to gauge the influence that different methods for deriving binary interactions and complexes may have on the results. Lastly, it is shown that employing identical procedures to process the latest purification data sets significantly improves the convergence between the resulting interactome descriptions.
An approach is presented for computing meaningful pathways in the network of small molecule metabolism comprising the chemical reactions characterized in all organisms. The metabolic network is described as a weighted graph in which all the compounds are included, but each compound is assigned a weight equal to the number of reactions in which it participates. Path finding is performed in this graph by searching for one or more paths with lowest weight. Performance is evaluated systematically by computing paths between the first and last reactions in annotated metabolic pathways, and comparing the intermediate reactions in the computed pathways to those in the annotated ones. For the sake of comparison, paths are computed also in the un-weighted raw (all compounds and reactions) and filtered (highly connected pool metabolites removed) metabolic graphs, respectively. The correspondence between the computed and annotated pathways is very poor (<30%) in the raw graph; increasing to approximately 65% in the filtered graph; reaching approximately 85% in the weighted graph. Considering the best-matching path among the five lightest paths increases the correspondence to 92%, on average. We then show that the average distance between pairs of metabolites is significantly larger in the weighted graph than in the raw unfiltered graph, suggesting that the small-world properties previously reported for metabolic networks probably result from irrelevant shortcuts through pool metabolites. In addition, we provide evidence that the length of the shortest path in the weighted graph represents a valid measure of the "metabolic distance" between enzymes. We suggest that the success of our simplistic approach is rooted in the high degree of specificity of the reactions in metabolic pathways, presumably reflecting thermodynamic constraints operating in these pathways. We expect our approach to find useful applications in inferring metabolic pathways in newly sequenced genomes.
Identification of protein-protein interactions often provides insight into protein function, and many cellular processes are performed by stable protein complexes. We used tandem affinity purification to process 4,562 different tagged proteins of the yeast Saccharomyces cerevisiae. Each preparation was analysed by both matrix-assisted laser desorption/ionization-time of flight mass spectrometry and liquid chromatography tandem mass spectrometry to increase coverage and accuracy. Machine learning was used to integrate the mass spectrometry scores and assign probabilities to the protein-protein interactions. Among 4,087 different proteins identified with high confidence by mass spectrometry from 2,357 successful purifications, our core data set (median precision of 0.69) comprises 7,123 protein-protein interactions involving 2,708 proteins. A Markov clustering algorithm organized these interactions into 547 protein complexes averaging 4.9 subunits per complex, about half of them absent from the MIPS database, as well as 429 additional interactions between pairs of complexes. The data (all of which are available online) will help future studies on individual proteins as well as functional genomics and systems biology.
Genetic analysis of a large Indian family with an autosomal dominant cataract phenotype allowed us to identify a novel cataract gene, CRYBA4. After a genomewide screen, linkage analysis identified a maximum LOD score of 3.20 (recombination fraction [theta] 0.001) with marker D22S1167 of the beta -crystallin gene cluster on chromosome 22. To date, CRYBA4 was the only gene in this cluster not associated with either human or murine cataracts. A pathogenic mutation was identified in exon 4 that segregated with the disease status. The c.317T-->C sequence change is predicted to replace the highly conserved hydrophobic amino acid phenylalanine94 with the hydrophilic amino acid serine. Modeling suggests that this substitution would significantly reduce the intrinsic stability of the crystalline monomer, which would impair its ability to form the association modes critical for lens transparency. Considering that CRYBA4 associates with CRYBB2 and that the latter protein has been implicated in microphthalmia, mutational analysis of CRYBA4 was performed in 32 patients affected with microphthalmia (small eye). We identified a c.242T-->C (Leu69Pro) sequence change in exon 4 in one patient, which is predicted here to disrupt the beta -sheet structure in CRYBA4. Protein folding would consequently be impaired, most probably leading to a structure with reduced stability in the mutant. This is the first report linking mutations in CRYBA4 to cataractogenesis and microphthalmia.
A comprehensive study is performed on the condition-dependent expression of genes coding for the components of hand curated multi-protein complexes of the yeast Saccharomyces cerevisiae, in order to identify coherent transcriptional modules within these complexes. Such modules are defined as groups of genes within complexes whose expression profiles under a common set of experimental conditions allow us to discriminate them from random sets of genes. Our analysis reveals that complexes such as the cytoplasmic ribosome, the proteasome and the respiration chain complexes previously characterized as "stable" or "permanent" represent transcriptional modules that are coherently up or down-regulated in many different conditions. Overall however, some level of coherent expression is detected only in 71 out of the total of 113 complexes with at least five different protein components that could be reliably analyzed. Of these, 26 behave as coherently expressed transcriptional modules encompassing all the components of the complex. In another 15, at least half of the components make up such modules and in ten, few or no modules are detected. In an additional 20 complexes coherent expression is detected, but in too few conditions to enable reliable module detection. Interestingly, the transcriptional modules, when detected, often correspond to one or more known sub-complexes with specific functions. Furthermore, detected modules are generally consistent with transcriptional modules identified on the basis of predicted cis-regulatory sequence motifs. Also, groups of genes shared between complexes that carry out related functions tend to be part of overlapping transcriptional modules identified in these complexes. Together these findings suggest that transcriptional modules may represent basic functional and evolutionary building blocs of protein complexes.
Our knowledge of metabolism can be represented as a network comprising several thousands of nodes (compounds and reactions). Several groups applied graph theory to analyse the topological properties of this network and to infer metabolic pathways by path finding. This is, however, not straightforward, with a major problem caused by traversing irrelevant shortcuts through highly connected nodes, which correspond to pool metabolites and co-factors (e.g. h1O, NADP and H+). In this study, we present a web server implementing two simple approaches, which circumvent this problem, thereby improving the relevance of the inferred pathways. In the simplest approach, the shortest path is computed, while filtering out the selection of highly connected compounds. In the second approach, the shortest path is computed on the weighted metabolic graph where each compound is assigned a weight equal to its connectivity in the network. This approach significantly increases the accuracy of the inferred pathways, enabling the correct inference of relatively long pathways (e.g. with as many as eight intermediate reactions). Available options include the calculation of the k-shortest paths between two specified seed nodes (either compounds or reactions). Multiple requests can be submitted in a queue. Results are returned by email, in textual as well as graphical formats (available in http: //www.scmbb.ulb.ac.be/Path Finding/).
The current status of docking procedures for predicting protein-protein interactions starting from their three-dimensional (3D) structure is reassessed by evaluating blind predictions, performed during 2003-2004 as part of Rounds 3-5 of the community-wide experiment on Critical Assessment of PRedicted Interactions (CAPRI). Ten newly determined structures of protein-protein complexes were used as targets for these rounds. They comprised 2 enzyme-inhibitor complexes, 2 antigen-antibody complexes, 2 complexes involved in cellular signaling, 2 homo-oligomers, and a complex between 2 components of the bacterial cellulosome. For most targets, the predictors were given the experimental structures of 1 unbound and 1 bound component, with the latter in a random orientation. For some, the structure of the free component was derived from that of a related protein, requiring the use of homology modeling. In some of the targets, significant differences in conformation were displayed between the bound and unbound components, representing a major challenge for the docking procedures. For 1 target, predictions could not go to completion. In total, 1866 predictions submitted by 30 groups were evaluated. Over one-third of these groups applied completely novel docking algorithms and scoring functions, with several of them specifically addressing the challenge of dealing with side-chain and backbone flexibility. The quality of the predicted interactions was evaluated by comparison to the experimental structures of the targets, made available for the evaluation, using the well-agreed-upon criteria used previously. Twenty-four groups, which for the first time included an automatic Web server, produced predictions ranking from acceptable to highly accurate for all targets, including those where the structures of the bound and unbound forms differed substantially. These results and a brief survey of the methods used by participants of CAPRI Rounds 3-5 suggest that genuine progress in the performance of docking methods is being achieved, with CAPRI acting as the catalyst.
The Comprehensive Yeast Genome Database (CYGD) compiles a comprehensive data resource for information on the cellular functions of the yeast Saccharomyces cerevisiae and related species, chosen as the best understood model organism for eukaryotes. The database serves as a common resource generated by a European consortium, going beyond the provision of sequence information and functional annotations on individual genes and proteins. In addition, it provides information on the physical and functional interactions among proteins as well as other genetic elements. These cellular networks include metabolic and regulatory pathways, signal transduction and transport processes as well as co-regulated gene clusters. As more yeast genomes are published, their annotation becomes greatly facilitated using S.cerevisiae as a reference. CYGD provides a way of exploring related genomes with the aid of the S.cerevisiae genome as a backbone and SIMAP, the Similarity Matrix of Proteins. The comprehensive resource is available under http: //mips.gsf.de/genre/proj/yeast/.
Increasingly complex schemes for representing solvent effects in an implicit fashion are being used in computational analyses of biological macromolecules. These schemes speed up the calculations by orders of magnitude and are assumed to compromise little on essential features of the solvation phenomenon. In this work we examine this assumption. Five implicit solvation models, a surface area-based empirical model, two models that approximate the generalized Born treatment and a finite difference Poisson-Boltzmann method are challenged in situations differing from those where these models were calibrated. These situations are encountered in automatic protein design procedures, whose job is to select sequences, which stabilize a given protein 3D structure, from a large number of alternatives. To this end we evaluate the energetic cost of burying amino acids in thousands of environments with different solvent exposures belonging, respectively, to decoys built with random sequences and to native protein crystal structures. In addition we perform actual sequence design calculations. Except for the crudest surface area-based procedure, all the tested models tend to favor the burial of polar amino acids in the protein interior over nonpolar ones, a behavior that leads to poor performance in protein design calculations. We show, on the other hand, that three of the examined models are nonetheless capable of discriminating between the native fold and many nonnative alternatives, a test commonly used to validate force fields. It is concluded that protein design is a particularly challenging test for implicit solvation models because it requires accurate estimates of the solvation contribution of individual residues. This contrasts with native recognition, which depends less on solvation and more on other nonbonded contributions.
Given the increasing interest in protein-protein interactions, the prediction of these interactions from sequence and structural information has become a booming activity. CAPRI, the community-wide experiment for assessing blind predictions of protein-protein interactions, is playing an important role in fostering progress in docking procedures. At the same time, novel methods are being derived for predicting regions of a protein that are likely to interact and for characterizing putative intermolecular contacts from sequence and structural data. Together with docking procedures, these methods provide an integrated computational approach that should be a valuable complement to genome-scale experimental studies of protein-protein interactions.
MOTIVATION: Several pattern discovery methods have been proposed to detect over-represented motifs in upstream sequences of co-regulated genes, and are for example used to predict cis-acting elements from clusters of co-expressed genes. The clusters to be analyzed are often noisy, containing a mixture of co-regulated and non-co-regulated genes. We propose a method to discriminate co-regulated from non-co-regulated genes on the basis of counts of pattern occurrences in their non-coding sequences. METHODS: String-based pattern discovery is combined with discriminant analysis to classify genes on the basis of putative regulatory motifs. RESULTS: The approach is evaluated by comparing the significance of patterns detected in annotated regulons (positive control), random gene selections (negative control) and high-throughput regulons (noisy data) from the yeast Saccharomyces cerevisiae. The classification is evaluated on the annotated regulons, and the robustness and rejection power is assessed with mixtures of co-regulated and random genes.
BACKGROUND: Multiprotein complexes play an essential role in many cellular processes. But our knowledge of the mechanism of their formation, regulation and lifetimes is very limited. We investigated transcriptional regulation of protein complexes in yeast using two approaches. First, known regulons, manually curated or identified by genome-wide screens, were mapped onto the components of multiprotein complexes. The complexes comprised manually curated ones and those characterized by high-throughput analyses. Second, putative regulatory sequence motifs were identified in the upstream regions of the genes involved in individual complexes and regulons were predicted on the basis of these motifs. RESULTS: Only a very small fraction of the analyzed complexes (5-6%) have subsets of their components mapping onto known regulons. Likewise, regulatory motifs are detected in only about 8-15% of the complexes, and in those, about half of the components are on average part of predicted regulons. In the manually curated complexes, the so-called 'permanent' assemblies have a larger fraction of their components belonging to putative regulons than 'transient' complexes. For the noisier set of complexes identified by high-throughput screens, valuable insights are obtained into the function and regulation of individual genes. CONCLUSIONS: A small fraction of the known multiprotein complexes in yeast seems to have at least a subset of their components co-regulated on the transcriptional level. Preliminary analysis of the regulatory motifs for these components suggests that the corresponding genes are likely to be co-regulated either together or in smaller subgroups, indicating that transcriptionally regulated modules might exist within complexes.
MALECON is a progressive combinatorial procedure for multiple alignments of protein structures. It searches a library of pairwise alignments for all three-protein alignments in which a specified number of residues is consistently aligned. These alignments are progressively expanded to include additional proteins and more spatially equivalent residues, subject to certain criteria. This action involves superimposing the aligned proteins by their hitherto equivalent residues and searching for additional Calpha atoms that lie close in space. The performance of MALECON is illustrated and compared with several extant multiple structure alignment methods by using as test the globin homologous superfamily, the OB and the Jellyrolls folds. MALECON gives better definitions of the common structural features in the structurally more diverse proteins of the OB and Jellyrolls folds, but it yields comparable results for the more similar globins. When no consistent multiple alignments can be derived for all members of a protein group, our procedure is still capable of automatically generating consistent alignments and common core definitions for subgroups of the members. This finding is illustrated for proteins of the OB fold and SH3 domains, believed to share common structural features, and should be very instrumental in homology modeling and investigations of protein evolution. Copyright 2004 Wiley-Liss, Inc.
The aMAZE LightBench (http: //www.amaze.ulb. ac.be/) is a web interface to the aMAZE relational database, which contains information on gene expression, catalysed chemical reactions, regulatory interactions, protein assembly, as well as metabolic and signal transduction pathways. It allows the user to browse the information in an intuitive way, which also reflects the underlying data model. Moreover links are provided to literature references, and whenever appropriate, to external databases.
The ACLAME database (http: //aclame.ulb.ac.be) is a collection and classification of prokaryotic mobile genetic elements (MGEs) from various sources, comprising all known phage genomes, plasmids and transposons. In addition to providing information on the full genomes and genetic entities, it aims to build a comprehensive classification of the functional modules of MGEs at the protein, gene and higher levels. This first version contains a comprehensive classification of 5069 proteins from 119 DNA bacteriophages into over 400 functional families. This classification was produced automatically using TRIBE-MCL, a graph-theory-based Markov clustering algorithm that uses sequence measures as input, and then manually curated. Manual curation was aided by consulting annotations available in public databases retrieved through additional sequence similarity searches using Psi-Blast and Hidden Markov Models. The database is publicly accessible and open to expert volunteers willing to participate in its curation. Its web interface allows browsing as well as querying the classification. The main objectives are to collect and organize in a rational way the complexity inherent to MGEs, to extend and improve the inadequate annotation currently associated with MGEs and to screen known genomes for the validation and discovery of new MGEs.
Biochemical pathways such as metabolic, regulatory or signal transduction pathways can be viewed as interconnected processes forming an intricate network of functional and physical interactions between molecular species in the cell. The amount of information available on such pathways for different organisms is increasing very rapidly. This is offering the possibility of performing various analyses on the structure of the full network of pathways for one organism as well as across different organisms, and has therefore generated interest in developing databases for storing and managing this information. Analysing these networks remains far from straightforward owing to the nature of the databases, which are often heterogeneous, incomplete or inconsistent. Pathway analysis is hence a challenging problem in systems biology and in bioinformatics. Various forms of data models have been devised for the analysis of biochemical pathways. This paper presents an overview of the types of models used for this purpose, concentrating on those concerned with the structural aspects of biochemical networks. In particular, the different types of data models found in the literature are classified using a unified framework. In addition, how these models have been used in the analysis of biochemical networks is described. This enables us to underline the strengths and weaknesses of the different approaches, as well as to highlight relevant future research directions.
CCR5 is a G protein-coupled receptor responding to four natural agonists, the chemokines RANTES (regulated on activation normal T cell expressed and secreted), macrophage inflammatory protein (MIP)-1 alpha, MIP-1 beta, and monocyte chemotactic protein (MCP)-2, and is the main co-receptor for the macrophage-tropic human immunodeficiency virus strains. We have previously identified a structural motif in the second transmembrane helix of CCR5, which plays a crucial role in the mechanism of receptor activation. We now report the specific role of aromatic residues in helices 2 and 3 of CCR5 in this mechanism. Using site-directed mutagenesis and molecular modeling in a combined approach, we demonstrate that a cluster of aromatic residues at the extracellular border of these two helices are involved in chemokine-induced activation. These aromatic residues are involved in interhelical interactions that are key for the conformation of the helices and govern the functional response to chemokines in a ligand-specific manner. We therefore suggest that transmembrane helices 2 and 3 contain important structural elements for the activation mechanism of chemokine receptors, and possibly other related receptors as well.
CAPRI is a communitywide experiment to assess the capacity of protein-docking methods to predict protein-protein interactions. Nineteen groups participated in rounds 1 and 2 of CAPRI and submitted blind structure predictions for seven protein-protein complexes based on the known structure of the component proteins. The predictions were compared to the unpublished X-ray structures of the complexes. We describe here the motivations for launching CAPRI, the rules that we applied to select targets and run the experiment, and some conclusions that can already be drawn. The results stress the need for new scoring functions and for methods handling the conformation changes that were observed in some of the target systems. CAPRI has already been a powerful drive for the community of computational biologists who development docking algorithms. We hope that this issue of Proteins will also be of interest to the community of structural biologists, which we call upon to provide new targets for future rounds of CAPRI, and to all molecular biologists who view protein-protein recognition as an essential process. Copyright 2003 Wiley-Liss, Inc.
The current status of docking procedures for predicting protein-protein interactions starting from their three-dimensional structure is assessed from a first major evaluation of blind predictions. This evaluation was performed as part of a communitywide experiment on Critical Assessment of PRedicted Interactions (CAPRI). Seven newly determined structures of protein-protein complexes were available as targets for this experiment. These were the complexes between a kinase and its protein substrate, between a T-cell receptor beta-chain and a superantigen, and five antigen-antibody complexes. For each target, the predictors were given the experimental structures of the free components, or of one free and one bound component in a random orientation. The structure of the complex was revealed only at the time of the evaluation. A total of 465 predictions submitted by 19 groups were evaluated. These groups used a wide range of algorithms and scoring functions, some of which were completely novel. The quality of the predicted interactions was evaluated by comparing residue-residue contacts and interface residues to those in the X-ray structures and by analyzing the fit of the ligand molecules (the smaller of the two proteins in the complex) or of interface residues only, in the predicted versus target complexes. A total of 14 groups produced predictions, ranking from acceptable to highly accurate for five of the seven targets. The use of available biochemical and biological information, and in one instance structural information, played a key role in achieving this result. It was essential for identifying the native binding modes for the five correctly predicted targets, including the kinase-substrate complex where the enzyme changes conformation on association. But it was also the cause for missing the correct solution for the two remaining unpredicted targets, which involve unexpected antigen-antibody binding modes. Overall, this analysis reveals genuine progress in docking procedures but also illustrates the remaining serious limitations and points out the need for better scoring functions and more effective ways for handling conformational flexibility. Copyright 2003 Wiley-Liss, Inc.
Homology modeling in combination with transmembrane topology predictions are used to build the atomic model of Neurospora crassa plasma membrane H+-ATPase, using as template the 2.6 A crystal structure of rabbit sarcoplasmic reticulum Ca2+-ATPase [Toyoshima, C., Nakasako, M., Nomura, H. & Ogawa, H. (2000) Nature 405, 647-655]. Comparison of the two calcium-binding sites in the crystal structure of Ca2+-ATPase with the equivalent region in the H+-ATPase model shows that the latter is devoid of most of the negatively charged groups required to bind the cations, suggesting a different role for this region. Using the built model, a pathway for proton transport is then proposed from computed locations of internal polar cavities, large enough to contain at least one water molecule. As a control, the same approach is applied to the high-resolution crystal structure of halorhodopsin and the proton pump bacteriorhodopsin. This revealed a striking correspondence between the positions of internal polar cavities, those of crystallographic water molecules and, in the case of bacteriorhodopsin, the residues mediating proton translocation. In our H+-ATPase model, most of these cavities are in contact with residues previously shown to affect coupling of proton translocation to ATP hydrolysis. A string of six polar cavities identified in the cytoplasmic domain, the most accurate part of the model, suggests a proton entry path starting close to the phosphorylation site. Strikingly, members of the haloacid dehalogenase superfamily, which are close structural homologs of this domain but do not share the same function, display only one polar cavity in the vicinity of the conserved catalytic Asp residue.
An automatic protein design procedure was used to compute amino acid sequences of peptides likely to bind the HLA-A2 major histocompatibility complex (MHC) class I allele. The only information used by the procedure are a structural template, a rotamer library, and a well established classical empirical force field. The calculations are performed on six different templates from x-ray structures of HLA-A0201-peptide complexes. Each template consists of the bound peptide backbone and the full atomic coordinates of the MHC protein. Sequences within 2 kcal/mol of the minimum energy sequence are computed for each template, and the sequences from all the templates are combined and ranked by their energies. The five lowest energy peptide sequences and five other low energy sequences re-ranked on the basis of their similarity to peptides known to bind the same MHC allele are chemically synthesized and tested for their ability to bind and form stable complexes with the HLA-A2 molecule. The most efficient binders are also tested for inhibition of the T cell receptor recognition of two known CD8(+) T effectors. Results show that all 10 peptides bind the expected MHC protein. The six strongest binders also form stable HLA-A2-peptide complexes, albeit to varying degrees, and three peptides display significant inhibition of CD8(+) T cell recognition. These results are rationalized in light of our knowledge of the three-dimensional structures of the HLA-A2-peptide and HLA-A2-peptide-T cell receptor complexes.
In recent years a large body of data has been obtained from Nuclear Magnetic Resonance and Circular Dichroism experiments on the influence of the amino acid sequence and various other parameters on the conformational state of peptides in solution. Interpreting the experimental data in terms of the conformational populations of the peptides remains a key problem, for which current solutions leave appreciable room for improvement. Considering that making this body of data available for surveys and analysis should be instrumental in tackling the problem, we undertook the development of Pescador: The 'PEptides in Solution ConformAtion Database: Online Resource'. Pescador contains data from NMR and CD spectroscopy on peptides in solution as well as information on the structural parameters derived from these data. It also features specialized Web-based tools for data deposition, and means for readily accessing the stored information for analysis purposes. To illustrate the use of the database in deriving information for the conformational analysis of peptides, we show how the alpha proton delta-values stored in Pescador and measured by NMR for different peptides in different laboratories can be used to derive a new set of 'random coil' chemical shift values. Firstly, we show these values to be very similar to those obtained experimentally for model peptides in water, and their variation with increasing Tri-Fluoro-Ethanol (TFE) concentration is similar to that reported for model peptides. We show, furthermore, that the chemical shift data in Pescador can be used to derive correction factors that take into account effects of neighboring residues. These correction factors compare favorably with those recently derived from a series of model GGXGG peptides (Schwarzinger et al., 2001). These encouraging results suggest that, as the quantity of NMR data on peptide deposited in Pescador increases, surveys of these data should be a valuable means of deriving key parameters for the analysis of peptide conformation.
A set of conserved water positions making direct contacts with the alpha1 and alpha2 domains of the MHC class-I protein was identified by a cluster analysis in 12 high-resolution crystal structures of proteins from different allele types and different species, comprising human, mouse and rat. The analysis revealed a total of 63 clusters, corresponding to water molecules, whose positions are conserved in half or more of the analyzed structures. Analysis of these clusters shows that the most conserved water positions-those appearing in the largest fraction of the structures-were also the most accurately defined, as measured by their normalized crystallographic B-factor. Not too surprisingly, these positions displayed better overlap and formed more H-bonds with the protein. In a second part of this work, a detailed analysis is presented of three of the most conserved water positions and their putative structural and functional roles are discussed. The most highly conserved of the three appears to play an important role in stabilizing the conformation of a twisted beta-turn between residues 118 and 122 (numbering of HLA-B3501, PDB code 1A1N). An equivalent water molecule was found to be associated with a similar beta-turn in 43 unrelated structures surveyed in the PDB, leading to the suggestion that this water molecule plays an important structural role in this type of turn. The second water molecule makes hydrogen bonds with residues lining pocket B in the peptide-binding groove and is suggested to play a role in modulating peptide recognition. The third highly conserved water molecule is located at the first kink of the alpha2 helix, possibly playing a role in determining the position of the N-terminal segment of that helix, which also carries side chains in contact with the bound peptide. This information on conserved water positions in MHC class-I molecules should be helpful in modeling interactions with bound peptide antigens and in designing new peptides with tailor-made affinities.
MOTIVATION: Comparing the 3D structures of two proteins or analyzing the structural changes undergone by a protein upon ligand binding or when it crystallizes under different conditions, can be both tricky and tedious, especially when the two proteins are distantly related, or when the structural changes are complex. Readily accessible tools for performing these tasks automatically and reliably should therefore be welcome. RESULTS: We describe a web interface to several automatic procedures for performing pairwise structure superposition in a flexible manner, for detailed analyses of conformational changes and for displaying the results in a pictorial fashion. AVAILABILITY: This interface can be accessed at the Brussels and Cuba Web sites, respectively: http: //www.ucmb.ulb.ac.be/SCMBB/Tools.htmland http: //bio.cigb.edu.cu.
The program SFCHECK [Vaguine et al. (1999), Acta Cryst. D55, 191-205] is used to survey the quality of the structure-factor data and the agreement of those data with the atomic coordinates in 105 nucleic acid crystal structures for which structure-factor amplitudes have been deposited in the Nucleic Acid Database [NDB; Berman et al. (1992), Biophys. J. 63, 751-759]. Nucleic acid structures present a particular challenge for structure-quality evaluations. The majority of these structures, and DNA molecules in particular, have been solved by molecular replacement of the double-helical motif, whose high degree of symmetry can lead to problems in positioning the molecule in the unit cell. In this paper, the overall quality of each structure was evaluated using parameters such as the R factor, the correlation coefficient and various atomic error estimates. In addition, each structure is characterized by the average values of several local quality indicators, which include the atomic displacement, the density correlation, the B factor and the density index. The latter parameter measures the relative electron-density level at the atomic position. In order to assess the quality of the model in specific regions, the same local quality indicators are also surveyed for individual groups of atoms in each structure. Several of the global quality indicators are found to vary linearly with resolution and less than a dozen structures are found to exhibit values significantly different from the mean for these indicators, showing that the quality of the nucleic acid structures tends to be rather uniform. Analysis of the mutual dependence of the values of different local quality indicators, computed for individual residues and atom groups, reveals that these indicators essentially complement each other and are not redundant with the B factor. Using several of these indicators, it was found that the atomic coordinates of the nucleic acid bases tend to be better defined than those of the backbone. One of the local indicators, the density index, is particularly useful in spotting regions of the model that fit poorly in the electron density. Using this parameter, the quality of crystallographic water positions in the analyzed structures was surveyed and it was found that a sizable fraction of these positions have poorly defined electron density and may therefore not be reliable. The possibility that cases of poorly positioned water molecules are symptomatic of more widespread problems with the structure as a whole is also raised.
This review describes computational procedures for deriving the amino acid sequences that are compatible with a given protein backbone structure. Such procedures can be used to gain insight into the constraints imposed by the 3D structure of the protein sequence, or to design proteins that are likely to adopt a given backbone conformation. We start by presenting a short overview of the various types of approaches to protein design developed over more than a decade. This is followed by a more detailed presentation of a recently developed sequence selection procedure DESIGNER. This latter presentation illustrates the basic principles underlying this type of procedures, described what they may teach us when applied to small proteins, and highlights issues that need to be addressed in order to go forward.
The thyrotropin (TSH) receptor is an interesting model to study G protein-coupled receptor activation as many point mutations can significantly increase its basal activity. Here, we identified a molecular interaction between Asp(633) in transmembrane helix 6 (TM6) and Asn(674) in TM7 of the TSHr that is crucial to maintain the inactive state through conformational constraint of the Asn. We show that these residues are perfectly conserved in the glycohormone receptor family, except in one case, where they are exchanged, suggesting a direct interaction. Molecular modeling of the TSHr, based on the high resolution structure of rhodopsin, strongly favors this hypothesis. Our approach combining site-directed mutagenesis with molecular modeling shows that mutations disrupting this interaction, like the D633A mutation in TM6, lead to high constitutive activation. The strongly activating N674D (TM7) mutation, which in our modeling breaks the TM6-TM7 link, is reverted to wild type-like behavior by an additional D633N mutation (TM6), which would restore this link. Moreover, we show that the Asn of TM7 (conserved in most G protein-coupled receptors) is mandatory for ligand-induced cAMP accumulation, suggesting an active role of this residue in activation. In the TSHr, the conformation of this Asn residue of TM7 would be constrained, in the inactive state, by its Asp partner in TM6.
Standard volumes for atoms in double-stranded B-DNA are derived using high resolution crystal structures from the Nucleic Acid Database (NDB) and compared with corresponding values derived from crystal structures of small organic compounds in the Cambridge Structural Database (CSD). Two different methods are used to compute these volumes: the classical Voronoi method, which does not depend on the size of atoms, and the related Radical Planes method which does. Results show that atomic groups buried in the interior of double-stranded DNA are, on average, more tightly packed than in related small molecules in the CSD. The packing efficiency of DNA atoms at the interfaces of 25 high resolution protein-DNA complexes is determined by computing the ratios between the volumes of interfacial DNA atoms and the corresponding standard volumes. These ratios are found to be close to unity, indicating that the DNA atoms at protein-DNA interfaces are as closely packed as in crystals of B-DNA. Analogous volume ratios, computed for buried protein atoms, are also near unity, confirming our earlier conclusions that the packing efficiency of these atoms is similar to that in the protein interior. In addition, we examine the number, volume and solvent occupation of cavities located at the protein-DNA interfaces and compared them with those in the protein interior. Cavities are found to be ubiquitous in the interfaces as well as inside the protein moieties. The frequency of solvent occupation of cavities is however higher in the interfaces, indicating that those are more hydrated than protein interiors. Lastly, we compare our results with those obtained using two different measures of shape complementarity of the analysed interfaces, and find that the correlation between our volume ratios and these measures, as well as between the measures themselves, is weak. Our results indicate that a tightly packed environment made up of DNA, protein and solvent atoms plays a significant role in protein-DNA recognition.
CCR5 is a G-protein-coupled receptor activated by the chemokines RANTES (regulated on activation normal T cell expressed and secreted), macrophage inflammatory protein 1alpha and 1beta, and monocyte chemotactic protein 2 and is the main co-receptor for the macrophage-tropic human immunodeficiency virus strains. We have identified a sequence motif (TXP) in the second transmembrane helix of chemokine receptors and investigated its role by theoretical and experimental approaches. Molecular dynamics simulations of model alpha-helices in a nonpolar environment were used to show that a TXP motif strongly bends these helices, due to the coordinated action of the proline, which kinks the helix, and of the threonine, which further accentuates this structural deformation. Site-directed mutagenesis of the corresponding Pro and Thr residues in CCR5 allowed us to probe the consequences of these structural findings in the context of the whole receptor. The P84A mutation leads to a decreased binding affinity for chemokines and nearly abolishes the functional response of the receptor. In contrast, mutation of Thr-82(2.56) into Val, Ala, Cys, or Ser does not affect chemokine binding. However, the functional response was found to depend strongly on the nature of the substituted side chain. The rank order of impairment of receptor activation is P84A > T82V > T82A > T82C > T82S. This ranking of impairment parallels the bending of the alpha-helix observed in the molecular simulation study.
The most abundant alpha-amylase inhibitor (AAI) present in the seeds of Amaranthus hypochondriacus, a variety of the Mexican crop plant amaranth, is the smallest polypeptide (32 residues) known to inhibit alpha-amylase activity of insect larvae while leaving that of mammals unaffected. In solution, 1H NMR reveals that AAI isolated from amaranth seeds adopts a major trans (70%) and minor cis (30%) conformation, resulting from slow cis-trans isomerization of the Val15-Pro16 peptide bond. Both solution structures have been determined using 2D 1H-NMR spectroscopy and XPLOR followed by restrained energy refinement in the consistent-valence force field. For the major isomer, a total of 563 distance restraints, including 55 medium-range and 173 long-range ones, were available from the NOESY spectra. This rather large number of constraints from a protein of such a small size results from a compact fold, imposed through three disulfide bridges arranged in a cysteine-knot motif. The structure of the minor cis isomer has also been determined using a smaller constraint set. It reveals a different backbone conformation in the Pro10-Pro20 segment, while preserving the overall global fold. The energy-refined ensemble of the major isomer, consisting of 20 low-energy conformers with an average backbone rmsd of 0.29 +/- 0.19 A and no violations larger than 0.4 A, represents a considerable improvement in precision over a previously reported and independently performed calculation on AAI obtained through solid-phase synthesis, which was determined with only half the number of medium-range and long-range restraints reported here, and featured the trans isomer only. The resulting differences in ensemble precision have been quantified locally and globally, indicating that, for regions of the backbone and a good fraction of the side chains, the conformation is better defined in the new solution structure. Structural comparison of the solution structure with the X-ray structure of the inhibitor when bound to its alpha-amylase target in Tenebrio molitor shows that the backbone conformation is only slightly adjusted on complexation, while that of the side chains involved in protein-protein contacts is similar to those present in solution. Therefore, the overall conformation of AAI appears to be predisposed to binding to its target alpha-amylase, confirming the view that it acts as a lid on top of the alpha-amylase active site.
This paper describes how biological function can be represented in terms of molecular activities and processes. It presents several key features of a data model that is based on a conceptual description of the network of interactions between molecular entities within the cell and between cells. This model is implemented in the aMAZE database that presently deals with information on metabolic pathways, gene regulation, sub- or supracellular locations, and transport. It is shown that this model constitutes a useful generalisation of data representations currently implemented in metabolic pathway databases, and that it can furthermore include multiple schemes for categorising and classifying molecular entities, activities, processes and localisations. In particular, we highlight the flexibility offered by our system in representing multiple molecular activities and their control, in viewing biological function at different levels of resolution and in updating this view as our knowledge evolves.
Determining the biological function of a myriad of genes, and understanding how they interact to yield a living cell, is the major challenge of the post genome-sequencing era. The complexity of biological systems is such that this cannot be envisaged without the help of powerful computer systems capable of representing and analysing the intricate networks of physical and functional interactions between the different cellular components. In this review we try to provide the reader with an appreciation of where we stand in this regard. We discuss some of the inherent problems in describing the different facets of biological function, give an overview of how information on function is currently represented in the major biological databases, and describe different systems for organising and categorising the functions of gene products. In a second part, we present a new general data model, currently under development, which describes information on molecular function and cellular processes in a rigorous manner. The model is capable of representing a large variety of biochemical processes, including metabolic pathways, regulation of gene expression and signal transduction. It also incorporates taxonomies for categorising molecular entities, interactions and processes, and it offers means of viewing the information at different levels of resolution, and dealing with incomplete knowledge. The data model has been implemented in the database on protein function and cellular processes 'aMAZE' (http: //www.ebi.ac.uk/research/pfbp/), which presently covers metabolic pathways and their regulation. Several tools for querying, displaying, and performing analyses on such pathways are briefly described in order to illustrate the practical applications enabled by the model.
A fully automatic procedure for predicting the amino acid sequences compatible with a given target structure is described. It is based on the CHARMM package, and uses an all atom force-field and rotamer libraries to describe and evaluate side-chain types and conformations. Sequences are ranked by a quantity akin to the free energy of folding, which incorporates hydration effects. Exact (Branch and Bound) and heuristic optimisation procedures are used to identifying highly scoring sequences from an astronomical number of possibilities. These sequences include the minimum free energy sequence, as well as all amino acid sequences whose free energy lies within a specified window from the minimum. Several applications of our procedure are illustrated. Prediction of side-chain conformations for a set of ten proteins yields results comparable to those of established side-chain placement programs. Applications to sequence optimisation comprise the re-design of the protein cores of c-Crk SH3 domain, the B1 domain of protein G and Ubiquitin, and of surface residues of the SH3 domain. In all calculations, no restrictions are imposed on the amino acid composition and identical parameter settings are used for core and surface residues. The best scoring sequences for the protein cores are virtually identical to wild-type. They feature no more than one to three mutations in a total of 11-16 variable positions. Tests suggest that this is due to the balance between various contributions in the force-field rather than to overwhelming influence from packing constraints. The effectiveness of our force-field is further supported by the sequence predictions for surface residues of the SH3 domain. More mutations are predicted than in the core, seemingly in order to optimise the network of complementary interactions between polar and charged groups. This appears to be an important energetic requirement in absence of the partner molecules with which the SH3 domain interacts, which were not included in the calculations. Finally, a detailed comparison between the sequences generated by the heuristic and exact optimisation algorithms, commends a note of caution concerning the efficiency of heuristic procedures in exploring sequence space. Copyright 2000 Academic Press.
The clearance of seven different ligands from the deeply buried active-site of Torpedo californica acetylcholinesterase is investigated by combining multiple copy sampling molecular dynamics simulations, with the analysis of protein-ligand interactions, protein motion and the electrostatic potential sampled by the ligand copies along their journey outwards. The considered ligands are the cations ammonium, methylammonium, and tetramethylammonium, the hydrophobic methane and neopentane, and the anionic product acetate and its neutral form, acetic acid. We find that the pathways explored by the different ligands vary with ligand size and chemical properties. Very small ligands, such as ammonium and methane, exit through several routes. One involves the main exit through the mouth of the enzyme gorge, another is through the so-called back door near Trp84, and a third uses a side door at a direction of approximately 45 degrees to the main exit. The larger polar ligands, methylammonium and acetic acid, leave through the main exit, but the bulkiest, tetramethylammonium and neopentane, as well as the smaller acetate ion, remain trapped in the enzyme gorge during the time of the simulations. The pattern of protein-ligand contacts during the diffusion process is highly non-random and differs for different ligands. A majority is made with aromatic side-chains, but classical H-bonds are also formed. In the case of acetate, but not acetic acid, the anionic and neutral form, respectively, of one of the reaction products, specific electrostatic interactions with protein groups, seem to slow ligand motion and interfere with protein flexibility; protonation of the acetate ion is therefore suggested to facilitate clearance. The Poisson-Boltzmann formalism is used to compute the electrostatic potential of the thermally fluctuating acetylcholinesterase protein at positions actually visited by the diffusing ligand copies. Ligands of different charge and size are shown to sample somewhat different electrostatic potentials during their migration, because they explore different microscopic routes. The potential along the clearance route of a cation such as methylammonium displays two clear minima at the active and peripheral anionic site. We find moreover that the electrostatic energy barrier that the cation needs to overcome when moving between these two sites is small in both directions, being of the order of the ligand kinetic energy. The peripheral site thus appears to play a role in trapping inbound cationic ligands as well as in cation clearance, and hence in product release. Copyright 2000 Academic Press.
A novel automatic procedure for identifying domains from protein atomic coordinates is presented. The procedure, termed STRUDL (STRUctural Domain Limits), does not take into account information on secondary structures and handles any number of domains made up of contiguous or non-contiguous chain segments. The core algorithm uses the Kernighan-Lin graph heuristic to partition the protein into residue sets which display minimum interactions between them. These interactions are deduced from the weighted Voronoi diagram. The generated partitions are accepted or rejected on the basis of optimized criteria, representing basic expected physical properties of structural domains. The graph heuristic approach is shown to be very effective, it approximates closely the exact solution provided by a branch and bound algorithm for a number of test proteins. In addition, the overall performance of STRUDL is assessed on a set of 787 representative proteins from the Protein Data Bank by comparison to domain definitions in the CATH protein classification. The domains assigned by STRUDL agree with the CATH assignments in at least 81% of the tested proteins. This result is comparable to that obtained previously using PUU (Holm and Sander, Proteins 1994;9: 256-268), the only other available algorithm designed to identify domains with any number of non-contiguous chain segments. A detailed discussion of the structures for which our assignments differ from those in CATH brings to light some clear inconsistencies between the concept of structural domains based on minimizing inter-domain interactions and that of delimiting structural motifs that represent acceptable folding topologies or architectures. Considering both concepts as complementary and combining them in a layered approach might be the way forward.
We analyzed the atomic models of 75 X-ray structures of protein-nucleic acid complexes with the aim of uncovering common properties. The interface area measured the extent of contact between the protein and nucleic acid. It was found to vary between 1120 and 5800 A2. Despite this wide variation, the interfaces in complexes of transcription factors with double-stranded DNA could be broken up into recognition modules where 12 +/- 3 nucleotides on the DNA side contact 24 +/- 6 amino acids on the protein side, with interface areas in the range 1600 +/- 400 A2. For enzymes acting on DNA, the recognition module is on average 600 A2 larger, due to the requirement of making an active site. As judged by its chemical and amino acid composition, the average protein surface in contact with the DNA is more polar than the solvent accessible surface or the typical protein-protein interface. The protein side is rich in positively charged groups from lysine and arginine side chains; on the DNA side the negative charges from phosphate groups dominate. Hydrogen bonding patterns were also analyzed, and we found one intermolecular hydrogen bond per 125 A2 of interface area in high-resolution structures. An equivalent number of polar interactions involved water molecules, which are generally abundant at protein-DNA interfaces. Calculations of Voronoi atomic volumes, performed in the presence and absence of water molecules, showed that protein atoms buried at the interface with DNA are on average as closely packed as in the protein interior. Water molecules contribute to the close packing, thereby mediating shape complementarity. Finally, conformational changes accompanying association were analyzed in 24 of the complexes for which the structure of the free protein was also available. On the DNA side the extent of deformation showed some correlation with the size of the interface area. On the protein side the type and size of the structural changes spanned a wide spectrum. Disorder-to-order transitions, domain movements, quaternary and tertiary changes were observed, and the largest changes occurred in complexes with large interfaces.
In this paper we present SFCHECK, a stand-alone software package that features a unified set of procedures for evaluating the structure-factor data obtained from X-ray diffraction experiments and for assessing the agreement of the atomic coordinates with these data. The evaluation is performed completely automatically, and produces a concise PostScript pictorial output similar to that of PROCHECK [Laskowski, MacArthur, Moss & Thornton (1993). J. Appl. Cryst. 26, 283-291], greatly facilitating visual inspection of the results. The required inputs are the structure-factor amplitudes and the atomic coordinates. Having those, the program summarizes relevant information on the deposited structure factors and evaluates their quality using criteria such as data completeness, structure-factor uncertainty and the optical resolution computed from the Patterson origin peak. The dependence of various parameters on the nominal resolution (d spacing) is also given. To evaluate the global agreement of the atomic model with the experimental data, the program recomputes the R factor, the correlation coefficient between observed and calculated structure-factor amplitudes and Rfree (when appropriate). In addition, it gives several estimates of the average error in the atomic coordinates. The local agreement between the model and the electron-density map is evaluated on a per-residue basis, considering separately the macromolecule backbone and side-chain atoms, as well as solvent atoms and heterogroups. Among the criteria are the normalized average atomic displacement, the local density correlation coefficient and the polymer chain connectivity. The possibility of computing these criteria using the omit-map procedure is also provided. The described software should be a valuable tool in monitoring the refinement procedure and in assessing structures deposited in databases.
Barnase, an extracellular endoribonuclease from Bacillus amyloliquefaciens, hydrolyses single-stranded RNA. Its very low catalytic activity toward GpN dinucleotides, where N stands for any nucleoside, is markedly increased when a phosphate is added to the 3'-end, as in GpNp. Here we investigate the conformational properties of GpA and GpAp in solution, in order to determine whether differences in these properties may be related to the changes in enzymatic activity. Two independent 1.3 ns molecular dynamics trajectories are generated for each dinucleotide in the presence of explicit water molecules and counter ions. These trajectories are analysed by monitoring molecular properties, such as the solvent accessible surface area, the distance and orientation between the bases, the behaviour of torsion angles and formation of intramolecular H-bonds. To identify relevant correlations between these parameters, statistical techniques, comprising multiple regression, clustering and discriminant analysis are used. Results show that GpA has a significant propensity to form folded conformations (approximately 50%), fostered by a small number of intramolecular H-bonds, whereas GpAp remains essentially extended. The latter behaviour seems to be due to an H-bond between the terminal phosphate and adenosine ribose group, which restricts rotation about the adenine Agamma angle. We also find that GpA folding is induced by a concerted motion of specific torsion angles, which is closely coupled to the formation of a network of flexible hydrogen bonds. Finally, on the basis of an expression for barnase KM, which incorporates the folded/extended conformational equilibria of the dinucleotide substrates, it is argued that our findings on the differences between these equilibria, can qualitatively rationalize the experimentally measured differences in enzymatic properties. Copyright 1998 Academic Press.
The geometrical properties of zinc binding sites in a data set of high quality protein crystal structures deposited in the Protein Data Bank have been examined to identify important differences between zinc sites that are directly involved in catalysis and those that play a structural role. Coordination angles in the zinc primary coordination sphere are compared with ideal values for each coordination geometry, and zinc coordination distances are compared with those in small zinc complexes from the Cambridge Structural Database as a guide of expected trends. We find that distances and angles in the primary coordination sphere are in general close to the expected (or ideal) values. Deviations occur primarily for oxygen coordinating atoms and are found to be mainly due to H-bonding of the oxygen coordinating ligand to protein residues, bidentate binding arrangements, and multi-zinc sites. We find that H-bonding of oxygen containing residues (or water) to zinc bound histidines is almost universal in our data set and defines the elec-His-Zn motif. Analysis of the stereochemistry shows that carboxyl elec-His-Zn motifs are geometrically rigid, while water elec-His-Zn motifs show the most geometrical variation. As catalytic motifs have a higher proportion of carboxyl elec atoms than structural motifs, they provide a more rigid framework for zinc binding. This is understood biologically, as a small distortion in the zinc position in an enzyme can have serious consequences on the enzymatic reaction. We also analyze the sequence pattern of the zinc ligands and residues that provide elecs, and identify conserved hydrophobic residues in the endopeptidases that also appear to contribute to stabilizing the catalytic zinc site. A zinc binding template in protein crystal structures is derived from these observations.
A fully automatic classification procedure of short protein fragments is applied to identify connections between alpha-helices and beta-strands in a data set of 141 protein chains. It yields 15 structural families of alphabeta turns and 15 families of betaalpha turns with at least five members. The sequence and structural features of these turn motifs are analysed with the focus on the local interactions located at alpha-helix and beta-strand ends. This analysis reveals specific interaction patterns that occur frequently among the members of many of the identified turn motifs. For the beta-strands, novel patterns are identified at the strands' entry and exit; they involve side chain/side chain contacts and beta-turns, generally of type I or II. For the alpha-helices, the interaction patterns consist of several backbone/backbone or backbone/side chain hydrogen bonds and of hydrophobic contacts; they generalize the well known N-terminal capping and C-terminal Schellman motifs. The interaction patterns at both ends of alpha-helices and beta-strands are found to constitute favourable structure motifs with low amino acid sequence specificity; their possible stabilizing role is discussed. Finally, the robustness of our classification procedure and of the description of N- and C-cap interaction patterns is validated by repeating our analysis on a larger data set of 381 protein chains and showing that the results are maintained.
BACKGROUND: The classical picture of the hydrophobic stabilization of proteins invokes a resemblance between the protein interior and nonpolar solvents, but the extent to which this is the case has often been questioned. The protein interior is believed to be at least as tightly packed as organic crystals, and was shown to have very low compressibility. There is also evidence that these properties are not uniform throughout the protein, and conflicting views exist on the nature of sidechain packing and on its influence on the properties of the protein. RESULTS: In order to probe the physical properties of the protein, the free energy associated with the formation of empty cavities has been evaluated for two proteins: barnase and T4 lysozyme. To this end, the likelihood of encountering such cavities was computed from room temperature molecular dynamics trajectories of these proteins in water. The free energy was evaluated in each protein taken as a whole and in submolecular regions. The computed free energies yielded information on the manner in which empty space is distributed in the system, while the latter undergoes thermal motion, a property hitherto not analyzed in heterogeneous media such as proteins. Our results showed that the free energy of cavity formation is higher in proteins than in both water and hexane, providing direct evidence that the native protein medium differs in fundamental ways from the two liquids. Furthermore, although the packing density was found to be higher in nonpolar regions of the protein than in polar ones, the free energy cost of forming atomic size cavities is significantly lower in nonpolar regions, implying that these regions contain larger chunks of empty space, thereby increasing the likelihood of containing atomic size packing defects. These larger empty spaces occur preferentially where buried hydrophobic sidechains belonging to secondary structures meet one another. These particular locations also appear to be more compressible than other parts of the core or surface of the protein. CONCLUSIONS: The cavity free energy calculations described here provide a much more detailed physical picture of the protein matrix than volume and packing calculations. According to this picture, the packing of hydrophobic sidechains is tight in the interior of the protein, but far from uniform. In particular, the packing is tighter in regions where the backbone forms less regular hydrogen-bonding interactions than at interfaces between secondary structure elements, where such interactions are fully developed. This may have important implications on the role of sidechain packing in protein folding and stability.
Standard ranges of atomic and residue volumes are computed in 64 highly resolved and well-refined protein crystal structures using the classical Voronoi procedure. Deviations of the atomic volumes from the standard values, evaluated as the volume Z-scores, are used to assess the quality of protein crystal structures. To score a structure globally, we compute the volume Z-score root mean square deviation (Z-score rms), which measures the average magnitude of the volume irregularities in the structure. We find that the Z-score rms decreases as the resolution and R-factor improve, consistent with the fact that these improvements generally reflect more accurate models. From the Z-score rms distribution in structures with a given resolution or R-factor, we determine the normal limits in Z-score rms values for structures solved at that resolution or R-factor. Structures whose Z-score rms exceeds these limits are considered as outliers. Such structures also exhibit unusual stereochemistry, as revealed by other analyses. Absolute Z-scores of individual atoms are used to identify problems in specific regions within a protein model. These Z-scores correlate fairly well with the atomic B-factors, and atoms having absolute Z-scores > 3, occur at or near regions in the model where programs such as PROCHECK identify unusual stereochemistry. Atomic volumes, themselves not directly restrained in crystallographic refinement, can thus provide an independent, rather sensitive, measure of the quality of a protein structure. The volume-based structure validation procedures are implemented in the program PROVE (PROtein Volume Evaluation), which is accessible through the World Wide Web.
The current status and future outlook of macromolecular structure databases and information handling, with particular reference to European databases, are reviewed. Issues concerning the efficiency with which data are represented, validated, archived and accessed are discussed in view of the fast growing body of information on structures of biological macromolecules.
An automatic procedure for the classification of short protein fragments, representing turn motifs between two consecutive secondary structures, is presented. This procedure has two steps. Fragments of given length are first grouped on the basis of their backbone dihedral angle values, and then clustered as a function of the root-mean-square deviation of their superimposed backbone atoms. The classification procedure identifies 63 families of turn motifs with at least five members, in a data set of 141 proteins. A detailed analysis is presented of the ten identified alpha alpha-turn families, of which four correspond to novel motifs. The sequence and structure features that characterize these families are described. It is found that some features are conserved within the fragments belonging to the same family, but their environment in the parent protein varies considerably. N-capping interactions and helix stop signals are encountered in a number of families, where they seem to stabilize the motif conformation. In two families, one with three residues in the loop, and one with four, an appreciable fraction of the members displays both types of characteristic helix end interactions in the same motif. Interestingly, contrary to most other alpha alpha-turns, the relative frequency of these two motifs is much higher than that of short protein segments with the same loop conformation. Furthermore, the family with three residues in the loop includes the helix-turn-helix motif known to bind DNA. It seems to be the only one among the ten identified families that can be related to biological function.
This paper evaluates the results of a protein structure prediction contest. The predictions were made using threading procedures, which employ techniques for aligning sequences with 3D structures to select the correct fold of a given sequence from a set of alternatives. Nine different teams submitted 86 predictions, on a total of 21 target proteins with little or no sequence homology to proteins of known structure. The 3D structures of these proteins were newly determined by experimental methods, but not yet published or otherwise available to the predictors. The predictions, made from the amino acid sequence alone, thus represent a genuine test of the current performance of threading methods. Only a subset of all the predictions is evaluated here. It corresponds to the 44 predictions submitted for the 11 target proteins seen to adopt known folds. The predictions for the remaining 10 proteins were not analyzed, although weak similarities with known folds may also exist in these proteins. We find that threading methods are capable of identifying the correct fold in many cases, but not reliably enough as yet. Every team predicts correctly a different set of targets, with virtually all targets predicted correctly by at least one team. Also, common folds such as TIM barrels are recognized more readily than folds with only a few known examples. However, quite surprisingly, the quality of the sequence-structure alignments, corresponding to correctly recognized folds, is generally very poor, as judged by comparison with the corresponding 3D structure alignments. Thus, threading can presently not be relied upon to derive a detailed 3D model from the amino acid sequence. This raises a very intriguing question: how is fold recognition achieved? Our analysis suggests that it may be achieved because threading procedures maximize hydrophobic interactions in the protein core, and are reasonably good at recognizing local secondary structure.