research advances

A journey into the expanding protein universe

SBKB [doi:10.1038/fa_sbkb.2010.42]

By providing new structural knowledge, PSI-2 has covered considerable ground in characterizing protein space.

Source: NASA's Hubble Space Telescope

The sequences of millions of proteins are now available, thanks to advances in sequencing technology since the pioneering work of Fred Sanger. For some of these proteins, their structure, localization, function and even kinetics have been characterized, but for the vast majority, all that is known is their translated nucleotide sequence. As with many facets of biology, subsequent grouping into families (based on sequence similarity) should help in the quest to dissect the organization of protein space. Never have we been so well prepared to explore this protein universe: the advent of structural genomics — and, in particular, the second phase of the Protein Structure Initiative (PSI-2) — has already provided incredibly useful insights into structural and functional space.

The aim of the PSI is not to determine three-dimensional (3D) structures for every existing protein but, rather, to determine representative structures for each protein family, which can then be 'leveraged' for computational comparative modeling studies. The leverage of a solved structure corresponds to the number (and quality) of the models that can be generated using that structure as a template, and might be viewed as a measure of the success of a structural genomics endeavor, providing clues to the structure and function of other sequences within the family. Liu et al. 1 introduced the term 'novel leverage' as a measure of the addition of novel structural knowledge, applicable to proteins, domains or residues. When the novel leverage values for all protein structures deposited in the Protein Data Bank (PDB) between September 1, 2000, and August 31, 2006, were analyzed, structural genomics centers had contributed 27% of proteins, with more than 50% of this figure coming from the four large-scale PSI centers (JCSG, MCSG, NESG and NYSGXRC). In 2005, each deposited structural genomics structure provided, on average, a novel leverage of 36 proteins and 6,600 residues, compared with an average leverage of 15 proteins and 3,618 residues provided by structures from other structural biology groups.

Nair et al. 2 reported that the PSI centers had contributed 7% of all experimental structures deposited into the PDB worldwide up to September 2008, even though the PSI only started contributing structures from 2000, with this rate rapidly increasing in recent years to reach almost 15% in 2008 (the last year for which complete statistics are available). This figure is quite impressive, given that it comes largely from the four large-scale PSI centers, compared with the total world output. The contribution of PSI to generating novel leverage is even more exceptional: PSI-2 centers have contributed 30% of novel leverage since 2005, providing structural templates for more than 300,000 new reliable protein-structure models. Not only has PSI-2 provided more novel leverage, but it has also contributed more coverage than all other US efforts combined, and the authors provide an optimistic view that, if 'representatives' of protein families are solved by structural genomics centers at their current rate, exploration of the protein universe could be complete within the next two decades.

It is, however, worth indicating that the novel leverage for any deposited experimental structure has decreased considerably since 2001, 2 which might imply that most protein structure geometries or folds have already been identified. Indeed, when the PSI determined the 3D structures of representatives from 248 protein families of protein domains of unknown function (DUF), 3 73% of the DUF families adopt a previously known fold (e.g. SH3-like barrel, T-fold, TIM barrel). The overall similarity, accompanied by divergence of many local features of these DUF families to known folds and families, led the authors to believe that they represent very distant homologies, which can, therefore, be used to provide clues to their function. Some hypotheses regarding the function of these folds could be made on the basis of a bound ligand in the structure, or from published literature and database entries, but establishing their absolute function requires experimental determination.

Only 27% of the DUFs represent new folds, but this is still highly significant given the dwindling numbers of new folds currently being identified by the community worldwide. An interesting finding was that over one-third of these new folds harbor fragments (larger than supersecondary elements and typically involving 3-4 secondary elements, with a total length around 50-100 residues) that are structurally very similar to regions in known folds, but are not necessary related from an evolutionary perspective. The collection of known protein folds might, therefore, be reaching saturation, implying that there might well not be a limitless number of topologies or structures.

In his analysis of almost 8 million sequences in the NCBI's nonredundant database of protein sequences, Michael Levitt 4 similarly reports that the number of single domain architecture families (SDAs; those with one region matched by a sequence profile) seems to be reaching a plateau. By contrast, the number of multidomain architecture families (MDAs) is growing rapidly. New MDA families almost exclusively comprise combinations of domains found within SDA family members. Structural genomics programs have played an important part in increasing the structural coverage of different SDA families (26% of families have a member with a known structure); in turn, the limited number of different SDA families has implications for structural genomics. If, as implied, the combination of existing domains from SDA families is responsible for the growth of MDA families, then it should, theoretically, be feasible to determine a structural representative for each sequence profile (instead of clustering whole protein sequences in his analysis, Levitt focused on the occurrence of single domain architectures, which are related, but do not correspond exactly, to structural domains).

The target selection process is fundamental to the impact of structural genomics. DUF families constitute a considerable proportion of uncharacterized protein space and are, therefore, obvious targets for substantially increasing structural coverage. The strategy used during PSI-2 has targeted both representatives from large, structurally uncharacterized, protein domain families, and from structurally uncharacterized subfamilies in very large and diverse families that show incomplete structural coverage. 5 Consequently, 28% of the domain structures solved by PSI-2 large-scale centers during the first 3 years were defined as being structurally novel, compared with 3% of domains solved by nonstructural genomics approaches worldwide, justifying the selection strategy. By proceeding with its targeted selection strategy and continuing with its optimal pipelines, technologies and expertise, the PSI seems well poised to rise to the challenge of furthering and possibly completing its exploration of the protein universe in the foreseeable future.

Katrin Legg

  1. J. Liu, G. T. Montelione and B. Rost Novel leverage of structural genomics.

    Nature Biotech. 25, 849-851 (2007). doi:10.1038/nbt0807-849

  2. R. Nair, J. Liu, T.-T. Soong, T. B. Acton, J. K. Everett et al. Structural genomics is the largest contributor of novel structural leverage.

    J. Struct. Funct. Genomics 10, 181-191 (2009). doi:10.1007/s10969-008-9055-6

  3. L. Jaroszewski, Z. Li, S. S. Krishna, C. Bakolitsa, J. Wooley et al. Exploration of unchartered regions of the protein universe.

    PloS Biology 7, e1000205 (2009). doi:10.1371/journal.pbio.1000205

  4. M. Levitt Nature of the protein universe.

    Proc. Natl Acad. Sci. USA 106, 11079-11084 (2009). doi:10.1073/pnas.0905029106

  5. B. H. Dessailly, R. Nair, L. Jaroszewski, J. E. Fajardo, A. Kouranov et al. PSI-2: structural genomics to cover protein domain family space.

    Structure 17, 869-881 (2009). doi:10.1016/j.str.2009.03.015


Explore proteins and this website