Understanding Missing Proteins: Why Should You Care?

We all know that proteins are important. Proteins are large macromolecules which play important roles in our bodies, working as enzymes, maintaining the structural integrity of our cells, helping cells divide, providing signals for tissues and organ systems to function harmoniously, etc. Understanding the functionalities of proteins is, thus, akin to understanding how the human body works and, with that knowledge, help us develop better treatment options.

In that regard, it is therefore important to learn about all the proteins that are expressed in the body.

Genomics and transcriptomics are not enough

Two developments in the biological arena have catapulted biology into a digital data-heavy science. The first being genomics, which is the ability to sequence the entire make-up of DNA in each living cell in our body. The second being transcriptomics, which is the ability to measure which genes are being expressed as intermediaries known as mRNA. mRNA, in turn, undergoes a process called translation to produce proteins.

Unfortunately, it is not enough to know if a gene is present (in DNA) and is expressed in the intermediary form (mRNA). Genes are regulated in complex ways, and so, more copies of a gene found in the genome doesn’t mean more protein is produced. The same can be said for translation as well, as more copies of mRNA, does not mean more protein is produced either. To add to this complexity, proteins undergo a process known as post-translational modification (PTM), which involves the adding of various biochemical moieties to the protein structure and further adds to the diversity of proteins found in living cells.

While we have about 20,000 genes, the number of proteins we may produce is estimated to be in the range of 500,000 onwards. Since we may not rely on genomics and transcriptomics exclusively, we, therefore, need to assay proteins directly.

It is difficult to observe proteins

The high-throughput assaying of proteins is called proteomics and is achieved via an instrument known as the mass spectrometer. The mass spectrometer (MS) does not produce direct information on the nature of the proteins and creates complex peak patterns due to how proteins are determinably broken down in the instrument.

Trying to find out which proteins are present in a cell is a reconstruction problem where given a series of mass peaks, we try to piece together which proteins are likely to give rise to such patterns (This obviously fails when the peak patterns are not sufficiently informative to allow reconstruction). To add to this complexity, proteins are broken down by the rules of probability, so we do not necessarily observe the same peak patterns, even if the proteins are the same. This, therefore, gives rise to coverage (the inability to observe all the proteins) and consistency issues (the inability to reproducibly observe the same proteins).

What is a missing protein?

The official definition of a missing protein (MP), according to the chromosome-centric Human Proteome Project (cHPP), is that it is an unconfirmed genetic sequence for which a protein product has not been detected. It is estimated that MPs account for nearly a fifth of known gene sequences.

The official classification of MPs is based on the five neXtProt Protein Existence (PE) tiers (PE1 to PE5, PE1s are not MPs because of protein-level support. Subsequent tiers lack protein evidence but sport increasingly less reliable evidence) [1]. Moreover, because proteins go missing in everyday proteomics as a result of a larger variety of factors (e.g. low abundance, splice variants, and PTMs), the current thinking framework is of a limited perspective and is irrelevant to the day-to-day reality. That is, proteins are routinely missing whenever a proteomics experiment is conducted.

A functional perspective on missing proteins

A broader functional perspective is needed. We propose an expanded functional perspective of MPs, which must also include those that are missed in routine experimentation. This functional perspective is important as resolving the remaining approximately 18% MPs does not make proteomics a practical technology if proteins cannot be observed consistently or quantitated properly anyway.

This new classification is three-tiered, involving missing protein classes (MPC) 1 to 3.

In MPC1, we include those for which evidence at the sequence level is known but hard to observe (reasons: limited samples, low abundance, low instrument resolution).

MPC2 includes disambiguating splice variants, sequence family members, and PTMs (reasons: lack of uniquely defining sequences detectable by MS; PSM confidence bottlenecks; search-space problem; cross- interference).

MPC3 includes those for which evidence of existence is unknown or highly dubious (reasons: lack of primary sequence or homology information; lack of confident sequence; and individual variation). The traditional notion of an MP fits exactly here.

Resolving missing protein problems based on 3 levels of solutions

As the MPC classification is problem-oriented, it is therefore also pairable with obvious sources of solutions. These are in themselves, broadly divisible into 3 categories: biological (LEVEL 1), technical (LEVEL 2), and analytical (informatics) (LEVEL 3). Each MPC may be tackled by various combinations of Levels 1 to 3 solutions.

Biological (LEVEL 1) describes methods that manipulate living systems and includes techniques such as epigenetic manipulation [2], tissue cultures [3], and antibodies [4]. By manipulating living systems, you may deliberately express certain proteins, or increase the levels of a low-abundance protein, making detection more likely.

Technical (LEVEL 2) includes chemical labeling [5], protein isolation and purification [6], and MS-hardware advances [7,8]. An example of a technical enhancement may be to invent better tag-based systems to enable higher multiplexing capabilities so that many samples may be analyzed simultaneously. Another example may be to devise even higher-resolution instruments such that low-abundance proteins are now also easier to observe (but note that with higher-resolution and sensitivity, also comes more noise).

Analytical (Informatics) (LEVEL 3) comprises all data-driven solutions, including library search algorithms, de novo sequencing, and statistics. As current proteomic technologies become more advanced and sensitive and are able to obtain more measurements, they also become simultaneously noisier. There is great pressure to provide better informatics solutions in tackling such problems.

In practice, Levels 1 and 2 are largely sufficient for identifying unobserved gene sequences with biological and technical advances, but disambiguating splice variants and identifying completely new sequences are much harder problems. In our opinion, these harder MP issues are only resolvable through better informatics and statistical approaches (LEVEL 3).

Moving on

Understanding the complement of proteins expressed in the body is important for helping us understand it. Unfortunately, current technology is inadequate for such purposes.

The current definition of MP is restrictive and does not address the fact that proteins go missing routinely in every day proteomic situations. Also, a protein that is extremely difficult to detect via routine means would likely lack real-world applicability as diagnostic markers anyway.

A rethink about what a missing protein is can help us move forward, and is helpful for practical efforts (e.g. biomarker development and drug target identification). We should also be mindful that while biological and technical enhancements may help resolve these missing protein problems, to meet the demands of real-world application (and that more and more data is being generated from high-resolution instruments), informatics-based solutions are the best way forward.

These findings are described in the article entitled Understanding missing proteins: a functional perspective, recently published in the journal Drug Discovery Today. This work was conducted by Longjian Zhou from Tianjin University, Limsoon Wong from National University of Singapore, and Wilson Wen Bin Goh from Nanyang Technological University.


  1. Omenn, G.S. et al. (2015) Metrics for the Human Proteome Project 2015: progress on the human proteome and guidelines for high- confidence protein identification. J. Proteome Res. 14, 3452–3460
  2. Yang, L. et al. (2015) Finding missing proteins from the epigenetically manipulated human cell with stringent quality criteria. J. Proteome Res. 14, 3645–3657
  3. Fridriksdottir, A.J. et al. (2015) Propagation of oestrogen receptor-positive and oestrogen–responsive normal human breast cells in culture. Nat. Commun. 6, 8786
  4. Larsson, K. et al. (2006) Multiplexed PrEST immunization for high-throughput affinity proteomics. J. Immunol. Methods 315, 110–120
  5. Ma, Y. et al. (2017) HILAQ: a novel strategy for newly synthesized protein quantification. J. Proteome Res. 16, 2213–2220
  6. Guo, T. et al. (2015) Rapid mass spectrometric conversion of tissue biopsy samples into permanent quantitative digital proteome maps. Nat. Med. 21, 407–413
  7. Gillet, L.C. et al. (2012) Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis. Mol. Cell Proteomics 11, 0111.016717
  8. Bruderer, R. et al. (2015) Extending the limits of quantitative proteome profiling with data-independent acquisition and application to acetaminophen-treated three-dimensional liver microtissues. Mol. Cell Proteomics 14, 1400–1410