Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies

Blakeley, Paul; Overton, Ian M.; Hubbard, Simon J.

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1021/pr300411q
Scopus: eid_2-s2.0-84868310579
PMID: 23025403
WOS: WOS:000311190600009
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
- PubMed Central: 0
Appears in Collections:
- Surgery: Journal/Magazine Articles

Article: Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies

Title	Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies
Authors	Blakeley, Paul Overton, Ian M.Hubbard, Simon J.
Keywords	expressed sequence tag false discovery rate peptide spectrum match posterior error probability proteogenomics
Issue Date	2012
Citation	Journal of Proteome Research, 2012, v. 11, n. 11, p. 5221-5234 How to Cite? DOI: http://dx.doi.org/10.1021/pr300411q
Abstract	Proteogenomics has the potential to advance genome annotation through high quality peptide identifications derived from mass spectrometry experiments, which demonstrate a given gene or isoform is expressed and translated at the protein level. This can advance our understanding of genome function, discovering novel genes and gene structure that have not yet been identified or validated. Because of the high-throughput shotgun nature of most proteomics experiments, it is essential to carefully control for false positives and prevent any potential misannotation. A number of statistical procedures to deal with this are in wide use in proteomics, calculating false discovery rate (FDR) and posterior error probability (PEP) values for groups and individual peptide spectrum matches (PSMs). These methods control for multiple testing and exploit decoy databases to estimate statistical significance. Here, we show that database choice has a major effect on these confidence estimates leading to significant differences in the number of PSMs reported. We note that standard target:decoy approaches using six-frame translations of nucleotide sequences, such as assembled transcriptome data, apparently underestimate the confidence assigned to the PSMs. The source of this error stems from the inflated and unusual nature of the six-frame database, where for every target sequence there exists five "incorrect" targets that are unlikely to code for protein. The attendant FDR and PEP estimates lead to fewer accepted PSMs at fixed thresholds, and we show that this effect is a product of the database and statistical modeling and not the search engine. A variety of approaches to limit database size and remove noncoding target sequences are examined and discussed in terms of the altered statistical estimates generated and PSMs reported. These results are of importance to groups carrying out proteogenomics, aiming to maximize the validation and discovery of gene structure in sequenced genomes, while still controlling for false positives. © 2012 American Chemical Society.
Persistent Identifier	http://hdl.handle.net/10722/335754
ISSN	1535-3893 2023 Impact Factor: 3.8 2023 SCImago Journal Rankings: 1.299
ISI Accession Number ID	WOS:000311190600009

DC Field	Value	Language
dc.contributor.author	Blakeley, Paul	-
dc.contributor.author	Overton, Ian M.	-
dc.contributor.author	Hubbard, Simon J.	-
dc.date.accessioned	2023-12-28T08:48:30Z	-
dc.date.available	2023-12-28T08:48:30Z	-
dc.date.issued	2012	-
dc.identifier.citation	Journal of Proteome Research, 2012, v. 11, n. 11, p. 5221-5234	-
dc.identifier.issn	1535-3893	-
dc.identifier.uri	http://hdl.handle.net/10722/335754	-
dc.description.abstract	Proteogenomics has the potential to advance genome annotation through high quality peptide identifications derived from mass spectrometry experiments, which demonstrate a given gene or isoform is expressed and translated at the protein level. This can advance our understanding of genome function, discovering novel genes and gene structure that have not yet been identified or validated. Because of the high-throughput shotgun nature of most proteomics experiments, it is essential to carefully control for false positives and prevent any potential misannotation. A number of statistical procedures to deal with this are in wide use in proteomics, calculating false discovery rate (FDR) and posterior error probability (PEP) values for groups and individual peptide spectrum matches (PSMs). These methods control for multiple testing and exploit decoy databases to estimate statistical significance. Here, we show that database choice has a major effect on these confidence estimates leading to significant differences in the number of PSMs reported. We note that standard target:decoy approaches using six-frame translations of nucleotide sequences, such as assembled transcriptome data, apparently underestimate the confidence assigned to the PSMs. The source of this error stems from the inflated and unusual nature of the six-frame database, where for every target sequence there exists five "incorrect" targets that are unlikely to code for protein. The attendant FDR and PEP estimates lead to fewer accepted PSMs at fixed thresholds, and we show that this effect is a product of the database and statistical modeling and not the search engine. A variety of approaches to limit database size and remove noncoding target sequences are examined and discussed in terms of the altered statistical estimates generated and PSMs reported. These results are of importance to groups carrying out proteogenomics, aiming to maximize the validation and discovery of gene structure in sequenced genomes, while still controlling for false positives. © 2012 American Chemical Society.	-
dc.language	eng	-
dc.relation.ispartof	Journal of Proteome Research	-
dc.subject	expressed sequence tag	-
dc.subject	false discovery rate	-
dc.subject	peptide spectrum match	-
dc.subject	posterior error probability	-
dc.subject	proteogenomics	-
dc.title	Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies	-
dc.type	Article	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1021/pr300411q	-
dc.identifier.pmid	23025403	-
dc.identifier.scopus	eid_2-s2.0-84868310579	-
dc.identifier.volume	11	-
dc.identifier.issue	11	-
dc.identifier.spage	5221	-
dc.identifier.epage	5234	-
dc.identifier.eissn	1535-3907	-
dc.identifier.isi	WOS:000311190600009	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats