SNP selection and classification of genome-wide SNP data using stratified sampling random forests

Wu, Qingyao; Ye, Yunming; Liu, Yang; Ng, Michael K.

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/TNB.2012.2214232
Scopus: eid_2-s2.0-84866484354
PMID: 22987127
WOS: WOS:000308959700004
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
- PubMed Central: 0
Appears in Collections:
- Mathematics: Journal/Magazine Articles

Article: SNP selection and classification of genome-wide SNP data using stratified sampling random forests

Title	SNP selection and classification of genome-wide SNP data using stratified sampling random forests
Authors	Wu, Qingyao Ye, Yunming Liu, Yang Ng, Michael K.
Keywords	SNP Genome-wide association study random forest stratified sampling
Issue Date	2012
Citation	IEEE Transactions on Nanobioscience, 2012, v. 11, n. 3, p. 216-227 How to Cite? DOI: http://dx.doi.org/10.1109/TNB.2012.2214232
Abstract	For high dimensional genome-wide association (GWA) case-control data of complex disease, there are usually a large portion of single-nucleotide polymorphisms (SNPs) that are irrelevant with the disease. A simple random sampling method in random forest using default mtry parameter to choose feature subspace, will select too many subspaces without informative SNPs. Exhaustive searching an optimal mtry is often required in order to include useful and relevant SNPs and get rid of vast of non-informative SNPs. However, it is too time-consuming and not favorable in GWA for high-dimensional data. The main aim of this paper is to propose a stratified sampling method for feature subspace selection to generate decision trees in a random forest for GWA high-dimensional data. Our idea is to design an equal-width discretization scheme for informativeness to divide SNPs into multiple groups. In feature subspace selection, we randomly select the same number of SNPs from each group and combine them to form a subspace to generate a decision tree. The advantage of this stratified sampling procedure can make sure each subspace contains enough useful SNPs, but can avoid a very high computational cost of exhaustive search of an optimal mtry, and maintain the randomness of a random forest. We employ two genome-wide SNP data sets (Parkinson case-control data comprised of 408803 SNPs and Alzheimer case-control data comprised of 380157 SNPs) to demonstrate that the proposed stratified sampling method is effective, and it can generate better random forest with higher accuracy and lower error bound than those by Breiman's random forest generation method. For Parkinson data, we also show some interesting genes identified by the method, which may be associated with neurological disorders for further biological investigations. © 2002-2011 IEEE.
Persistent Identifier	http://hdl.handle.net/10722/276667
ISSN	1536-1241 2023 Impact Factor: 3.7 2023 SCImago Journal Rankings: 0.659
ISI Accession Number ID	WOS:000308959700004

DC Field	Value	Language
dc.contributor.author	Wu, Qingyao	-
dc.contributor.author	Ye, Yunming	-
dc.contributor.author	Liu, Yang	-
dc.contributor.author	Ng, Michael K.	-
dc.date.accessioned	2019-09-18T08:34:17Z	-
dc.date.available	2019-09-18T08:34:17Z	-
dc.date.issued	2012	-
dc.identifier.citation	IEEE Transactions on Nanobioscience, 2012, v. 11, n. 3, p. 216-227	-
dc.identifier.issn	1536-1241	-
dc.identifier.uri	http://hdl.handle.net/10722/276667	-
dc.description.abstract	For high dimensional genome-wide association (GWA) case-control data of complex disease, there are usually a large portion of single-nucleotide polymorphisms (SNPs) that are irrelevant with the disease. A simple random sampling method in random forest using default mtry parameter to choose feature subspace, will select too many subspaces without informative SNPs. Exhaustive searching an optimal mtry is often required in order to include useful and relevant SNPs and get rid of vast of non-informative SNPs. However, it is too time-consuming and not favorable in GWA for high-dimensional data. The main aim of this paper is to propose a stratified sampling method for feature subspace selection to generate decision trees in a random forest for GWA high-dimensional data. Our idea is to design an equal-width discretization scheme for informativeness to divide SNPs into multiple groups. In feature subspace selection, we randomly select the same number of SNPs from each group and combine them to form a subspace to generate a decision tree. The advantage of this stratified sampling procedure can make sure each subspace contains enough useful SNPs, but can avoid a very high computational cost of exhaustive search of an optimal mtry, and maintain the randomness of a random forest. We employ two genome-wide SNP data sets (Parkinson case-control data comprised of 408803 SNPs and Alzheimer case-control data comprised of 380157 SNPs) to demonstrate that the proposed stratified sampling method is effective, and it can generate better random forest with higher accuracy and lower error bound than those by Breiman's random forest generation method. For Parkinson data, we also show some interesting genes identified by the method, which may be associated with neurological disorders for further biological investigations. © 2002-2011 IEEE.	-
dc.language	eng	-
dc.relation.ispartof	IEEE Transactions on Nanobioscience	-
dc.subject	SNP	-
dc.subject	Genome-wide association study	-
dc.subject	random forest	-
dc.subject	stratified sampling	-
dc.title	SNP selection and classification of genome-wide SNP data using stratified sampling random forests	-
dc.type	Article	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1109/TNB.2012.2214232	-
dc.identifier.pmid	22987127	-
dc.identifier.scopus	eid_2-s2.0-84866484354	-
dc.identifier.volume	11	-
dc.identifier.issue	3	-
dc.identifier.spage	216	-
dc.identifier.epage	227	-
dc.identifier.isi	WOS:000308959700004	-
dc.identifier.issnl	1536-1241	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: SNP selection and classification of genome-wide SNP data using stratified sampling random forests

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats