Finding motifs for insufficient number of sequences with strong binding to transcription factor

Chin, FYL; Leung, HCM; Yiu, SM; Lam, TW; Rosenfeld, R; Tsang, WW; Smith, DK; Jiang, Y

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Scopus: eid_2-s2.0-2442447220

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Finding motifs for insufficient number of sequences with strong binding to transcription factor

Title	Finding motifs for insufficient number of sequences with strong binding to transcription factor
Authors	Chin, FYL Leung, HCM Yiu, SM Lam, TW Rosenfeld, R Tsang, WW Smith, DK Jiang, Y
Keywords	Binding Energy DNA Microarray Motif Finding Transcription Factor
Issue Date	2004
Publisher	ACM.
Citation	The 8th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2004), San Diego, CA., 27-31 March 2004. In Conference Proceedings. 2004, v. 8, p. 125-132 How to Cite?
Abstract	Finding motifs is an important problem in computational biology. Our paper makes two major contributions to this problem. Firstly, we better characterize the types of problem instances that cannot be solved by most existing methods of finding motifs. Secondly, we introduce a different method, which is shown to succeed for various problem instances for which popular existing methods fail. Most existing computational methods to finding motifs are based on the strong-signal model wherein only strong-signal sequences (i.e. those that are known to contain binding sites very similar to the motif) are considered as input and weak-signal sequences (i.e. those do not contain any sub-string similar to the motif) are disregarded. Buhler and Tompa have studied the limitations of methods based on the strong-signal model. They characterized the problem instances for which the motif is unlikely to be found in terms of the number of input (strong-signal) sequences needed under the assumption that each input sequence contains exactly one binding site. They further gave a method to calculate the minimum number of input sequences required. We re-characterize the limitations of the strong-signal model in terms of the minimum total number of binding sites, rather than the minimum number of strong-signal sequences, required to be in the input data set. We use a probability matrix to represent a motif instead of a string pattern to calculate the minimum total number of binding sites required. This new characterization is shown to be more general and realistic. Next, we introduce a more general and realistic energy-based model, which considers all available sequences (including weak-signal sequences) with varying degrees of binding strength to the transcription factors (as measured experimentally by observed color intensity). Given varying degrees of binding strength, our model can consider sequences ranging from those that contain more than one binding site to those that are weak sequences. By treating sequences with different degrees of binding strength differently, we develop a heuristic algorithm called EBMF (Energy-Based Motif Finding algorithm) using an EM-like approach to find motifs under our model. This EBMF algorithm can find motifs for data sets that do not even have the required minimum number of binding sites as previously derived for the strong-signal model. Our algorithm compares favorably with common motif-finding programs AlignACE and MEME, which are based on the strong-signal model. In particular, for some simulated and real data sets, our algorithm finds the motif when both AlignACE and MEME fail to do so.
Persistent Identifier	http://hdl.handle.net/10722/93128
References	References in Scopus

DC Field	Value	Language
dc.contributor.author	Chin, FYL	en_HK
dc.contributor.author	Leung, HCM	en_HK
dc.contributor.author	Yiu, SM	en_HK
dc.contributor.author	Lam, TW	en_HK
dc.contributor.author	Rosenfeld, R	en_HK
dc.contributor.author	Tsang, WW	en_HK
dc.contributor.author	Smith, DK	en_HK
dc.contributor.author	Jiang, Y	en_HK
dc.date.accessioned	2010-09-25T14:51:44Z	-
dc.date.available	2010-09-25T14:51:44Z	-
dc.date.issued	2004	en_HK
dc.identifier.citation	The 8th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2004), San Diego, CA., 27-31 March 2004. In Conference Proceedings. 2004, v. 8, p. 125-132	en_HK
dc.identifier.uri	http://hdl.handle.net/10722/93128	-
dc.description.abstract	Finding motifs is an important problem in computational biology. Our paper makes two major contributions to this problem. Firstly, we better characterize the types of problem instances that cannot be solved by most existing methods of finding motifs. Secondly, we introduce a different method, which is shown to succeed for various problem instances for which popular existing methods fail. Most existing computational methods to finding motifs are based on the strong-signal model wherein only strong-signal sequences (i.e. those that are known to contain binding sites very similar to the motif) are considered as input and weak-signal sequences (i.e. those do not contain any sub-string similar to the motif) are disregarded. Buhler and Tompa have studied the limitations of methods based on the strong-signal model. They characterized the problem instances for which the motif is unlikely to be found in terms of the number of input (strong-signal) sequences needed under the assumption that each input sequence contains exactly one binding site. They further gave a method to calculate the minimum number of input sequences required. We re-characterize the limitations of the strong-signal model in terms of the minimum total number of binding sites, rather than the minimum number of strong-signal sequences, required to be in the input data set. We use a probability matrix to represent a motif instead of a string pattern to calculate the minimum total number of binding sites required. This new characterization is shown to be more general and realistic. Next, we introduce a more general and realistic energy-based model, which considers all available sequences (including weak-signal sequences) with varying degrees of binding strength to the transcription factors (as measured experimentally by observed color intensity). Given varying degrees of binding strength, our model can consider sequences ranging from those that contain more than one binding site to those that are weak sequences. By treating sequences with different degrees of binding strength differently, we develop a heuristic algorithm called EBMF (Energy-Based Motif Finding algorithm) using an EM-like approach to find motifs under our model. This EBMF algorithm can find motifs for data sets that do not even have the required minimum number of binding sites as previously derived for the strong-signal model. Our algorithm compares favorably with common motif-finding programs AlignACE and MEME, which are based on the strong-signal model. In particular, for some simulated and real data sets, our algorithm finds the motif when both AlignACE and MEME fail to do so.	en_HK
dc.language	eng	en_HK
dc.publisher	ACM.	-
dc.relation.ispartof	RECOMB 2004 - Proceedings of the Eighth Annual International Conference on Research in Computational Molecular Biology	en_HK
dc.subject	Binding Energy	en_HK
dc.subject	DNA Microarray	en_HK
dc.subject	Motif Finding	en_HK
dc.subject	Transcription Factor	en_HK
dc.title	Finding motifs for insufficient number of sequences with strong binding to transcription factor	en_HK
dc.type	Conference_Paper	en_HK
dc.identifier.email	Chin, FYL:chin@cs.hku.hk	en_HK
dc.identifier.email	Leung, HCM:cmleung2@cs.hku.hk	en_HK
dc.identifier.email	Yiu, SM:smyiu@cs.hku.hk	en_HK
dc.identifier.email	Lam, TW:twlam@cs.hku.hk	en_HK
dc.identifier.email	Tsang, WW:tsang@cs.hku.hk	en_HK
dc.identifier.email	Smith, DK: dsmith@hkucc.hku.hk	-
dc.identifier.authority	Chin, FYL=rp00105	en_HK
dc.identifier.authority	Leung, HCM=rp00144	en_HK
dc.identifier.authority	Yiu, SM=rp00207	en_HK
dc.identifier.authority	Lam, TW=rp00135	en_HK
dc.identifier.authority	Tsang, WW=rp00179	en_HK
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.scopus	eid_2-s2.0-2442447220	en_HK
dc.identifier.hkuros	86051	en_HK
dc.identifier.hkuros	129061	-
dc.relation.references	http://www.scopus.com/mlt/select.url?eid=2-s2.0-2442447220&selection=ref&src=s&origin=recordpage	en_HK
dc.identifier.volume	8	en_HK
dc.identifier.spage	125	en_HK
dc.identifier.epage	132	en_HK
dc.identifier.scopusauthorid	Chin, FYL=7005101915	en_HK
dc.identifier.scopusauthorid	Leung, HCM=35233742700	en_HK
dc.identifier.scopusauthorid	Yiu, SM=7003282240	en_HK
dc.identifier.scopusauthorid	Lam, TW=7202523165	en_HK
dc.identifier.scopusauthorid	Rosenfeld, R=7201664625	en_HK
dc.identifier.scopusauthorid	Tsang, WW=7201558521	en_HK
dc.identifier.scopusauthorid	Smith, DK=7410351143	en_HK
dc.identifier.scopusauthorid	Jiang, Y=7404832549	en_HK
dc.customcontrol.immutable	sml 151014 - merged	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Finding motifs for insufficient number of sequences with strong binding to transcription factor

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats