File Download
There are no files associated with this item.
Supplementary

Citations:
 Scopus: 0
 Appears in Collections:
Conference Paper: Finding motifs for insufficient number of sequences with strong binding to transcription factor
Title  Finding motifs for insufficient number of sequences with strong binding to transcription factor 

Authors  
Keywords  Binding Energy DNA Microarray Motif Finding Transcription Factor 
Issue Date  2004 
Publisher  ACM. 
Citation  The 8th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2004), San Diego, CA., 2731 March 2004. In Conference Proceedings. 2004, v. 8, p. 125132 How to Cite? 
Abstract  Finding motifs is an important problem in computational biology. Our paper makes two major contributions to this problem. Firstly, we better characterize the types of problem instances that cannot be solved by most existing methods of finding motifs. Secondly, we introduce a different method, which is shown to succeed for various problem instances for which popular existing methods fail. Most existing computational methods to finding motifs are based on the strongsignal model wherein only strongsignal sequences (i.e. those that are known to contain binding sites very similar to the motif) are considered as input and weaksignal sequences (i.e. those do not contain any substring similar to the motif) are disregarded. Buhler and Tompa have studied the limitations of methods based on the strongsignal model. They characterized the problem instances for which the motif is unlikely to be found in terms of the number of input (strongsignal) sequences needed under the assumption that each input sequence contains exactly one binding site. They further gave a method to calculate the minimum number of input sequences required. We recharacterize the limitations of the strongsignal model in terms of the minimum total number of binding sites, rather than the minimum number of strongsignal sequences, required to be in the input data set. We use a probability matrix to represent a motif instead of a string pattern to calculate the minimum total number of binding sites required. This new characterization is shown to be more general and realistic. Next, we introduce a more general and realistic energybased model, which considers all available sequences (including weaksignal sequences) with varying degrees of binding strength to the transcription factors (as measured experimentally by observed color intensity). Given varying degrees of binding strength, our model can consider sequences ranging from those that contain more than one binding site to those that are weak sequences. By treating sequences with different degrees of binding strength differently, we develop a heuristic algorithm called EBMF (EnergyBased Motif Finding algorithm) using an EMlike approach to find motifs under our model. This EBMF algorithm can find motifs for data sets that do not even have the required minimum number of binding sites as previously derived for the strongsignal model. Our algorithm compares favorably with common motiffinding programs AlignACE and MEME, which are based on the strongsignal model. In particular, for some simulated and real data sets, our algorithm finds the motif when both AlignACE and MEME fail to do so. 
Persistent Identifier  http://hdl.handle.net/10722/93128 
References 
DC Field  Value  Language 

dc.contributor.author  Chin, FYL  en_HK 
dc.contributor.author  Leung, HCM  en_HK 
dc.contributor.author  Yiu, SM  en_HK 
dc.contributor.author  Lam, TW  en_HK 
dc.contributor.author  Rosenfeld, R  en_HK 
dc.contributor.author  Tsang, WW  en_HK 
dc.contributor.author  Smith, DK  en_HK 
dc.contributor.author  Jiang, Y  en_HK 
dc.date.accessioned  20100925T14:51:44Z   
dc.date.available  20100925T14:51:44Z   
dc.date.issued  2004  en_HK 
dc.identifier.citation  The 8th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2004), San Diego, CA., 2731 March 2004. In Conference Proceedings. 2004, v. 8, p. 125132  en_HK 
dc.identifier.uri  http://hdl.handle.net/10722/93128   
dc.description.abstract  Finding motifs is an important problem in computational biology. Our paper makes two major contributions to this problem. Firstly, we better characterize the types of problem instances that cannot be solved by most existing methods of finding motifs. Secondly, we introduce a different method, which is shown to succeed for various problem instances for which popular existing methods fail. Most existing computational methods to finding motifs are based on the strongsignal model wherein only strongsignal sequences (i.e. those that are known to contain binding sites very similar to the motif) are considered as input and weaksignal sequences (i.e. those do not contain any substring similar to the motif) are disregarded. Buhler and Tompa have studied the limitations of methods based on the strongsignal model. They characterized the problem instances for which the motif is unlikely to be found in terms of the number of input (strongsignal) sequences needed under the assumption that each input sequence contains exactly one binding site. They further gave a method to calculate the minimum number of input sequences required. We recharacterize the limitations of the strongsignal model in terms of the minimum total number of binding sites, rather than the minimum number of strongsignal sequences, required to be in the input data set. We use a probability matrix to represent a motif instead of a string pattern to calculate the minimum total number of binding sites required. This new characterization is shown to be more general and realistic. Next, we introduce a more general and realistic energybased model, which considers all available sequences (including weaksignal sequences) with varying degrees of binding strength to the transcription factors (as measured experimentally by observed color intensity). Given varying degrees of binding strength, our model can consider sequences ranging from those that contain more than one binding site to those that are weak sequences. By treating sequences with different degrees of binding strength differently, we develop a heuristic algorithm called EBMF (EnergyBased Motif Finding algorithm) using an EMlike approach to find motifs under our model. This EBMF algorithm can find motifs for data sets that do not even have the required minimum number of binding sites as previously derived for the strongsignal model. Our algorithm compares favorably with common motiffinding programs AlignACE and MEME, which are based on the strongsignal model. In particular, for some simulated and real data sets, our algorithm finds the motif when both AlignACE and MEME fail to do so.  en_HK 
dc.language  eng  en_HK 
dc.publisher  ACM.   
dc.relation.ispartof  RECOMB 2004  Proceedings of the Eighth Annual International Conference on Research in Computational Molecular Biology  en_HK 
dc.subject  Binding Energy  en_HK 
dc.subject  DNA Microarray  en_HK 
dc.subject  Motif Finding  en_HK 
dc.subject  Transcription Factor  en_HK 
dc.title  Finding motifs for insufficient number of sequences with strong binding to transcription factor  en_HK 
dc.type  Conference_Paper  en_HK 
dc.identifier.email  Chin, FYL:chin@cs.hku.hk  en_HK 
dc.identifier.email  Leung, HCM:cmleung2@cs.hku.hk  en_HK 
dc.identifier.email  Yiu, SM:smyiu@cs.hku.hk  en_HK 
dc.identifier.email  Lam, TW:twlam@cs.hku.hk  en_HK 
dc.identifier.email  Tsang, WW:tsang@cs.hku.hk  en_HK 
dc.identifier.email  Smith, DK: dsmith@hkucc.hku.hk   
dc.identifier.authority  Chin, FYL=rp00105  en_HK 
dc.identifier.authority  Leung, HCM=rp00144  en_HK 
dc.identifier.authority  Yiu, SM=rp00207  en_HK 
dc.identifier.authority  Lam, TW=rp00135  en_HK 
dc.identifier.authority  Tsang, WW=rp00179  en_HK 
dc.description.nature  link_to_subscribed_fulltext   
dc.identifier.scopus  eid_2s2.02442447220  en_HK 
dc.identifier.hkuros  86051  en_HK 
dc.identifier.hkuros  129061   
dc.relation.references  http://www.scopus.com/mlt/select.url?eid=2s2.02442447220&selection=ref&src=s&origin=recordpage  en_HK 
dc.identifier.volume  8  en_HK 
dc.identifier.spage  125  en_HK 
dc.identifier.epage  132  en_HK 
dc.identifier.scopusauthorid  Chin, FYL=7005101915  en_HK 
dc.identifier.scopusauthorid  Leung, HCM=35233742700  en_HK 
dc.identifier.scopusauthorid  Yiu, SM=7003282240  en_HK 
dc.identifier.scopusauthorid  Lam, TW=7202523165  en_HK 
dc.identifier.scopusauthorid  Rosenfeld, R=7201664625  en_HK 
dc.identifier.scopusauthorid  Tsang, WW=7201558521  en_HK 
dc.identifier.scopusauthorid  Smith, DK=7410351143  en_HK 
dc.identifier.scopusauthorid  Jiang, Y=7404832549  en_HK 
dc.customcontrol.immutable  sml 151014  merged   