File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Conference Paper: A data-centric pipeline using convolutional neural network to select better multiple sequence alignment method

TitleA data-centric pipeline using convolutional neural network to select better multiple sequence alignment method
Authors
KeywordsMultiple sequence alignment
classification
convolutional neural network
data-centric
decision model
Issue Date2020
PublisherAssociation for Computing Machinery (ACM).
Citation
Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB '20), Virtual Conference, USA, 21-24 September 2020, article no. 72 How to Cite?
AbstractMultiple sequence alignment (MSA) is widely used to find out the evolutionary relationship of every input sequence as well as the functional or structural roles of every aligned residue. Traditionally, the MSA problem was tackled by algorithm-centric approaches which had applied many classical computer algorithms (such as dynamic programming, divide-and-conquer algorithm and so on) and proven strategies (such as progressive strategy, non-progressive strategy, consistency-based strategy, iterative refinement etc.). Different single-algorithm MSA methods have different accuracies on different similarity protein families. Therefore, to integrate the advantages of different MSA methods, we present a brand-new data-centric pipeline using the convolutional neural network (CNN) [3] to choose better MSA method for different similarity protein families. An MSA is very similar to a 2D picture, which has a good hierarchical structure. The conserved regions and the corresponding conserved columns in an MSA could be seen as boxes and lines in a picture. CNN is known to be very good at recognizing imperfect pictures which containing existed noises, which means it may perform well for recognizing draft MSAs. Briefly, the method first using a quick MSA method to construct large-scale draft MSAs from the simulated protein families produced by protein simulation tool INDELible [1]. The main point is training a classifier by CNN which employing the draft MSAs as input and giving the better MSA method as output. In our research, we simulated more than 640,000 protein families with sequence number range from 3 to 64. The fastest (but not accurate) mode of a famous MSA tool, Mafft(FFT-NS-1) [2], with default parameters used for constructing draft MSAs from those families. We regard these MSAs as two-color images, one color for the aligned residues, and the other color for the gaps. Two layers of CNN and a fully connected layer with 0.5 of dropout were used for training the decision model. The preliminary results suggest more than 85% accuracy in classification of choosing better alignment solution between the newest versions of Mafft(L-INS-i) and Mafft(G-INS-i). Currently, we are improving the performance of this pipeline by selecting better categories for protein families and fine-tuning the decision model.
DescriptionBCB Poster
Persistent Identifierhttp://hdl.handle.net/10722/301185
ISBN

 

DC FieldValueLanguage
dc.contributor.authorKuang, M-
dc.contributor.authorTing, HF-
dc.date.accessioned2021-07-27T08:07:23Z-
dc.date.available2021-07-27T08:07:23Z-
dc.date.issued2020-
dc.identifier.citationProceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB '20), Virtual Conference, USA, 21-24 September 2020, article no. 72-
dc.identifier.isbn9781450379649-
dc.identifier.urihttp://hdl.handle.net/10722/301185-
dc.descriptionBCB Poster-
dc.description.abstractMultiple sequence alignment (MSA) is widely used to find out the evolutionary relationship of every input sequence as well as the functional or structural roles of every aligned residue. Traditionally, the MSA problem was tackled by algorithm-centric approaches which had applied many classical computer algorithms (such as dynamic programming, divide-and-conquer algorithm and so on) and proven strategies (such as progressive strategy, non-progressive strategy, consistency-based strategy, iterative refinement etc.). Different single-algorithm MSA methods have different accuracies on different similarity protein families. Therefore, to integrate the advantages of different MSA methods, we present a brand-new data-centric pipeline using the convolutional neural network (CNN) [3] to choose better MSA method for different similarity protein families. An MSA is very similar to a 2D picture, which has a good hierarchical structure. The conserved regions and the corresponding conserved columns in an MSA could be seen as boxes and lines in a picture. CNN is known to be very good at recognizing imperfect pictures which containing existed noises, which means it may perform well for recognizing draft MSAs. Briefly, the method first using a quick MSA method to construct large-scale draft MSAs from the simulated protein families produced by protein simulation tool INDELible [1]. The main point is training a classifier by CNN which employing the draft MSAs as input and giving the better MSA method as output. In our research, we simulated more than 640,000 protein families with sequence number range from 3 to 64. The fastest (but not accurate) mode of a famous MSA tool, Mafft(FFT-NS-1) [2], with default parameters used for constructing draft MSAs from those families. We regard these MSAs as two-color images, one color for the aligned residues, and the other color for the gaps. Two layers of CNN and a fully connected layer with 0.5 of dropout were used for training the decision model. The preliminary results suggest more than 85% accuracy in classification of choosing better alignment solution between the newest versions of Mafft(L-INS-i) and Mafft(G-INS-i). Currently, we are improving the performance of this pipeline by selecting better categories for protein families and fine-tuning the decision model.-
dc.languageeng-
dc.publisherAssociation for Computing Machinery (ACM).-
dc.relation.ispartofProceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB '20)-
dc.subjectMultiple sequence alignment-
dc.subjectclassification-
dc.subjectconvolutional neural network-
dc.subjectdata-centric-
dc.subjectdecision model-
dc.titleA data-centric pipeline using convolutional neural network to select better multiple sequence alignment method-
dc.typeConference_Paper-
dc.identifier.emailTing, HF: hfting@cs.hku.hk-
dc.identifier.authorityTing, HF=rp00177-
dc.description.natureabstract-
dc.identifier.doi10.1145/3388440.3414909-
dc.identifier.hkuros323530-
dc.identifier.spagearticle no. 72-
dc.identifier.epagearticle no. 72-
dc.publisher.placeNew York, NY-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats