A data-centric pipeline using convolutional neural network to select better multiple sequence alignment method

Kuang, M; Ting, HF

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1145/3388440.3414909
WOS: WOS:000936196100090

Supplementary

Citations:
- Web of Science: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: A data-centric pipeline using convolutional neural network to select better multiple sequence alignment method

Title	A data-centric pipeline using convolutional neural network to select better multiple sequence alignment method
Authors	Kuang, M Ting, HF
Keywords	Multiple sequence alignment classification convolutional neural network data-centric decision model
Issue Date	2020
Publisher	Association for Computing Machinery (ACM).
Citation	Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB '20), Virtual Conference, USA, 21-24 September 2020, article no. 72 How to Cite? DOI: http://dx.doi.org/10.1145/3388440.3414909
Abstract	Multiple sequence alignment (MSA) is widely used to find out the evolutionary relationship of every input sequence as well as the functional or structural roles of every aligned residue. Traditionally, the MSA problem was tackled by algorithm-centric approaches which had applied many classical computer algorithms (such as dynamic programming, divide-and-conquer algorithm and so on) and proven strategies (such as progressive strategy, non-progressive strategy, consistency-based strategy, iterative refinement etc.). Different single-algorithm MSA methods have different accuracies on different similarity protein families. Therefore, to integrate the advantages of different MSA methods, we present a brand-new data-centric pipeline using the convolutional neural network (CNN) [3] to choose better MSA method for different similarity protein families. An MSA is very similar to a 2D picture, which has a good hierarchical structure. The conserved regions and the corresponding conserved columns in an MSA could be seen as boxes and lines in a picture. CNN is known to be very good at recognizing imperfect pictures which containing existed noises, which means it may perform well for recognizing draft MSAs. Briefly, the method first using a quick MSA method to construct large-scale draft MSAs from the simulated protein families produced by protein simulation tool INDELible [1]. The main point is training a classifier by CNN which employing the draft MSAs as input and giving the better MSA method as output. In our research, we simulated more than 640,000 protein families with sequence number range from 3 to 64. The fastest (but not accurate) mode of a famous MSA tool, Mafft(FFT-NS-1) [2], with default parameters used for constructing draft MSAs from those families. We regard these MSAs as two-color images, one color for the aligned residues, and the other color for the gaps. Two layers of CNN and a fully connected layer with 0.5 of dropout were used for training the decision model. The preliminary results suggest more than 85% accuracy in classification of choosing better alignment solution between the newest versions of Mafft(L-INS-i) and Mafft(G-INS-i). Currently, we are improving the performance of this pipeline by selecting better categories for protein families and fine-tuning the decision model.
Description	BCB Poster
Persistent Identifier	http://hdl.handle.net/10722/301185
ISBN	9781450379649
ISI Accession Number ID	WOS:000936196100090

DC Field	Value	Language
dc.contributor.author	Kuang, M	-
dc.contributor.author	Ting, HF	-
dc.date.accessioned	2021-07-27T08:07:23Z	-
dc.date.available	2021-07-27T08:07:23Z	-
dc.date.issued	2020	-
dc.identifier.citation	Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB '20), Virtual Conference, USA, 21-24 September 2020, article no. 72	-
dc.identifier.isbn	9781450379649	-
dc.identifier.uri	http://hdl.handle.net/10722/301185	-
dc.description	BCB Poster	-
dc.description.abstract	Multiple sequence alignment (MSA) is widely used to find out the evolutionary relationship of every input sequence as well as the functional or structural roles of every aligned residue. Traditionally, the MSA problem was tackled by algorithm-centric approaches which had applied many classical computer algorithms (such as dynamic programming, divide-and-conquer algorithm and so on) and proven strategies (such as progressive strategy, non-progressive strategy, consistency-based strategy, iterative refinement etc.). Different single-algorithm MSA methods have different accuracies on different similarity protein families. Therefore, to integrate the advantages of different MSA methods, we present a brand-new data-centric pipeline using the convolutional neural network (CNN) [3] to choose better MSA method for different similarity protein families. An MSA is very similar to a 2D picture, which has a good hierarchical structure. The conserved regions and the corresponding conserved columns in an MSA could be seen as boxes and lines in a picture. CNN is known to be very good at recognizing imperfect pictures which containing existed noises, which means it may perform well for recognizing draft MSAs. Briefly, the method first using a quick MSA method to construct large-scale draft MSAs from the simulated protein families produced by protein simulation tool INDELible [1]. The main point is training a classifier by CNN which employing the draft MSAs as input and giving the better MSA method as output. In our research, we simulated more than 640,000 protein families with sequence number range from 3 to 64. The fastest (but not accurate) mode of a famous MSA tool, Mafft(FFT-NS-1) [2], with default parameters used for constructing draft MSAs from those families. We regard these MSAs as two-color images, one color for the aligned residues, and the other color for the gaps. Two layers of CNN and a fully connected layer with 0.5 of dropout were used for training the decision model. The preliminary results suggest more than 85% accuracy in classification of choosing better alignment solution between the newest versions of Mafft(L-INS-i) and Mafft(G-INS-i). Currently, we are improving the performance of this pipeline by selecting better categories for protein families and fine-tuning the decision model.	-
dc.language	eng	-
dc.publisher	Association for Computing Machinery (ACM).	-
dc.relation.ispartof	Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB '20)	-
dc.subject	Multiple sequence alignment	-
dc.subject	classification	-
dc.subject	convolutional neural network	-
dc.subject	data-centric	-
dc.subject	decision model	-
dc.title	A data-centric pipeline using convolutional neural network to select better multiple sequence alignment method	-
dc.type	Conference_Paper	-
dc.identifier.email	Ting, HF: hfting@cs.hku.hk	-
dc.identifier.authority	Ting, HF=rp00177	-
dc.description.nature	abstract	-
dc.identifier.doi	10.1145/3388440.3414909	-
dc.identifier.hkuros	323530	-
dc.identifier.spage	article no. 72	-
dc.identifier.epage	article no. 72	-
dc.identifier.isi	WOS:000936196100090	-
dc.publisher.place	New York, NY	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: A data-centric pipeline using convolutional neural network to select better multiple sequence alignment method

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats