File Download
There are no files associated with this item.
Supplementary
-
Citations:
- Appears in Collections:
Conference Paper: A data-centric pipeline using convolutional neural network to select better multiple sequence alignment method
Title | A data-centric pipeline using convolutional neural network to select better multiple sequence alignment method |
---|---|
Authors | |
Keywords | Multiple sequence alignment classification convolutional neural network data-centric decision model |
Issue Date | 2020 |
Publisher | Association for Computing Machinery (ACM). |
Citation | Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB '20), Virtual Conference, USA, 21-24 September 2020, article no. 72 How to Cite? |
Abstract | Multiple sequence alignment (MSA) is widely used to find out the evolutionary relationship of every input sequence as well as the functional or structural roles of every aligned residue. Traditionally, the MSA problem was tackled by algorithm-centric approaches which had applied many classical computer algorithms (such as dynamic programming, divide-and-conquer algorithm and so on) and proven strategies (such as progressive strategy, non-progressive strategy, consistency-based strategy, iterative refinement etc.). Different single-algorithm MSA methods have different accuracies on different similarity protein families. Therefore, to integrate the advantages of different MSA methods, we present a brand-new data-centric pipeline using the convolutional neural network (CNN) [3] to choose better MSA method for different similarity protein families.
An MSA is very similar to a 2D picture, which has a good hierarchical structure. The conserved regions and the corresponding conserved columns in an MSA could be seen as boxes and lines in a picture. CNN is known to be very good at recognizing imperfect pictures which containing existed noises, which means it may perform well for recognizing draft MSAs. Briefly, the method first using a quick MSA method to construct large-scale draft MSAs from the simulated protein families produced by protein simulation tool INDELible [1]. The main point is training a classifier by CNN which employing the draft MSAs as input and giving the better MSA method as output.
In our research, we simulated more than 640,000 protein families with sequence number range from 3 to 64. The fastest (but not accurate) mode of a famous MSA tool, Mafft(FFT-NS-1) [2], with default parameters used for constructing draft MSAs from those families. We regard these MSAs as two-color images, one color for the aligned residues, and the other color for the gaps. Two layers of CNN and a fully connected layer with 0.5 of dropout were used for training the decision model.
The preliminary results suggest more than 85% accuracy in classification of choosing better alignment solution between the newest versions of Mafft(L-INS-i) and Mafft(G-INS-i). Currently, we are improving the performance of this pipeline by selecting better categories for protein families and fine-tuning the decision model. |
Description | BCB Poster |
Persistent Identifier | http://hdl.handle.net/10722/301185 |
ISBN |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Kuang, M | - |
dc.contributor.author | Ting, HF | - |
dc.date.accessioned | 2021-07-27T08:07:23Z | - |
dc.date.available | 2021-07-27T08:07:23Z | - |
dc.date.issued | 2020 | - |
dc.identifier.citation | Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB '20), Virtual Conference, USA, 21-24 September 2020, article no. 72 | - |
dc.identifier.isbn | 9781450379649 | - |
dc.identifier.uri | http://hdl.handle.net/10722/301185 | - |
dc.description | BCB Poster | - |
dc.description.abstract | Multiple sequence alignment (MSA) is widely used to find out the evolutionary relationship of every input sequence as well as the functional or structural roles of every aligned residue. Traditionally, the MSA problem was tackled by algorithm-centric approaches which had applied many classical computer algorithms (such as dynamic programming, divide-and-conquer algorithm and so on) and proven strategies (such as progressive strategy, non-progressive strategy, consistency-based strategy, iterative refinement etc.). Different single-algorithm MSA methods have different accuracies on different similarity protein families. Therefore, to integrate the advantages of different MSA methods, we present a brand-new data-centric pipeline using the convolutional neural network (CNN) [3] to choose better MSA method for different similarity protein families. An MSA is very similar to a 2D picture, which has a good hierarchical structure. The conserved regions and the corresponding conserved columns in an MSA could be seen as boxes and lines in a picture. CNN is known to be very good at recognizing imperfect pictures which containing existed noises, which means it may perform well for recognizing draft MSAs. Briefly, the method first using a quick MSA method to construct large-scale draft MSAs from the simulated protein families produced by protein simulation tool INDELible [1]. The main point is training a classifier by CNN which employing the draft MSAs as input and giving the better MSA method as output. In our research, we simulated more than 640,000 protein families with sequence number range from 3 to 64. The fastest (but not accurate) mode of a famous MSA tool, Mafft(FFT-NS-1) [2], with default parameters used for constructing draft MSAs from those families. We regard these MSAs as two-color images, one color for the aligned residues, and the other color for the gaps. Two layers of CNN and a fully connected layer with 0.5 of dropout were used for training the decision model. The preliminary results suggest more than 85% accuracy in classification of choosing better alignment solution between the newest versions of Mafft(L-INS-i) and Mafft(G-INS-i). Currently, we are improving the performance of this pipeline by selecting better categories for protein families and fine-tuning the decision model. | - |
dc.language | eng | - |
dc.publisher | Association for Computing Machinery (ACM). | - |
dc.relation.ispartof | Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB '20) | - |
dc.subject | Multiple sequence alignment | - |
dc.subject | classification | - |
dc.subject | convolutional neural network | - |
dc.subject | data-centric | - |
dc.subject | decision model | - |
dc.title | A data-centric pipeline using convolutional neural network to select better multiple sequence alignment method | - |
dc.type | Conference_Paper | - |
dc.identifier.email | Ting, HF: hfting@cs.hku.hk | - |
dc.identifier.authority | Ting, HF=rp00177 | - |
dc.description.nature | abstract | - |
dc.identifier.doi | 10.1145/3388440.3414909 | - |
dc.identifier.hkuros | 323530 | - |
dc.identifier.spage | article no. 72 | - |
dc.identifier.epage | article no. 72 | - |
dc.publisher.place | New York, NY | - |