File Download
There are no files associated with this item.
Supplementary
-
Citations:
- Appears in Collections:
Conference Paper: MegaPath: Low-Similarity Pathogen Detection from Metagenomic NGS Data
Title | MegaPath: Low-Similarity Pathogen Detection from Metagenomic NGS Data |
---|---|
Authors | |
Issue Date | 2018 |
Publisher | IEEE. The Journal's web site is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1800307 |
Citation | 2018 IEEE 8th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), Las Vegas, NV, USA, 18-20 October 2018 How to Cite? |
Abstract | Detecting pathogen, the causal bacteria or virus, of infections such as pneumonia is an important step in diagnosis. Traditional method for pathogen detection is time- consuming as infectious disease may be caused by a large range of pathogens which should be checked one by one. This causes the delay of treatment or even mistreatment of patients. Unbiased next-generation sequencing (NGS) can detect DNA fragments (reads) of all species in a metagenomics sample with a mixture of different species. Those NGS reads could be classified into different taxa by comparing them with a collection of reference genome sequences, and pathogens could be detected if some reads match them. In clinical diagnoses, it is important that a classifier can detect a significant number of reads supporting the potential pathogens and report as few false classifications as possible, to give a high abundance rank for the pathogen. Otherwise, the pathogen cannot be distinguished from background noises, and it will take doctors a long time to go through a long list of candidates to verify its existence. Existing metagenomic classifiers do not perform well for detecting low-similarity pathogens, i.e., pathogen with genome that is not similar to the reference. It is because most classifiers detect pathogen by constructing a characteristic profile (e.g. k-mers) for each reference and assign reads to species by comparing them with the profiles. When the characteristic profile does not match with the genome of low- similarity pathogens, this approach fails and results in many incorrect or nonspecifically classification. Some tools assign reads to reference sequences by local or semi-global alignment. The analysis time is long (over 4 hours for a typical dataset of 1 Gb) but more reads from the pathogen can be assigned correctly. However, the alignment score of reads are still low for low-similarity pathogen. These reads cannot be assigned to the pathogen specifically such that the number of reads supporting the pathogen is still too low. In order to detect low-similarity pathogen, we introduce MegaPath for NGS-based pathogen detection. There are two major contributions. First, instead of assigning each read to reference sequence one by one, MegaPath analyzes all aligned reads globally to determine a subset of reads with confident alignments. Then MegaPath reassigns non-specifically aligned reads to species with confident alignments, and discards unconfident alignments to avoid potential false classifications. It will increase the number of reads supporting the pathogen and reduce the number of false positive assignments. Second, MegaPath adopts a fast alignment-based approach using an enhanced maximum-exact-match prefix seeding strategy and SIMD-accelerated Smith-Waterman algorithm. Use a metagenomic NGS sample of cerebrospinal fluid (CSF) [1] as an example. The similarity of the pathogen to reference is 18.7%. Centrifuge [2] and Kraken [4], based on characteristic profile, detect 31 and 6 reads from the pathogen respectively. The abundance rank of the pathogen is 710 and 384 respectively. Thus, the doctor needs to go through a list of 300+ species to find out the pathogen. By an alignment processes taking 4 hours, SURPI [3] can detect 76 reads for the pathogen and its rank go up to 245. With better alignment tools and global analysis of reads, MegaPath takes less than one hour to detect 608 reads for the pathogen and its rank is at 33. Thus, MegaPath has the best performance among existing software with a reasonable running time. Experiment results for more datasets can be found in the full paper. In addition to detecting pathogens with known reference sequences, MegaPath can also detect pathogens without any similar DNA-level sequences in the reference database, using de novo assembly and protein alignment. |
Persistent Identifier | http://hdl.handle.net/10722/274110 |
ISBN |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Li, D | - |
dc.contributor.author | Leung, HCM | - |
dc.contributor.author | Wong, CK | - |
dc.contributor.author | Zhang, Y | - |
dc.contributor.author | Law, WC | - |
dc.contributor.author | Xin, Y | - |
dc.contributor.author | Luo, R | - |
dc.contributor.author | Ting, HF | - |
dc.contributor.author | Lam, TW | - |
dc.date.accessioned | 2019-08-18T14:55:17Z | - |
dc.date.available | 2019-08-18T14:55:17Z | - |
dc.date.issued | 2018 | - |
dc.identifier.citation | 2018 IEEE 8th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), Las Vegas, NV, USA, 18-20 October 2018 | - |
dc.identifier.isbn | 9781538685204 | - |
dc.identifier.uri | http://hdl.handle.net/10722/274110 | - |
dc.description.abstract | Detecting pathogen, the causal bacteria or virus, of infections such as pneumonia is an important step in diagnosis. Traditional method for pathogen detection is time- consuming as infectious disease may be caused by a large range of pathogens which should be checked one by one. This causes the delay of treatment or even mistreatment of patients. Unbiased next-generation sequencing (NGS) can detect DNA fragments (reads) of all species in a metagenomics sample with a mixture of different species. Those NGS reads could be classified into different taxa by comparing them with a collection of reference genome sequences, and pathogens could be detected if some reads match them. In clinical diagnoses, it is important that a classifier can detect a significant number of reads supporting the potential pathogens and report as few false classifications as possible, to give a high abundance rank for the pathogen. Otherwise, the pathogen cannot be distinguished from background noises, and it will take doctors a long time to go through a long list of candidates to verify its existence. Existing metagenomic classifiers do not perform well for detecting low-similarity pathogens, i.e., pathogen with genome that is not similar to the reference. It is because most classifiers detect pathogen by constructing a characteristic profile (e.g. k-mers) for each reference and assign reads to species by comparing them with the profiles. When the characteristic profile does not match with the genome of low- similarity pathogens, this approach fails and results in many incorrect or nonspecifically classification. Some tools assign reads to reference sequences by local or semi-global alignment. The analysis time is long (over 4 hours for a typical dataset of 1 Gb) but more reads from the pathogen can be assigned correctly. However, the alignment score of reads are still low for low-similarity pathogen. These reads cannot be assigned to the pathogen specifically such that the number of reads supporting the pathogen is still too low. In order to detect low-similarity pathogen, we introduce MegaPath for NGS-based pathogen detection. There are two major contributions. First, instead of assigning each read to reference sequence one by one, MegaPath analyzes all aligned reads globally to determine a subset of reads with confident alignments. Then MegaPath reassigns non-specifically aligned reads to species with confident alignments, and discards unconfident alignments to avoid potential false classifications. It will increase the number of reads supporting the pathogen and reduce the number of false positive assignments. Second, MegaPath adopts a fast alignment-based approach using an enhanced maximum-exact-match prefix seeding strategy and SIMD-accelerated Smith-Waterman algorithm. Use a metagenomic NGS sample of cerebrospinal fluid (CSF) [1] as an example. The similarity of the pathogen to reference is 18.7%. Centrifuge [2] and Kraken [4], based on characteristic profile, detect 31 and 6 reads from the pathogen respectively. The abundance rank of the pathogen is 710 and 384 respectively. Thus, the doctor needs to go through a list of 300+ species to find out the pathogen. By an alignment processes taking 4 hours, SURPI [3] can detect 76 reads for the pathogen and its rank go up to 245. With better alignment tools and global analysis of reads, MegaPath takes less than one hour to detect 608 reads for the pathogen and its rank is at 33. Thus, MegaPath has the best performance among existing software with a reasonable running time. Experiment results for more datasets can be found in the full paper. In addition to detecting pathogens with known reference sequences, MegaPath can also detect pathogens without any similar DNA-level sequences in the reference database, using de novo assembly and protein alignment. | - |
dc.language | eng | - |
dc.publisher | IEEE. The Journal's web site is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1800307 | - |
dc.relation.ispartof | IEEE International Conference on Computational Advances in Bio and Medical Sciences Proceedings | - |
dc.relation.ispartof | 2018 IEEE 8th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS) | - |
dc.rights | IEEE International Conference on Computational Advances in Bio and Medical Sciences Proceedings. Copyright © IEEE. | - |
dc.rights | ©2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. | - |
dc.title | MegaPath: Low-Similarity Pathogen Detection from Metagenomic NGS Data | - |
dc.type | Conference_Paper | - |
dc.identifier.email | Leung, HCM: cmleung3@hku.hk | - |
dc.identifier.email | Zhang, Y: yifanz@hku.hk | - |
dc.identifier.email | Xin, Y: yxinbal@HKUCC-COM.hku.hk | - |
dc.identifier.email | Luo, R: rbluo@cs.hku.hk | - |
dc.identifier.email | Ting, HF: hfting@cs.hku.hk | - |
dc.identifier.email | Lam, TW: twlam@cs.hku.hk | - |
dc.identifier.authority | Leung, HCM=rp00144 | - |
dc.identifier.authority | Luo, R=rp02360 | - |
dc.identifier.authority | Ting, HF=rp00177 | - |
dc.identifier.authority | Lam, TW=rp00135 | - |
dc.description.nature | link_to_subscribed_fulltext | - |
dc.identifier.doi | 10.1109/ICCABS.2018.8541953 | - |
dc.identifier.hkuros | 302243 | - |
dc.identifier.volume | 2018 | - |
dc.publisher.place | United States | - |