File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

postgraduate thesis: Statistical analysis of RNA-seq and scRNA-seq expression data

TitleStatistical analysis of RNA-seq and scRNA-seq expression data
Authors
Advisors
Advisor(s):Sham, PCWang, JJ
Issue Date2018
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Yip, S. [葉信恆]. (2018). Statistical analysis of RNA-seq and scRNA-seq expression data. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractRNA-seq is a popular technique that utilizes next-generation sequencing to obtain transcriptome information from a cell population. It generates short sequences of reads from the transcriptome, which can be utilized for gene annotation, expression quantification, fusion gene detection, differentially expressed gene (DEG) analysis, etc. This technology can be applied to single cells, which enables more in-depth study of the transcriptome. Single cell RNA-seq (scRNA-seq) enables highly variable transcript discovery, cell subpopulation analysis, etc., in addition to common RNA-seq data analyses. The analysis of RNA-seq data can be separated into two main categories. The first category focuses on the read sequences. This data allows the analysis of alternative splicing, gene annotation, post-transcriptional modifications, gene fusion, etc. The second category focuses on the expression data, which are obtained by counting the number of reads generated from each gene or transcript. Analysis of the expression data includes DEG analysis, highly variable transcript discovery, cell subpopulation analysis, etc. This thesis begins by briefly describing the backgrounds in RNA-seq analysis and the commonly utilized pipelines from the first category. The main focus of this thesis is the statistical analysis of the expression data. RNA-seq analysis tools that analyze expression data can often perform DEG analysis; and they are previously shown to have advantages over each other in different aspects. For example, voom controls false positive rates well, DESeq2 is focused on precision and edgeR has an advantage in overall accuracy. This prompts the development a new method that can perform optimally in all of these aspects. On the other hand, scRNA-seq is a newer technology and many tools are developed recently. Compared to RNA-seq data, scRNA-seq expression matrices contain a higher amount of zero counts and their expression estimates are often less accurate. Hence, scRNA-seq analysis methods are often emphasized on technical noise reduction. Utilizing DEG analysis, which is a basic statistical test, existing scRNA-seq tools are shown to be inferior to existing RNA-seq methods in controlling false positive rates with real scRNA-seq data. To improve current analysis pipelines, the issue is pinpointed to the normalization and transformation step, which is crucial for the reduction of technical noises. The linear model and normality based normalization and transformation method (Linnorm) is developed to normalize and transform scRNA-seq data for statistical analyses. By using real RNA-seq and scRNA-seq data, Linnorm is compared with existing normalization methods and it shows improvements in multiple aspects.
DegreeDoctor of Philosophy
SubjectNucleotide sequence - Statistical methods
Dept/ProgramBiomedical Sciences
Persistent Identifierhttp://hdl.handle.net/10722/266318

 

DC FieldValueLanguage
dc.contributor.advisorSham, PC-
dc.contributor.advisorWang, JJ-
dc.contributor.authorYip, Shun-hang-
dc.contributor.author葉信恆-
dc.date.accessioned2019-01-18T01:52:02Z-
dc.date.available2019-01-18T01:52:02Z-
dc.date.issued2018-
dc.identifier.citationYip, S. [葉信恆]. (2018). Statistical analysis of RNA-seq and scRNA-seq expression data. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/266318-
dc.description.abstractRNA-seq is a popular technique that utilizes next-generation sequencing to obtain transcriptome information from a cell population. It generates short sequences of reads from the transcriptome, which can be utilized for gene annotation, expression quantification, fusion gene detection, differentially expressed gene (DEG) analysis, etc. This technology can be applied to single cells, which enables more in-depth study of the transcriptome. Single cell RNA-seq (scRNA-seq) enables highly variable transcript discovery, cell subpopulation analysis, etc., in addition to common RNA-seq data analyses. The analysis of RNA-seq data can be separated into two main categories. The first category focuses on the read sequences. This data allows the analysis of alternative splicing, gene annotation, post-transcriptional modifications, gene fusion, etc. The second category focuses on the expression data, which are obtained by counting the number of reads generated from each gene or transcript. Analysis of the expression data includes DEG analysis, highly variable transcript discovery, cell subpopulation analysis, etc. This thesis begins by briefly describing the backgrounds in RNA-seq analysis and the commonly utilized pipelines from the first category. The main focus of this thesis is the statistical analysis of the expression data. RNA-seq analysis tools that analyze expression data can often perform DEG analysis; and they are previously shown to have advantages over each other in different aspects. For example, voom controls false positive rates well, DESeq2 is focused on precision and edgeR has an advantage in overall accuracy. This prompts the development a new method that can perform optimally in all of these aspects. On the other hand, scRNA-seq is a newer technology and many tools are developed recently. Compared to RNA-seq data, scRNA-seq expression matrices contain a higher amount of zero counts and their expression estimates are often less accurate. Hence, scRNA-seq analysis methods are often emphasized on technical noise reduction. Utilizing DEG analysis, which is a basic statistical test, existing scRNA-seq tools are shown to be inferior to existing RNA-seq methods in controlling false positive rates with real scRNA-seq data. To improve current analysis pipelines, the issue is pinpointed to the normalization and transformation step, which is crucial for the reduction of technical noises. The linear model and normality based normalization and transformation method (Linnorm) is developed to normalize and transform scRNA-seq data for statistical analyses. By using real RNA-seq and scRNA-seq data, Linnorm is compared with existing normalization methods and it shows improvements in multiple aspects. -
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshNucleotide sequence - Statistical methods-
dc.titleStatistical analysis of RNA-seq and scRNA-seq expression data-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineBiomedical Sciences-
dc.description.naturepublished_or_final_version-
dc.identifier.doi10.5353/th_991044069403703414-
dc.date.hkucongregation2018-
dc.identifier.mmsid991044069403703414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats