START : a parallel signal track analytical research tool for flexible and efficient analysis of genomic data

Zhu, Xinjie; 朱信杰

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_b5481898

Supplementary

Citations:
Appears in Collections:
- Computer Science: Theses
- HKU Theses Online

postgraduate thesis: START : a parallel signal track analytical research tool for flexible and efficient analysis of genomic data

Title	START : a parallel signal track analytical research tool for flexible and efficient analysis of genomic data
Authors	Zhu, Xinjie 朱信杰
Issue Date	2015
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Zhu, X. [朱信杰]. (2015). START : a parallel signal track analytical research tool for flexible and efficient analysis of genomic data. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5481898
Abstract	Signal Track Analytical Research Tool (START), is a parallel system for analyzing large-scale genomic data. Currently, genomic data analyses are usually performed by using custom scripts developed by individual research groups, and/or by the integrated use of multiple existing tools (such as BEDTools and Galaxy). The goals of START are 1) to provide a single tool that supports a wide spectrum of genomic data analyses that are commonly done by analysts; and 2) to greatly simplify these analysis tasks by means of a simple declarative language (STQL) with which users only need to specify what they want to do, rather than the detailed computational steps as to how the analysis task should be performed. START consists of four major components: 1) A declarative language called Signal Track Query Language (STQL), which is a SQL-like language we specifically designed to suit the needs for analyzing genomic signal tracks. 2) A STQL processing system built on top of a large-scale distributed architecture. The system is based on the Hadoop distributed storage and the MapReduce Big Data processing framework. It processes each user query using multiple machines in parallel. 3) A simple and user-friendly web site that helps users construct and execute queries, upload/download compressed data files in various formats, man-age stored data, queries and analysis results, and share queries with other users. It also provides a complete help system, detailed specification of STQL, and a large number of sample queries for users to learn STQL and try START easily. Private files and queries are not accessible by other users. 4) A repository of public data popularly used for large-scale genomic data analysis, including data from ENCODE and Roadmap Epigenomics, that users can use in their analyses.
Degree	Doctor of Philosophy
Subject	Genomics - Data processing Parallel programming (Computer science)
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/211136
HKU Library Item ID	b5481898

DC Field	Value	Language
dc.contributor.author	Zhu, Xinjie	-
dc.contributor.author	朱信杰	-
dc.date.accessioned	2015-07-07T23:10:45Z	-
dc.date.available	2015-07-07T23:10:45Z	-
dc.date.issued	2015	-
dc.identifier.citation	Zhu, X. [朱信杰]. (2015). START : a parallel signal track analytical research tool for flexible and efficient analysis of genomic data. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5481898	-
dc.identifier.uri	http://hdl.handle.net/10722/211136	-
dc.description.abstract	Signal Track Analytical Research Tool (START), is a parallel system for analyzing large-scale genomic data. Currently, genomic data analyses are usually performed by using custom scripts developed by individual research groups, and/or by the integrated use of multiple existing tools (such as BEDTools and Galaxy). The goals of START are 1) to provide a single tool that supports a wide spectrum of genomic data analyses that are commonly done by analysts; and 2) to greatly simplify these analysis tasks by means of a simple declarative language (STQL) with which users only need to specify what they want to do, rather than the detailed computational steps as to how the analysis task should be performed. START consists of four major components: 1) A declarative language called Signal Track Query Language (STQL), which is a SQL-like language we specifically designed to suit the needs for analyzing genomic signal tracks. 2) A STQL processing system built on top of a large-scale distributed architecture. The system is based on the Hadoop distributed storage and the MapReduce Big Data processing framework. It processes each user query using multiple machines in parallel. 3) A simple and user-friendly web site that helps users construct and execute queries, upload/download compressed data files in various formats, man-age stored data, queries and analysis results, and share queries with other users. It also provides a complete help system, detailed specification of STQL, and a large number of sample queries for users to learn STQL and try START easily. Private files and queries are not accessible by other users. 4) A repository of public data popularly used for large-scale genomic data analysis, including data from ENCODE and Roadmap Epigenomics, that users can use in their analyses.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Genomics - Data processing	-
dc.subject.lcsh	Parallel programming (Computer science)	-
dc.title	START : a parallel signal track analytical research tool for flexible and efficient analysis of genomic data	-
dc.type	PG_Thesis	-
dc.identifier.hkul	b5481898	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_b5481898	-
dc.identifier.mmsid	991005693919703414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: START : a parallel signal track analytical research tool for flexible and efficient analysis of genomic data

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats