Online scheduling of heterogeneous distributed machine learning jobs

Zhang, Q; Zhou, R; Wu, C; Jiao, L; Li, Z

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1145/3397166.3409128
Scopus: eid_2-s2.0-85093915586

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Online scheduling of heterogeneous distributed machine learning jobs

Title	Online scheduling of heterogeneous distributed machine learning jobs
Authors	Zhang, Q Zhou, R Wu, C Jiao, L Li, Z
Issue Date	2020
Publisher	Association for Computing Machinery (ACM).
Citation	Proceedings of the Twenty-first ACM International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing (Mobihoc '20), Virtual Conference, Boston, MA, USA, 11-14 October 2020, p. 111-120 How to Cite? DOI: http://dx.doi.org/10.1145/3397166.3409128
Abstract	Distributed machine learning (ML) has played a key role in today's proliferation of AI services. A typical model of distributed ML is to partition training datasets over multiple worker nodes to update model parameters in parallel, adopting a parameter server architecture. ML training jobs are typically resource elastic, completed using various time lengths with different resource configurations. A fundamental problem in a distributed ML cluster is how to explore the demand elasticity of ML jobs and schedule them with different resource configurations, such that the utilization of resources is maximized and average job completion time is minimized. To address it, we propose an online scheduling algorithm to decide the execution time window, the number and the type of concurrent workers and parameter servers for each job upon its arrival, with a goal of minimizing the weighted average completion time. Our online algorithm consists of (i) an online scheduling framework that groups unprocessed ML training jobs into a batch iteratively, and (ii) a batch scheduling algorithm that configures each ML job to maximize the total weight of scheduled jobs in the current iteration. Our online algorithm guarantees a good parameterized competitive ratio with polynomial time complexity. Extensive evaluations using real-world data demonstrate that it outperforms state-of-the-art schedulers in today's AI cloud systems.
Persistent Identifier	http://hdl.handle.net/10722/301417
ISBN	9781450380157

DC Field	Value	Language
dc.contributor.author	Zhang, Q	-
dc.contributor.author	Zhou, R	-
dc.contributor.author	Wu, C	-
dc.contributor.author	Jiao, L	-
dc.contributor.author	Li, Z	-
dc.date.accessioned	2021-07-27T08:10:45Z	-
dc.date.available	2021-07-27T08:10:45Z	-
dc.date.issued	2020	-
dc.identifier.citation	Proceedings of the Twenty-first ACM International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing (Mobihoc '20), Virtual Conference, Boston, MA, USA, 11-14 October 2020, p. 111-120	-
dc.identifier.isbn	9781450380157	-
dc.identifier.uri	http://hdl.handle.net/10722/301417	-
dc.description.abstract	Distributed machine learning (ML) has played a key role in today's proliferation of AI services. A typical model of distributed ML is to partition training datasets over multiple worker nodes to update model parameters in parallel, adopting a parameter server architecture. ML training jobs are typically resource elastic, completed using various time lengths with different resource configurations. A fundamental problem in a distributed ML cluster is how to explore the demand elasticity of ML jobs and schedule them with different resource configurations, such that the utilization of resources is maximized and average job completion time is minimized. To address it, we propose an online scheduling algorithm to decide the execution time window, the number and the type of concurrent workers and parameter servers for each job upon its arrival, with a goal of minimizing the weighted average completion time. Our online algorithm consists of (i) an online scheduling framework that groups unprocessed ML training jobs into a batch iteratively, and (ii) a batch scheduling algorithm that configures each ML job to maximize the total weight of scheduled jobs in the current iteration. Our online algorithm guarantees a good parameterized competitive ratio with polynomial time complexity. Extensive evaluations using real-world data demonstrate that it outperforms state-of-the-art schedulers in today's AI cloud systems.	-
dc.language	eng	-
dc.publisher	Association for Computing Machinery (ACM).	-
dc.relation.ispartof	Proceedings of the Twenty-First International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing	-
dc.title	Online scheduling of heterogeneous distributed machine learning jobs	-
dc.type	Conference_Paper	-
dc.identifier.email	Wu, C: cwu@cs.hku.hk	-
dc.identifier.authority	Wu, C=rp01397	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1145/3397166.3409128	-
dc.identifier.scopus	eid_2-s2.0-85093915586	-
dc.identifier.hkuros	323514	-
dc.identifier.spage	111	-
dc.identifier.epage	120	-
dc.publisher.place	New York, NY	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Online scheduling of heterogeneous distributed machine learning jobs

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats