File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)

Article: WBSP: Addressing stragglers in distributed machine learning with worker-busy synchronous parallel

TitleWBSP: Addressing stragglers in distributed machine learning with worker-busy synchronous parallel
Authors
KeywordsDistributed machine learning
Heterogeneous environment
Parameter server
Stragglers
Synchronous parallel
Issue Date1-Sep-2024
PublisherElsevier
Citation
Parallel Computing: Systems & Applications, 2024, v. 121 How to Cite?
AbstractParameter server is widely used in distributed machine learning to accelerate training. However, the increasing heterogeneity of workers’ computing capabilities leads to the issue of stragglers, making parameter synchronization challenging. To address this issue, we propose a solution called Worker-Busy Synchronous Parallel (WBSP). This approach eliminates the waiting time of fast workers during the synchronization process and decouples the gradient upload and model download of fast workers into asymmetric parts. By doing so, it allows fast workers to complete multiple steps of local training and upload more gradients to the server, improving computational resource utilization. Additionally, the global model is only updated when the slowest worker uploads the gradients, ensuring the consistency of global models that are pulled down by all workers and the convergence of the global model. Building upon WBSP, we propose an optimized version to further reduce the communication overhead. It enables parallel execution of communication and computation tasks on workers to shorten the global synchronization interval, thereby improving training speed. We conduct theoretical analyses for the proposed mechanisms. Extensive experiments verify that our mechanism can reduce the required time to achieve the target accuracy by up to 60% compared with the fastest method and increase the proportion of computation time from 55%–72% in existing methods to 91%.
Persistent Identifierhttp://hdl.handle.net/10722/351119
ISSN
2023 Impact Factor: 2.0
2023 SCImago Journal Rankings: 0.460

 

DC FieldValueLanguage
dc.contributor.authorYang, Duo-
dc.contributor.authorHu, Bing-
dc.contributor.authorLiu, An-
dc.contributor.authorJin, A. Long-
dc.contributor.authorYeung, Kwan L.-
dc.contributor.authorYou, Yang-
dc.date.accessioned2024-11-10T00:30:15Z-
dc.date.available2024-11-10T00:30:15Z-
dc.date.issued2024-09-01-
dc.identifier.citationParallel Computing: Systems & Applications, 2024, v. 121-
dc.identifier.issn0167-8191-
dc.identifier.urihttp://hdl.handle.net/10722/351119-
dc.description.abstractParameter server is widely used in distributed machine learning to accelerate training. However, the increasing heterogeneity of workers’ computing capabilities leads to the issue of stragglers, making parameter synchronization challenging. To address this issue, we propose a solution called Worker-Busy Synchronous Parallel (WBSP). This approach eliminates the waiting time of fast workers during the synchronization process and decouples the gradient upload and model download of fast workers into asymmetric parts. By doing so, it allows fast workers to complete multiple steps of local training and upload more gradients to the server, improving computational resource utilization. Additionally, the global model is only updated when the slowest worker uploads the gradients, ensuring the consistency of global models that are pulled down by all workers and the convergence of the global model. Building upon WBSP, we propose an optimized version to further reduce the communication overhead. It enables parallel execution of communication and computation tasks on workers to shorten the global synchronization interval, thereby improving training speed. We conduct theoretical analyses for the proposed mechanisms. Extensive experiments verify that our mechanism can reduce the required time to achieve the target accuracy by up to 60% compared with the fastest method and increase the proportion of computation time from 55%–72% in existing methods to 91%.-
dc.languageeng-
dc.publisherElsevier-
dc.relation.ispartofParallel Computing: Systems & Applications-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subjectDistributed machine learning-
dc.subjectHeterogeneous environment-
dc.subjectParameter server-
dc.subjectStragglers-
dc.subjectSynchronous parallel-
dc.titleWBSP: Addressing stragglers in distributed machine learning with worker-busy synchronous parallel-
dc.typeArticle-
dc.identifier.doi10.1016/j.parco.2024.103092-
dc.identifier.scopuseid_2-s2.0-85198006976-
dc.identifier.volume121-
dc.identifier.issnl0167-8191-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats