IaC-Eval: A Code Generation Benchmark for Cloud Infrastructure-as-Code Programs

Kon, Patrick Tser Jern; Liu, Jiachen; Qiu, Yiming; Fan, Weijun; Lin, Ting He Lei; Zhang, Haoran; Park, Owen M.; Elengikal, George S.; Kang, Yuxin; Chen, Ang; Chowdhury, Mosharaf; Lee, Myungjin; Wang, Xinyu

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Scopus: eid_2-s2.0-105000522034
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: IaC-Eval: A Code Generation Benchmark for Cloud Infrastructure-as-Code Programs

Title	IaC-Eval: A Code Generation Benchmark for Cloud Infrastructure-as-Code Programs
Authors	Kon, Patrick Tser Jern Liu, Jiachen Qiu, Yiming Fan, Weijun Lin, Ting He Lei Zhang, Haoran Park, Owen M.Elengikal, George S.Kang, Yuxin Chen, Ang Chowdhury, Mosharaf Lee, Myungjin Wang, Xinyu
Issue Date	2024
Citation	Advances in Neural Information Processing Systems, 2024, v. 37 How to Cite?
Abstract	Infrastructure-as-Code (IaC), an important component of cloud computing, allows the definition of cloud infrastructure in high-level programs. However, developing IaC programs is challenging, complicated by factors that include the burgeoning complexity of the cloud ecosystem (e.g., diversity of cloud services and workloads), and the relative scarcity of IaC-specific code examples and public repositories. While large language models (LLMs) have shown promise in general code generation and could potentially aid in IaC development, no benchmarks currently exist for evaluating their ability to generate IaC code. We present IaC-Eval, a first step in this research direction. IaC-Eval's dataset includes 458 human-curated scenarios covering a wide range of popular AWS services, at varying difficulty levels. Each scenario mainly comprises a natural language IaC problem description and an infrastructure intent specification. The former is fed as user input to the LLM, while the latter is a general notion used to verify if the generated IaC program conforms to the user's intent; by making explicit the problem's requirements that can encompass various cloud services, resources and internal infrastructure details. Our in-depth evaluation shows that contemporary LLMs perform poorly on IaC-Eval, with the top-performing model, GPT-4, obtaining a pass@1 accuracy of 19.36%. In contrast, it scores 86.6% on EvalPlus, a popular Python code generation benchmark, highlighting a need for advancements in this domain. We open-source the IaC-Eval dataset and evaluation framework at https://github.com/autoiac-project/iac-eval to enable future research on LLM-based IaC code generation.
Persistent Identifier	http://hdl.handle.net/10722/362998
ISSN	1049-5258 2020 SCImago Journal Rankings: 1.399

DC Field	Value	Language
dc.contributor.author	Kon, Patrick Tser Jern	-
dc.contributor.author	Liu, Jiachen	-
dc.contributor.author	Qiu, Yiming	-
dc.contributor.author	Fan, Weijun	-
dc.contributor.author	Lin, Ting He Lei	-
dc.contributor.author	Zhang, Haoran	-
dc.contributor.author	Park, Owen M.	-
dc.contributor.author	Elengikal, George S.	-
dc.contributor.author	Kang, Yuxin	-
dc.contributor.author	Chen, Ang	-
dc.contributor.author	Chowdhury, Mosharaf	-
dc.contributor.author	Lee, Myungjin	-
dc.contributor.author	Wang, Xinyu	-
dc.date.accessioned	2025-10-10T07:43:58Z	-
dc.date.available	2025-10-10T07:43:58Z	-
dc.date.issued	2024	-
dc.identifier.citation	Advances in Neural Information Processing Systems, 2024, v. 37	-
dc.identifier.issn	1049-5258	-
dc.identifier.uri	http://hdl.handle.net/10722/362998	-
dc.description.abstract	Infrastructure-as-Code (IaC), an important component of cloud computing, allows the definition of cloud infrastructure in high-level programs. However, developing IaC programs is challenging, complicated by factors that include the burgeoning complexity of the cloud ecosystem (e.g., diversity of cloud services and workloads), and the relative scarcity of IaC-specific code examples and public repositories. While large language models (LLMs) have shown promise in general code generation and could potentially aid in IaC development, no benchmarks currently exist for evaluating their ability to generate IaC code. We present IaC-Eval, a first step in this research direction. IaC-Eval's dataset includes 458 human-curated scenarios covering a wide range of popular AWS services, at varying difficulty levels. Each scenario mainly comprises a natural language IaC problem description and an infrastructure intent specification. The former is fed as user input to the LLM, while the latter is a general notion used to verify if the generated IaC program conforms to the user's intent; by making explicit the problem's requirements that can encompass various cloud services, resources and internal infrastructure details. Our in-depth evaluation shows that contemporary LLMs perform poorly on IaC-Eval, with the top-performing model, GPT-4, obtaining a pass@1 accuracy of 19.36%. In contrast, it scores 86.6% on EvalPlus, a popular Python code generation benchmark, highlighting a need for advancements in this domain. We open-source the IaC-Eval dataset and evaluation framework at https://github.com/autoiac-project/iac-eval to enable future research on LLM-based IaC code generation.	-
dc.language	eng	-
dc.relation.ispartof	Advances in Neural Information Processing Systems	-
dc.title	IaC-Eval: A Code Generation Benchmark for Cloud Infrastructure-as-Code Programs	-
dc.type	Conference_Paper	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.scopus	eid_2-s2.0-105000522034	-
dc.identifier.volume	37	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: IaC-Eval: A Code Generation Benchmark for Cloud Infrastructure-as-Code Programs

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats