Agentic CLI tools are AI coding tools that can create and delete files, run commands, plan, and execute the coding of the entire project. We tested the leading tools in 20 real-world web development scenarios to see which one truly delivers a production-ready website.
Agentic CLI benchmark results
Analysis and insights
Kiro CLI achieved the highest accuracy in our benchmark. It achieved a 77% success rate, specifically excelling in orchestrating interactive elements and complex component logic.
While Aider and Cline were successful in building basic structures, they showed deficiencies in detailed functional requirements, such as complex form validations and multi-layered navigation menus.
Methodology of agentic CLI benchmark
This benchmark measures the ability of agentic CLI tools to build fully functional, interactive, and production-ready websites. We only prompted the CLI tools once and did not prompt them further to resolve errors, to keep the benchmark fair.
Project pool and model configurations
We selected 20 diverse web projects, ranging from landing pages to complex admin dashboards. To ensure a fair comparison, we utilized the most capable models for each specific CLI and used Claude 4.5 Opus for the open source CLIs:
- Kiro CLI, Aider, and Cline: Tested using Claude 4.5 Opus.
- Claude Code: Tested using Claude 4.5 Opus.
- OpenAI Codex CLI: Tested using OpenAI Codex 5.1.
- Gemini CLI: Tested using Gemini 3 Pro.
Functional checklists (browser-side audit)
Success was measured by human developers evaluating the final output in the browser, not just checking whether the code was syntactically correct. Each project was evaluated against 20 to 30 concrete criteria, focusing on:
- Component functionality: Verifying if navigation links direct correctly, search bars return results, and assets (images/icons) load as expected.
- User interaction: Testing if buttons perform expected actions and if contact forms capture data while providing appropriate user feedback.
- Visual & functional alignment: Ensuring the page layout matches the design requirements and meets responsive (mobile compatibility) standards.
The final scores represent the percentage of checklist items successfully met, averaged across all 20 projects.
Example benchmark task: Online education platform dashboard
One of our tasks is online education platform dashboard. This task requires building a complex dashboard with navigation, progress tracking, and course recommendations. Key parts from our task:
Project scope & Tech stack:
“Build a complete LMS where instructors create and sell courses, students watch video lessons and take quizzes, and admins manage the platform. Role-based dashboards, course builders with drag-drop curriculum, video player with progress tracking, quiz system, and certificate generation.”
Specific UI & dashboard requirements:
“Student Dashboard:
- Progress overview card: Total courses enrolled, Completed, In progress
- Learning streak counter (days in a row)
- Continue watching section: Course cards with progress bars and ‘Continue’ button
- Recommended courses (based on enrolled courses)
Sidebar & Navigation: Modular navigation with links for Dashboard, Browse, My Learning, Wishlist, and Certificates.”
Advanced functional logic:
“Complex Features to Implement:
- Drag-and-Drop Curriculum Builder: Use @dnd-kit/core and @dnd-kit/sortable.
- Video Player with Progress Tracking: Save progress every 10 seconds, Resume from last position, Auto-mark complete at 90%.
- Certificate Generation: Template with student name, course title, completion date, and QR code.”
Results of example task
Details of the results of the example task
Kiro CLI vs Gemini CLI
Kiro CLI
As seen above, Kiro CLI delivered a UI with a functional search bar, a comprehensive sidebar, interactive progress cards, and a well-structured “Recommended for You” section. Kiro CLI completed this task with 95% success rate.
Gemini CLI
In contrast, the Gemini CLI output remained at a skeletal level. It failed to implement the sidebar and search functionality, leaving the user with a mostly empty and non-functional interface. Gemini CLI completed this task with 10% success rate.
Claude Code
Claude Code is a CLI interface that connects Claude models.
Claude Code generates a session summary at the end of each session. This summary shows activity details. For example, one session shows the total cost was $0.0556 and the API processing time was 9 seconds.
Pricing & runtime behavior
The tool has a /cost command but no upfront control over spending or session limits.
Claude Code’s $20/month plan has a tiny fraction of usage.
Note that there are other tools that can create websites with a single prompt for free.
In our AI code editor benchmark which was a to-do app test, it was the top performer, successfully implementing all core features except drag-and-drop.
Gemini CLI
Gemini CLI is an open-source AI agent that provides the capabilities of the advanced Gemini models (e.g., Gemini 3 Pro) directly within the command line.
Pricing & runtime behavior
Gemini CLI offers a free tier that includes 60 requests/min and 1,000 requests/day with personal Google account.
OpenHands
OpenHands (formerly OpenDevin) is an open-source platform designed to create and deploy autonomous AI agents capable of performing comprehensive software development tasks. It is built as a community-driven project with a free MIT license.
Codex CLI
Codex CLI is an interactive terminal-based coding assistant from OpenAI, providing access to their specialized coding models.
Pricing & runtime behavior
Security default: By default, the cloud agent’s sandbox is cut off from the internet for security, which may present a friction point for tasks requiring new package installation or external API access.
Subscription model: Codex is generally not a standalone product but is included as a core feature within paid ChatGPT subscription plans (Plus, Pro, Business, Enterprise), positioning it as a value-add to a broader AI toolkit.
API alternative: For heavy-duty or custom needs, users can bypass subscription limits and use the pay-as-you-go API, though this introduces less predictable costs based on usage.
Aider
Aider is one of the first open-source AI coding assistants.
Cline CLI
Cline is an open-source AI coding agent that uses a flexible LLM backend to plan and execute complex, multi-step software development tasks within a developer’s IDE (VS Code/JetBrains) or via its new Cline CLI.
As AI-generated code becomes more common, code review tools are essential for catching bugs and vulnerabilities. We evaluated the top tools on 309 PRs in our RevEval benchmark.
AI coding tools can be grouped into three categories:
- CLI-based coding agents: Tools for terminal-based development workflows, generate, edit and refactor code through prompts and command-line interactions.
- Examples: Aider, Devin, Claude Code, Codex CLI
- AI code editors: Also known as agentic IDEs, these tools provides a GUI similar to VS Code (most of them are built on VS Code).
- Examples: GitHub Copilot, Cursor, Replit, Antigravity and Cline
- Prompt-to-app builders: Low-code/no-code platforms to build apps using natural language prompts and visual workflows.
- Examples: Bolt, Lovable, v0.dev, Firebase Studio, Dazl
Across tools like Claude Code, Gemini CLI, and OpenHands, common capabilities include:
- End-to-end code work: Create and modify files, fix bugs, refactor code, and run tests or linters directly from the terminal.
- Agentic workflows: Perform multi-step tasks such as task chaining, troubleshooting, search, and iterative debugging.
- Git & project management: Review history, resolve merges, manage branches, and create commits or pull requests.
- Command execution & automation: Run shell commands, automate analyses, and translate natural language into complex CLI operations.
- Deep context handling: Operate on full repositories with awareness of dependencies and project structure.
- Model flexibility: Support multiple cloud and, in some cases, local models; some tools allow using your own API key or choosing between plans.
- Sandboxed or controlled access: Offer modes ranging from read-only to full automation, often with isolated environments for safety.
Read more
For those exploring the broader ecosystem of agentic developer tools, here our latest benchmarks:
- MCP benchmark: A comparison of the top MCP servers for web access.
- Remote browsers: How emerging browser infrastructure enables AI agents to interact with the web securely.
Principal Analyst
Cem Dilmegani
Principal Analyst
Cem’s work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.
Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.
He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem’s work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.
Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.