Agent Tool Calling Thinks
1. Introduction
Large language models are no longer used solely as passive text generators. In modern agentic systems, they increasingly function as decision-making and orchestration components that select tools, generate action plans, call external services, and integrate returned results into subsequent reasoning steps. As a result, the interface between the model and the external world has become a central systems problem.
A key source of confusion in this area is the frequent conflation of LLMs with agents. In practical discourse, the two terms are often used interchangeably because an LLM is typically the generative core of an agent system. However, from an analytical perspective, they remain distinct. A classical reinforcement learning agent is commonly conceptualized through an iterative decision process over states, actions, and rewards. By contrast, many contemporary LLM agents do not directly implement such a stateful control policy. Instead, they act as prompt-conditioned text generators that produce structured outputs such as intermediate reasoning traces, plans, or tool-calling instructions. In this setting, the agent's externally meaningful action is often best understood as a sequence of tool calls emitted through text generation.
This observation motivates a broader question: what is the most appropriate interface for connecting LLM agents to tools and environments? Two competing design philosophies have emerged. One advocates returning to the command-line interface, emphasizing simplicity, low overhead, and practical alignment with the model's training distribution. The other favors structured interaction through protocols such as the Model Context Protocol (MCP), which provide explicit schemas, strong typing, and machine-readable service contracts.
This paper examines the trade-offs between these approaches. Our central claim is that the suitability of an interface depends on the structural properties of the task. When an agent performs local, one-dimensional, and highly parameterized operations, CLI often provides the most efficient and robust solution. However, as the interaction shifts toward remote services, nested objects, typed queries, and multi-system orchestration, text-only interfaces become increasingly fragile. Structured protocols such as MCP resolve many of these problems, but at the cost of increased context overhead and token consumption. We further argue that a promising middle ground is emerging through code-centric tooling, where the agent writes code against typed APIs rather than invoking raw tool schemas directly.
The main contributions of this paper are as follows:
- We clarify the distinction between LLMs and LLM-based agents from an action-interface perspective.
- We analyze the operational strengths of CLI for local agent tooling.
- We explain the structural limitations of CLI in the presence of nested data and remote service integration.
- We examine how MCP addresses these limitations through typed and discoverable interfaces, while also introducing context and token overhead.
- We discuss code-centric, language-agnostic APIs as an emerging alternative for improving tool scalability and multi-step composition.
- We propose a hybrid design framework for selecting tool interfaces according to task locality, structural complexity, and context budget.
2. LLMs, Agents, and Tool-Calling Actions
2.1. LLMs and Agents Are Not Identical
Although modern agent systems are usually built around an LLM, the model itself should not be equated with the full agent. In many deployed systems, the LLM serves as a generative component that produces candidate reasoning steps, action plans, or structured tool invocations. The surrounding runtime is responsible for interpreting these outputs, executing the requested operations, and feeding the results back into the model context.
This architecture differs from the standard reinforcement learning view in which an agent is defined by a policy that maps states to actions under an explicit transition process. Many LLM agents do not directly manipulate a formal state representation or optimize over a reward signal in real time. Instead, they repeatedly extend a conversational context with generated text fragments such as think, plan, act, or tool_call blocks. Under this view, the operational action of an LLM agent is not the internal reasoning text itself, but the sequence of tool calls that produces external effects.
2.2. Tool Calling as the Operational Core of Agency
The ability to call tools is therefore the fundamental mechanism through which an LLM agent interacts with the outside world. File operations, shell execution, web navigation, database lookup, and enterprise API access all become instances of tool-mediated action. This perspective shifts the design problem away from language generation alone and toward the structure of the interface that exposes tool capabilities to the model.
In current systems, tool use is often implemented through a special output format that signals a function call. The model emits a structured payload, typically encoded as a JSON-like object, which is then interpreted by the runtime and executed externally. This design enables action, but it also introduces a representational mismatch. Tool invocation relies on special conventions that do not arise naturally in ordinary text corpora, meaning that successful tool use depends not only on language modeling ability but also on specialized fine-tuning and interface design.
3. Why CLI Remains Attractive for LLM Agents
3.1. Historical Maturity and Engineering Robustness
The command-line interface has more than five decades of software engineering history and remains one of the most mature and thoroughly validated interaction paradigms in computing. Its persistence is not accidental. CLI offers a simple abstraction for invoking operations, composing commands, inspecting outputs, and handling failures through explicit exit codes and standard error channels.
For LLM agents, CLI has gained renewed relevance because its textual form aligns closely with the model's pretraining distribution. Modern LLMs have absorbed massive quantities of terminal transcripts, shell scripts, open-source repositories, configuration files, and developer discussions. As a result, they are often highly competent at producing shell commands in a zero-shot manner. Operations such as git log, docker ps, directory traversal, file filtering, and process inspection are readily expressible in a format that the model already understands well.
3.2. Suitability for Local, Linear, and Parameterized Tasks
CLI is particularly effective when the agent's task has three properties: locality, linearity, and parameterization.
Locality means the operation concerns resources available in the immediate execution environment, such as files, directories, logs, or processes. Linearity means that the input and output can be represented as relatively simple textual streams. Parameterization means the task can be specified through a finite list of flags, arguments, or pipeline operators.
Typical examples include identifying the programming language distribution of a repository, searching for matching files with regular expressions, extracting selected lines from logs, or listing active containers. For such tasks, wrapping functionality as a CLI command has low implementation cost, fast execution, and straightforward debugging behavior. Because failures usually produce explicit error messages, the agent can often recover quickly through self-correction.
3.3. Low Context Overhead
Another major advantage of CLI is that the interface description can be minimal. In many cases, an agent can infer the expected command structure directly from prior knowledge or a short help message, without requiring a large tool schema to be loaded into context. This property makes CLI especially attractive in settings where token budget, latency, and inference cost are tightly constrained.
4. Structural Limitations of CLI for Complex Services
4.1. Text Is a Presentation Layer, Not a Native Structural Interface
Despite its strengths, CLI becomes problematic when the agent must interact with services whose inputs and outputs are deeply structured. A text terminal is fundamentally a two-dimensional presentation medium. It is well suited for human inspection and pattern recognition, but it does not natively preserve the hierarchical semantics required for deterministic machine parsing.
Modern software services increasingly communicate through nested JSON objects, arrays, enums, references, and complex DTOs. A typical REST or GraphQL response may contain dozens of related fields spanning multiple levels of hierarchy. When such objects are surfaced through a CLI wrapper, they are either flattened into text or handed off to downstream string processing utilities such as jq, awk, sed, or regular-expression based extraction. For an LLM agent, this introduces fragility at precisely the point where correctness matters most.
4.2. Loss of Type Safety
The central problem is the collapse of typed structure into weakly typed strings. Once booleans, integers, arrays, and object references are converted into textual fragments, the agent must reconstruct their meaning heuristically. In long interaction chains, this weakens state integrity and increases the probability of silent failure.
Consider an agent that must retrieve a complex user profile from a remote service, extract three nested attributes, and pass them to a subsequent system. In a CLI-centered workflow, the model often has to generate brittle text-processing pipelines to isolate the desired values. Small changes in spacing, column ordering, or output formatting can cause the workflow to break. Because the parser is effectively embedded inside generated prompt text rather than enforced by a schema, even minor service changes can propagate downstream as workflow corruption.
4.3. Poor Fit for Remote, High-Complexity Endpoints
These issues become more severe when the agent operates across multiple remote systems. Enterprise APIs, customer relationship platforms, ticketing systems, cloud dashboards, and internal data services usually expose objects with rich schemas and strict field-level semantics. Attempting to mediate these services through plain CLI text often produces an unnecessary serialization and deserialization cycle: structured data are flattened for display, then reinterpreted by the model, then possibly re-serialized for the next call. This conversion pipeline increases both error surface and cognitive load for the model.
The emergence of universal CLI bridges that wrap structured tools inside shell commands reflects this tension. Although such adapters can improve convenience, they do not eliminate the underlying structural mismatch. Instead, they often reintroduce complexity through DTO serialization, parsing conventions, and tool-specific formatting rules.
5. MCP as a Structured Interface for Agent Tooling
5.1. Core Design Motivation
The Model Context Protocol was proposed to standardize communication between LLMs and external tools, data sources, and services. Its value lies not primarily in local execution speed, but in providing a machine-readable, discoverable, and strongly typed bridge between the model and structured systems.
Under MCP, tools are exposed through explicit schemas that define input parameters, required fields, expected types, and return structures. Instead of inferring these properties from ad hoc text output, the agent receives a formal description of tool capabilities. This reduces ambiguity and enables a more principled interaction model.
5.2. Native Handling of Nested Objects and Typed Queries
The key advantage of MCP is that hierarchical structure is preserved rather than flattened away. If a remote service returns a complex organization tree, a nested issue object, or a customer record with typed subfields, the protocol can represent that structure directly. The model no longer needs to infer object boundaries from formatting conventions or reconstruct field meanings through textual heuristics.
This has important implications for reliability. When an agent calls a tool such as salesforce.updateRecord, it can be informed that recordId is a string, data is an object with required keys, and the returned value contains typed status metadata. This typed interface is especially important for multi-system workflows in which outputs from one service must be passed as valid inputs into another.
5.3. Improved Discoverability and Safer Integration
Structured tool descriptions also improve discoverability and governance. Because the protocol exposes tool signatures explicitly, it becomes easier to enumerate available capabilities, validate arguments before execution, and apply policy controls around access patterns. In enterprise settings, this can support auditability, safer delegation, and more controlled integration with internal systems.
6. The Cost of Traditional MCP: Context and Token Inflation
6.1. Schema Loading Overhead
The most serious criticism of traditional MCP-style tooling is economic rather than conceptual. In many implementations, the client injects detailed tool schemas, argument descriptions, and return specifications directly into the model's context window at session start. In a composite enterprise environment containing code repositories, ticketing systems, CRM platforms, cloud services, and internal databases, the total schema payload can become very large.
This overhead creates a mismatch between representational rigor and context efficiency. For simple tasks that require only a single local action, the cost of loading large tool descriptions may far exceed the cost of the task itself. In such settings, a direct CLI command can be substantially cheaper and faster.
6.2. Context Rot and Tool Selection Degradation
Beyond financial cost, large schema payloads can degrade model performance. When the context window is saturated with tool definitions, the model must allocate attention across a large volume of auxiliary material that may be irrelevant to the immediate task. This raises the risk that core system instructions, user goals, or critical tool constraints will be diluted in attention.
The practical consequence is a form of context degradation in which tool selection becomes less reliable, parameter inference worsens, and multi-step execution becomes more error-prone. The problem is not that structured schemas are inherently harmful, but that indiscriminate schema injection into the prompt turns structural clarity into context burden.
6.3. The Economic Pressure Toward Simplicity
These inefficiencies help explain why many developers continue to prefer CLI even when a more formal protocol exists. The issue is not merely conservatism or familiarity. Rather, there is a concrete trade-off between structural precision and token economy. If a task is local, simple, and short-lived, the additional overhead of a full schema-driven protocol may be difficult to justify.
7. Code-Centric Tool Interfaces as an Emerging Alternative
7.1. From Raw Tool Calls to Code-Level Abstraction
A promising alternative is to expose tools not as raw function-calling schemas, but as code-native APIs that the agent can manipulate directly. In this model, the LLM writes code against a library interface instead of repeatedly issuing tool calls through a schema-mediated conversational loop.
This approach offers two important advantages. First, agents often handle larger and more complex toolsets more effectively when tools are presented through familiar programming abstractions rather than through specialized tool-calling formats. One plausible reason is that modern LLMs have extensive exposure to real-world source code and software libraries during pretraining. By contrast, explicit tool-calling formats rely on special conventions and synthetic fine-tuning data that may not be as deeply represented in the training distribution.
Second, code-centric interfaces are especially effective when a task requires chaining multiple operations. In a traditional tool-calling loop, the output of each tool invocation is returned to the model context so that the model can copy relevant values into the next call. This repeated round-trip wastes tokens, latency, and attention. When the model can instead write a short program, intermediate values can remain inside the program state rather than cycling through the context window. The model only needs to consume the final result or any exception requiring intervention.
7.2. Why Language-Agnostic Code APIs Are Appealing
Language-agnostic code interfaces provide a useful compromise between human readability and machine structure. They preserve callable methods, object properties, and data organization in a format that many LLMs appear to handle well. Unlike raw JSON schemas, code APIs allow the model to operate in a representational form that is both structurally expressive and naturally embedded in widely available software corpora.
In effect, code becomes a compression layer over tool interaction. Rather than loading every possible tool schema into the active context, the agent can import or reference a library and compose the necessary calls inside code. This shifts part of the burden from prompt-level description to executable abstraction. Importantly, this design should be understood as language-agnostic. Different agent frameworks may realize it through different implementation languages and runtime designs without changing the core architectural idea.
7.3. Remaining Constraints
However, code-centric tooling is not a universal replacement for MCP or CLI. Writing code introduces its own execution environment assumptions, debugging demands, and security considerations. It also presupposes that the runtime can safely execute generated programs or evaluate them within a controlled sandbox. Therefore, code-level abstraction should be understood as an additional design option rather than a complete substitute for other interfaces.
8. A Hybrid Design Framework for Agent Tooling
The preceding analysis suggests that no single interface is optimal across all task regimes. Instead, interface choice should be guided by the locality of execution, the structural complexity of the data, and the acceptable context budget.
- CLI is best suited for local, linear, parameterized tasks with simple textual inputs and outputs. It is particularly effective for developer workflows, filesystem interaction, shell operations, and quick environment inspection.
- MCP is better suited for remote and heterogeneous systems where preserving type information, nested structure, and discoverable capability descriptions is essential. It is especially valuable when agents must interact with enterprise services, orchestrate across multiple APIs, or manipulate structured DTOs reliably.
- Code-centric APIs provide an attractive option for workflows that require multi-step composition, intermediate state handling, and repeated tool chaining. They are especially promising when the runtime can safely support generated code and when the task benefits from keeping intermediate values out of the conversational context.
Based on these observations, we propose the following design principle: agent systems should adopt a tiered interface architecture. Lightweight local operations should default to CLI or equivalent minimal interfaces. Structured remote services should be exposed through typed protocol layers such as MCP. Multi-step workflows spanning several dependent calls should, where feasible, be elevated into code-centric abstractions that minimize context round-trips.
9. Discussion
The debate between CLI and MCP is often framed as a contest between simplicity and sophistication. This framing is misleading. CLI does not succeed merely because it is simple, and MCP does not impose cost merely because it is structured. Rather, each interface encodes a different assumption about where complexity should reside.
CLI externalizes complexity into text manipulation and model inference. This works well when the task structure is already shallow and familiar. MCP internalizes complexity into explicit schema design and protocol semantics. This works well when the data are deeply structured and correctness depends on preserving type information. Code-centric interfaces distribute complexity into reusable abstractions and executable logic, reducing repeated context mediation at the cost of introducing a stronger runtime dependency.
From this perspective, the future of agent tooling is likely to be pluralistic. Efficient systems will combine presentation-efficient interfaces for local operations, schema-rich protocols for structured remote services, and code-level composition for complex workflows. The engineering challenge is therefore not to select one universal interface, but to design coordination layers that let agents move among these interfaces with minimal loss of fidelity or efficiency.
10. Conclusion
This paper examined how LLM agents should interface with tools and external systems through the contrasting paradigms of CLI, MCP, and code-centric APIs. We first argued that LLMs and agents should not be treated as identical concepts, because the operational action of many contemporary agent systems is realized through tool-calling sequences rather than through a classical reinforcement learning policy. We then showed why CLI remains highly effective for local, linear, and parameterized tasks, especially given its maturity, low context overhead, and strong alignment with the textual distributions seen during LLM pretraining.
At the same time, we identified the structural limitations of CLI when agents must manipulate nested objects, preserve type information, and interact with remote services exposing complex DTOs. MCP addresses these weaknesses by providing typed, discoverable, and machine-readable interfaces, but traditional MCP deployments often incur substantial token and context overhead through schema loading. We further discussed how code-centric interfaces may alleviate some of these costs by enabling the model to compose multi-step operations within code rather than repeatedly routing intermediate results through the context window.
The main conclusion is that agent interface design should be task-sensitive rather than ideology-driven. CLI, MCP, and code-native abstractions each solve different problems. Future agent systems should therefore adopt hybrid architectures that align interface choice with the structural complexity of the task, the need for type preservation, and the available context budget.