How AI Agents Execute Tasks: UI, API & Function Calling

By now, we've explored all key building blocks in autonomous agents: Profiling (identity, goals, constraints), Knowledge (base facts), Memory (past contexts), Reasoning and Planning (task breakdown, inference, action plans), and Reflection (evaluating outcomes to improve future performance through feedback loops). All but one – Actions, the practical steps through which autonomous agents execute planned activities, interact with environments or external tools, and produce tangible outcomes. Actions bridge theory and reality, making them essential for agent autonomy. They enable an AI agent to “do something” rather than merely “say something.”

In agentic AI, an action is any operation an agent performs to interact with external systems – going beyond passive text responses to actively fetch data, execute code, invoke APIs, or control interfaces. Tool integration is essential, as it extends an agent’s capabilities beyond its model weights, enabling true autonomy. Agentic AI dynamically applies tools and real-time information from sensors, databases, or web APIs to adapt and solve complex, real-world tasks.

In this article, we examine UI-driven versus API-driven approaches, clarify function calling within LLMs, and compare prominent open-source frameworks like LangGraph, Microsoft AutoGen, CrewAI, Composio, OctoTools, BabyAGI, and MemGPT (Letta). It’s not a casual read, but it’s packed with useful insights if you’re into agents.

What’s in today’s episode?

Essential Components of Action
- Tool Learning: UI-Based vs. API-Based Interactions
- Function Calling: How LLMs Invoke External Functions
Open-Source Frameworks Enabling Actions (This overview of frameworks is a goldmine for anyone looking to build or experiment with agentic AI.)
Emerging Trends in AI Action Execution
Concluding Thoughts
Resources to dive deeper

Essential Components of Action

Tool Learning: UI-Based vs. API-Based Interactions

One fundamental choice in enabling agent actions is how the agent interacts with external tools or applications. Broadly, these interactions fall into two categories: UI-based interactions and API-based interactions.

UI-Based Tool Use: In this approach, an AI agent operates the software’s user interface (UI) like a human would – clicking buttons, typing into forms, and reading on-screen information. Such a computer-use AI agent essentially simulates a human user’s behavior on the frontend. This method is akin to robotic process automation (RPA) driven by AI. The advantage of UI-based action is that it works even when direct programmatic access is unavailable or prohibited. For example, if an enterprise application has no public API or a website’s terms of service forbid scraping, an agent can still perform tasks by navigating the UI just as an employee would. UI-based agents inherently comply with front-end usage policies and can integrate workflows across multiple disparate applications. However, this approach can be slower and more brittle – changes in the interface or layout can break the agent’s “script,” and setting up a virtual browser or desktop environment for the agent to operate in adds complexity.
API-Based Tool Use: Here, an agent uses backend APIs or function calls to interact with software systems directly. Instead of clicking through a web page to get stock prices, for instance, the agent might call a REST API that returns the data in JSON. API-based actions are more structured and efficient: they provide the agent with precise data or let it trigger defined operations (e.g. create a calendar event via an API) without having to parse visual interfaces.

In practice, modern AI agent frameworks prioritize API-based tools for their reliability and speed. Even Anthropic’s Computer Use tool, which enables agents to interact with virtual environments, relies on APIs. Tool learning in this context means prompting the AI to understand when and how to use a tool, often through a series of prompts, constraints, and examples. Given descriptions of available tools and usage examples, an LLM-based agent can select the right tool for a query and generate the correct API call format, effectively learning the tool’s interface from instructions. AI practitioners say it’s easy to make it work but hard to keep it working consistently. Research like Toolformer shows LLMs can be fine-tuned to insert API calls autonomously, but practical systems typically use prompt engineering or function-calling interfaces instead of retraining. For businesses, the choice between UI and API tools matters: API-focused agents excel in efficiency and scalability with robust APIs, while UI-based agents are necessary for legacy systems or UI-only platforms.

Function Calling: How LLMs Invoke External Functions

Function calling is a recently introduced capability that has become a cornerstone of agentic AI workflows. It allows developers to extend LLMs without retraining, either through prompts or fine-tuning. Each function is modular, self-contained, and independently maintainable.

With that, an LLM can explicitly invoke external functions or APIs as part of its response. Rather than producing a final answer directly, the LLM can return a structured “function call” (for example: call_weather_api(location="London")) indicating that an external tool is needed. The agent framework or calling code will then execute this function (e.g. actually call the weather API with the given parameter), get the result (say, the current temperature and forecast), and feed that result back into the LLM. The LLM can then incorporate the live data into its final answer to the user. This mechanism effectively extends the LLM’s capabilities beyond its trained knowledge, letting it perform real-world actions or lookup fresh information.

Image Credit: Apideck

Under the hood, function calling usually works by providing the LLM with a schema or definition of functions it can use. During a prompt, if the model “decides” the user’s request requires one of those functions, it outputs a JSON or structured block indicating the function name and arguments (instead of or before a normal answer). The agent orchestrator detects this and executes the function, then feeds the function’s output to the model (often in a system message or a continuation of the conversation) to let it continue processing with that new information. For example, if a user asks, “How many open support tickets do we have today?”, the LLM might see that it should call a get_open_tickets() function. It produces a function call output, the system runs that API call to the company database, gets the number, and returns it to the LLM, which then responds, “You have 42 open support tickets as of now.”

Function calling is a powerful way to integrate external actions into an LLM’s workflow. It ensures that the interaction with tools is reliable and structured – the model’s intent to use a function is explicit, not hidden in free-form text. This reduces errors in parsing the model’s output to trigger actions. Initially, before native function calling was available (e.g. in early GPT-4), frameworks used careful prompt patterns to make the model output a pseudo-command (like Action: Search["query"]), which the framework would parse and execute. Today, many LLM platforms (OpenAI, Azure, etc.) support function calling out-of-the-box, and open-source LLM projects are incorporating similar ideas. This opens the door for dynamic tool use, where an AI can seamlessly tap into a library of functions ranging from web search, math calculators, databases, to even actuating IoT devices – all within a single conversation. In summary, function calling lets AI agents do things in the world (via code) and not just say things, greatly enhancing their usefulness in practical workflows.

Open-Source Frameworks Enabling Actions

Over the past year, a variety of frameworks have emerged to help developers build agentic AI systems. These frameworks provide the core infrastructure for action selection, tool integration, memory management, and multi-step reasoning.

In our episode on Reflection, we explored the ReAct paradigm, a foundational open-source approach to reasoning-action loops. ReAct enables LLMs to interleave reasoning with tool use iteratively, making it a widely adopted strategy in frameworks like LangChain and AutoGPT. While highly flexible, ReAct requires structured oversight to prevent redundant loops. Many modern frameworks build on ReAct to enhance accuracy and scalability.

Below, we compare several notable frameworks, focusing on how they execute actions. Each framework offers a unique approach to orchestrating the perception → reasoning → action loop:

LangGraph

An open-source framework for graph-based workflow representation, enabling fine-grained control over complex agent behaviors. Unlike linear scripts, LangGraph defines a graph of nodes (LLM calls, tool APIs) with edges controlling information flow. It excels at visualizing dependencies and managing multi-step tasks. Built atop LangChain, it integrates seamlessly with databases, web search, and other tools. Supports parallel execution and asynchronous sub-agents for performance gains. LangGraph Studio provides a visual IDE for debugging agent workflows. The trade-off is its steep learning curve due to the need for manual graph structuring.

AutoGen (Microsoft)

An open-source event-driven framework for multi-agent coordination. Designed for scenarios like code generation, where agents collaborate asynchronously (e.g., “Writer” drafts, “Reviewer” debugs). Features asynchronous message passing for real-time responsiveness, modular tool integration (via LangChain and function calling), and distributed deployment support. Includes AutoGen Studio for no-code orchestration. Highly scalable but imposes structured patterns (e.g., message loop model), which requires some learning.

CrewAI

A high-level, open-source framework emphasizing multi-agent collaboration via role-based structures (e.g., “Researcher,” “Solver”). Built on LangChain for easy tool integration and memory management, it prioritizes ease of use with ready-made templates. Ideal for quick prototyping and brainstorming but offers less fine-grained control than LangGraph. Best suited for small-scale agent orchestration rather than highly parallel or real-time-intensive tasks.

Composio

An open-source framework for hierarchical task planning and execution. Supports adaptive workflows with real-time feedback and dependency management. Suitable for complex multi-step tasks (e.g., multi-stage transactions, project management). Offers deep customization, including modular tool selection and state management, but requires more effort to configure. More focused on workflow depth than high concurrency.

OctoTools (Stanford)

A research-driven, open-source framework for structured tool use. Uses “Tool Cards” to define API interactions, separating planning from execution. Optimizes multi-tool reasoning tasks by selecting minimal tool sets per query. Relies on prompting instead of fine-tuning, making it model-agnostic. Offers transparency and reliability but may have slower real-time performance due to multiple processing steps.

BabyAGI

One of the earliest examples of an “AI agent” framework (more of a script in its original form) that went viral. It’s essentially a minimalist agent that can generate tasks, prioritize them, and execute them in sequence, using an LLM (GPT-4) for the cognitive work and a vector database for memory. Many subsequent frameworks drew inspiration from this, but they introduced more robust handling of actions. BabyAGI itself is best suited for simple task automation scenarios or as a starting template for understanding agent loops.

MemGPT (Letta)

An open-source framework for persistent, stateful AI agents with long-term memory. Manages an agent’s evolving knowledge and user context beyond an LLM’s token limit. Functions like an “external brain” for agents, storing memory in databases and retrieving relevant context dynamically. Requires setup (e.g., server deployment) but is highly flexible and model-agnostic. Ideal for applications needing continuous learning or personalized AI assistants.

Other Emerging Frameworks and Approaches

In addition to the above, there are other frameworks and research efforts in the agentic AI ecosystem worth mentioning. OpenAI’s Swarm is a lightweight orchestration library recently open-sourced by OpenAI that allows one to chain multiple agent instances through function calls in a minimalistic way. It treats each agent as an independent entity with its own function set, and uses a simple hand-off mechanism: an agent can call a function that effectively transfers control to another agent. Swarm doesn’t provide the rich structures of LangGraph or AutoGen, but it’s very flexible for developers who want to manually manage multi-agent dialogues with the guarantee of OpenAI’s function calling format.

Another noteworthy concept is HuggingGPT, a proposal by Microsoft researchers to use an LLM (ChatGPT) as a controller to route tasks to specialist models (like vision or speech models on Hugging Face). While not a persistent framework, it demonstrated the idea of an LLM serving as a general planner that dynamically selects expert tools (models) to solve parts of a problem. This idea has been practically implemented in the Hugging Face Transformers library as “transformers agents,” where a language model can choose from a library of tools (which can be other ML models or APIs). It’s similar in spirit to what Composio offers (a set of tools), though typically on a smaller scale and often focused on AI model tools (e.g. using a translation model, then an image generation model in sequence via text instructions).

Semantic Kernel by Microsoft is another open-source framework that provides a plugin system (skills) for LLMs and the ability to do sequential planning. It’s more of a general AI orchestration SDK, combining prompt templates, memory, and connectors, and can be used to build agent-like applications with C# or Python. While not explicitly positioned as a multi-agent or agent-team framework, it covers similar ground in integrating functions and planning out calls, especially in enterprise .NET environments.

Finally, we should also mention early projects like Auto-GPT and AgentGPT (which gained viral attention). These are applications that sat on top of LLMs to create user-friendly autonomous agents – Auto-GPT for instance could take a goal and attempt to fulfill it by generating subgoals and using tools like web search. They were more demo than framework, often hard-coded and less extensible. However, they raised awareness and provided feedback that has been incorporated into the more formal frameworks we discussed.

The landscape of agentic AI is evolving rapidly. The ones reviewed above are among the prominent choices as of 2024–2025, giving a flavor of different philosophies: from graph-based planning (LangGraph), to multi-agent dialog (AutoGen, CrewAI), to tool-centric design (Composio, OctoTools), to autonomous loops (BabyAGI), to long-term memory systems (MemGPT). But there are other trends that aim to improve how agents select and execute actions.

Emerging Trends in AI Action Execution

Reinforcement Learning & Fine-Tuning for Tool Use
Moving beyond prompt engineering, fine-tuned agents (e.g., AGILE) learn optimal tool-use strategies, outperforming larger untrained models. Datasets for multi-step decision-making and imitation learning (from expert demonstrations) enhance efficiency, making agents more environment-aware.
Model Specialization
Instead of relying on a single LLM, frameworks are using lightweight models for parsing and action selection while reserving larger models for complex reasoning. This division improves real-time performance, as seen in projects like HuggingGPT, where a coordinator delegates tasks to specialized models.
Structured Memory & Knowledge Graphs
Agents are increasingly integrating databases or knowledge graphs to persist structured context. Instead of treating each tool use as isolated, agents build and update a dynamic world state, reducing redundancy and improving coherence in multi-step workflows.
Human-in-the-Loop & Self-Correction
While autonomy is the goal, hybrid approaches enhance reliability. Some frameworks enable agent self-critique, structured peer review, or user approvals before executing critical actions. Reinforcement learning from human feedback (RLHF) further refines action policies.
Multimodal & Real-World Actions
Agents are expanding beyond text-based tools, incorporating vision, audio, and robotics. LLMs are already guiding browser automation (e.g., WebGPT) and controlling robots via high-level planning and RL-based execution. Future agents may seamlessly integrate perception and physical actuation.

Overall, AI agents are shifting from basic tool-chaining to optimized, learning-driven action loops, leveraging structured frameworks, fine-tuning, and real-world adaptability for greater efficiency and reliability.

Concluding Thoughts

Integrating actions in a structured way empowers AI agents to transition from passive to proactive. It provides the necessary backbone for autonomy, enabling agents not only to think (thanks to LLMs) but also to act in pursuit of their goals. As the ecosystem matures, we’re likely to see even more standardized and powerful action integration techniques, making it increasingly easier to build AI agents that reliably coordinate complex tools and APIs to solve real-world problems. With robust action integrations, AI agents will continue to reach new heights in autonomy and usefulness, ushering in workflows capable of handling tasks end-to-end with minimal human intervention.

Minimal but not obsolete. In the next episode, we will explore Human-AI communication and Human-in-the-Loop (HITL) integration. And then move to Multi-Agent Collaboration.

Sources: