By now, we've explored all key building blocks in autonomous agents: Profiling (identity, goals, constraints), Knowledge (base facts), Memory (past contexts), Reasoning and Planning (task breakdown, inference, action plans), and Reflection (evaluating outcomes to improve future performance through feedback loops). All but one β Actions, the practical steps through which autonomous agents execute planned activities, interact with environments or external tools, and produce tangible outcomes. Actions bridge theory and reality, making them essential for agent autonomy. They enable an AI agent to βdo somethingβ rather than merely βsay something.β
In agentic AI, an action is any operation an agent performs to interact with external systems β going beyond passive text responses to actively fetch data, execute code, invoke APIs, or control interfaces. Tool integration is essential, as it extends an agentβs capabilities beyond its model weights, enabling true autonomy. Agentic AI dynamically applies tools and real-time information from sensors, databases, or web APIs to adapt and solve complex, real-world tasks.
In this article, we examine UI-driven versus API-driven approaches, clarify function calling within LLMs, and compare prominent open-source frameworks like LangGraph, Microsoft AutoGen, CrewAI, Composio, OctoTools, BabyAGI, and MemGPT (Letta). Itβs not a casual read, but itβs packed with useful insights if youβre into agents.
Whatβs in todayβs episode?
Essential Components of Action
Tool Learning: UI-Based vs. API-Based Interactions
Function Calling: How LLMs Invoke External Functions
Open-Source Frameworks Enabling Actions (This overview of frameworks is a goldmine for anyone looking to build or experiment with agentic AI.)
Emerging Trends in AI Action Execution
Concluding Thoughts
Resources to dive deeper
Essential Components of Action
Tool Learning: UI-Based vs. API-Based Interactions
One fundamental choice in enabling agent actions is how the agent interacts with external tools or applications. Broadly, these interactions fall into two categories: UI-based interactions and API-based interactions.
UI-Based Tool Use: In this approach, an AI agent operates the softwareβs user interface (UI) like a human would β clicking buttons, typing into forms, and reading on-screen information. Such a computer-use AI agent essentially simulates a human userβs behavior on the frontendβ. This method is akin to robotic process automation (RPA) driven by AI. The advantage of UI-based action is that it works even when direct programmatic access is unavailable or prohibited. For example, if an enterprise application has no public API or a websiteβs terms of service forbid scraping, an agent can still perform tasks by navigating the UI just as an employee wouldβ. UI-based agents inherently comply with front-end usage policies and can integrate workflows across multiple disparate applicationsβ. However, this approach can be slower and more brittle β changes in the interface or layout can break the agentβs βscript,β and setting up a virtual browser or desktop environment for the agent to operate in adds complexity.
API-Based Tool Use: Here, an agent uses backend APIs or function calls to interact with software systems directly. Instead of clicking through a web page to get stock prices, for instance, the agent might call a REST API that returns the data in JSON. API-based actions are more structured and efficient: they provide the agent with precise data or let it trigger defined operations (e.g. create a calendar event via an API) without having to parse visual interfaces.
In practice, modern AI agent frameworks prioritize API-based tools for their reliability and speed. Even Anthropicβs Computer Use tool, which enables agents to interact with virtual environments, relies on APIs. Tool learning in this context means prompting the AI to understand when and how to use a tool, often through a series of prompts, constraints, and examples. Given descriptions of available tools and usage examples, an LLM-based agent can select the right tool for a query and generate the correct API call format, effectively learning the toolβs interface from instructions. AI practitioners say itβs easy to make it work but hard to keep it working consistently. Research like Toolformer shows LLMs can be fine-tuned to insert API calls autonomously, but practical systems typically use prompt engineering or function-calling interfaces instead of retraining. For businesses, the choice between UI and API tools matters: API-focused agents excel in efficiency and scalability with robust APIs, while UI-based agents are necessary for legacy systems or UI-only platforms.
Function Calling: How LLMs Invoke External Functions
Upgrade if you want to be the first to receive the full articles with detailed explanations and curated resources directly in your inbox. Simplify your learning journey β
or follow us on Hugging Face, you can read this article there tomorrow for free
Want a 1-month subscription? Invite three friends to subscribe and get a 1-month subscription free!Β π€

