We live in a time of change and revision as we need better algorithms, more powerful computing, and new levels of AI. Efficiency is a new trending term in the AI world that has come to replace previous battles for the biggest models. With that, some fundamental approaches are being reconsidered.
For example, the multilayer perceptron (MLP), arguably, the most important algorithm in the history of deep learning received an alternative just recently. A group of researchers proposed Kolmogorov–Arnold Networks reporting better accuracy and interpretability than MLP in certain tasks.
What is KAN? How can it improve the results achieved by the fundamental MLP? Let’s find out.
In today’s episode, we will cover:
What Are Kolmogorov-Arnold Networks (KAN)?
First, what is MLP and where it came from?
KAN story
KAN architecture
Is KAN Better Than MLP?
Advantages of KANs over MLPs
KAN’s limitations
Conclusion
Bonus: Resources
What Are Kolmogorov-Arnold Networks (KAN)?
Kolmogorov-Arnold Networks, or KANs, are a neural network architecture that replaces fixed activation functions with learnable spline functions on the edges between nodes. KANs make each connection learn its own flexible function.
The idea is inspired by the Kolmogorov-Arnold representation theorem, which showed that multivariate continuous functions can be represented through combinations of one-dimensional functions. So KANs revisit this mathematical foundation as a possible alternative to multilayer perceptrons (MLPs), especially for tasks where interpretability, precise function approximation, and flexible modeling matter.
KANs are not a universal replacement for MLPs yet. They can be more interpretable and accurate in some small-scale scientific tasks, but they are also slower to train and still need more testing at industrial scale.
First, what is MLP and where it came from?
Multilayer perceptrons (MLPs), a core type of feedforward neural network, are fundamental in artificial intelligence. Among different types of neural networks (NNs)*, feedforward neural networks (FNNs) are the simplest. Information flows only in one direction, from input to output, without loops or cycles in the network architecture.
*Neural networks are inspired by the processes happening in the human brain, where biological neurons work together to identify phenomena, weigh options, and arrive at conclusions.To understand MLPs, let’s revise some basics of neural networks. Multilayer perceptron consists of layers of nodes (also known as neurons or perceptrons):
Input Layer: This is the initial point of data entry into the network. Each node here represents a different feature of the input data, effectively translating raw data into a format the network can work with.
Hidden Layers: Situated between the input and output layers, these layers can vary in number and size depending on the network's complexity. Each neuron in these layers processes inputs from all the neurons in the previous layer, transforming them via weights, biases, and activation functions, then passing the result to the next layer.
Output Layer: This final layer outputs the network's predictions or classifications. The number of neurons here aligns with the desired output dimensions, depending on the specific task at hand.
Other important concepts:
Weights and biases: These parameters define the strength of connections between neurons. They are adjustable through training and crucial in determining the network’s output.
Activation functions: Functions like rectified linear unit (ReLU) or sigmoid dictate a neuron's output based on its input. These functions are essential for introducing non-linearity into the network, enabling it to model complex patterns.
Learning rule: An algorithmic approach, such as backpropagation, adjusts the network's weights and biases based on the error between the predicted output and the actual output, refining the model iteratively.
MLPs originated in the 1960s. The initial concept was introduced by Frank Rosenblatt with his perceptron model, a simple neural network which demonstrated basic pattern recognition tasks. But the perceptron's capabilities were limited, as Marvin Minsky and Seymour Papert demonstrated it could not solve non-linear problems. This finding temporarily halted neural network research.
Almost two decades passed until researchers began to explore the ability of multilayer feedforward networks to approximate general mappings from one finite dimensional space to another. In 1989, Kurt Hornik, Maxwell Stinchcombe, and Halbert White showed that MLPs, given a sufficient number of hidden layers and units, could approximate any continuous function – no matter how complex. Their final result was a theorem called “universal approximation theorem.”
This property, known as "universal approximation," established MLPs as incredibly versatile tools capable of addressing a broad spectrum of tasks from simple regression to complex pattern recognition challenges in ML, without the need for custom-designed algorithms for each new problem.
According to the Deep Learning textbook, MLPs are “the quintessential deep learning models”. Today, MLPs are widely used in various fields of machine learning working with text, images and speech among others. Their flexibility in architecture and ability to approximate nonlinear functions make them a fundamental building block in deep learning and neural network research.
Who Invented KAN? History of Kolmogorov-Arnold Networks
Parallel to the universal approximation theorem, there is a Kolmogorov-Arnold Representation theorem proved in 1957 by Vladimir Arnold and Andrey Kolmogorov. It stated that any multivariate continuous function can be represented as a finite composition of continuous functions of a single variable and the operation of addition. In other words, Kolmogorov-Arnold Representation theorem showed that a function of many dimensions can be broken down into a linear combination of functions with only one dimension.
Then why didn’t researchers use it in the first place? The problem is that these one-dimensional functions don’t necessarily have the properties needed for a function to be represented by the neural network. That’s why in the 1990s researchers came back to that topic and stated that “Kolmogorov's Theorem Is Irrelevant”, rendering the research community silent about it.
Fast forward to 2024, the authors: Kolmogorov-Arnold Networks revitalized and contextualized the theorem in today’s deep learning world. They suggest adapting the theorem to networks of arbitrary widths and depths, which would allow for a more flexible and robust application in machine learning models.
Furthermore, the authors highlight that many functions encountered in science and everyday life possess properties such as smoothness and sparse compositional structures. These characteristics are conducive to representations using the Kolmogorov-Arnold approach, which could make the theorem more practically applicable in real-world scenarios.
How KAN Works: Architecture and Spline Functions
The main innovation in KANs is the replacement of fixed activation functions on nodes with learnable activation functions on edges which eliminates linear weight matrices. Instead, each "weight" in the network is modeled as a learnable univariate function, specifically parameterized as a spline. This change means each connection in the network can adapt its function based on the data, potentially offering a more flexible and tailored data processing pathway.

Image Credit: KAN paper
The architecture allows it to handle inputs and perform computations in a fundamentally different way compared to MLPs. Other core features:
Absence of linear weights: In traditional networks, linear transformations (weights) play a critical role in mapping inputs through the network's layers. Here spline-based functions replace linear weights, allowing each connection to perform complex, non-linear transformations tailored to the specific requirements of the input data.
Summation nodes: The nodes here primarily function as summation points that aggregate inputs from incoming edges. These inputs are not subjected to any further non-linear transformation at the nodes themselves, a departure from the typical use of activation functions like ReLU or sigmoid in MLP nodes.
Spline parameterization: Each edge function in a KAN is parameterized as a spline, which offers a powerful way to model non-linear relationships with a relatively simple mathematical construct. Splines provide the flexibility to form smooth curves that can fit a wide range of data shapes, making them ideal for the varied functional transformations the network requires.
Is KAN Better Than MLP?
These networks present several theoretical and practical advantages over traditional MLP. They effectively integrate the approaches of MLPs and splines, capitalizing on the strengths of each to address their individual limitations.

Image Credit: KAN paper
Advantages of splines in KANs:
Precision in low dimensions: Splines are particularly adept at accurately modeling functions in low-dimensional spaces. This precision is crucial for tasks where fine-grained control over the function shape is necessary.
Local adjustability: The nature of splines allows for easy local adjustments, meaning that changes to the spline parameters can directly and precisely affect specific sections of the function without unintended impacts elsewhere.
Resolution flexibility: Splines can switch between different resolutions, providing a versatile tool for function approximation that can be coarse or detailed depending on the need.
Advantages of MLPs in KANs:
Handling high-dimensional data: Unlike splines, MLPs are less prone to the curse of dimensionality, making them more effective in environments with high-dimensional input spaces.
Feature learning: MLPs excel at learning complex feature hierarchies from data, which is beneficial for tasks requiring the extraction and integration of informative features from raw inputs.
Advantages of KANs over MLPs
Deciding whether these networks are better than MLPs depends largely on application, the nature of the data, and the objectives of the model.
Here are some of the main advantages:
Function approximation: The architecture might offer more precise and flexible function approximation capabilities, particularly because they use splines for activation functions that can be adjusted to fit the specific features of the input data more closely than the fixed activation functions typically used in MLPs.
Interpretability: The use of splines allows for potentially greater interpretability. Since each connection’s function is directly modifiable and observable, it can be easier to understand how inputs are transformed throughout the network.
Flexibility: These networks provide flexibility in adapting the network to various resolutions and complexities of data, potentially offering better performance on tasks involving complex, non-linear relationships that can be better modeled with splines.
High-dimensional functionality: While traditionally splines suffer from the curse of dimensionality, the integration with MLP-like structures allows them to potentially handle high-dimensional data more effectively than splines alone.
KAN Limitations and Challenges

Image Credit: KAN paper
So far, the biggest problem is slow training. It’s 10x slower than neural networks, limiting mainstream adoption. The authors suggest:
“If you care about interpretability and/or accuracy, and slow training is not a major concern, we suggest trying KANs, at least for small-scale AI + Science problems.”
The arichtecture showed improvements on small-scale problems, but their scalability to larger, more complex datasets and problems typical in industrial applications remains untested. The unique aspect of this approach – using spline functions for each connection – might introduce a high cost in terms of the number of parameters and the computational resources required to learn these parameters effectively, especially as network depth and complexity increase.
Also, as noted in Devan’s newsletter, there are other possible drawbacks:
Lack of research: Not that many researchers are working on it (comparing to transformers and diffusion), this may mean potential unknown blockers.
Market fit: Hardware favors Transformers/neural networks, it might hinder adoption.
Conclusion
Despite the prominence of LLMs, significant research explores alternatives like KANs. With unique spline-based functions, the approach offer precise, flexible function approximations and greater interpretability. Although they face challenges like slower training and scalability, the ongoing exploration of these architectures highlights the diversity in AI research, ensuring continual advancements and robust solutions.
In an era focused on AI efficiency, reevaluating approaches like the MLP is crucial. The resurgence of neural networks after periods of dormancy shows that promising technologies can overcome initial hurdles with sustained research. Similarly, they might evolve to become more efficient and widely applicable.
Bonus: Resources
KAN paper: https://arxiv.org/abs/2404.19756
KAN original code: https://github.com/KindXiaoming/pykan
Documentation: https://kindxiaoming.github.io/pykan/
A recording of the talk given by Ziming Liu, one of the authors of KAN, at Google TechTalks
Another recording of the talk given by Ziming Liu to Portal community
KAN extension: Convolutional Kolmogorov-Arnold Networks
Kolmogorov-Arnold Networks (KANs) for Time Series Analysis: https://arxiv.org/abs/2405.08790
Wav-KAN: Wavelet Kolmogorov-Arnold Networks: https://arxiv.org/abs/2405.12832
An interesting experiment: Kolmogorov-Arnold Networks (KAN) using Chebyshev polynomials instead of B-splines: https://github.com/SynodicMonth/ChebyKAN
Other great KAN sources (related papers, libraries, projects, discussions, and tutorials) can be found here: https://github.com/mintisan/awesome-kan
Papers mentioned in the article:
Rosenblatt’s Perceptron: Professor’s perceptron paved the way for AI – 60 years too soon
Critique of the Perceptron by Marvin Minsky and Seymour Papert: Perceptrons: An Introduction to Computational Geometry
Deep Learning textbook
Universal approximation theorem: Multilayer feedforward networks are universal approximators
Kolmogorov-Arnold Representation theorem: On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition
Critique of the Kolmogorov-Arnold Representation theorem applicability in machine learning: Representation Properties of Networks: Kolmogorov's Theorem Is Irrelevant
How did you like it?
FAQ
What is KAN in AI?
KAN stands for Kolmogorov-Arnold Network. It is a neural network architecture where learnable spline functions are placed on the edges between nodes. This differs from traditional MLPs, which usually use fixed activation functions on nodes and linear weights between layers.
Who invented KAN?
The mathematical foundation comes from Andrey Kolmogorov and Vladimir Arnold, who proved the Kolmogorov-Arnold representation theorem in 1957. The modern KAN architecture was introduced in 2024 by researchers from MIT and Caltech who adapted this idea for today’s deep learning models.
What is the interpretability of KANs?
KANs can be more interpretable than standard MLPs because each connection is represented by a learnable spline function that can be inspected and adjusted. This makes it easier to see how inputs are transformed through the network, especially in smaller scientific or function-approximation tasks.
Is KAN better than MLP?
KAN is not simply better than MLP. It may offer better accuracy and interpretability in some small-scale tasks, especially where smooth function approximation matters. However, KANs are slower to train, less tested at scale, and not yet as broadly adopted as MLPs in mainstream AI systems.
Thank you for reading! Share this article with three friends and get a 1-month subscription free! 🤍








