• Turing Post
  • Posts
  • The Recipe for an AI Revolution: How ImageNet, AlexNet and GPUs Changed AI Forever

The Recipe for an AI Revolution: How ImageNet, AlexNet and GPUs Changed AI Forever

And the unbending faith! A behind-the-scenes look at the unlikely ingredients that fueled the 2000s AI boom

In our “History of Computer Vision (CV)” series, we’ve explored in The Dawn of Computer Vision: From Concept to Early Models (1950-70s) and CV in the 1980s: The Quiet Decade of Key Advancements, and CV's Great Leap Forward: From the 1990s to AlexNet, that was revolutionary for deep learning.

So what is the recipe for an AI revolution? Today, we will summarize the ingredients for computer vision and more broadly deep learning revolution that took place in the late 2000s. Take out your pens and notebooks – learning mode is on!

We are opening our Historical Series on Computer Vision to everyone. Please share it with those who might find it inspiring for their current research. It’s free to read. If you would still like to support us, click the button below, this will be truly appreciated →

Table of Contents

State of things in 2000s

The beginning of the 2000s was the calm before the storm. The excitement over the potential for computer vision (CV) was as strong as the frustration with tool limitations and resource constraints. One of the most glaring obstacles was the lack of standardized datasets. Each research group curated their own small collections of images, making it nearly impossible to compare results and benchmark progress across different algorithms. This made it challenging to assess the true capabilities of emerging CV techniques. But large standardized datasets were simply not considered hugely important on the path to capable AI. Many teams were focused on developing algorithms that, according to consensus, would push the AI industry forward.

The early methods and models, while groundbreaking in their own right, were quite limited. Traditional feature extraction methods like SIFT (Scale-Invariant Feature Transform) and HOG (Histogram of Oriented Gradients) were initially crucial for identifying image patterns. But it struggled to grasp the full spectrum of object categories and the complexities of real-world scenes. LeNet-5, with its elegant convolutional neural network (CNN) architecture, showed that deep learning could unlock the secrets of handwritten digit recognition. But what about the vast and intricate world of objects beyond those simple digits?

It’s hard to believe in our data-centric world that basically no one thought about creating a really large dataset with diverse, high-quality, real-world data.

ImageNet: Where Have We Gone? Where Are We Going? with Fei-Fei Li, 2017

While everyone targeted detail, no one targeted scale.

Meet Fei-Fei Li and ImageNet

In 2006, Fei-Fei Li, then a new professor at the University of Illinois Urbana-Champaign, sought to overcome the limitations of existing AI algorithms that relied on datasets that were often small and lacked diversity. Poor datasets made it hard for models to learn general patterns. This scarcity of data often led to overfitting, as models would memorize the limited training examples rather than learning generalizable features. During her research into existing methods for cataloging data, Li came across WordNet, a project by Princeton psychologist George Miller that organized words hierarchically. Intrigued by WordNet's approach to structuring knowledge, she reached out to Christiane Fellbaum, who took over the direction of the WordNet project after Miller. This interaction and her subsequent readings gave Li and idea to apply a similar hierarchical approach to visual data. In early 2007, Li joined the Princeton faculty and initiated the ImageNet project. He first hire was Kai Li, a fellow professor who believed her vision. Kai Li convinced Ph.D. student Jia Deng to transfer into Li’s lab.

“The paradigm shift of the ImageNet thinking is that while a lot of people are paying attention to models, let’s pay attention to data. Data will redefine how we think about models.”

But how would you build such dataset? To make a difference it should have millions of annotated images. Fei-Fei Li initially planned to hire undergraduate students to manually find images for $10 an hour. But for a really large dataset, it would take dozens of years to complete at the undergraduates' rate.

Then the team considered using CV algorithms to pick photos from internet, but after trying to play with them for a few months, they decided it wasn't sustainable: machine-generated datasets can only match the best algorithms of the time.

But many still believed that better algorithms, not more data, were key to progress. It was too ahead of its time. ImageNet failed to secure federal grants, receiving criticism and skepticism about its significance and feasibility. It took enormous confidence in one’s vision to keep pushing and overcoming all hurdles. And some luck: by a chance conversation in a hallway, a graduate student asked if Li knew about Amazon Mechanical Turk. She decided to give it a try – and the project took off.

“He showed me the website, and I can tell you literally that day I knew the ImageNet project was going to happen. Suddenly we found a tool that could scale, that we could not possibly dream of by hiring Princeton undergrads.”

It took two years, 49,000 workers from 167 countries to create 12 subtrees with 5247 synsets and 3.2 million images in total (with the aim to complete the construction of around 50 million images in the next two years as indicated in the original paper). In 2009, the team presented ImageNet for the first time at the Conference on Computer Vision and Pattern Recognition (CVPR) in Miami.

As a poster.

Because, again, almost no one really believed that such dataset would make a difference.

Keep pushing for her vision, in 2010 Fei-Fei Li and her team decided to establish the ImageNet challenge to 'democratize' the idea of using large-scale datasets for training computer vision algorithms and set a benchmark for evaluating the performance of different image recognition algorithms on a massive and diverse dataset. To truly advance the field, they believed it was necessary to reach a wider audience and encourage more researchers to explore the potential of ImageNet.

Did that by itself started deep learning revolution? Of course, not. There were a few other very important developments happening at the same period of time and a little before that.

Convolutional neural networks

We’ve mentioned LeCun’s et al. LeNet before. Do you know how he became interested in neural networks? Yann LeCun, who introduced convolutional neural networks (CNNs) in 1989, was initially studying electrical engineering. His interest in intelligent machines was sparked during his undergraduate studies by reading about the Piaget vs. Chomsky debate on language acquisition (Piattelli-Palmarini, 1983). Seymour Papert's mention of Rosenblatt's perceptron in the book inspired LeCun to explore neural networks. Ironically, Papert, along with Marvin Minsky, had previously contributed to the decline of neural network research in the late '60s and one of the first AI winters.

By 1998, LeNet-5 achieved 99.05% accuracy on the MNIST dataset, marking a significant milestone in the development of CNNs and inspiring a few AI labs to keep working on CNNs. The main roadblock for CNNs was computational limitations, training deep CNNs was prohibitively slow and resource-intensive.

At the time, most deep learning work was done on central processing units (CPUs), often in small-scale experiments focusing on various learning algorithms and architectures. The deep learning community fell CNNs had a big potential but they were stuck with CPUs limitations. NVIDIA was game on for a change.

NVIDIA introduces CUDA

The first people who noticed the immense limitations of CPUs were not ML practitioners. In 1993, Jensen Huang, Chris Malachowsky, and Curtis Priem realized that 3D graphics in video games placed a lot of repetitive, math-intensive demands on PC central processing units (CPUs). What if a dedicated chip could perform these calculations more rapidly in parallel? The first Nvidia GeForce graphic accelerator chips were born. Initially creating GPUs (Graphics Processing Units) for video games, soon NVIDIA’s team, and Jensen Huang specifically, had a bigger picture in mind.

In November 2006, NVIDIA pioneered a groundbreaking solution for general-purpose computing on GPUs called CUDA (Compute Unified Device Architecture). CUDA, a parallel computing platform and programming model, leverages the power of NVIDIA GPUs to tackle complex computational problems more effectively than traditional CPU-based approaches. It was designed to be compatible with popular programming languages like C, C++, Fortran, and Python. Now the ML crowd could jump in and play with it.

A few AI pioneers immediately started experimenting with compute and GPUs.

According to Jürgen Schmidhuber, in 2010, his team showed that “GPUs can be used to train deep standard supervised NNs by plain backpropagation, achieving a 50-fold speedup over CPUs, and breaking the long-standing famous MNIST benchmark record, using pattern distortions. This really was all about GPUs – no novel NN techniques were necessary, no unsupervised pre-training, only decades-old stuff.” Around the same time, Andrew Ng's lab at Stanford was also moving towards GPUs for deep learning at scale. GPUs were still novel for the ML community, the developments guided mostly by intuition. The reasoning was that a robust computational infrastructure could dramatically accelerate statistical model training, addressing many of the scaling challenges inherent in big data. At the time, it was a contentious and risky move.

Now circling back to another roadblock: the lack of vision regarding large standardized datasets.

Geoffrey Hinton’s lab and AlexNet’s breakthrough

Geoffrey Hinton. Well, let me just quote Britannica: “His family includes multiple mathematicians, among them Mary Everest Boole and her husband, George Boole, whose algebra of logic (known as Boolean logic) became the basis for modern computing. Other notable relatives include Joan Hinton, one of the few women to work on the Manhattan Project; Charles Howard Hinton, the mathematician famous for visualizing higher dimensions; and George Everest, the surveyor Mount Everest is named for.”

Geoffrey Hinton couldn’t disappoint. He earned a degree in experimental psychology and a PhD in artificial intelligence in 1978. In 1987, he became a professor at the University of Toronto. His lab became a continuous factory of AI talent.

One of his students were Alex Krizhevsky and Ilya Sutskever. They were not the first describing CNNs on CUDA. The shift brought by AlexNet involved using a relatively standard convolutional neural network (ConvNet), but significantly scaling it up:

  • training it on the large ImageNet dataset

  • implementing it efficiently in CUDA/C++ (you can find the original code for AlexNet here). This approach utilized model-parallelism, splitting parallel convolution streams across two GPUs, which was quite innovative for the time.

According to the original paper, AlexNet’s large, deep convolutional neural network was trained on 1.2 million high-resolution images from the ImageNet LSVRC-2010 contest, achieving record-breaking results. The network's architecture consists of five convolutional and three fully-connected layers, and its depth was found to be crucial for its performance. To prevent overfitting, data augmentation techniques like image translations, horizontal reflections, and altering RGB channel intensities were employed. Additionally, a regularization method called "dropout" was used, where the output of each hidden neuron is randomly set to zero during training. The network was trained using stochastic gradient descent with specific parameter settings and achieved top-1 and top-5 error rates of 37.5% and 17.0% on the ILSVRC-2010 test set. The results demonstrate the potential of large, deep convolutional neural networks in image classification tasks and suggest that further improvements can be achieved with even larger networks and datasets.

Image Credit: The original paper

There were a few Innovations made by AlexNet:

  • ReLU Nonlinearity: Utilization of Rectified Linear Units (ReLU) as activation functions, leading to faster training of deep neural networks compared to traditional saturating nonlinearities like tanh.

  • Training on Multiple GPUs: Implementation of cross-GPU parallelization, allowing for the training of larger networks that would not fit on a single GPU.

  • Local Response Normalization: Introduction of a normalization scheme inspired by lateral inhibition in real neurons, promoting competition between neuron outputs and aiding generalization.exclamation

  • Overlapping Pooling: Use of pooling layers with overlapping neighborhoods, reducing overfitting compared to traditional non-overlapping pooling.expand_more

  • Data Augmentation: Implementation of image translations, horizontal reflections, and PCA-based intensity alterations to artificially enlarge the dataset and improve the network's ability to generalize.

  • Dropout: Introduction of a regularization technique where random neurons are "dropped out" during training, forcing the network to learn more robust features and reducing overfitting.

Participating in the ImageNet Challenge in 2012, AlexNet achieved phenomenal results, outperforming all previous models with a top-5 error rate of 15.3%, compared to the 26.2% achieved by the second-best entry.

This success marked a pivotal moment for deep learning, leading to widespread adoption and further advancements in CNN architectures.

In one of the interviews (I don’t remember which), Ilya Sutskever described that moment as an absolute revelation: the combination of a huge amount of data and powerful compute would lead, he was sure, to unprecedented breakthroughs in AI. He was right. It paved the way for subsequent breakthroughs, including models like VGG, GoogLeNet, and ResNet, and the entire field of deep learning, driving it to the GenAI revolution we are currently experiencing.

So here is your recipe for the deep learning revolution:

  • Key Ingredients: Massive dataset opened for collaboration (ImageNet), powerful neural network built on the shoulders of other giants (AlexNet), computing power (NVIDIA’s GPUs); But most of all, the unbending faith of a few researchers in their vision (Fei-Fei Li).

  • Outcome: Transformed AI, breakthroughs in image recognition, NLP, and other fields. Led to today's AI-powered applications.

Thank you for reading, don’t forget to →

How did you like it?

Login or Subscribe to participate in polls.

We appreciate you!

Join the conversation

or to participate.