Foundation Models: AI's Next Frontier

Modern-day artificial intelligence (AI) centers on learning from data — and the more data there is, the better it learns.

That’s why, until now, AI research and application has been largely focused on training bigger AI models on more data by using highly efficient computational resources. But while significant progress has been made in this area, many application areas — such as healthcare and the manufacturing industry — have limited data available, which has limited its applicability in these areas.

Foundation models could be the solution to this. The term “foundation models” refers to a general purpose behind an AI model. While traditional AI models must be trained on vast data sets for each individual use case, foundation models can be adapted to a wide range of downstream tasks — thus limiting the amount of legwork required to get an AI venture off the ground and improving efficiency. (Also read: 7 Key AI Adoption Challenges – and How to Overcome Them.)

Foundation models are based on standard ideas in transfer learning and recent advances in training deep learning models using self-supervised learning. They have also demonstrated striking emergent capabilities and significantly enhanced performance on a wide variety of use cases –making them an attractive prospect for the enterprise.

But the potential foundation models present is even bigger than that: They represent a growing paradigm shift in AI. Until now, AI researchers and developers have had to train models from scratch for each use case, which requires them to collect large amount of task-specific data sets. Contrarily, foundation models offer general purpose-based models that can be adopted to specific use cases using the data you already have available.

In this way, foundation models will enable organizations to more easily build upon, or heavily integrate, AI into their operations.

How do Foundation Models Work?

From a technological viewpoint, foundation models are deep neural networks trained using self-supervised learning. Although these technologies have existed for many years, what’s really groundbreaking is the scale at which they are creating models.

Recent foundation models contain hundreds of billions to trillions of parameters and are trained on hundreds of gigabytes of data. Existing foundation models mostly use state-of-the-art transfer learning.

While transfer learning is not integral to foundation models, it has a few properties that make it stand out as an ideal core for foundation models:

They’re easily parallelizable. Transfer learning can be easily parallelizable in both training and inference phases. This property is especially vital for natural language processing (NLP), where previous state-of-the-art models — including recurrent neural networks (RNNs) and long-short term memory (LSTM) — process data sequentially and hence cannot be parallelized.
They have less implicit bias. Compared to other contemporary models, such as convolutional neural networks (CNNs) and RNNs, transfer learning has minimal implicit bias. Implicit bias refers to design choices one makes by considering some characteristics of the input data—for example, feature locality in CNNs and sequential dependencies of features in RNNs. Hence, due to fewer implicit biases, transfer learning is a more universal architecture than other models, which makes it more suitable for building foundation models. However, this also means transfer learning requires more training data due to the well-known trade-off between implicit bias and data. (Also read: Why Diversity is Essential for Quality Data to Train AI.)

Foundation models are generally trained using self-supervised learning, which, unlike supervised learning, requires less human intervention. Instead, self-supervised learning allows a model to “teach itself” how to learn by using the supervision signals available naturally within the training data.

Some examples of these supervision signals are:

Masking words within a sentence and training the model to recover the missing words, as BERT does.
Predicting the next character or word in a sentence, as GPT-3 does.
Judging the correspondence between an image and its transformed version, as SimCLR does.
Judging the similarity between an image and its explanation as CLIP does.

Self-supervised learning is useful for training foundation models for at least two reasons:

It has better scalability than supervised learning. This is because it’s much more convenient to get more unlabeled data than labeled data.
It learns more expressive features. This is because it uses a richer data space than supervised data, whose label spaces are notoriously confined.

The combination of high-capacity and compute-efficient model architecture, a highly scalable training objective and potent hardware enable us to scale foundational models to an extraordinary level.

The Rise of Foundation Models

The rise of foundation models can be understood in terms of emergence and homogenization. Emergence refers to a system’s behavior, which is produced indirectly. Homogenization implies the consolidation of methods to build machine learning systems for a wide spectrum of applications.

To better contextualize where foundation models fit into the broader AI conversation, let’s explore the rise of AI over the last 30 years: (Also read A Brief History of AI.)

1.Machine Learning

Most contemporary AI developments are driven by machine learning (ML), which uses historical data to learn predictive models for making future predictions. The rise of ML within AI began in the 1990s and was a paradigm shift from the way AI systems were built previously.

ML algorithms can induce how to perform a given operation from data it is trained on. This was a major step toward homogenization, as a wide range of AI use cases can be realized using a single generic ML algorithm.

However, an important task of ML is feature engineering, which requires domain experts to transform raw data into higher-level features.

2. Deep Learning

Neural networks saw a new beginning in the form of deep learning (DL) around 2010.

Unlike vanilla neural networks, DL models are powered by deep neural networks (i.e., neural networks with more computational layers), compute-efficient hardware and larger data sets. A major advantage of DL is to take raw input (i.e. pixels) and produce a hierarchy of features in the training process. Hence, in DL, features also emerge from the act of learning.

This development led DL to exhibit extraordinary performance on standard benchmarks. The rise of DL was also a step further towards homogenization, as the same DL algorithm could be used for many AI use cases without domain-specific feature engineering.

DL models, however, require a lot of domain-specific data for training. (Also read: Basic Machine Learning Terms You Should Know.)

3. Foundation Models

The era of foundation models started in 2018 in the field of natural language processing. Technically, foundation models are empowered by transfer learning and scale.

Transfer learning works by taking the knowledge an AI model had to gain to perform the tasks it can already do and expanding upon it to teach the model to perform new tasks — essentially “transferring” the model’s knowledge to new use cases.

In deep learning, a dominant approach to transfer learning is to pre-train a model using self-supervised learning and then fine-tune it to a specific use case.

While transfer learning makes foundation models realizable, scale makes them potent. Scale depends on three key factors:

Developing compute-efficient model architecture that takes advantage of parallelism of the hardware (for example, transfer learning).
Enhancing computer hardware with better throughput and memory (for example, GPUs)
Accessing larger data sets.

Unlike deep learning, where large amounts of task-specific data sets must be available for the model to learn use case-specific features, foundation models aim to create “general purpose” features which can be adopted to multiple use cases.

In this way, foundation models present the possibility of an unprecedented level of homogenization. Case in point: almost all state-of-the-art NLP models are adopted from one of the few foundational models (e.g., BERT, GPT-3, T5, CLIP, DALL-E 2, Codex and OPT).

Conclusion

Foundation models represent the start of a paradigm shift in the way artificial intelligence systems are constructed and deployed in the world. They have already established their mark in NLP and are under exploration in other fields such as computer vision, speech recognition and reinforcement learning.

However, given their potential, we can expect foundation models to move beyond the world of research and revolutionize the way AI is adopted in business. Automating processes within the enterprise will no longer require data science teams to re-train models from scratch for each task they want to automate; instead, they can train a model on baseline parameters and fine-tune for each use case. (Also read: 3 Amazing Examples of Artificial Intelligence in Action.)