Over the past 12 months, large language models (LLMs) have dominated the conversation around generative AI. However, behind the buzz of proprietary LLMs like ChatGPT and Google Bard, small language models (SLMs) have been quietly generating interest among industry leaders.
Earlier this month, Microsoft announced the release of Phi-2, a 2.7 billion parameter SLM with “outstanding” reasoning and language understanding capabilities.
This model has reportedly achieved state-of-the-art performance among models with less than 13 billion parameters while having the ability to outperform models 25x larger.
READ MORE: What is the Role of Parameters in AI?
Likewise, When Google dropped the much anticipated Gemini multimodal LLM, it made sure to include the lightweight version Gemini Nano, which has between 1.8 billion and 3.25 billion parameters and is designed for on-device tasks.
So why are vendors like Microsoft and Google looking toward offering customers smaller but computationally efficient language models? There are many reasons, but perhaps the most significant is cost.
The Cost of LLMs
Cost is one of the most significant pain points when training and running an LLM. The GPUs that power modern LLMs are expensive to purchase and run. In general, the more parameters a model has, the more computational power and GPUs it needs to operate.
This means it is expensive not just for enterprises to train their own LLMs but also to use pre-trained LLMs. For instance, according to OpenAI, pricing for a custom GPT-4 model starts at $2-3 million, and it can take several months to train and requires “billions of tokens at a minimum.”
While it is not confirmed how much GPT-4 cost to train, some analysts have estimated that its predecessor, GPT-3, could have cost over $4 million, while others suggest that ChatGPT could cost as much as $700,000 per day to run.
READ MORE: The Evolution From GPT-1 to GPT-4
Although these figures seem high, the cost of developing an LLM can be much higher. For example, Dr Jim Fan, a senior AI scientist at Nvidia, estimates that Llama-2 cost over $20 million to train but failed to exceed GPT 3.5.
You'll soon see lots of "Llama just dethroned ChatGPT" or "OpenAI is so done" posts on Twitter. Before your timeline gets flooded, I'll share my notes:
▸ Llama-2 likely costs $20M+ to train. Meta has done an incredible service to the community by releasing the model with a… pic.twitter.com/MrABHrmACv
— Jim Fan (@DrJimFan) July 18, 2023
Whether or not these estimates are accurate, it is indisputable that training or running an LLM requires a significant financial investment. That’s why vendors like Microsoft are trying to be more computationally lightweight.
Phi-2 and the SLM Movement
SLMs are gaining traction in the generative AI market because they require less computational power than LLMs to generate insights and can thus operate more cost-effectively.
While GPT-4 is rumored to have been trained on 25,000 Nvidia A100 GPUs over a period of 90-100 days, Phi-2 took just 14 days to train on 96 A100 GPUs.
Although it hasn’t reached the level of performance of GPT-4, it has managed to outperform larger models across multiple benchmarks.
More specifically, it outperforms models like Mistral 7B and Llama-2 in areas including BBH, common sense reasoning, language understanding (Llama 2 only), Math, and Coding. It has also outperformed Gemini Nano 2 on multiple benchmarks, including BBH, BoolQ, MBPP, and MMLU.
When considering that Phi-2 has performed on par with or even outperformed Llama 2 70B in certain benchmarks, it’s evident that SLMs can outperform models on reasoning tasks even if they have more parameters. But how?
How Training Data Makes Phi-2 Come Together
In the example of Phi-2, Microsoft has suggested that one of the key drivers of the SLM’s success is the quality of its training data. The better the quality of data fed into the model, the better its overall performance.
With Phi-2, Microsoft has used what it calls “text-book quality” training data, which incorporates synthetic datasets to teach the model common sense reasoning and general knowledge (science, daily activities, theory of mind),
READ MORE:
This synthetic data is then combined alongside web data, which has been “filtered based on educational value and content quality.”
It’s worth noting that Phi-2 hasn’t undergone alignment via reinforcement learning or fine-tuning, so there is the potential for its performance to be enhanced further through these measures.
In any case, the initial results highlight that low-parameter models can be competitive with larger parameter models if they are trained on carefully curated, high-quality datasets.
The Bottom Line
Although SLMs are a long way from reaching the capabilities of leading LLMs like GPT-4, Phi-2’s performance against Llama 2 70B on reasoning tasks suggests that that gap is closing.
Organizations that want to leverage generative AI on a more cost-efficient and computationally efficient basis can look toward SLMS as a potential alternative.