If there’s anything we’ve learned in 2023, it’s that open-source AI is rapidly gaining ground. While OpenAI’s November release of ChatGPT stole the headlines in 2022, more and more high-performance open-source large language models (LLMs) have been emerging for research and commercial use this year.
While these pre-trained open-source LLM models aren’t yet at the stage to unseat the performance of proprietary AI models like GPT4, these can be a viable alternative to LLMs like GPT 3.5.
Below, we will look at 6 of the top LLMs to watch out for in 2024 as the open-source AI ecosystem continues to evolve.
6 Best Open-Source LLMs
6. Llama 2: Best Open Source LLM Overall
One of the most significant open-source LLMs to launch this year is Meta’s Llama 2, arguably the best open-source LLM for commercial use due to its overall versatility and performance.
Back in July, Meta and Microsoft announced the release of Llama 2, a pre-trained generative AI model trained on 2 trillion tokens, which supports between 7 to 70 billion parameters. It’s worth highlighting that Llama 2 was trained on 40% more data than Llama 1 and supports double the context length.
At the time of writing, Llama 2 remains one of the highest-performing open-source language models on the market, excelling in key benchmarks such as reasoning, coding, proficiency, and knowledge tests.
Currently, the Hugging Face Open LLM Leaderboard ranks Llama 2 70B as the second-best LLM on the market, scoring 67.35 on average, 67.32 on ARC, 87.33 on HellaSwag, 69,83 on MMLU, and 44.92 on TruthfulQA.
Llama 2 has also demonstrated promising performance against proprietary models like GPT4. Waleed Kadous, Chief Scientist at Anyscale and Ex Principal Engineer at Google, published a blog post finding that the Llama 2 had roughly the same level of accuracy at summarization as GPT-4 while also being 30x cheaper to run.
It’s worth noting that Meta also has a new version of Llama 2 called Llama 2 Long, designed to perform well when responding to long queries. It is a modified version of Llama 2 that comes with 400 billion additional tokens and supports a 32,000 context length.
Upon release, Meta claims that the 70B variant of Llama 2 Long surpasses GPT 3.5 16ks performance on long context tasks, such as answering questions, test summarization, and multi-document aggregation.
Pros
- Generate natural language
- Fine-tuned for chat use cases
- Few shot learning
- Multi-task learning
- Uses less computational resources than LLMs of a similar size
- Translate into multiple languages
- Supports multiple programming languages
- Generates safer output
- Uses a diverse dataset with over 1 million human annotations
Cons
- Training can be financially and computationally costly
- Not as creative as models like GPT 3.5
- Limited support in language other than English
- Performance depends on pre-training data quality
- Hallucinations
5. Falcon 180B: Most Powerful Open Access Model
One of the biggest open LLMs (open access) to launch in 2023 was Falcon 180B. The United Arab Emirates Technology Innovation Institute’s (TII) language model trained on 3.5 trillion tokens taken from the RefinedWeb dataset, which supports up to 180 billion parameters.
It was designed to excel in completing natural language tasks, and as of October 2023, is the top-ranked LLM on the Hugging Face Open LLM Leaderboard for pre-trained language models, achieving an average score of 68.74, 69.8 on ARC, 88.95 on HellaSwag, 70.54 on MMLU, and 45.67 on TruthfulQA.
The TII claims Falcon 180B has “performed exceptionally well” on reasoning, coding proficiency, and knowledge tests, outperforming competitors like Llama 2 in some areas and performing “on par” with Google’s PaLM 2, which powers the popular Bard chatbot.
Researchers who want to experiment with Falcon 180B in a chatbot context can use a modified version called Falcon 180B Chat, which is a modified version of the main model fine-tuned on chat and instruction data.
However, one of the key limitations of Falcon 180B is that its underlying license is quite restrictive. In addition to forbidding users from using the LLM to break local or international laws or harming other living beings, organizations that intend to host or offer managers services based on the LLM will need a separate license.
In addition, Falcon 180B has a lack of guardrails compared to other proprietary LLMs or open-source LLMs that have been fine-tuned for safety, like Llama 2, which means that it can more easily be used for malicious use cases.
Pros
- More powerful than popular tools like GPT 3.5 and Llama 2
- Generate text
- Write and debug code
- Optimized for inference
- Available for research and commercial use
- Fine-tuned on chat and instruction data
- Trained on diverse data (Including the RefinedWeb dataset)
Cons
- Open-access rather than open source
- Restrictions on commercial use
- Requires powerful hardware to run
- Not as user-friendly as other tools on the market
- Need to ask TII before offering hosted access to the model
4. Code Llama: Best Open LLM for Code Generation
When it comes to code creation, one of the most exciting releases this year came from Meta in the form of Code Llama. It is an AI model that was created by training Llama 2 on code-specific datasets, including 500 billion tokens of code and code-related data.
Code Llama supports 7B, 13B, and 34B parameters and has been fine-tuned to generate code and explain what code does in a range of languages, including Python, C++, Java, PHP, Typescript (Javascript), C#, Bash, and more.
For example, users can ask the chatbot to write a function that outputs the Fibonacci sequence or to request instructions on how to list all text files in a given directory.
This makes it ideal for developers aiming to streamline their workflows or novice coders looking to better understand what a piece of code does and how it works.
There are two main variations of Code Llama; Code Llama Python and Code Llama Instruct. Code Llama – Python is trained on an extra 100B tokens of Python code to offer users better code creation capabilities in the Python programming language.
Code Llama Instruct is a fine-tuned version of Code Llama, which is trained on 5 billion tokens of human instructions and has been developed to better understand human instructions.
Pros
- Capable of generating natural language and code
- Fine-tuned version of the model available for chat use cases (Mistral 7B Instruct)
- Faster inference time (via Grouped-query attention)
- Reduced inference cost (via sliding window attention)
- Can be used locally
- No restrictions under Apache 2.0 license
Cons
- Coding performance lags behind GPT-4 without additional fine-tuning
- Limited parameters
- Risk of prompt injections
- Prone to hallucination
3. Mistral: Best 7B Pretrained Model
In September 2023, Mistral AI announced the release of Mistral 7B, a small but high-performance open source LLM with 7 billion parameters, which is developed to function more efficiently than larger closed-source models, making it ideal for supporting real-time applications.
Mistral 7B uses techniques such as grouped-query attention to conduct faster inference and sliding window attention (SWA) to handle longer sequences at a lower cost.
These techniques enable the LLM to process and generate large texts faster and at a lower cost than more resource-intensive LLMs.
The organization’s release announcement indicates that Mistral 7B scored 80.0% on arc-e, 81.3% on HellaSwag, 60.1% on MMLU, and 30.5% on HumanEval benchmark tests, significantly outperforming LLama 2 7B in each category.
Mistral AI also suggested that Mistral outperforms and outperforms Llama 1 34B in code, mathematics, and reasoning while approaching Code Llama 7 B’s performance on code tasks.
Together, this information suggests that Mistral AI is a viable choice for both natural language and code generation tasks.
There is also an alternative version of Mistral 7B called Mistral 7B Instruct, which has been trained on publicly available conversation datasets, and outperforms all 7B models on the MT-Bench benchmark.
On another note, it is worth mentioning that some commentators have voiced concerns over Mistral 7 B’s lack of content moderation, which has led to it being able to generate problematic content, such as instructions for how to create a bomb.
Pros
- Generate natural language and code
- Fine-tuned version of the model available for chat use cases (Mistral 7B Instruct)
- Fast inference time (via Grouped-query attention)
- Reduced inference cost (via sliding window attention)
- Can be used locally
- No restrictions under Apache 2.0 license
Cons
- Coding performance lags behind GPT-4 without fine-tuning
- Limited parameters
- Exposed to prompt injections
- Can hallucinate facts
2. Vicuna: Best Size-Output Quality LLM
Vicuna 13B is an open-source chatbot that was released by students and faculty members at UC Berkeley, operating under the open research organization Large Model Systems Organisation (LMSYS Org) back in March 2023.
LMSYS Org’s researchers took Meta’s Llama model and fine-tuned it with 70,000 ChatGPT conversations shared by users on ShareGPT.com. Training Llama on this data has given Vicuna the ability to generate detailed and articulate responses to user queries with a level of sophistication comparable to ChatGPT.
For example, preliminary tests conducted by LMSYS Org suggest that Vicuna achieves 90% of the quality of ChatGPT and Bard while outperforming Llama and Stanford Alpaca in 90% of scenarios (although the researchers admit that research is needed to fully evaluate the solution).
LMSYS ORG also reports that Vicuna 13B achieved 6.39 on MT-bench, a 1,061 arena ELO rating, and 52.1 on MMLU.
Similarly, on the AlpacaEval leaderboard, which ranks the instruction-following capabilities of language models, Vicuna 13B achieved a win rate of 82.11%, compared to 81.71% for GPT-3.5, and 92.66% for Llama 2 Chat 70B.
These results are impressive when considering that Vicuna 13B took roughly $300 to train.
There is also a larger version of Vicuna called Vicuna-33B, which scores 7.12 on MT-bench and 59.2 on MMLU.
Pros
- Produces detailed natural language output
- Lightweight
- Costs $300 to train
- Fine-tuned with over 70K conversations taken from ShareGPT
- Commercially available
Cons
- Limited performance in tasks involving reasoning and mathematics
- Can hallucinate information
- Limited content moderation controls
1. Giraffe: Best Scale-Context Length Model
In September 2023, Abacus.AI released a 70B version of Giraffe, a family of fine-tuned AI models based on Llama 2, extending the model’s context length from 4,096 to 32,000. Abacus.AI has given Giraffe a long context window to help improve the performance of downstream processing tasks.
Extending the context length enables the LLM to retrieve more information from a downstream dataset while making fewer errors. At the same time, it also helps to maintain longer conversations with users.
Abacus.AI claims that Giraffe displays the best performance of all open-source models in extraction, coding, and mathematics. Under the MT-Bench evaluation benchmark, the 70B version of Giraffe achieves a score of 7.01.
“We conducted an evaluation of the 70B model on our set of benchmarks that probe LLM performance over long contexts,” said Bindu Reddy, CEO of Abacus AI.
“The 70B model improves significantly at the longest context windows (32k) for the document QA task vs the 13B model, scoring 61% accuracy vs. the 18% accuracy of 13B on our AltQA dataset. We also find that it outperforms the comparable LongChat-32k model at all context lengths, with an increasing performance at the longest context lengths (recording 61% vs. 35% accuracy at 32k context lengths.”
It’s also worth noting that Abacus AI has also reported that Giraffe 16k “should perform well on real-world tasks up to 16k context lengths” and potentially up to 20-24k context lengths.
Pros
- Understand and generate natural language text
- Large context window supports larger input and longer conversations
- 16 model should perform well on tasks up to 16K context length
- Vicuna-instruction fine-tuned version of the model available
Cons
- Requires significant computational power
- Retrieval accuracy requires fine-tuning
- Prone to hallucinations
The Bottom Line
While this article just scratches the surface of some of the LLMs that are being developed and fine-tuned on an open-source basis, all of these models illustrate that the range of open AI solutions is growing rapidly.
If you want your LLM open-source and freely available, there are plenty of options on the market. As more iterations of these models continue to be released and fine-tuned, the utility of these solutions will continue to expand.