Insights: Breaking Down the Transformative Journey of GPT Models in AI, from GPT-1 to GPT-4

Why Trust Techopedia

The GPT series has transformed the AI landscape. Each successive model exhibits progress in capabilities, with training computations (expressed in FLOPs) demonstrating the immense resources allocated. However, a recent study observed shifts in GPT-4 and GPT-3.5’s outputs as time progressed, suggesting that there has been an overall decline in their performance. Princeton researchers disputed these findings, suggesting biases in the datasets and evaluations, highlighting the challenges in assessing language models.

Artificial intelligence (AI) has seen major changes since the Chat Generative Pre-trained Transformer (GPT) series started in 2018.

Successive models brought enhancements, upgrades, and challenges, capturing the interest of enthusiasts, researchers, and users. From GPT-1’s basic text creation to GPT-4’s diverse skills, the progress is evident. Continuous studies examine these models’ actions, shedding light on their changing skills and possible issues.

This article covers the growth and study of the chat generative pre-trained transformer models. It centers on their performance scores and insights from different tests.

The Evolution of the Generative Pre-Trained Transformer Series

An essential aspect of understanding the advancements in the GPT series is the training computation, often gauged in total FLOP (floating-point operations). A FLOP represents basic math operations such as addition, subtraction, multiplication, or division performed with two decimal numbers. 

When it comes to scale, one petaFLOP equals a staggering quadrillion (10^15) FLOP. This measure of computation showcases the vast resources invested in training these models.

Launch of GPT in 2018

GPT-1, introduced in June 2018, marked the inception of the generative pre-trained transformer model series. This laid the groundwork for the ChatGPT of today. GPT-1 showcased the potential of unsupervised learning in language understanding, predicting the next word in sentences using books as training data.


GPT was trained using 17,600 petaFLOPs. 

The leap to GPT-2 in 2019

In February 2019, GPT-2 emerged as a significant upgrade to the generative pre-trained transformer series. It exhibited substantial improvements in text generation, producing coherent, multi-paragraph content. However, due to potential misuse concerns, GPT-2’s public release was initially withheld. It was eventually launched in November 2019 after OpenAI‘s careful risk assessment.

GPT-2 was trained using 1.49 million petaFLOPs. 

The revolutionary GPT-3 in 2020

GPT-3, a monumental leap in June 2020. Its advanced text generation found applications in email drafting, article writing, poetry creation, and even programming code generation. It demonstrated capabilities in answering factual queries and language translation.

GPT-3 was trained using 314 million petaFLOPs. 

GPT-3.5’s Impact 

GPT-3.5 is an improved version of GPT-3, released in 2022. This generative pre-trained transformer model has fewer parameters and uses fine-tuning for better machine learning (ML). This involves reinforcement learning with human feedback to make the algorithms more accurate and effective. GPT-3.5 is also designed to follow ethical values, making sure that the AI it powers is safe and reliable for humans to use.

This model is offered for free use by OpenAI. The number of petaFLOPs used for training is not available. 

Introduction of the multimodal GPT-4 in 2023

GPT-4, the most recent version, carries forward the trend of remarkable advancement, introducing enhancements such as:

  • Enhanced model alignment, enabling it to understand user intentions better;
  • Reduced chances of producing offensive or harmful content;
  • Heightened factual precision;
  • Improved steerability, allowing it to adapt its behavior based on user prompts;
  • Internet connectivity, a new feature enabling real-time Internet searching.

This model is offered to ChatGPT Plus subscribers. 

GPT-4 was trained using 21 billion petaFLOPs. 

GPT-3.5 vs. GPT-4: A Research Study

A research paper emerged from Stanford University and the University of California, Berkeley, that highlighted the shifts in GPT-4 and GPT-3.5’s outputs as time progressed. The paper suggests that there has been an overall decline in the performance of these generative pre-trained transformer models. 

Lingjiao Chen, Matei Zaharia, and James Zou studied OpenAI’s models by using API access to examine the models from March and June 2023. They conducted tests to understand the generative pre-trained transformer models’ evolution and adaptability over time.

Prime vs. Composite Numbers 

The researchers wanted to check whether GPT-4 and GPT-3.5 can tell whether numbers are prime or not. They used 1,000 questions for this test, where half of them were prime numbers from a list extracted from another paper. The other half were picked from numbers between 1,000 and 20,000. 

A method called Chain-of-Thought (CoT) was used to help the generative pre-trained transformers think. This method breaks the task down, first by checking if a number is even, second by finding its square root, and third by seeing if smaller prime numbers can divide it.

These were the results: 


  • March 2023: 84% accuracy
  • June 2023: 51% accuracy


  • March 2023: 49.6% accuracy
  • June 2023: 76.2% accuracy

Happy Numbers

The test aimed to check how well ChatGPT can identify happy numbers within a set range. A happy number is when you keep adding the squares of its digits, and you end up with 1. 

For example, 13 is a happy number because 1 squared plus 3 squared equals 10, and then 1 squared equals 1. 

The study focused on this because it’s a clear-cut question, unlike others that might have yes or no answers. It’s also just about simple math.

For this test, 500 questions were created. Each question asked about how many happy numbers are in a certain range. The range’s size varied, and its start point was picked from numbers between 500 and 15,000. The test used CoT to help with logical thinking.

These were the results: 


  • March 2023: 83.6% accuracy
  • June 2023: 35.2% accuracy


  • March 2023: 30.6% accuracy
  • June 2023: 48.2% accuracy

Sensitive/Dangerous Questions

This test looked at how the generative pre-trained transformer models handled sensitive questions. A set of 100 sensitive questions was made for this, with questions that could be harmful or controversial. Therefore, models should avoid direct answers. 

The researchers used manual labeling to see if a model answered a question directly.

These were the results: 


  • March 2023: 21.0% response rate
  • June 2023: 5.0% response rate


  • March 2023: 2.0% response rate
  • June 2023: 8.0% response rate

Opinion Surveys

This test examined how the language models’ opinion biases changed over time using the OpinionQA dataset. This set had 1,506 opinion questions from top public polls. Questions were in multiple-choice style, and the models were told to “Pick the best single option.”

The main goal was to see whether the generative pre-trained transformer models were ready to give opinions.

These were the results: 


  • March 2023: 97.6% response rate
  • June 2023: 22.1% response rate


  • March 2023: 94.3% response rate
  • June 2023: 96.7% response rate

Multi-hop Knowledge-intensive Questions

To study how well large language models (LLMs) can answer complex multi-hop questions, the researchers used an approach called the LangChain HotpotQA Agent. This approach involved having LLMs search through Wikipedia to find answers to intricate questions. 

The agent was then assigned the task of responding to each query in the HotpotQA dataset.

These were the results: 


  • March 2023: 1.2% exact match
  • June 2023: 37.8% exact match


  • March 2023: 22.8% exact match
  • June 2023: 14.0% exact match

Generating Code

To assess the code generation capabilities of LLMs without the risk of data contamination, a novel dataset was curated using the latest 50 problems categorized as “easy” from LeetCode. These problems are equipped with solutions and discussions that were made public in December 2022. 

The generative pre-trained transformer models were presented with these problems, along with the original descriptions and Python code templates.

The code generated by the LLMs was directly submitted to the LeetCode online judge for assessment. If the generated code was accepted by the judge, it signified that the code adhered to Python’s rules and successfully passed the judge’s designated tests.

These were the results: 


  • March 2023: 52.0% directly executable
  • June 2023: 10.0% directly executable


  • March 2023: 22.0% directly executable
  • June 2023: 2.0% directly executable

Medical Exam

This test set out to evaluate the progress of GPT-4 and GPT-3.5 in a specialized field – the USMLE, a crucial medical examination for American doctors. This exam was a benchmark for evaluating LLMs’ medical knowledge. The methodology involved having the generative pre-trained transformer models take the USMLE and then comparing their performance. 

These were the results: 


  • March 2023: 86.6% accuracy rate
  • June 2023: 82.4% accuracy rate


  • March 2023: 58.5% accuracy rate
  • June 2023: 57.7% accuracy rate

Visual Reasoning 

This test aimed to see how well LLMs did with visual tasks. Using the ARC dataset, a popular tool for such tests, they asked the LLMs to create grids based on given samples. These grids used colors represented in 2-D arrays. Out of 467 samples tested, they compared the LLMs’ answers to the correct ones to gauge their accuracy.

These were the results: 


  • March 2023: 24.6% exact match rate
  • June 2023: 27.2% exact match rate


  • March 2023: 10.9% exact match rate
  • June 2023: 14.3% exact match rate


The results showed a shift in performance. Both generative pre-trained transformer models had accuracy changes for many tasks, with some tasks improving and others declining.

For example, GPT-4 did better with difficult questions but struggled with coding and math. On the other hand, GPT-3.5 had mixed results in some tasks.

Research indicates that LLMs continue to evolve. Continuous monitoring and assessment are crucial, especially for critical uses. The data emphasizes monitoring changes and the challenge of consistent performance in tasks.

Is GPT-4’s Performance Really Declining? A Closer Look

While the Stanford study raises concerns about GPT-4’s performance, other experts have offered a different perspective. Princeton University’s computer science professor Arvind Narayanan and Ph.D. candidate Sayash Kapoor delved into the findings of the paper to note the below.

Understanding Chatbots

Chatbots like GPT-4 have two main features: capability (what they can do) and behavior (how they act). While capabilities are established during an intensive pre-training phase, behavior can be adjusted in the subsequent, more frequent fine-tuning phase. After pre-training, the model essentially acts as an autocomplete tool. Its ability to interact in a chat-like manner comes from fine-tuning.

Evaluating Code Generation

The study found that the newer GPT-4 version sometimes adds non-code text in its outputs. Instead of checking the accuracy of the code, the researchers only verified if it was directly executable. This means that the model’s efforts to provide more comprehensive answers were seen as negatives.

Assessing Math Problems

The study used math problems centered around identifying prime numbers. However, all the numbers they tested were primes. This choice of data influenced the results. 

In fact, Narayanan and Kapoor tested the models with 500 composite numbers and realized that much of the performance degradation was due to this choice of evaluation data. 

In the March release, GPT-4 frequently predicted numbers to be prime, while the June version typically assumes they’re composite. The researchers viewed this as a significant decline in performance, primarily because they only evaluated prime numbers. Interestingly, GPT-3.5 displays the opposite behavior.

GPT models comparison chart
Source: AI Snake Oil

In truth, all four models performed similarly poorly, as illustrated in the above graph. Their predictions were influenced by their initial calibration. In most cases, none of the models actually checked whether the numbers had divisors – they just pretended to by listing all the factors that needed to be checked without actually checking them. 

Ultimately, Narayanan and Kapoor concluded that the paper doesn’t conclusively prove that GPT-4’s capabilities have declined. However, it highlights the potential unintended consequences of fine-tuning, including significant behavioral changes. 

Evaluating language models remains a challenging task, and it’s crucial to approach such evaluations with a comprehensive understanding of the models’ capabilities and behaviors.

The Bottom Line

The generative pre-trained transformer series stands out in the AI field. But with new ideas comes the need for regular checks.

The performance path of these models, shown in studies, points to changing machine learning results. Some see a drop in skills, while others focus on testing details.

Still, the growth of GPT models holds big meaning for AI’s path ahead. And keeping a flexible view is key, looking at both the ups and downs of these tools.


Related Reading

Related Terms

Maria Webb
Technology Journalist
Maria Webb
Technology Journalist

Maria is a technology journalist with over five years of experience with a deep interest in AI and machine learning. She excels in data-driven journalism, making complex topics both accessible and engaging for her audience. Her work is prominently featured on Techopedia, Business2Community, and Eurostat, where she provides creative technical writing. She holds a Bachelor of Arts Honours in English and a Master of Science in Strategic Management and Digital Marketing from the University of Malta. Maria's background includes journalism for, covering a range of topics from local events to international tech trends.