Published on

Stanford Study Reveals ChatGPT Performance Decline Over Time

Authors
  • avatar
    Name
    Ajax
    Twitter

ChatGPT Performance Fluctuations: A Deep Dive

A groundbreaking study, titled 'ChatGPT Behavior Over Time,' conducted by researchers at Stanford University and UC Berkeley, has exposed notable variations in the performance and behavior of both GPT-3.5 and GPT-4. This research, published in the Harvard Data Science Review, examined these models over a three-month period across seven distinct tasks. These tasks included mathematical problem-solving, code generation, multi-hop knowledge-intensive question answering, the US Medical Licensing Exam, and visual reasoning.

Significant Performance Changes

The study revealed considerable shifts in performance for both models within the three-month window. For instance, GPT-4's accuracy in identifying prime versus composite numbers plummeted from 84% in March to 51% in June. This decline was partly attributed to a diminished ability to follow 'chain of thought' prompts. Conversely, GPT-3.5 demonstrated an improvement in this specific task during the same period, highlighting the complex and sometimes unpredictable nature of these models.

Other Key Findings

  • Reduced Sensitivity: GPT-4 showed a decreased willingness to answer sensitive questions and opinion surveys by June.
  • Reasoning Abilities: While GPT-4 improved in solving multi-step reasoning problems, GPT-3.5 experienced a decline in these tasks.
  • Code Generation Errors: Both models exhibited an increase in formatting errors during code generation.
  • Instruction Following: GPT-4's ability to adhere to user instructions also showed a noticeable decline.

Evaluation Methodology: Testing the Models

The researchers rigorously evaluated GPT-3.5 and GPT-4 using principles of diversity and representation. Tests were conducted across seven major domains:

  • Mathematical problems
  • Sensitive/dangerous issues
  • Opinion surveys
  • Multi-hop knowledge-intensive questions
  • Code generation
  • US Medical Licensing Exam
  • Visual reasoning

To further understand the behavioral changes, the team developed a new benchmark focused on task-independent instruction following. This benchmark included four types of common instructions: answer extraction, stop apologizing, avoid specific words, and content filtering.

Instruction Following: A Critical Test

This series of tests was designed to evaluate the models' ability to follow instructions independently of specific skills or knowledge. In March, GPT-4 was able to follow most individual instructions well, but by June it began to disregard them. For example, the compliance rate for answer extraction instructions dropped dramatically from 99.5% to nearly zero. Content filtering instruction fidelity also declined from 74.0% to 19.0%.

Specific Instruction Types and Performance

  • Answer Extraction: This instruction tests the model's ability to locate and identify the answer within a given text. GPT-4's compliance dropped significantly from 99.5% in March to nearly zero in June.
  • Stop Apologizing: This tests the model's ability to avoid using apologies or self-identifying as an AI. While compliant in March, GPT-4 frequently violated this instruction by June.
  • Avoid Specific Words: This instruction checks the model's flexibility in adhering to specific constraints. The decline from March to June indicates a reduction in GPT-4's ability to handle complex instructions.
  • Content Filtering: This instruction requires the model to exclude specific topics or sensitive information. GPT-4's filtering ability significantly decreased from March to June, with only about 19% of sensitive issues handled correctly.

Implications of the Research

The researchers emphasized that the closed-source nature of GPT-3.5 and GPT-4, with OpenAI not disclosing its training data and processes, leads to a lack of transparency. This makes it difficult for users to understand the changes that occur with each major update. This study is crucial for developers and users to comprehend the performance and behavioral dynamics of ChatGPT, which is vital for ensuring the model's safety and content authenticity. The study also highlights the challenges in maintaining these models’ consistency and reliability, especially in rapidly evolving environments.