OpenAI Model Parameters Leaked: Microsoft Paper Reveals GPT4o Size

In the tech world, the parameter sizes of large language models (LLMs) have been a closely guarded secret. However, a recent medical paper co-authored by Microsoft and the University of Washington inadvertently disclosed parameter information for several OpenAI models, sparking widespread interest.

Parameter Disclosure

The paper revealed key information, including:

GPT-4: Approximately 1.76 trillion parameters
GPT-4o: Approximately 200 billion parameters
GPT-4o mini: Approximately 8 billion parameters
o1-preview: Approximately 300 billion parameters
o1-mini: Approximately 100 billion parameters
Claude 3.5 Sonnet: Approximately 175 billion parameters

It's important to note that the researchers stated these parameter counts are estimates.

GPT-4o Series Parameter Speculations

The parameter sizes of the GPT-4o series were surprisingly lower than anticipated, particularly the mini version with only 8 billion parameters. Some internet users speculate that the GPT-4o mini may use a Mixture of Experts (MoE) architecture. This would mean that while the model has an overall parameter count of 400 billion, only 8 billion parameters are actually activated at any given time. This architecture could allow a smaller model to learn more, while maintaining speed.

Claude 3.5 Sonnet Parameter Comparison

Furthermore, some commentators have pointed out that the parameter size of Claude 3.5 Sonnet is comparable to that of GPT-3 davinci. This observation raises questions about the relationship between model performance and scale.

MEDEC Benchmark: A New Standard for Medical Error Detection

The paper which revealed these parameters was actually centered on a new evaluation benchmark called MEDEC1. This benchmark is designed to assess the capabilities of large language models in the detection and correction of medical errors. The focus is on errors in clinical notes, covering five areas: diagnosis, management, treatment, pharmacotherapy, and causal organism.

Data Source and Characteristics

The MEDEC dataset consists of 488 clinical notes from three US hospital systems, totaling 3,848 clinical texts. This data had not been seen by any large language models before, which ensures the evaluation's authenticity and reliability. Currently, the dataset is being used for the MEDIQA-CORR shared task to evaluate the performance of 17 participating systems.

Testing and Results

The research team used the MEDEC dataset to test various advanced models, including o1-preview, GPT-4, Claude 3.5 Sonnet, and Gemini 2.0 Flash. Additionally, two professional physicians participated in the same error detection task for a human-machine comparison.

The results showed that while large language models performed well in medical error detection and correction, they still lag behind human doctors. This suggests that MEDEC is a challenging benchmark.

Core Paper Content: LLM Applications and Challenges in Healthcare

The paper highlights that surveys in US healthcare institutions reveal that one in five patients who read their clinical notes report finding errors. Among these errors, 40% are considered serious, with diagnostic errors being the most common.

LLM Applications and Risks in Medical Documentation

As large language models are increasingly used for medical documentation tasks, such as generating clinical notes, ensuring the accuracy and safety of the information they output is crucial. LLMs may produce hallucinations, outputting incorrect or fabricated content. These inaccuracies can have severe implications for clinical decisions.

Significance of the MEDEC Benchmark

To address these concerns and ensure the safety of LLMs in medical content generation, rigorous validation methods are essential. The MEDEC benchmark was introduced to evaluate models' ability to detect and correct medical errors in clinical text.

Construction of the MEDEC Dataset

The MEDEC dataset contains 3,848 clinical texts from different medical fields, annotated by eight medical professionals. The dataset covers five types of errors:

Diagnosis: Inaccurate diagnosis provided.
Management: Inaccurate next steps in management.
Pharmacotherapy: Incorrect recommended drug therapy.
Treatment: Inaccurate recommended treatment plan.
Causal Organism: Incorrect causal organism or pathogen identified.

These error types were selected based on the most common question types in medical board exams.

Data Creation Methods

The dataset was constructed using two methods:

Method #1 (MS): Using medical board exam questions from the MedQA collection, medical annotators injected incorrect answers into the context text.
Method #2 (UW): Using real clinical note databases from three University of Washington hospital systems, medical student teams manually introduced errors into the records.

Both methods underwent strict quality control measures to ensure the accuracy and reliability of the data.

Medical Error Detection and Correction Methods

To evaluate the models' performance in medical error detection and correction, the researchers divided the process into three sub-tasks:

Sub-task A: Predict the error flag (0: no error; 1: error).
Sub-task B: Extract the sentence containing the error.
Sub-task C: Generate a correction for the sentence containing the error.

The research team built solutions based on LLMs and used two different prompts to generate the required output.

Experiments and Results

Language Models

The researchers experimented with several language models, including Phi-3-7B, Claude 3.5 Sonnet, Gemini 2.0 Flash, ChatGPT, GPT-4, GPT-4o, o1-mini, and o1-preview.

Analysis of Experimental Results

The experimental results showed that Claude 3.5 Sonnet performed well in error flag detection and error sentence detection. o1-preview performed best in error correction. However, all models still performed worse than human physicians in medical error detection and correction.

The results also indicate that the models had issues with precision and, in many cases, over-predicted the existence of errors (i.e., hallucinated). Furthermore, there was a ranking discrepancy between classification performance and error correction generation performance.

Error Type Analysis

In detecting and correcting different types of errors, o1-preview had higher recall in error flag and sentence detection, but physicians performed better in accuracy.

Future Research Directions

Researchers stated that the next step in their research involves introducing more examples and optimizing prompts to further improve the models' performance in medical error detection and correction.