Can GPT-4.5, Claude 3.7, and Gemini 2.0 Keep Up with Biomedical Research? A Benchmark Update

Guest blog from the team (Owen Bianchi, Maya Willey, Nicole Kuznetsov, Mat Koretsky, Daniel Khashabi, and Faraz Faghri)!

The AI arms race continues. OpenAI, Anthropic, and Google have released their latest models, promising better reasoning, accuracy, and efficiency:

✅ OpenAI unveiled GPT-4.5, promising improvements in reasoning and factual accuracy.

✅ Anthropic launched Claude 3.7 Sonnet, boasting enhanced understanding and efficiency.

✅ Google introduced Gemini 2.0 Flash, optimized for speed and cost-effectiveness.

But here’s the real question: Are they actually getting better at biomedical research?

We put them to the test using CARDBiomedBench, our specialized benchmark for evaluating AI in genetics, disease associations, and drug discovery. The results? A mix of progress and new concerns.

First, a quick refresher on our evaluation metrics (more technical details in the paper):

Response Quality Rate (RQR): Measures how often a model provides correct answers.
Safety Rate (SR): Assesses a model’s ability to abstain from answering when uncertain.

The Good, the Bad, and the Overconfident

📈 GPT-4.5 showed small but steady improvements in both accuracy and safety.
⚠️ Claude 3.7 & Gemini 2.0 Flash got more accurate but became overconfident, providing incorrect answers instead of knowing when to stay silent.

Why it matters: Biomedical research isn’t like casual chatbots—wrong answers can mislead scientists and slow real medical advancements. Currently, the "unsafe and inaccurate" quadrant is getting crowded, with some models making strides in accuracy but regressing in safety.

When AI Gets It Wrong: Real-World Failures

We tested a 980-question subset of CARDBiomedBench, covering drug mechanisms, genetic mutations, and disease pathways.

Here’s where the new models still struggle:

🔹 Claude 3.7 mixed up chromosomes—a critical error in genetics.
🔹 Gemini 2.0 Flash fabricated SNP associations, presenting false findings as facts.
🔹 GPT-4.5 gave an incorrect FDA approval date, a major issue for clinical research.

These failures highlight a major risk in biomedical AI:

💡 Hallucinations—Models confidently generate incorrect or misleading answers, which can be dangerous in biomedical contexts.

Cost, Speed, and the Trade-offs

Accuracy isn't everything—LLMs also need to be fast and cost-effective.

While Claude 3.7 leads in speed and Gemini 2.0 Flash is the most cost-effective,

💰 GPT-4.5 is 150x more expensive than Gemini 2.0 Flash.
🐌 It’s also the slowest, taking over 3 minutes per response.

Why it matters: AI needs to be scalable and efficient—especially for researchers dealing with massive datasets and complex analyses.

Key Takeaways: LLMs Still Struggle in Biomedicine

LLMs are improving, but biomedicine remains a unique challenge. The latest models show:

🚧 Accuracy is improving, but safety is declining—models prioritize answering over admitting uncertainty.
🚧 Scientific knowledge gaps persist—even in well-documented areas like genetic association studies and FDA drug approvals.
🚧 Cost and response time are real barriers—LLMs must be both accurate and accessible for real-world use.

The ideal biomedical AI doesn’t just sound confident—it needs to be correct, cautious, and cost-effective.

🔬 Check out our full analysis and results in our latest paper:
📄 CARDBiomedBench: Evaluating AI in Biomedicine

🧬 Want to test your own model? All code and data are open-source:
🔗 CARDBiomedBench on GitHub
🤗 Dataset on Hugging Face

Think your model can do better? Give it a shot! 🚀

Can GPT-4.5, Claude 3.7, and Gemini 2.0 Keep Up with Biomedical Research? A Benchmark Update

The Good, the Bad, and the Overconfident

When AI Gets It Wrong: Real-World Failures

Cost, Speed, and the Trade-offs

Key Takeaways: LLMs Still Struggle in Biomedicine

New Parkinson’s meta-GWAS of over 1.8M samples!

Need a tool for long read single cell data wrangling at scale?