GSM8K-Platinum: Revealing Performance Gaps in Frontier LLMs

We present GSM8K-Platinum, a revised version of the GSM8K benchmark that reveals meaningful differences in frontier model capabilities

Do Large Language Model Benchmarks Test Reliability?

We introduce the concept of so-called platinum benchmarks to better quantify model reliability

D3M: Improving Group Robustness via Dataset Selection

Using ContextCite for LLM reliability

We use our method ContextCite to detect unverified statements and discover poisoned documents.

ContextCite: Attributing Model Generation to Context

We present ContextCite, a method for attributing statements generated by language models back to specific information provided in-context.

Editing Predictions by Modeling Model Computation

We use our component modeling framework to design targeted model edits.

Decomposing Predictions by Modeling Model Computation

We introduce a framework called component modeling for studying how model components collectively shape ML predictions.

How Can We Harness Pre-Training to Develop Robust Models?

We explore a simple principle for harnessing pre-training to develop robust models.