GSM8K-Platinum: Revealing Performance Gaps in Frontier LLMs
We present GSM8K-Platinum, a revised version of the GSM8K benchmark that reveals meaningful differences in frontier model capabilitiesDo Large Language Model Benchmarks Test Reliability?
We introduce the concept of so-called platinum benchmarks to better quantify model reliabilityUsing ContextCite for LLM reliability
We use our method ContextCite to detect unverified statements and discover poisoned documents.ContextCite: Attributing Model Generation to Context
We present ContextCite, a method for attributing statements generated by language models back to specific information provided in-context.Editing Predictions by Modeling Model Computation
We use our component modeling framework to design targeted model edits.Decomposing Predictions by Modeling Model Computation
We introduce a framework called component modeling for studying how model components collectively shape ML predictions.How Can We Harness Pre-Training to Develop Robust Models?
We explore a simple principle for harnessing pre-training to develop robust models.
Newer