Yourbench
There are two things from our announcement today I wanted to highlight.
First, we’re launching YourBench, an open-source tool for custom benchmarking and synthetic data generation from any of your documents. It’s not a product we’re trying to sell - it’s a step toward rethinking how we evaluate AI models, and we’re sharing it with everyone. You can try it out now with your own documents at this demo link, or dig into the code and methodology on GitHub.
Evaluations as we know them are old and broken - general benchmarks give us a rough sense of capability, but they don’t tell you how a model will handle your specific tasks. YourBench is a first attempt to fix that, letting you test models on what actually matters to you. It’s rough around the edges, and it’ll take time to get right, but we think it’s a direction worth exploring.
Second, this feels like the start of something bigger. Most evaluations today are static - they’re snapshots of performance on generic datasets that don’t evolve with the problems we’re trying to solve. YourBench is modular and configurable, designed to iterate alongside the models it tests. You can throw your own documents at it, generate diverse questions, and use it as a hard question set or even training data. It’s not perfect yet, coverage isn’t flawless, and the questions it generates can miss the mark, but the idea is to build something we can improve over time, not a one - off score to brag about. I can imagine a future where evaluations aren’t just a report card, but a living tool that co-evolves with AI, helping us understand and push these systems further.
We don’t know exactly where this will go. It’s early, and we’re not sure how useful it’ll end up being in practice or how much it’ll shape research. Predictions are tricky - maybe this becomes a standard way to test domain-specific performance, or maybe it’s just a stepping stone. Either way, we’re excited to see what people do with it. Huge thanks to @clefourrier, @ailozovskaya, @lvwerra, @tur_gokhan, @dilekhakkanitur, and @Thom_Wolf for pouring their energy into this. They’ve helped makae something cool happen, and I’m grateful to watch it unfold.