SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?
A new benchmark tests AI’s ability to complete real-world software engineering tasks.
We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks sourced from Upwork, representing $1 million USD in real-world payouts. SWE-Lancer includes both independent engineering tasks—ranging from $50 bug fixes to $32,000 feature implementations—and managerial tasks, where models must select between competing technical proposals. Independent tasks are graded using end-to-end tests that are triple-verified by experienced software engineers. Managerial tasks are evaluated against the decisions made by the original hiring engineering managers. Our evaluation shows that even frontier models struggle to solve the majority of tasks. To support further research, we open-source a unified Docker image and a public evaluation split, SWE-Lancer Diamond, available at https://github.com/openai/SWELancer-Benchmark. By linking model performance to real-world monetary value, SWE-Lancer provides a foundation for deeper research into the economic impact of AI on software engineering.
- Samuel Miserendino
- Michele Wang
- Tejal Patwardhan
- Johannes Heidecke