Yegor Denisov-Blanch

Research Scientist, Stanford AI Lab / Co‑Founder, P10Y

I study how AI is changing knowledge work.

AI is doing to knowledge work what machines once did to muscle, compressed from generations into years. The comforting story is that we will all just move up a level of abstraction. I study whether that story is true.

Yegor Denisov-Blanch, Research Scientist at the Stanford AI Lab.

Featured in & recognized by

Washington Post cover Re-shared by Elon Musk Guided UN Policy & World Bank reports 100+ outlets worldwide

Featured

Ghost engineers: 9.5% of software engineers do nothing

Ghost engineers: 9.5% of software engineers do nothingI studied 50,000+ engineers and found a measurable group producing almost no work.Read the coverage →

Does AI Actually Boost Developer Productivity? (300K+ views)

Does AI Actually Boost Developer Productivity? (300K+ views)I break down where AI helps developers, where it stalls, and what the data shows.Watch →

Can you prove AI ROI in software engineering? (35K views)

Can you prove AI ROI in software engineering? (35K views)I show how to prove software-engineering AI ROI with data from 120,000 developers.Watch →

Founder, operator, athlete, researcher

My Path

I left school at 14 and figured the rest out as I went. Each step made the next one possible.

Age 14

Left school during 8th grade.

School didn't feel very useful so I left at age 14.

Age 14-18

Built a B2B e-commerce business in Spain and grew it to $500K in revenue.

Nobody takes a 14-year-old seriously on a call, so I taught myself how to code and sold online before e-commerce was a thing.

2011

Enrolled directly into college, skipping 5 grades.

Skipped high school, scored well on the SAT, and learned via Khan Academy.

2013

National Champion - Olympic Weightlifting.

Became Master of Sport of Russia, a title awarded for national-champion-level athletic results.

2016

Graduated Top 1% from Indiana University's Kelley School of Business.

Studied Operations Research: optimization, stochastic systems, queues & simulation. The math of making systems run better.

2017–2022

Rose to Chief of Staff to DHL's CEO for Europe, the Middle East & Africa.

Led transformation work touching 2,500 employees across 25+ countries. Running logistics during COVID was a peak life experience.

2022–2025

Stanford MBA on a full-tuition scholarship.

Spent my days doing research, building things, and sitting in on as many CS classes as I could.

Dec 2024

Ghost engineer research showed that 9.5% of software engineers do almost no measurable work.

It made The Washington Post's business cover, got shared by Elon Musk and Marc Andreessen, and 100+ outlets followed.

2025-Now

Co-founded P10Y, a company that measures AI's impact on software engineers.

Built on my research, deployed across dozens of enterprises, including all of Salesforce and part of a FAANG company.

2025-Now

Research Scientist at the Stanford AI Lab (SAIL).

Measuring how AI changes knowledge work, with data from hundreds of companies. Teaching team for CS321M: AI Measurement Science.

Other cool stuff

Awards

DHL Employee of the Year Nomination1 of 6 nominees out of 40,000 employees
Master of Sport of RussiaNational Champion equivalent. Awarded in 2013 for Olympic weightlifting

Geography

Fluent in 4 languagesRussian, Spanish, English, and Catalan.
Lived in 6 countriesWhether I liked living somewhere mostly came down to whether I liked the food.

Courses I Enjoyed

CS329ASelf-Improving AI Agents
CS349DCompound AI Systems
CS329TTrustworthy Machine Learning
CS525Training Data for AI
PSYC 233Awareness and Stress
STRAMGT 514Product Market Fit

Deep dive

Ghost engineers

I studied private Git data from more than 50,000 engineers across hundreds of companies. About 9.5% did almost no measurable work, less than one tenth as much as a typical engineer.

9.5%engineers who do virtually nothing

50,000+engineers analyzed, across 100s of companies

14% vs 6%ghosts when fully-remote vs in-office

The estimate doesn't come from counting commits. A model scores every commit the way a panel of ten expert reviewers would: how hard was the work, how maintainable is it, how much value does it add. Counting commits only catches people who do nothing. Scoring the work catches people who commit a lot of nothing.

The finding made the cover of the Washington Post's Business section, sparked a global debate about remote work and measurement, and was amplified by Elon Musk. The strongest validation came from the companies themselves: when they checked the engineers we flagged, the ghosts were real.

Washington Post coverage →The original thread →

Ghost engineers

Selected high-signal coverage from a wider set of 100+ outlets worldwide.

Deep dive

AI & developer productivity

I measure what AI does to software output across 100,000 developers at hundreds of companies. For most of the AI boom the answer held: gains that are real, below the sales pitch, and uneven across tasks and teams. In December 2025 the answer started to change.

~100kdevelopers measured

Modestaverage lift through 2025, below the hype

Dec 2025inflection point in the data

The same expert-panel model scores every commit on time, quality, maintainability, and complexity, then tracks output as teams adopt AI. Through 2025 the average lift stayed smaller than the headlines, and it depended on the task, the age of the codebase, and how common the language was.

Most companies can't tell whether their AI investment pays off, because their metrics can't see it. That measurement gap, and the distance between teams that master AI and teams that don't, is the throughline of the work.

December 2025 marked an inflection in the data. Through the peak of the hype I kept saying we weren't there yet, and the numbers backed me up. I was right about the call and wrong about the clock: the shift arrived long before I expected it.

Watch: Does AI boost productivity? →Watch: Can you prove AI ROI? →

AI & developer productivity

The AI-productivity work is now cited and discussed across institutional, enterprise, investor, podcast, and engineering-leadership channels.

World Bank Amdocs LeadDev Aviator CAST / BCG GitLab DPE Summit The AI Conference AI Engineer Code Consumer Reports William Blair KPMG Belgium

17 papers

Publications

2026Truthfulness Does Not Scale Like Reasoning: Why Polling Fails as a Proxy Verifier

Shows that language models can agree and still be wrong, so truthfulness needs real verification rather than voting or self-consistency.

ICML 2026AI Measurement & EvaluationarXiv →

2026

Repository AI Configuration Is Associated with Three-Fold Differences in Code Quality After Agent Adoption

Introduces RAMP, a way to score how ready repositories are for coding agents, and links stronger setup with fewer quality problems after adoption.

ASE 2026AI & Work Performance

2026Scale Dependent Data Duplication

Studies how duplicated training data affects models differently as they scale, showing that repetition can change performance in ways averages hide.

PreprintModel Behavior & DataarXiv →

2026

The Properties and Pathologies of V-Information

Revisits V-information and shows it can behave strangely under realistic modeling limits, making it risky to treat as a simple information measure.

Working paperAI Measurement & Evaluation

2026Internal Data Repetition Destroys Language Models

Shows that repeating data inside a training set can damage language models, even when the repeated examples look harmless at first.

5x ICML 2026 Workshops1 oralModel Behavior & DataOpenReview →

2026AI Writes Faster Than Humans Can Review: A Longitudinal Study of an Enterprise 2x Mandate

Studies an enterprise AI-coding mandate across developers and pull requests, finding throughput gains alongside much heavier review load.

PreprintAI & Work PerformancearXiv →

2026Test Set Contamination of Generative Evaluations

Explains how generative benchmarks can be contaminated by test data, making models look better than they really are.

PreprintAI Measurement & EvaluationarXiv →

2026

Propagating Evaluation Failures in Awarded Papers on Language Model Sampling

Tracks how weak sampling evaluations passed through major papers, turning shaky claims into later assumptions that other work built on.

Working paperAI Measurement & Evaluation

2026

The Artificial Hivemind That Wasn't

Rechecks claims that LLM answers collapse into one narrow style, and finds much more diversity across topics, models, and prompts.

Working paperModel Behavior & Data

2026VeriBench: An End-to-End Formal Verification Benchmark for AI Coding Agents in Lean 4

Tests whether coding agents can turn Python programs into Lean 4 specifications and proofs that verify end to end.

DL4C 2026AI4Math 20262x ICML WorkshopsAgents & VerificationOpenReview →

2026

VeriBench-DT: Trustworthy Agentic Autoformalization with Original-Code Verification via Differential Testing

Tests whether an agent's Lean proof actually matches the original Python code, exposing proofs that are correct but about the wrong thing.

BenchmarkAgents & Verification

2026

Predictive AI Evaluation Competition

Proposes a competition where researchers predict evaluation results before they are run, making AI benchmarks harder to game after the fact.

NeurIPS 2026AI Measurement & Evaluation

2026Certifying the Judge: Falsifiable Properties for LLM-Based Evaluation of Formal Code

Builds a protocol for testing whether LLM judges of Lean 4 specs behave sensibly, turning monotonicity and stability checks into a trust signal.

DL4C 2026ICML WorkshopAgents & VerificationOpenReview →

2025Position: Machine Learning Conferences Should Establish a "Refutations and Critiques" Track

Argues that machine learning conferences need a formal place for critiques and corrections, so important mistakes can be reviewed openly.

NeurIPS 2025OralAI Measurement & EvaluationarXiv →

2025Measuring Determinism in Large Language Models for Software Code Review

Measures how much LLM code-review outputs vary across runs, which matters when teams want reliable and repeatable review results.

PreprintAI & Work PerformancearXiv →

2025Turning Down the Heat: A Critical Analysis of Min-p Sampling in Language Models

Reexamines min-p sampling and finds that its claimed benefits are not supported by the available evaluation evidence.

PreprintModel Behavior & DataarXiv →

2024Predicting Expert Evaluations in Software Code Reviews

Shows that models can predict expert code-review scores, making it cheaper to estimate code quality across large datasets.

PreprintAI & Work PerformancearXiv →

Authored writing

Policy analysis and columns on AI, productivity, language, and remote work for public audiences.

Policy analysisPermanent Mission of Kazakhstan to the United NationsUN / Kazakhstan

Kazakhstan: Central Asia's AI Powerhouse

Strategic AI policy analysis commissioned for Kazakhstan's official diplomatic channel at the United Nations.

ColumnThe Astana TimesKazakhstan

Kazakhstan: Central Asia's AI Powerhouse

A public case for Kazakhstan's AI opportunity, grounded in productivity data from nearly 100,000 developers across 500+ companies.

ColumnTengri NewsKazakhstan

Kazakhstan Becomes Central Asian Leader in AI Race

Russian-language column on how Kazakhstan can keep its emerging lead in AI-driven software productivity.

ColumnEl Confidencial / TeknautasSpain

I have spent years analyzing productivity in Silicon Valley

Spanish-language analysis of AI, ghost workers, and the changing productivity model in Silicon Valley.

ColumnEl Español / InvertiaSpain

Artificial intelligence speaks English, not Spanish

Argument that AI's English-language bias creates a structural productivity disadvantage for Spanish-speaking economies.

ColumnEl DebateSpain

Ghost workers: 9.3% of remote workers in Spain do nothing or almost nothing

Spanish op-ed on remote work, ghost workers, and why better measurement should protect merit rather than become surveillance.

Columnmoney.plPoland

They gain a lot from AI. But are they tripping over their own feet?

Polish business op-ed on AI productivity gains, rework, language effects, and what local firms need to get right.

ColumnInfor.plPoland

Believe in the ghost: what do programmers really do when working remotely?

Polish-language piece explaining ghost engineers, remote-work structure, and data-driven productivity measurement.

Proof

Media & Talks

Used in policy by the World Bank and the United Nations, covered by 100+ outlets worldwide, and presented at major AI conferences.

Media

Washington Post

Washington PostThe business-section story that pushed ghost engineers into the mainstream.2024

World Bank

World BankCited in a global policy report on AI, jobs, and productivity.2025

United Nations

United NationsPolicy analysis published through Kazakhstan's UN diplomatic channel.2025

Business Insider

Business InsiderTech-industry writeup on underperformance and output measurement.2024

Yahoo FinanceFinance pickup translating the finding into estimated payroll waste.2024

404 Media

404 MediaIndependent tech press on overemployment, measurement, and surveillance tradeoffs.2024

More confirmed outlets and references

Talks & events

June 2026GitLab Transcend · London

May 2026Factory AI · Silicon Valley

November 2025AI Engineer Code Summit · New York City October 2025MLOps World · Austin, TX September 2025The AI Conference · San Francisco June 2025AI Engineer World's Fair · San Francisco

2025Vietnam IT Forum keynote

September 2024DPE Summit · San Francisco

2023 and 2024McKinsey roundtables · San Francisco

October 2023Globalize Silicon Valley Conference · San Francisco

Teaching

Stanford Computer Science · Spring 2026

CS321M: AI Measurement Science

A graduate course on how to measure AI systems when benchmarks saturate and evaluation methods disagree. Students learn to treat evaluation as a measurement problem, then build a new measurement approach of their own.

19Lectures

45+Readings

6Textbook chapters

5Guest lectures

Guest lectures from OpenAI, Google DeepMind, Harvard, MIT & Transluce

Off the clock

Interests

Yegor seated in a gym with chalk and weightlifting plates nearby

Yegor seated on a red Kawasaki motorcycle by Stanford Graduate School of Business

Lifting things

I've logged 5,000+ workouts and lifted 100+ million kg, about 10 Eiffel Towers' worth of weight.

Automating things

At 13, I started coding video game bots. Watching them play was more fun.

Pushing limits

I enjoy doing side quests in human firmware.

Driving things

I enjoy driving things with engines at fast speeds.

Contact

Get in touch

If you study how work is changing, build in this space, or want to, get in touch.

ydebl [at] stanford [dot] edu