
The Limits of Ethical AI
Unprecedented access to high-stakes algorithmic experiment tests promise of Ethical AI
The Netherlands has become ground zero for AI experiments in social services. An algorithm played a leading role in a childcare benefits scandal which saw tens of thousands of families falsely accused of fraud and the resignation of the government. In the port city of Rotterdam, our investigation found that an algorithm ranked welfare recipients based on clothing and fluency in Dutch and targeted single mothers with migrant backgrounds.
So when officials in Amsterdam—one of Europe’s most progressive cities—told us in 2023 that they were building a “fair” algorithm to detect welfare fraud, we wanted to know more.
Over the course of five years, Amsterdam’s welfare department has been engaged in a high-stakes experiment guided by Responsible AI: a framework of technical and ethical guidelines meant to ensure fairness, transparency, and accountability in automated systems. Promoted by academics, NGOs, multinational institutions, as well as a swelling consulting sector, Responsible AI has emerged as a leading response to the scandals surrounding algorithmic decision-making.
But few systems built using these principles have been subjected to independent journalistic scrutiny. Over the course of years, the city of Amsterdam spent hundreds of thousands of euros, hired consultants, spoke to academic experts, extensively audited its system for bias, and even brought in welfare recipients to provide feedback on the system’s design.
It also invited Lighthouse to look over its shoulder as everything unfolded.
Despite all this effort, the system still failed. In collaboration with MIT Technology Review and Trouw, we set out to understand why. We obtained unprecedented access to the system, the officials who built it, and the critics who fought against it.
The result is one of the first in-depth looks at a system developed under Responsible AI guidelines—and what happens when those promises meet reality.
STORYLINES
With our partners MIT Technology Review and Trouw, we explored an unanswered question: what does it actually mean for an algorithm to be deployed fairly?
Previous reporting, including our own, has covered the worst-case deployments of this technology. Because many of these systems are poorly designed or intentionally discriminatory, these stories have avoided some of the thorny questions about when, if ever, this technology should be deployed and what fair AI should look like.
Amsterdam followed every piece of advice in the Responsible AI playbook. It debiased its system when early tests showed ethnic bias and brought on academics and consultants to shape its approach, ultimately choosing an explainable algorithm over more opaque alternatives. The city even consulted a participatory council of welfare recipients — the very people the system would scrutinize — who sharply criticized the project.
Yet when the city deployed a pilot in the real world, the system continued to be plagued by biases. It was also no more effective than the human case workers it was designed to replace. As political pressure mounted, officials killed the project, bringing an expensive, multi-year experiment to a quiet end.
We reveal the different lessons drawn by participants and experts from Amsterdam’s experience of trying to build a Responsible AI system. These competing interpretations reflect deeper disagreements about whether Responsible AI can ever deliver on its promises, or whether some applications of artificial intelligence are fundamentally incompatible with human rights.
METHODS
In 2023, a few months after we published our original Suspicion Machines investigation, one of our reporters filed a public records request to Amsterdam. We requested code and documents related to a fraud detection system the city had been developing.
In previous investigations, obtaining this type of information took months, even years, with agencies stonewalling us at every turn. We were therefore surprised when the city immediately disclosed everything we requested and invited us to an online meeting. In that meeting, it immediately became clear to us why the city had been so forthcoming. The city had gone to great lengths to design a fair system and, at the time, was confident that it had done so.
When we first began speaking to the city, it was preparing for a pilot in which the model would score real-world welfare applicants. The pilot was ultimately a failure: the model was both biased and ineffective.
In the fall of 2024, nearly a year since we originally began reporting, the city shelved the project altogether. We wanted to better understand how biases had crept into the model as it was trained, reweighted, and then deployed in the real world. Doing so required auditing the system, but we hit a roadblock. While the city had disclosed machine learning models and extensive documentation, Europe’s GDPR barred it from sharing data on how the system had scored real world welfare recipients – important information to audit such systems.
In what is, to our knowledge, a first, the city cooperated in a remote access regime. We sent code and tests to city officials, who executed them on the real data and returned to us aggregate results. Lighthouse has published a methodology explaining our analysis on our website and all the underlying code and data to our GitHub.
Months of ground reporting, including interviews with officials, welfare recipients and experts, allowed us to piece together how the project came together and why it ultimately failed.
Methodology
How we investigated Amsterdam’s attempt to build a ‘fair’ fraud detection modelJune 11, 2025