Yutong Liu & Kingston School of Art / https://betterimagesofai.org / https://creativecommons.org/licenses/by/4.0/

METHODOLOGY

How we investigated Amsterdam’s attempt to build a ‘fair’ fraud detection model

For the past four years, Lighthouse has investigated welfare fraud detection algorithms deployed in five European countries. Our investigations have found evidence that these systems discriminated against vulnerable groups with oftentimes steep consequences for people’s lives.

Governments and companies deploying these systems often show little regard for the biases they perpetrate against vulnerable groups. The city of Rotterdam told us that it had never run code designed to test whether its model was disproportionately flagging vulnerable groups. France’s social security agency, CNAF, confirmed to us that it had never audited its model for bias.

In January of 2023, three months before publishing our investigation into Rotterdam’s risk scoring algorithm, we sent a public records request to the city of Amsterdam, one of Europe’s most progressive capitals. Among other things, we asked for documents, code and data relating to a similar system Amsterdam had been developing. Given that we had fought over a year to obtain these types of material in other investigations, we were surprised when the city immediately complied with our request.

The materials disclosed by the city were related to a machine learning model it was developing in order to predict which of the city’s residents were most likely to have submitted an incorrect application for welfare. At the time of our public records request, the model was still in development. The overall development goal for the model, according to the city’s internal documentation, was to have fewer welfare applicants investigated, but a higher share of those investigated rejected. Internal documents also emphasized two other aims: avoid bias against vulnerable groups, and outperform human caseworkers.

After reading through the documentation, it quickly became clear to us why the city had been so forthcoming. The city had gone to significant lengths in order to develop a model that was transparent and treated vulnerable groups fairly.

We wanted to investigate whether the city of Amsterdam had succeeded in developing a fair model. Over the course of eight months, we ran a series of our own tests on the model and data containing real world outcomes for people flagged as suspicious by the model. When it came to the outcome data, the city ran our own tests locally and returned aggregate results in order to comply with European data protection laws.

The past five years have seen a number of regulatory efforts to reign in harmful uses of AI. The EU AI Act, passed in 2024, will require AI systems deemed “high risk” to be registered with the European Commission. In New York City, employers have been prohibited from screening employees using AI without conducting bias audits since July 2023. Meanwhile, the private sector, academics and multilateral institutions have produced a number of responsible AI frameworks. In taking on this investigation, we wanted to look ahead and understand the thorny reality of building a fair AI tool that makes consequential decisions about people’s lives.

The code and data underlying our analysis can be found on our Github.

How the model works

The City of Amsterdam’s model aimed to identify applications for further investigation. Staff members check for applications that contain mistakes and redirect any concerning applications to an investigator. The model is designed to replicate this first screening.

A flagged application is not automatically rejected. However, investigators are empowered to request a beneficiary’s bank statements, summon them for meetings, and even visit their homes. Past reporting, including our own, has shown how these investigations can be a stressful or even traumatising experience.

The model uses a type of algorithm called an Explainable Boosting Machine (EBM). EBMs prioritize explainability — the degree to which an AI algorithm is understandable instead of being a confusing “black box” of abstract mathematical processes. Amsterdam’s EBM model predicts ‘investigation worthiness’ based on variables that measure welfare applicants’ behavior and characteristics, features.

Model predictions are based on 15 features. None of these features explicitly referred to an applicant’s gender or racial background, as well as other demographic characteristics protected by anti-discrimination law. But the model designers were aware that features could be correlated with demographic groups in a way that would make them proxies.

The model calculates a risk score for each application using these features. The risk scores suggest whether the applicant merits further investigation or not. Applicants with a risk score above 0.56 were redirected for further investigation.

Training the Model

Amsterdam constructed its training dataset from past investigations. As a result, the trained machine learning model may reflect biases from the old analogue process. It is possible, for example, that case workers are more likely to wrongly label a non-Dutch person as ‘investigation-worthy’ than a Dutch person.

Using past investigations to train a model also means that the training data population looked quite different from the real-world population: more than 50 percent of the training data was labeled “investigation worthy” whereas the city’s analog process flagged only around 7 percent of the beneficiary population.

In addition, there were issues with the label itself. The city admitted that their labelling process was prone to subjective interpretation.

Fairness Definitions

The city built an extensive pipeline to test whether the model treated different groups fairly. It was primarily interested in testing whether its model performed equally well across demographic groups, which is a common approach to measuring fairness in the academic literature. In essence, this approach asks whether the model is equally good at finding “investigation-worthy” men compared to women or Dutch citizens compared to non-Dutch citizens. The model is considered fair if its performance is equal across these groups. The challenge is that there are myriad ways to measure a model’s performance. Depending on which measure of performance is chosen, the same model can be considered fair or unfair.

Most performance metrics can be easily computed from a model’s confusion matrix. A confusion matrix contains the following elements:

– True Positives: the share of people the model has correctly predicted to have done something wrong. For this investigation, these are welfare applicants predicted to be “investigation worthy” whose application actually warranted further investigation.

– False Positives: The share of people the model has incorrectly predicted to have done something wrong. These are welfare applicants predicted to be “investigation worthy” but whose application did not warrant further investigation.

– True Negatives: The share of people the model has correctly predicted to have done nothing wrong. These are people not flagged by the model, where human case workers also agreed that there was no need for further investigation.

– False Negatives: The share of people the model has incorrectly predicted to have done nothing wrong. These are the people not flagged by the model, but whose application actually warranted further investigation.

To evaluate a model’s outcome fairness, one needs such confusion matrices broken down by demographic characteristics, which we obtained for age, gender, nationality, ethnic background (Western vs Non-Western), and parenthood. As an example, here are the confusion matrices for Dutch and non-Dutch welfare applicants for the original model built by the city.

With these matrices in hand, we can formalize various definitions of model fairness.

– Statistical Parity: Equal performance across groups requires the share of people who are flagged in each group is the same.

– Mathematical definition: TP+FP/TOTAL

– In practice this definition asks: Does the model flag one group at higher rates than another group?

– From the confusion matrices above we can see that people of a foreign background are significantly more likely to be flagged than people with a Dutch background. 53.04% vs 34.58%

– False Discovery Rate: Equal performance across groups requires that amongst the people flagged an equal share of people is wrongly flagged.

– Mathematical definition: FP/(TP+FP)

– In practice this definition asks: When the model flags someone from a group as investigation worthy, how often is it wrong?

– The FDR for Dutch people is 39.60% while 33.33% for non-Dutch people. If a Dutch person is flagged as investigation-worthy, then they are more likely to be incorrectly flagged.

– False Positive Share: Equal performance across groups requires that the share of people who are wrongly flagged in each group (as a whole) is the same.

– Mathematical definition: FP/TOTAL

– In practice this definition asks: How were falsely-flagged people distributed across different groups.

– The share of people who are mistakenly flagged is greater for non-Dutch than for Dutch people: 17.68% vs 13.70%.

– False Positive Rate: Equal performance across groups requires that, among the people who are not investigation worthy, the share of people who are wrongly flagged is the same.

– Mathematical definition: FP/(TN+FP)

– In practice this definition asks: Given that someone is not investigation-worthy, what is the chance that they are wrongly selected?

– We can see that the FPR for non-Dutch people is greater than for Dutch people: 43.88% vs 25.86%. This means that a non-Dutch person who has not made a mistake is more likely to be wrongly flagged by the model than a Dutch person who has not made a mistake.

Deciding which definition of fairness to optimize for is a question of values and context. If a government decides to use statistical parity, it believes that different groups should not be flagged at different rates. In other words: men should not be more likely to be flagged than women.

False Discovery Rate, by contrast, is all about the people who are flagged for investigation. It means that a man flagged for investigation should be equally likely to have done nothing wrong as a woman who is flagged – i.e., the risk score should mean the same thing for members of different groups. If this criterion is not met, the model could be considered a violation of equal treatment norms.

Ultimately, the city decided to focus on equalising False Positive Share across groups. This decision was made after considering the harmful consequences beneficiaries faced when they were investigated without due cause. By focusing on this definition, the city chose to spread the burden of being wrongly investigated evenly across groups.

In choosing False Positive Share over False Discovery Rate, the city argued that False Discovery Rate narrowly focuses on the efficiency of the model’s predictions rather than overall disparate impact on vulnerable groups.

The City’s First Model

After the model had been trained on past investigations, the city deployed it against recent applications made between 26 April 2021 and 28 March 2022. Applications deemed investigation worthy by the model were then reinvestigated by human experts to distinguish between true and false positives (these re-investigations had no consequences for the applicants). This “pre-pilot” data was used to test how well the model would perform under real world conditions.

When the city checked the share of false positives for each group — the city’s chosen definition of fairness — the model showed worrying signs of bias. The model showed strong bias against welfare applicants with a migration background.

People without Dutch citizenship were 30 percent more likely to be wrongly selected than people with Dutch citizenship and people with non-Western nationality were almost twice as likely to be wrongly selected than people with a Western passport. Overall, the initial model showed greater bias against vulnerable groups than the analogue process.

The city compared these false positive rates against false positive rates in their analogue process. Interestingly, the analogue process was also biased but not in the same way as the model. In the analogue process, women are more likely to be wrongly investigated than men and an applicant with non-Dutch nationality actually appears to be less likely to be wrongly investigated than an applicant with only a Dutch nationality.

The City’s ‘Debiased’ Model

The city attempted to correct the model’s bias against non-Dutch citizens and those with non-Western citizenship. No effort was made to correct other demographic biases.

Debiasing was conducted by reweighting the model training data. Among investigation-worthy applicants, those with a Western nationality were weighted higher and those with a non-Western nationality were weighted lower. Conversely, among applicants that were not deemed investigation-worthy, those with a Western nationality were weighed lower, and those with a non-Western nationality were weighed higher.

The reweighting procedure was successful in achieving its primary goal: much of the bias against applicants with a migration background vanishes. This is most readily apparent when we consider that before reweighting, those with Non-Western nationalities were almost twice as likely to be wrongly flagged. After reweighting, Western and non-Western applicants have almost the same share of false positives.

In comparison to the analogue process, the retrained model now looked substantially “fairer.” But not all differences have vanished: men, for example, are still just as likely to be wrongly investigated.

The Debiased Model is Deployed in a Pilot

After the apparent success of the reweighting procedure, the city decided to deploy the system in a pilot between June and August 2023. The prepilot phase had involved city investigators checking applications flagged by the model to gauge its accuracy. But these checks had no consequences for individuals who were flagged. This changed during the pilot phase. From June to August 2023, the city used the model to check all incoming applications for “investigation worthiness”.

The results of that pilot showed that new biases had emerged in the model. In fact, in most cases the reweighted model was now more likely to wrongly flag exactly the opposite groups compared to the original model tested on the prepilot data.

Women were 22 percent more likely to be wrongly flagged than men, while welfare applicants with Dutch nationality were more likely to be wrongly flagged than non-Dutch applicants. Because the model fully replaced the analogue process during the pilot, the city could no longer compare evaluations from the model with the analogue procedure.

In addition to the reappearance of biases, the model’s performance in the pilot also deteriorated. Crucially, the model was meant to lead to fewer investigations and more rejections. What happened instead was mostly an increase in investigations , while the likelihood to find investigation worthy applications barely changed in comparison to the analogue process. In late November 2023, the city announced that it would shelve the pilot.

Fairness Tradeoffs

The story from the city’s perspective is relatively straightforward: an initial attempt to train the model showed substantial bias against non-Dutch applicants. Reweighting the training data appeared to alleviate much of this bias, but when the model was deployed in the real world, the bias rematerialized in the opposite direction. Finding this, the city decided to halt the program.

Amsterdam narrowly focused on one definition of fairness, specifically the share of false positives. However, there are many other valid definitions including False Discovery Rate and Statistical Parity.

The goal of the city’s reweighting exercise was to reduce the gap in false positives between Dutch beneficiaries and non-Dutch beneficiaries. This goal was largely achieved, but while the city made progress on the difference in False Positive Rate, the gap in the False Discovery Rate grew substantially.

This kind of tradeoff between different fairness definitions is not surprising. In fact, such a result is mathematically guaranteed: trying to reduce bias according to False Positive Rate will invariably result in greater bias according to False Discovery Rate and/or Statistical Parity (under the condition that the true rate of investigation-worthiness differs by group). Like in a game of whack a mole, trying to reduce bias according to one definition will always result in increased bias according to another definition.

The complications do not end here: trying to improve fairness outcomes for one group can worsen them for another. Full-time parents, for instance, were already more likely to be wrongly flagged by the model (False Positive Rate) before reweighting than non parents. But after reweighting, the difference between the two groups increased rather than decreased. It appears that by making the model fairer based on migration/national background, the city inadvertently made it less fair for parents.

These are two sets of unavoidable tradeoffs: focusing on one fairness definition can lead to worse outcomes on others. Similarly, focusing on one group can lead to worse performance for other groups. In evaluating its model, the city made a choice to focus on false positives and on reducing ethnicity/nationality based disparities. Precisely because the reweighting procedure made some gains in this direction, the model did worse on other dimensions.

Model Performance

In the pilot, the model’s performance substantially worsened. The deterioration of the model’s performance is difficult to attribute. In the pilot, the share of false positives relative to false negatives is much greater. This suggests that the threshold used in the pilot phase was too low: too many people are flagged and too many people are wrongly flagged. It is not surprising that the model is not well calibrated: the share of people deemed investigation worthy was much greater in the training data than in the real world.

But while this is a problem that could have easily been resolved by raising the threshold at which a person is being flagged, the bias that emerged is less straightforward: Specifically, the pilot showed substantially higher False Positive Shares for Dutch relative to immigrant applicants. In other words, the model now appeared biassed in the exact opposite direction compared to the prepilot before reweighting.

Technical Details

Data access

In spring of 2023, we sent a public records request to the city of Amsterdam requesting access to code and documentation relating to its attempts to build a fraud prediction model. After receiving the city’s technical documentation and code base, we realized that we would not be able to verify the city’s analyses and conduct our own audit without access to some kind of test data. However, we also knew that the city would not be able to legally share this data with us because of European data protection laws.

Amsterdam decided to work with us and run several custom analyses on their end before sharing aggregated results with us which form the basis of our analysis. Specifically, the city shared with us confusion matrices for

– 8 demographic groups: gender, age below 30, age below 40, age below 50, Dutch citizenship, Western citizenship, full-time parenthood and part-time parenthood
– 3 datasets: the training/test set, the prepilot set, and the pilot set
– And the model before and after reweighting

With this dataset in hand, we could trace the model’s development, show the impact of the reweighting procedure, and concretely discuss the tradeoffs between various fairness definitions. The full dataset is available here, within our Github repository.

Working with the city to run a substantial portion of our analysis was time consuming. It also did not function perfectly (see Limitations) and required a certain degree of trust. Despite these challenges, remote access regimes like this could become a template for making public sector algorithms auditable while respecting privacy regulations.

Limitations

The city of Amsterdam could not disclose row-level data for applicants scored by the model because of European data protection laws. Instead, we designed a series of tests which the city then ran on the real data and returned aggregate results. This approach involves a certain level of trust and innately carries the risk that the city may have provided us with incorrect or even fabricated numbers. At the same time, we have seen nothing to indicate that the numbers have been adjusted and our results largely align with what the city has itself written in various non-public, internal reports.

A more concerning limitation is that when the city re-ran parts of its analysis, it did not fully replicate its own data and results. For example, the city was unable to replicate its train and test split. Furthermore, the data related to the model after reweighting is not identical to what the city published in its bias report and although the results are substantively the same, the differences cannot be explained by mere rounding errors.