Here at Civis, we build a lot of models. Most of the time we’re modeling people and their behavior because that’s what we’re particularly good at, but we’re hardly the only ones doing this — as we enter the age of “big data” more and more industries are applying machine learning techniques to drive person-level decision-making. This comes with exciting opportunities, but it also introduces an ethical dilemma: when machine learning models make decisions that affect people’s lives, how can you be sure those decisions are fair?
Defining “Fairness”
A central challenge in trying to build fair models is quantifying some notion of ‘fairness’. In the US there is a legal precedent which establishes one particular definition, however this is also an area of active research. In fact, a substantial portion of the academic literature is focused on proposing new fairness metrics and proving that they display various desirable mathematical properties. This proliferation of metrics reflects the multifaceted nature of ‘fairness’: no single formula can capture all of its nuances. This is good news for the academics — more nuance means more papers to publish — but for practitioners the multitude of options can be overwhelming. To address this my colleagues and I focused on three questions that helped guide our thinking around the tradeoffs between different definitions of fairness.
Group vs. Individual Fairness
For a given modeling task, do you care more about group-level fairness or individual-level fairness? Group fairness is the requirement that different groups of people should be treated the same on average. Individual fairness is the requirement that individuals who are similar should be treated similarly. These are both desirable, but in practice it’s usually not possible to optimize both at the same time. In fact in most cases it’s mathematically impossible. The debate around affirmative action in college admissions illustrates the conflict between individual and group fairness: group fairness stipulates that admission rates be equal across groups, for example gender or race, while individual fairness requires that each applicant be evaluated independently of any broader context.
Balanced vs. Imbalanced Ground Truth
Is the ground truth for whatever you are trying to model balanced between different groups? Many intuitive definitions of fairness assume that ground truth is balanced, but in real-world scenarios this is not always the case. For example, suppose you are building a model to predict the occurrence of breast cancer. Men and women have breast cancer at dramatically different rates, so a definition of fairness that required similar predictions between different groups would be ill-suited to this problem. Unfortunately, determining the balance of ground truth is generally hard because in most cases our only information about ground truth comes from datasets which may contain bias.
Sample Bias vs. Label Bias in your Data
Speaking of, what types of bias might be present in your data? In our thinking we focused on two types of bias that affect data generation: label bias and sample bias. Label bias occurs when the data-generating process systematically assigns labels differently for different groups.
For example, studies have shown that non-white kindergarten children are suspended at higher rates than white children for the same problem behavior, so a dataset of kindergarten disciplinary records might contain label bias. Accuracy is often a component of fairness definitions, however optimizing for accuracy when data labels are biased can perpetuate biased decision making.
Sample bias occurs when the data-generating process samples from different groups in different ways. For example, an analysis of New York City’s stop-and-frisk policy found that Black and Hispanic people were stopped by police at rates disproportionate to their share of the population (while controlling for geographic variation and estimated levels of crime participation). A dataset describing these stops would contain sample bias because the process by which data points are sampled is different for people of different races. Sample bias compromises the utility of accuracy as well as ratio-based comparisons, both of which are frequently used in definitions of algorithmic fairness.