


D P = P ( Y ∣ A = M a l e ) − P ( Y ∣ A = F e m a l e ) DP = P(Y \vert A = Male) - P(Y \vert A = Female) D P = P ( Y ∣ A = M a l e ) − P ( Y ∣ A = F e m a l e ). Proportion of each segment of a protected class (gender, for example) that should receive the positive outcome at equal rates.Ĭloser to zero means better parity. Note: Many of these metrics were influenced by this paper Measuring Model Biases in the Absence of Ground Truth. Using the dataset, we can see if the various sexes and races are receiving >50k income at equal or unequal rates. P a r i t y ( y ∣ x A, x B, A ( ⋅ ) ) ≔ A ( x A, y ) − A ( x B, y ) parity(y \vert x_A, x_B, A(\cdot)) \coloneqq A(x_A, y) - A(x_B, y) p a r i t y ( y ∣ x A , x B , A ( ⋅ ) ) : = A ( x A , y ) − A ( x B , y ) It is the difference between the association metrics of two different sensitive classes, with respect to the association metric A ( x i, y ) A(x_i, y) A ( x i , y ). In this context, we define a feature balance measure, called the parity, for label y. Measure Explanations Feature Balance Measures įeature Balance Measures allow us to see whether each combination of sensitive feature is receiving the positive outcome (true prediction) at balanced probability. Note: If you're running this notebook in a Spark environment such as Azure Synapse or Databricks, then you can easily visualize the imbalance measures by calling the built-in plotting features display(). AggregateBalanceMeasure - unsupervised (doesn't require label column).DistributionBalanceMeasure - unsupervised (doesn't require label column).FeatureBalanceMeasure - supervised (requires label column).Data Balance Analysis - Adult Census Incomeĭata Balance Analysis currently supports three transformers in the synapse.ml.exploratory namespace:.It enables easy end-to-end debugging of ML systems in combination with the RAI Toolbox by providing a clear view of model-related issues versus data-related issues.Before proceeding to train their models, data scientists can seek mitigation steps such as collecting more data, following a specific sampling mechanism, creating synthetic data, and so on. It reduces the costs of building ML through the early identification of data representation gaps.In summary, Data Balance Analysis, when used as a step for building ML models, has the following benefits: Data Balance Analysis is a tool to help do so, in combination with others.ĭata Balance Analysis consists of a combination of three groups of measures: Feature Balance Measures, Distribution Balance Measures, and Aggregate Balance Measures. Instead, teams tasked with developing and deploying AI systems must work to identify, measure, and mitigate fairness-related harms as much as possible. Note: Because fairness in AI is fundamentally a sociotechnical challenge, it's often impossible to fully “de-bias” an AI system. Over/underrepresentation harms: When an AI system over/underrepresents some groups of people or may even erase some groups entirely.Demeaning harms: When an AI system is actively derogatory or offensive.Stereotyping harms: When an AI system makes unfair generalizations about groups of people and reinforces negative stereotypes.Quality of service harms: When an AI system does not work as well for one group of people as it does for another.Allocation harms: When an AI system extends or withholds opportunities or resources in ways that negatively impact people’s lives.Fairness-related harms can have varying severities, and the cumulative impact of even seemingly non-severe harms can be burdensome. They may amplify the marginalization of particular groups whose needs and contexts are often overlooked during AI development and deployment. These behaviors can cause fairness-related harms that affect various groups of people. Data Balance Analysis is relevant for gaining an overall understanding of datasets, but it becomes essential when thinking about building AI systems in a responsible way, especially in terms of fairness.ĪI systems can sometimes exhibit unwanted, unfair behaviors.
