Adapting Static Fairness to Sequential Decision-Making: Bias Mitigation Strategies towards Equal Long-term Benefit Rate

Yuancheng Xu^1*, Chenghao Deng^1*, Yanchao Sun¹, Ruijie Zheng¹, Xiyao Wang¹, Jieyu Zhao², Furong Huang¹

University of Maryland, College Park¹ University of Southern California²
ICML 2024
^*Equal Contribution

Paper Code X (twitter) arXiv

Abstract

Decisions made by machine learning models can have lasting impacts, making long-term fairness a critical consideration. It has been observed that ignoring the long-term effect and directly applying fairness criterion in static settings can actually worsen bias over time. To address biases in sequential decision-making, we introduce a long-term fairness concept named Equal Long-term BEnefit RaTe (ELBERT). This concept is seamlessly integrated into a Markov Decision Process (MDP) to consider the future effects of actions on long-term fairness, thus providing a unified framework for fair sequential decision-making problems. ELBERT effectively addresses the temporal discrimination issues found in previous long-term fairness notions. Additionally, we demonstrate that the policy gradient of Long-term Benefit Rate can be analytically simplified to standard policy gradients. This simplification makes conventional policy optimization methods viable for reducing bias, leading to our bias mitigation approach ELBERT-PO. Extensive experiments across various diverse sequential decision-making environments consistently reveal that ELBERT-PO significantly diminishes bias while maintaining high utility. Code is available at https://github.com/umd-huang-lab/ELBERT.

Motivation

Directly imposing static fairness constraints without considering future effects of the current action/decision can actually exacerbate bias in the long run. To explicitly address it, recent efforts formulate the long-term effects of actions/decisions in each time step, in terms of both utility and fairness, using the framework of Markov Decision Process (MDP).

The predominant long-term fairness notion models long-term bias by estimating the accumulation of step-wise biases in the future. This is a ratio-before-aggregation fairness notion, which aggregates bias of ratios at each time step. An illustrative example is shown in the figure, concerning a loan application scenario where the bank aims to maximize profit while ensuring demographic parity. The result suggests unbiased sequential decisions.

motivation0

However, the ratio-before-aggregation notion inadvertently leads to temporal discrimination. This occurs within the same group, where decisions made for individuals at different time steps carry unequal importance in terms of characterizing the long-term unfairness. Consider the following two scenarios, which are almost identical except for the reversal of approvals for the red group at two time steps. Under the previous ratio-before-aggregation notions, the long-term bias is zero for trajectory A and non-zero for trajectory B. Therefore, the bank prefers A over B and thus inadvertently favors approving red applicants at time $t+1$ over red applicants at time $t$, thereby causing discrimination.

Ratio-before-aggregation metric leads to temproal discrimination.

Ratio-after-aggregation metric avoids temproal discrimination.

To address the issue of temporal discrimination, we refer to a ratio-after-aggregation metric. This metric considers the overall acceptance rate across time, which is the total number of approved loan over time normalized by the total number of applicants. Since the decisions are aggregated before normalizing with the total number of applications. Therefote, in terms of fairness, there is no distinction between allocating approval to an red applicant at time $t$ and at time $t+1$, avoiding the issue of temporal discrimination.

ELBERT: Equal Long-term Benefit Rate

To generally adapt static group fairness notions to sequential settings in the ratio-after-aggregation manner, we introduce Equal Long-term Benefit Rate (ELBERT). As a unified framework, it is based on standard MDP formulation with immediate group supply and immediate group demand, which is formalized as the Supply-Demand MDP (SD-MDP).

sdmdp

With cumulative group supply $\eta_g^S(\pi)$ and cumulative demand $\eta_g^D(\pi)$, we define the Long-tern Benefit Rate of group $g$ as $\eta_g^S(\pi)/\eta_g^D(\pi)$ and define the bias of a policy as the maximal difference of Long-term Benefit Rate among groups.

elbert0

Under the framework of ELBERT, the goal of reinforcement learning with fairness constraints is to find a policy to maximize the cumulative reward and keep the bias under a threshold $\delta$.

elbert1

ELBERT-PO

To solve the constrained problem, we propose to solve its unconstrained problem by maximizing the following objective. Here $\alpha$ is a constant controlling the trade-off between the traditional RL reward and the bias.

eqn0

Challenge 1: Policy gradient of $b(\pi)$

Policy optimization methods is natural for the objective above. However, how to deal with $\nabla b(\pi)$ is previously unclear because $b(\pi)$ is not of the form of expected total return.

To solve this, we first analytically reduce the objective's gradient to standard policy grandents

eqn1

Here $h={b(\pi)}^2$ and $\frac{\partial h}{\partial z_g}$ is the partial derivative of h w.r.t its $g$-th coordinate.

Then we compute the gradeint of the objective function using advantage functions as follows.

eqn2

In practice, we use PPO with the fairness-aware advantage function to update the policy.

Challenge 2: Non-smoothness in multi-group bias

When there are multiple groups, the max and min operator in the objective can cause non-smoothness during training. This is problematic especially when there are several other groups with Long-term Benefit Rate close to the maximal or minimal values.

eqn3

To solve this, we replace the max and min operator with their smoothed version controlled by the temperature $\beta>0$ and define the soft bias.

eqn4

The soft bias is an upper bound of the exact bias, and the quality of such approximation is controllable: the gap between the two decreases as $\beta$ increases and vanishes when $\beta\to\infty$.

alg

Experimental Results

From loan approvals to medical allocations and attention distribution, our simulations show ELBERT-PO consistently achieves the lowest bias among all baselines while maintaining high rewards.

Poster

poster

BibTeX

@inproceedings{xuadapting,
        title={Adapting Static Fairness to Sequential Decision-Making: Bias Mitigation Strategies towards Equal Long-term Benefit Rate},
        author={Xu, Yuancheng and Deng, Chenghao and Sun, Yanchao and Zheng, Ruijie and Wang, Xiyao and Zhao, Jieyu and Huang, Furong},
        booktitle={Forty-first International Conference on Machine Learning}
      }