More Web Proxy on the site http://driver.im/

survey

Representation Bias in Data: A Survey on Identification and Resolution Techniques

Authors:

Abolfazl Asudeh,

H. V. JagadishAuthors Info & Claims

ACM Computing Surveys, Volume 55, Issue 13s

Article No.: 293, Pages 1 - 39

https://doi.org/10.1145/3588433

Published: 13 July 2023 Publication History

Abstract

Data-driven algorithms are only as good as the data they work with, while datasets, especially social data, often fail to represent minorities adequately. Representation Bias in data can happen due to various reasons, ranging from historical discrimination to selection and sampling biases in the data acquisition and preparation methods. Given that “bias in, bias out,” one cannot expect AI-based solutions to have equitable outcomes for societal applications, without addressing issues such as representation bias. While there has been extensive study of fairness in machine learning models, including several review papers, bias in the data has been less studied. This article reviews the literature on identifying and resolving representation bias as a feature of a dataset, independent of how consumed later. The scope of this survey is bounded to structured (tabular) and unstructured (e.g., image, text, graph) data. It presents taxonomies to categorize the studied techniques based on multiple design dimensions and provides a side-by-side comparison of their properties.

There is still a long way to fully address representation bias issues in data. The authors hope that this survey motivates researchers to approach these challenges in the future by observing existing work within their respective domains.

References

[1]

2019. Health, United States Spotlight: Race and Ethnic Disparities in Heart Disease. Health, United States spotlight, CDC Stacks Public Health Publications. https://stacks.cdc.gov/view/cdc/77732.

[2]

2019. Asian-American and Pacific Islander Heritage in the United States. https://www.census.gov/newsroom/facts-for-features/2019/asian-american-pacific-islander.html. Accessed 26-03-2023.

[3]

Jacob Abernethy, Pranjal Awasthi, Matthäus Kleindessner, Jamie Morgenstern, Chris Russell, and Jie Zhang. 2020. Active sampling for min-max fairness. arXiv preprint arXiv:2006.06879 (2020).

[4]

Jacob Abernethy, Pranjal Awasthi, Matthäus Kleindessner, Jamie Morgenstern, and Jie Zhang. 2020. Adaptive sampling to reduce disparate performance. arXiv e-prints (2020), arXiv–2006.

[5]

Chiara Accinelli, Barbara Catania, Giovanna Guerrini, and Simone Minisi. 2021. The impact of rewriting on coverage constraint satisfaction. In Proceedings of the EDBT/ICDT Workshops.

[6]

Chiara Accinelli, Barbara Catania, Giovanna Guerrini, and Simone Minisi. 2022. A coverage-based approach to nondiscrimination-aware data transformation. J. Data and Information Quality 14, 4 (December 2022), 26 pages.

Digital Library

[7]

Chiara Accinelli, Simone Minisi, and Barbara Catania. 2020. Coverage-based rewriting for data preparation. In Proceedings of the EDBT/ICDT Workshops.

[8]

Abolfazl Asudeh, Zhongjun Jin, and H. V. Jagadish. 2019. Assessing and remedying coverage for a given dataset. In Proceedings of the IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 554–565.

[9]

Abolfazl Asudeh, Nima Shahbazi, Zhongjun Jin, and H. V. Jagadish. 2021. Identifying insufficient data coverage for ordinal continuous-valued attributes. In Proceedings of the SIGMOD Conference. ACM.

Digital Library

[10]

Fabio Azzalini, Chiara Criscuolo, and Letizia Tanca. 2021. FAIR-DB: Functional dependencies to discover data bias. In Proceedings of the EDBT/ICDT Workshops.

[11]

Fabio Azzalini, Chiara Criscuolo, and Letizia Tanca. 2022. E-FAIR-DB: functional dependencies to discover data bias and enhance data equity. ACM Journal of Data and Information Quality 14, 4 (2022), 1–26.

[12]

Fabio Azzalini, Chiara Criscuolo, and Letizia Tanca. 2022. FAIR-DB: A system to discover unfairness in datasets. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 3494–3497.

[13]

Pinkesh Badjatiya, Manish Gupta, and Vasudeva Varma. 2019. Stereotypical bias removal for hate speech detection task using knowledge-based generalizations. In Proceedings of the World Wide Web Conference. 49–59.

Digital Library

[14]

Agathe Balayn, Christoph Lofi, and Geert-Jan Houben. 2021. Managing bias and unfairness in data for decision support: A survey of machine learning and data engineering approaches to identify and mitigate bias and unfairness within data management and analytics systems. VLDB J. 30, 5 (2021), 739–768.

Digital Library

[15]

Solon Barocas, Moritz Hardt, and Arvind Narayanan. 2019. Fairness and machine learning: Limitations and opportunities. Retrieved from fairmlbook.org.

[16]

Solon Barocas and Andrew D. Selbst. 2016. Big data’s disparate impact. Calif. L. Rev. 104 (2016), 671.

[17]

Rok Blagus and Lara Lusa. 2013. SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 14 (2013), 106. DOI:

[18]

Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (technology) is power: A critical survey of “bias” in NLP. arXiv preprint arXiv:2005.14050 (2020).

[19]

Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. 2016. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. Adv. Neural Inf. Process. Syst. 29 (2016).

[20]

Avishek Bose and William Hamilton. 2019. Compositional fairness constraints for graph embeddings. In Proceedings of the International Conference on Machine Learning. PMLR, 715–724.

[21]

Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings of the Conference on Fairness, Accountability and Transparency. PMLR, 77–91.

[22]

Maarten Buyl and Tijl De Bie. 2020. Debayes: A Bayesian method for debiasing network embeddings. In Proceedings of the International Conference on Machine Learning. PMLR, 1220–1229.

[23]

Ángel Alexander Cabrera, Will Epperson, Fred Hohman, Minsuk Kahng, Jamie Morgenstern, and Duen Horng Chau. 2019. FairVis: Visual analytics for discovering intersectional bias in machine learning. In Proceedings of the IEEE Conference on Visual Analytics Science and Technology (VAST). IEEE, 46–56.

[24]

Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2015. Relaxed functional dependencies—A survey of approaches. IEEE Trans. Knowl. Data Eng. 28, 1 (2015), 147–165.

Digital Library

[25]

Barbara Catania, Giovanna Guerrini, and Chiara Accinelli. 2022. Fairness & friends in the data science era. AI & Societ. (2022), 1–11.

Digital Library

[26]

L. Elisa Celis, Vijay Keswani, and Nisheeth Vishnoi. 2020. Data preprocessing to mitigate bias: A maximum entropy based approach. In Proceedings of the International Conference on Machine Learning. PMLR, 1349–1359.

[27]

Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16 (2002), 321–357. DOI:

Digital Library

[28]

Irene Chen, Fredrik D. Johansson, and David Sontag. 2018. Why is my classifier discriminatory? arXiv preprint arXiv:1805.12002 (2018).

[29]

Lu Cheng, Suyu Ge, and Huan Liu. 2022. Toward understanding bias correlations for mitigation in NLP. arXiv preprint arXiv:2205.12391 (2022).

[30]

Donald G. Childers and Ke Wu. 1991. Gender recognition from speech. Part II: Fine analysis. J. Acoust. Societ. Amer. 90, 4 (1991), 1841–1856.

[31]

Manvi Choudhary, Charlotte Laclau, and Christine Largeron. 2022. A survey on fairness for machine learning on graphs. arXiv preprint arXiv:2205.05396 (2022).

[32]

Yeounoh Chung, Tim Kraska, Neoklis Polyzotis, Ki Hyun Tae, and Steven Euijong Whang. 2019. Slice finder: Automated data slicing for model validation. In Proceedings of the IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1550–1553.

[33]

Sabyasachi Dash, Sushil Kumar Shakyawar, Mohit Sharma, and Sandeep Kaushik. 2019. Big data in healthcare: Management, analysis and future prospects. J. Big Data 6, 1 (2019), 1–25.

[34]

Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. 2019. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency. 120–128.

Digital Library

[35]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.

[36]

Mark Díaz, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle. 2018. Addressing age-related bias in sentiment analysis. In Proceedings of the Chi Conference on Human Factors in Computing Systems. 1–14.

Digital Library

[37]

Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. Measuring and mitigating unintended bias in text classification. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 67–73.

Digital Library

[38]

Yushun Dong, Jing Ma, Chen Chen, and Jundong Li. 2022. Fairness in graph mining: A survey. arXiv preprint arXiv:2204.09888 (2022).

[39]

Chris Dulhanty and Alexander Wong. 2019. Auditing ImageNet: Towards a model-driven framework for annotating demographic attributes of large-scale image datasets. arXiv preprint arXiv:1905.01347 (2019).

[40]

Jeff Edmonds, Jarek Gryz, Dongming Liang, and Renée J. Miller. 2003. Mining for empty spaces in large data sets. Theor. Comput. Sci. 296, 3 (2003), 435–452. DOI:

Digital Library

[41]

Kathleen M. Egan, D. Trichopoulos, M. J. Stampfer, W. C. Willett, P. A. Newcomb, A. Trentham-Dietz, M. P. Longnecker, and J. A. Baron. 1996. Jewish religion and risk of breast cancer. Lancet 347, 9016 (1996), 1645–1646.

[42]

Hasan Erokyar. 2014. Age and gender recognition for speech applications based on support vector machines. USF Tampa Graduate Theses and Dissertations. https://digitalcommons.usf.edu/etd/5356.

[43]

Simone Fabbrizzi, Symeon Papadopoulos, Eirini Ntoutsi, and Ioannis Kompatsiaris. 2021. A survey on bias in visual datasets. arXiv preprint arXiv:2107.07919 (2021).

[44]

Eitan Farchi, Ramasuri Narayanam, and Lokesh Nagalapatti. 2021. Ranking data slices for ML model validation: A Shapley value approach. In Proceedings of the IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 1937–1942.

[45]

Sara Feijo. 2020. Here’s what happened when Boston tried to assign students good schools close to home. 2020. https://news.northeastern.edu/2018/07/16/heres-what-happened-when-boston-tried-to-assign-students-good-schools-close-to-home/. Accessed 26-03-2023.

[46]

Siyuan Feng, Olya Kudina, Bence Mark Halpern, and Odette Scharenborg. 2021. Quantifying bias in automatic speech recognition. arXiv preprint arXiv:2103.15122 (2021).

[47]

Donatella Firmani, Letizia Tanca, and Riccardo Torlone. 2019. Ethical dimensions for data quality. J. Data Inf. Qual. 12, 1 (2019), 1–5.

[48]

Tanmay Garg, Sarah Masud, Tharun Suresh, and Tanmoy Chakraborty. 2022. Handling bias in toxic speech detection: A survey. arXiv preprint arXiv:2202.00126 (2022).

[49]

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2018. Datasheets for datasets. arXiv preprint arXiv:1803.09010 (2018).

[50]

Markos Georgopoulos, James Oldfield, Mihalis A. Nicolaou, Yannis Panagakis, and Maja Pantic. 2021. Mitigating demographic bias in facial datasets with style-based multi-attribute transfer. Int. J. Comput. Vis. 129, 7 (2021), 2288–2307.

Digital Library

[51]

Karan Goel, Albert Gu, Yixuan Li, and Christopher Ré. 2020. Model patching: Closing the subgroup performance gap with data augmentation. arXiv preprint arXiv:2008.06775 (2020).

[52]

Aditya Grover and Jure Leskovec. 2016. Node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 855–864.

Digital Library

[53]

Martyn Hammersley and Roger Gomm. 1997. Bias in social research. Sociolog. Res. Onl. 2, 1 (1997), 7–19.

[54]

Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the International Conference on Intelligent Computing: Advances in Intelligent Computing. Springer, 878–887.

Digital Library

[55]

Emma J. Harding, Elizabeth S. Paul, and Michael Mendl. 2004. Cognitive bias and affective state. Nature 427, 6972 (2004), 312–312.

[56]

Martie G. Haselton, Daniel Nettle, and Damian R. Murray. 2015. The Evolution of Cognitive Bias. In The Handbook of Evolutionary Psychology. John Wiley & Sons, Ltd, 1–20. DOI:

[57]

David Haussler and Emo Welzl. 1986. Epsilon-nets and simplex range queries. In Proceedings of the 2nd Annual Symposium on Computational Geometry. 61–71.

Digital Library

[58]

Xiao Hu, Haobo Wang, Anirudh Vegesana, Somesh Dube, Kaiwen Yu, Gore Kao, Shuo-Han Chen, Yung-Hsiang Lu, George K. Thiruvathukal, and Ming Yin. 2020. Crowdsourcing detection of sampling biases in image datasets. In Proceedings of the Web Conference. 2955–2961.

Digital Library

[59]

Vasileios Iosifidis and Eirini Ntoutsi. 2018. Dealing with bias via data augmentation in supervised learning scenarios. Jo Bates Paul D. Clough Robert Jäschke 24 (2018).

[60]

Hosagrahar V. Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstantinou, Jignesh M. Patel, Raghu Ramakrishnan, and Cyrus Shahabi. 2014. Big data and its technical challenges. Commun. ACM 57, 7 (2014), 86–94.

Digital Library

[61]

Nikita Jaipuria, Xianling Zhang, Rohan Bhasin, Mayar Arafa, Punarjay Chakravarty, Shubham Shrivastava, Sagar Manglani, and Vidya N. Murali. 2020. Deflating dataset bias using synthetic data augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 772–773.

[62]

Zhongjun Jin, Mengjing Xu, Chenkai Sun, Abolfazl Asudeh, and H. V. Jagadish. 2020. MithraCoverage: A system for investigating population bias for intersectional fairness. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 2721–2724.

Digital Library

[63]

Bo Kang, Jefrey Lijffijt, and Tijl De Bie. 2018. Conditional network embeddings. arXiv preprint arXiv:1805.07544 (2018).

[64]

Jian Kang and Hanghang Tong. 2021. Fair graph mining. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4849–4852.

Digital Library

[65]

Kimmo Karkkainen and Jungseock Joo. 2021. FairFace: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1548–1558.

[66]

Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4401–4410.

[67]

Ahmad Khajehnejad, Moein Khajehnejad, Mahmoudreza Babaei, Krishna P. Gummadi, Adrian Weller, and Baharan Mirzasoleiman. 2022. CrossWalk: Fairness-enhanced node representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11963–11970.

[68]

Aditya Khosla, Tinghui Zhou, Tomasz Malisiewicz, Alexei A. Efros, and Antonio Torralba. 2012. Undoing the damage of dataset bias. In Proceedings of the European Conference on Computer Vision. Springer, 158–171.

Digital Library

[69]

Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. 2016. Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807 (2016).

[70]

Allison Koenecke, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, Connor Toups, John R. Rickford, Dan Jurafsky, and Sharad Goel. 2020. Racial disparities in automated speech recognition. Proc. Nat. Acad. Sci. 117, 14 (2020), 7684–7689.

[71]

Charlotte Laclau, Ievgen Redko, Manvi Choudhary, and Christine Largeron. 2021. All of the fairness for edge prediction with optimal transport. In Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, 1774–1782.

[72]

Jennifer Langston. 2015. Who’s a CEO? Google image results can shift gender biases. Retrieved from https://www.washington.edu/news/2015/04/09/whos-a-ceo-google-image-results-can-shift-gender-biases/.

[73]

Alyssa Whitlock Lees and Ananth Balashankar. 2019. Fairness sample complexity and the case for human intervention. Where is the Human? Bridging the Gap Between AI and HCI, CHI Workshop 2019. https://michae.lv/ai-hci-workshop/#call-for-participation.

[74]

Joseph Lemley, Filip Jagodzinski, and Razvan Andonie. 2017. Big holes in big data: A Monte Carlo algorithm for detecting large hyper-rectangles in high dimensional data. CoRR abs/1704.00683 (2017).

[75]

Yi Li and Nuno Vasconcelos. 2019. Repair: Removing representation bias by dataset resampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9572–9581.

[76]

M. Lichman. 2013. Adult Income Dataset, UCI Machine Learning Repository. Retrieved from https://archive.ics.uci.edu/ml/datasets/adult.

[77]

Yin Lin, Yifan Guan, Abolfazl Asudeh, and H. V. Jagadish. 2020. Identifying insufficient data coverage in databases with multiple relations. Proc. VLDB Endow. 13, 12 (2020), 2229–2242.

Digital Library

[78]

Bing Liu, Liang-Ping Ku, and Wynne Hsu. 1997. Discovering interesting holes in data. In Proceedings of the 15th International Joint Conference on Artifical Intelligence (IJCAI). Morgan Kaufmann Publishers Inc., 930–935.

[79]

Bing Liu, Ke Wang, Lai-Fun Mun, and Xin-Zhi Qi. 1998. Using decision tree induction for discovering holes in data. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence(Lecture Notes in Computer Science, Vol. 1531). Springer, 182–193.

[80]

Chunxi Liu, Michael Picheny, Leda Sarı, Pooja Chitkara, Alex Xiao, Xiaohui Zhang, Mark Chou, Andres Alvarado, Caner Hazirbas, and Yatharth Saraf. 2022. Towards measuring fairness in speech recognition: Casual conversations dataset transcriptions. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6162–6166.

[81]

Li Lucy and David Bamman. 2021. Gender and representation bias in GPT-3 generated stories. In Proceedings of the 3rd Workshop on Narrative Understanding. 48–55.

[82]

Thomas Manzini, Yao Chong Lim, Yulia Tsvetkov, and Alan W. Black. 2019. Black is to criminal as caucasian is to police: Detecting and removing multiclass bias in word embeddings. arXiv preprint arXiv:1904.04047 (2019).

[83]

Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning. ACM Comput. Surv. 54, 6 (2021), 1–35.

Digital Library

[84]

Michele Merler, Nalini Ratha, Rogerio S. Feris, and John R. Smith. 2019. Diversity in faces. arXiv preprint arXiv:1901.10436 (2019).

[85]

Y. Moskovitch and H. Jagadish. 2021. Patterns count-based labels for datasets. In Proceedings of the IEEE 37th International Conference on Data Engineering (ICDE). 1961–1966.

[86]

Yuval Moskovitch and H. V. Jagadish. 2020. COUNTATA: Dataset labeling using pattern counts. Int. J. Very Large Data Bases 13, 12 (2020), 2829–2832.

[87]

Yuval Moskovitch, Jinyang Li, and H. V. Jagadish. 2022. Bias analysis and mitigation in data-driven tools using provenance. In Proceedings of the 14th International Workshop on the Theory and Practice of Provenance. 1–4.

Digital Library

[88]

Fatemeh Nargesian, Abolfazl Asudeh, and H. V. Jagadish. 2021. Tailoring data source distributions for fairness-aware data integration. Proc. VLDB Endow. 14, 11 (2021), 2519–2532.

Digital Library

[89]

Fatemeh Nargesian, Abolfazl Asudeh, and H. V. Jagadish. 2022. Responsible data integration: Next-generation challenges. Procedings of the SIGMOD Conference.

[90]

Mark E. J. Newman. 2003. Mixing patterns in networks. Phys. Rev. E 67, 2 (2003), 026126.

[91]

Jerzy Neyman and Egon Sharpe Pearson. 1967. Contributions to the theory of testing statistical hypotheses. In Joint Statistical Papers. University of California Press, Berkeley, 203–239. DOI:

[92]

Laura Niss, Yuekai Sun, and Ambuj Tewari. 2022. Achieving representative data via convex hull feasibility sampling algorithms. arXiv preprint arXiv:2204.06664 (2022).

[93]

Eirini Ntoutsi, Pavlos Fafalios, Ujwal Gadiraju, Vasileios Iosifidis, Wolfgang Nejdl, Maria-Esther Vidal, Salvatore Ruggieri, Franco Turini, Symeon Papadopoulos, Emmanouil Krasanakis, et al. 2020. Bias in data-driven artificial intelligence systems’an introductory survey. Wiley Interdisc. Rev.: Data Mining Knowl. Discov. 10, 3 (2020), e1356.

[94]

Alexandra Olteanu, Carlos Castillo, Fernando Diaz, and Emre Kiciman. 2019. Social data: Biases, methodological pitfalls, and ethical boundaries. Front. Big Data 2 (2019), 13.

[95]

Orestis Papakyriakopoulos, Simon Hegelich, Juan Carlos Medina Serrano, and Fabienne Marco. 2020. Bias in word embeddings. In Proceedings of the Conference on Fairness, Accountability, and Transparency. 446–457.

Digital Library

[96]

Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Reducing gender bias in abusive language detection. arXiv preprint arXiv:1808.07231 (2018).

[97]

Eliana Pastor, Luca de Alfaro, and Elena Baralis. 2021. Looking for trouble: Analyzing classifier behavior via pattern divergence. In Proceedings of the International Conference on Management of Data. 1400–1412.

Digital Library

[98]

Dana Pessach and Erez Shmueli. 2022. A review on fairness in machine learning. ACM Comput. Surv. 55, 3 (2022), 1–44.

Digital Library

[99]

Romila Pradhan, Jiongli Zhu, Boris Glavic, and Babak Salimi. 2021. Interpretable data-based explanations for fairness debugging. arXiv preprint arXiv:2112.09745 (2021).

[100]

Tahleen Rahman, Bartlomiej Surma, Michael Backes, and Yang Zhang. 2019. Fairwalk: Towards Fair Graph Embedding. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19). International Joint Conferences on Artificial Intelligence Organization, 3289–3295. DOI:

[101]

Kumar Rakesh, Subhangi Dutta, and Kumara Shama. 2011. Gender recognition using speech processing techniques in LABVIEW. Int. J. Adv. Eng. Technol. 1, 2 (2011), 51.

[102]

Svetlana Sagadeeva and Matthias Boehm. 2021. SliceLine: Fast, linear-algebra-based slice finding for ML model debugging. In Proceedings of the International Conference on Management of Data. 2290–2299.

Digital Library

[103]

Beatrice Savoldi, Marco Gaido, Luisa Bentivogli, Matteo Negri, and Marco Turchi. 2021. Gender bias in machine translation. Trans. Assoc. Computat. Linguist. 9 (2021), 845–874.

[104]

Nina Schaaf, Omar de Mitri, Hang Beom Kim, Alexander Windberger, and Marco F. Huber. 2021. Towards measuring bias in image classification. In Proceedings of the International Conference on Artificial Neural Networks. Springer, 433–445.

Digital Library

[105]

Ben Schmidt. 2015. Rejecting the gender binary: A vector-space operation. Bens Bookworm Blog (2015). http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html.

[106]

L. S. Shapley. 1953. 17. A Value for n-Person Games. In Contributions to the Theory of Games (AM-28), Harold William Kuhn and Albert William Tucker (Eds.). Vol. II. Princeton University Press, Princeton, 307–318. DOI:

[107]

Shubham Sharma, Yunfeng Zhang, Jesús M. Ríos Aliaga, Djallel Bouneffouf, Vinod Muthusamy, and Kush R. Varshney. 2020. Data augmentation for discrimination prevention and bias disambiguation. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 358–364.

Digital Library

[108]

Shubhanshu Shekhar, Greg Fields, Mohammad Ghavamzadeh, and Tara Javidi. 2021. Adaptive sampling for minimax fair classification. Adv. Neural Inf. Process. Syst. 34 (2021), 24535–24544.

[109]

Suraj Shetiya, Ian P. Swift, Abolfazl Asudeh, and Gautam Das. 2022. Fairness-aware range queries for selecting unbiased data. In Proceedings of the IEEE 38th International Conference on Data Engineering (ICDE).

[110]

Indro Spinelli, Simone Scardapane, Amir Hussain, and Aurelio Uncini. 2021. FairDrop: Biased edge dropout for enhancing fairness in graph representation learning. IEEE Trans. Artif. Intell. 3, 3 (2021), 344–354.

[111]

Julia Stoyanovich, Bill Howe, and H. V. Jagadish. 2020. Responsible data management. Proc. VLDB Endow. 13, 12 (2020).

Digital Library

[112]

Seymour Sudman. 1976. Applied Sampling. Technical Report. Academic Press, New York.

[113]

Chenkai Sun, Abolfazl Asudeh, H. V. Jagadish, Bill Howe, and Julia Stoyanovich. 2019. MithraLabel: Flexible dataset nutritional labels for responsible data science. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2893–2896.

Digital Library

[114]

Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, and William Yang Wang. 2019. Mitigating gender bias in natural language processing: Literature review. arXiv preprint arXiv:1906.08976 (2019).

[115]

Harini Suresh and John Guttag. 2021. A framework for understanding sources of harm throughout the machine learning life cycle. In Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO’21). Association for Computing Machinery, New York, NY, USA, Article 17, 1–9.

Digital Library

[116]

Ki Hyun Tae and Steven Euijong Whang. 2021. Slice tuner: A selective data acquisition framework for accurate and fair machine learning models. In Proceedings of the International Conference on Management of Data. 1771–1783.

Digital Library

[117]

Antonio Torralba and Alexei A. Efros. 2011. Unbiased look at dataset bias. In Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE, 1521–1528.

Digital Library

[118]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 11 (2008).

[119]

Eva Vanmassenhove, Christian Hardmeier, and Andy Way. 2019. Getting gender right in neural machine translation. arXiv preprint arXiv:1909.05088 (2019).

[120]

Pranav Narayanan Venkit and Shomir Wilson. 2021. Identification of bias against people with disabilities in sentiment analysis and toxicity detection models. arXiv preprint arXiv:2111.13259 (2021).

[121]

Angelina Wang, Arvind Narayanan, and Olga Russakovsky. 2020. REVISE: A tool for measuring and mitigating bias in visual datasets. In Proceedings of the European Conference on Computer Vision. Springer, 733–751.

Digital Library

[122]

Seyma Yucer, Samet Akçay, Noura Al-Moubayed, and Toby P. Breckon. 2020. Exploring racial bias within face recognition via per-subject adversarially-enabled data augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 18–19.

[123]

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2018. Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876 (2018).

[124]

Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018. Learning gender-neutral word embeddings. arXiv preprint arXiv:1809.01496 (2018).

Cited By

Chen JBao Y(2025)Effect of dataset representation bias on generalizability of machine learning models in predicting flexural properties of ultra-high-performance concrete (UHPC) beamsEngineering Structures10.1016/j.engstruct.2024.119508326(119508)Online publication date: Mar-2025
https://doi.org/10.1016/j.engstruct.2024.119508
Ko JLee D(2025)Graph neural networks for classification and error detection in 2D architectural detail drawingsAutomation in Construction10.1016/j.autcon.2024.105936170(105936)Online publication date: Feb-2025
https://doi.org/10.1016/j.autcon.2024.105936
Neklonskyi ISmіrnov O(2024)Ризик-орієнтований підхід щодо впровадження технології утилізації протитанкових реактивних снарядівProblems of Emergency Situations10.52363/2524-0226-2024-39-5(66-80)Online publication date: 24-Apr-2024
https://doi.org/10.52363/2524-0226-2024-39-5
Show More Cited By

Index Terms

Representation Bias in Data: A Survey on Identification and Resolution Techniques
1. Information systems
  1. Data management systems

Recommendations

Poisoning for Debiasing: Fair Recognition via Eliminating Bias Uncovered in Data Poisoning
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Neural networks often tend to rely on bias features that have strong but spurious correlations with the target labels for decision-making, leading to poor performance on data that does not adhere to these correlations. Early debiasing methods typically ...
Ensuring High-Quality Private Data for Responsible Data Science: Vision and Challenges
On the Horizon, Regular Papers and Challenge Paper

High-quality data is critical for effective data science. As the use of data science has grown, so too have concerns that individuals’ rights to privacy will be violated. This has led to the development of data protection regulations around the globe ...
Analyzing data with systematic bias

In many data analysis problems, we only have access to biased data due to some systematic bias of the data collection procedure. In this letter, we present a general formulation of systematic bias in data as well as our recent results on how to handle ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 55, Issue 13s

December 2023

1367 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3606252

Editor:
Albert Zomaya
University of Sydney, Australia

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 July 2023

Online AM: 17 March 2023

Accepted: 14 February 2023

Revised: 25 September 2022

Received: 15 March 2022

Published in CSUR Volume 55, Issue 13s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
2,809
Total Downloads

Downloads (Last 12 months)1,696
Downloads (Last 6 weeks)158

Reflects downloads up to 05 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen JBao Y(2025)Effect of dataset representation bias on generalizability of machine learning models in predicting flexural properties of ultra-high-performance concrete (UHPC) beamsEngineering Structures10.1016/j.engstruct.2024.119508326(119508)Online publication date: Mar-2025
https://doi.org/10.1016/j.engstruct.2024.119508
Ko JLee D(2025)Graph neural networks for classification and error detection in 2D architectural detail drawingsAutomation in Construction10.1016/j.autcon.2024.105936170(105936)Online publication date: Feb-2025
https://doi.org/10.1016/j.autcon.2024.105936
Neklonskyi ISmіrnov O(2024)Ризик-орієнтований підхід щодо впровадження технології утилізації протитанкових реактивних снарядівProblems of Emergency Situations10.52363/2524-0226-2024-39-5(66-80)Online publication date: 24-Apr-2024
https://doi.org/10.52363/2524-0226-2024-39-5
Dehankar PDas S(2024)Detection of Heart Disease Using ANNFuture of AI in Biomedicine and Biotechnology10.4018/979-8-3693-3629-8.ch009(182-196)Online publication date: 30-May-2024
https://doi.org/10.4018/979-8-3693-3629-8.ch009
Popoola GSheppard J(2024)Investigating and Mitigating the Performance–Fairness Tradeoff via Protected-Category SamplingElectronics10.3390/electronics1315302413:15(3024)Online publication date: 31-Jul-2024
https://doi.org/10.3390/electronics13153024
Albaroudi EMansouri TAlameer A(2024)A Comprehensive Review of AI Techniques for Addressing Algorithmic Bias in Job HiringAI10.3390/ai50100195:1(383-404)Online publication date: 7-Feb-2024
https://doi.org/10.3390/ai5010019
Mirindi DSinkhonde DMirindi F(2024)An Advance Review of Urban-AI and Ethical ConsiderationsProceedings of the 2nd ACM SIGSPATIAL International Workshop on Advances in Urban-AI10.1145/3681780.3697246(24-33)Online publication date: 29-Oct-2024
https://dl.acm.org/doi/10.1145/3681780.3697246
Barbara Onyekwelu IShrestha REika Sandnes F(2024)“Not quite there yet”: On Users Perception of Popular Healthcare Chatbot Apps for Personal Health ManagementProceedings of the 17th International Conference on PErvasive Technologies Related to Assistive Environments10.1145/3652037.3652042(191-197)Online publication date: 26-Jun-2024
https://dl.acm.org/doi/10.1145/3652037.3652042
Bhattacharya AStumpf SVerbert KBoratto LGena CMarras MGermanakos PPopescus E(2024)Representation Debiasing of Generated Data Involving Domain ExpertsAdjunct Proceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization10.1145/3631700.3664910(516-522)Online publication date: 27-Jun-2024
https://dl.acm.org/doi/10.1145/3631700.3664910
Baack S(2024)A Critical Analysis of the Largest Source for Generative AI Training Data: Common CrawlProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3659033(2199-2208)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3630106.3659033
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents