[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
survey

Representation Bias in Data: A Survey on Identification and Resolution Techniques

Published: 13 July 2023 Publication History

Abstract

Data-driven algorithms are only as good as the data they work with, while datasets, especially social data, often fail to represent minorities adequately. Representation Bias in data can happen due to various reasons, ranging from historical discrimination to selection and sampling biases in the data acquisition and preparation methods. Given that “bias in, bias out,” one cannot expect AI-based solutions to have equitable outcomes for societal applications, without addressing issues such as representation bias. While there has been extensive study of fairness in machine learning models, including several review papers, bias in the data has been less studied. This article reviews the literature on identifying and resolving representation bias as a feature of a dataset, independent of how consumed later. The scope of this survey is bounded to structured (tabular) and unstructured (e.g., image, text, graph) data. It presents taxonomies to categorize the studied techniques based on multiple design dimensions and provides a side-by-side comparison of their properties.
There is still a long way to fully address representation bias issues in data. The authors hope that this survey motivates researchers to approach these challenges in the future by observing existing work within their respective domains.

References

[1]
2019. Health, United States Spotlight: Race and Ethnic Disparities in Heart Disease. Health, United States spotlight, CDC Stacks Public Health Publications. https://stacks.cdc.gov/view/cdc/77732.
[2]
2019. Asian-American and Pacific Islander Heritage in the United States. https://www.census.gov/newsroom/facts-for-features/2019/asian-american-pacific-islander.html. Accessed 26-03-2023.
[3]
Jacob Abernethy, Pranjal Awasthi, Matthäus Kleindessner, Jamie Morgenstern, Chris Russell, and Jie Zhang. 2020. Active sampling for min-max fairness. arXiv preprint arXiv:2006.06879 (2020).
[4]
Jacob Abernethy, Pranjal Awasthi, Matthäus Kleindessner, Jamie Morgenstern, and Jie Zhang. 2020. Adaptive sampling to reduce disparate performance. arXiv e-prints (2020), arXiv–2006.
[5]
Chiara Accinelli, Barbara Catania, Giovanna Guerrini, and Simone Minisi. 2021. The impact of rewriting on coverage constraint satisfaction. In Proceedings of the EDBT/ICDT Workshops.
[6]
Chiara Accinelli, Barbara Catania, Giovanna Guerrini, and Simone Minisi. 2022. A coverage-based approach to nondiscrimination-aware data transformation. J. Data and Information Quality 14, 4 (December 2022), 26 pages.
[7]
Chiara Accinelli, Simone Minisi, and Barbara Catania. 2020. Coverage-based rewriting for data preparation. In Proceedings of the EDBT/ICDT Workshops.
[8]
Abolfazl Asudeh, Zhongjun Jin, and H. V. Jagadish. 2019. Assessing and remedying coverage for a given dataset. In Proceedings of the IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 554–565.
[9]
Abolfazl Asudeh, Nima Shahbazi, Zhongjun Jin, and H. V. Jagadish. 2021. Identifying insufficient data coverage for ordinal continuous-valued attributes. In Proceedings of the SIGMOD Conference. ACM.
[10]
Fabio Azzalini, Chiara Criscuolo, and Letizia Tanca. 2021. FAIR-DB: Functional dependencies to discover data bias. In Proceedings of the EDBT/ICDT Workshops.
[11]
Fabio Azzalini, Chiara Criscuolo, and Letizia Tanca. 2022. E-FAIR-DB: functional dependencies to discover data bias and enhance data equity. ACM Journal of Data and Information Quality 14, 4 (2022), 1–26.
[12]
Fabio Azzalini, Chiara Criscuolo, and Letizia Tanca. 2022. FAIR-DB: A system to discover unfairness in datasets. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 3494–3497.
[13]
Pinkesh Badjatiya, Manish Gupta, and Vasudeva Varma. 2019. Stereotypical bias removal for hate speech detection task using knowledge-based generalizations. In Proceedings of the World Wide Web Conference. 49–59.
[14]
Agathe Balayn, Christoph Lofi, and Geert-Jan Houben. 2021. Managing bias and unfairness in data for decision support: A survey of machine learning and data engineering approaches to identify and mitigate bias and unfairness within data management and analytics systems. VLDB J. 30, 5 (2021), 739–768.
[15]
Solon Barocas, Moritz Hardt, and Arvind Narayanan. 2019. Fairness and machine learning: Limitations and opportunities. Retrieved from fairmlbook.org.
[16]
Solon Barocas and Andrew D. Selbst. 2016. Big data’s disparate impact. Calif. L. Rev. 104 (2016), 671.
[17]
Rok Blagus and Lara Lusa. 2013. SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 14 (2013), 106. DOI:
[18]
Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (technology) is power: A critical survey of “bias” in NLP. arXiv preprint arXiv:2005.14050 (2020).
[19]
Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. 2016. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. Adv. Neural Inf. Process. Syst. 29 (2016).
[20]
Avishek Bose and William Hamilton. 2019. Compositional fairness constraints for graph embeddings. In Proceedings of the International Conference on Machine Learning. PMLR, 715–724.
[21]
Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings of the Conference on Fairness, Accountability and Transparency. PMLR, 77–91.
[22]
Maarten Buyl and Tijl De Bie. 2020. Debayes: A Bayesian method for debiasing network embeddings. In Proceedings of the International Conference on Machine Learning. PMLR, 1220–1229.
[23]
Ángel Alexander Cabrera, Will Epperson, Fred Hohman, Minsuk Kahng, Jamie Morgenstern, and Duen Horng Chau. 2019. FairVis: Visual analytics for discovering intersectional bias in machine learning. In Proceedings of the IEEE Conference on Visual Analytics Science and Technology (VAST). IEEE, 46–56.
[24]
Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2015. Relaxed functional dependencies—A survey of approaches. IEEE Trans. Knowl. Data Eng. 28, 1 (2015), 147–165.
[25]
Barbara Catania, Giovanna Guerrini, and Chiara Accinelli. 2022. Fairness & friends in the data science era. AI & Societ. (2022), 1–11.
[26]
L. Elisa Celis, Vijay Keswani, and Nisheeth Vishnoi. 2020. Data preprocessing to mitigate bias: A maximum entropy based approach. In Proceedings of the International Conference on Machine Learning. PMLR, 1349–1359.
[27]
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16 (2002), 321–357. DOI:
[28]
Irene Chen, Fredrik D. Johansson, and David Sontag. 2018. Why is my classifier discriminatory? arXiv preprint arXiv:1805.12002 (2018).
[29]
Lu Cheng, Suyu Ge, and Huan Liu. 2022. Toward understanding bias correlations for mitigation in NLP. arXiv preprint arXiv:2205.12391 (2022).
[30]
Donald G. Childers and Ke Wu. 1991. Gender recognition from speech. Part II: Fine analysis. J. Acoust. Societ. Amer. 90, 4 (1991), 1841–1856.
[31]
Manvi Choudhary, Charlotte Laclau, and Christine Largeron. 2022. A survey on fairness for machine learning on graphs. arXiv preprint arXiv:2205.05396 (2022).
[32]
Yeounoh Chung, Tim Kraska, Neoklis Polyzotis, Ki Hyun Tae, and Steven Euijong Whang. 2019. Slice finder: Automated data slicing for model validation. In Proceedings of the IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1550–1553.
[33]
Sabyasachi Dash, Sushil Kumar Shakyawar, Mohit Sharma, and Sandeep Kaushik. 2019. Big data in healthcare: Management, analysis and future prospects. J. Big Data 6, 1 (2019), 1–25.
[34]
Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. 2019. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency. 120–128.
[35]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.
[36]
Mark Díaz, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle. 2018. Addressing age-related bias in sentiment analysis. In Proceedings of the Chi Conference on Human Factors in Computing Systems. 1–14.
[37]
Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. Measuring and mitigating unintended bias in text classification. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 67–73.
[38]
Yushun Dong, Jing Ma, Chen Chen, and Jundong Li. 2022. Fairness in graph mining: A survey. arXiv preprint arXiv:2204.09888 (2022).
[39]
Chris Dulhanty and Alexander Wong. 2019. Auditing ImageNet: Towards a model-driven framework for annotating demographic attributes of large-scale image datasets. arXiv preprint arXiv:1905.01347 (2019).
[40]
Jeff Edmonds, Jarek Gryz, Dongming Liang, and Renée J. Miller. 2003. Mining for empty spaces in large data sets. Theor. Comput. Sci. 296, 3 (2003), 435–452. DOI:
[41]
Kathleen M. Egan, D. Trichopoulos, M. J. Stampfer, W. C. Willett, P. A. Newcomb, A. Trentham-Dietz, M. P. Longnecker, and J. A. Baron. 1996. Jewish religion and risk of breast cancer. Lancet 347, 9016 (1996), 1645–1646.
[42]
Hasan Erokyar. 2014. Age and gender recognition for speech applications based on support vector machines. USF Tampa Graduate Theses and Dissertations. https://digitalcommons.usf.edu/etd/5356.
[43]
Simone Fabbrizzi, Symeon Papadopoulos, Eirini Ntoutsi, and Ioannis Kompatsiaris. 2021. A survey on bias in visual datasets. arXiv preprint arXiv:2107.07919 (2021).
[44]
Eitan Farchi, Ramasuri Narayanam, and Lokesh Nagalapatti. 2021. Ranking data slices for ML model validation: A Shapley value approach. In Proceedings of the IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 1937–1942.
[45]
Sara Feijo. 2020. Here’s what happened when Boston tried to assign students good schools close to home. 2020. https://news.northeastern.edu/2018/07/16/heres-what-happened-when-boston-tried-to-assign-students-good-schools-close-to-home/. Accessed 26-03-2023.
[46]
Siyuan Feng, Olya Kudina, Bence Mark Halpern, and Odette Scharenborg. 2021. Quantifying bias in automatic speech recognition. arXiv preprint arXiv:2103.15122 (2021).
[47]
Donatella Firmani, Letizia Tanca, and Riccardo Torlone. 2019. Ethical dimensions for data quality. J. Data Inf. Qual. 12, 1 (2019), 1–5.
[48]
Tanmay Garg, Sarah Masud, Tharun Suresh, and Tanmoy Chakraborty. 2022. Handling bias in toxic speech detection: A survey. arXiv preprint arXiv:2202.00126 (2022).
[49]
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2018. Datasheets for datasets. arXiv preprint arXiv:1803.09010 (2018).
[50]
Markos Georgopoulos, James Oldfield, Mihalis A. Nicolaou, Yannis Panagakis, and Maja Pantic. 2021. Mitigating demographic bias in facial datasets with style-based multi-attribute transfer. Int. J. Comput. Vis. 129, 7 (2021), 2288–2307.
[51]
Karan Goel, Albert Gu, Yixuan Li, and Christopher Ré. 2020. Model patching: Closing the subgroup performance gap with data augmentation. arXiv preprint arXiv:2008.06775 (2020).
[52]
Aditya Grover and Jure Leskovec. 2016. Node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 855–864.
[53]
Martyn Hammersley and Roger Gomm. 1997. Bias in social research. Sociolog. Res. Onl. 2, 1 (1997), 7–19.
[54]
Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the International Conference on Intelligent Computing: Advances in Intelligent Computing. Springer, 878–887.
[55]
Emma J. Harding, Elizabeth S. Paul, and Michael Mendl. 2004. Cognitive bias and affective state. Nature 427, 6972 (2004), 312–312.
[56]
Martie G. Haselton, Daniel Nettle, and Damian R. Murray. 2015. The Evolution of Cognitive Bias. In The Handbook of Evolutionary Psychology. John Wiley & Sons, Ltd, 1–20. DOI:
[57]
David Haussler and Emo Welzl. 1986. Epsilon-nets and simplex range queries. In Proceedings of the 2nd Annual Symposium on Computational Geometry. 61–71.
[58]
Xiao Hu, Haobo Wang, Anirudh Vegesana, Somesh Dube, Kaiwen Yu, Gore Kao, Shuo-Han Chen, Yung-Hsiang Lu, George K. Thiruvathukal, and Ming Yin. 2020. Crowdsourcing detection of sampling biases in image datasets. In Proceedings of the Web Conference. 2955–2961.
[59]
Vasileios Iosifidis and Eirini Ntoutsi. 2018. Dealing with bias via data augmentation in supervised learning scenarios. Jo Bates Paul D. Clough Robert Jäschke 24 (2018).
[60]
Hosagrahar V. Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstantinou, Jignesh M. Patel, Raghu Ramakrishnan, and Cyrus Shahabi. 2014. Big data and its technical challenges. Commun. ACM 57, 7 (2014), 86–94.
[61]
Nikita Jaipuria, Xianling Zhang, Rohan Bhasin, Mayar Arafa, Punarjay Chakravarty, Shubham Shrivastava, Sagar Manglani, and Vidya N. Murali. 2020. Deflating dataset bias using synthetic data augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 772–773.
[62]
Zhongjun Jin, Mengjing Xu, Chenkai Sun, Abolfazl Asudeh, and H. V. Jagadish. 2020. MithraCoverage: A system for investigating population bias for intersectional fairness. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 2721–2724.
[63]
Bo Kang, Jefrey Lijffijt, and Tijl De Bie. 2018. Conditional network embeddings. arXiv preprint arXiv:1805.07544 (2018).
[64]
Jian Kang and Hanghang Tong. 2021. Fair graph mining. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4849–4852.
[65]
Kimmo Karkkainen and Jungseock Joo. 2021. FairFace: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1548–1558.
[66]
Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4401–4410.
[67]
Ahmad Khajehnejad, Moein Khajehnejad, Mahmoudreza Babaei, Krishna P. Gummadi, Adrian Weller, and Baharan Mirzasoleiman. 2022. CrossWalk: Fairness-enhanced node representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11963–11970.
[68]
Aditya Khosla, Tinghui Zhou, Tomasz Malisiewicz, Alexei A. Efros, and Antonio Torralba. 2012. Undoing the damage of dataset bias. In Proceedings of the European Conference on Computer Vision. Springer, 158–171.
[69]
Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. 2016. Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807 (2016).
[70]
Allison Koenecke, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, Connor Toups, John R. Rickford, Dan Jurafsky, and Sharad Goel. 2020. Racial disparities in automated speech recognition. Proc. Nat. Acad. Sci. 117, 14 (2020), 7684–7689.
[71]
Charlotte Laclau, Ievgen Redko, Manvi Choudhary, and Christine Largeron. 2021. All of the fairness for edge prediction with optimal transport. In Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, 1774–1782.
[72]
Jennifer Langston. 2015. Who’s a CEO? Google image results can shift gender biases. Retrieved from https://www.washington.edu/news/2015/04/09/whos-a-ceo-google-image-results-can-shift-gender-biases/.
[73]
Alyssa Whitlock Lees and Ananth Balashankar. 2019. Fairness sample complexity and the case for human intervention. Where is the Human? Bridging the Gap Between AI and HCI, CHI Workshop 2019. https://michae.lv/ai-hci-workshop/#call-for-participation.
[74]
Joseph Lemley, Filip Jagodzinski, and Razvan Andonie. 2017. Big holes in big data: A Monte Carlo algorithm for detecting large hyper-rectangles in high dimensional data. CoRR abs/1704.00683 (2017).
[75]
Yi Li and Nuno Vasconcelos. 2019. Repair: Removing representation bias by dataset resampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9572–9581.
[76]
M. Lichman. 2013. Adult Income Dataset, UCI Machine Learning Repository. Retrieved from https://archive.ics.uci.edu/ml/datasets/adult.
[77]
Yin Lin, Yifan Guan, Abolfazl Asudeh, and H. V. Jagadish. 2020. Identifying insufficient data coverage in databases with multiple relations. Proc. VLDB Endow. 13, 12 (2020), 2229–2242.
[78]
Bing Liu, Liang-Ping Ku, and Wynne Hsu. 1997. Discovering interesting holes in data. In Proceedings of the 15th International Joint Conference on Artifical Intelligence (IJCAI). Morgan Kaufmann Publishers Inc., 930–935.
[79]
Bing Liu, Ke Wang, Lai-Fun Mun, and Xin-Zhi Qi. 1998. Using decision tree induction for discovering holes in data. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence(Lecture Notes in Computer Science, Vol. 1531). Springer, 182–193.
[80]
Chunxi Liu, Michael Picheny, Leda Sarı, Pooja Chitkara, Alex Xiao, Xiaohui Zhang, Mark Chou, Andres Alvarado, Caner Hazirbas, and Yatharth Saraf. 2022. Towards measuring fairness in speech recognition: Casual conversations dataset transcriptions. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6162–6166.
[81]
Li Lucy and David Bamman. 2021. Gender and representation bias in GPT-3 generated stories. In Proceedings of the 3rd Workshop on Narrative Understanding. 48–55.
[82]
Thomas Manzini, Yao Chong Lim, Yulia Tsvetkov, and Alan W. Black. 2019. Black is to criminal as caucasian is to police: Detecting and removing multiclass bias in word embeddings. arXiv preprint arXiv:1904.04047 (2019).
[83]
Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning. ACM Comput. Surv. 54, 6 (2021), 1–35.
[84]
Michele Merler, Nalini Ratha, Rogerio S. Feris, and John R. Smith. 2019. Diversity in faces. arXiv preprint arXiv:1901.10436 (2019).
[85]
Y. Moskovitch and H. Jagadish. 2021. Patterns count-based labels for datasets. In Proceedings of the IEEE 37th International Conference on Data Engineering (ICDE). 1961–1966.
[86]
Yuval Moskovitch and H. V. Jagadish. 2020. COUNTATA: Dataset labeling using pattern counts. Int. J. Very Large Data Bases 13, 12 (2020), 2829–2832.
[87]
Yuval Moskovitch, Jinyang Li, and H. V. Jagadish. 2022. Bias analysis and mitigation in data-driven tools using provenance. In Proceedings of the 14th International Workshop on the Theory and Practice of Provenance. 1–4.
[88]
Fatemeh Nargesian, Abolfazl Asudeh, and H. V. Jagadish. 2021. Tailoring data source distributions for fairness-aware data integration. Proc. VLDB Endow. 14, 11 (2021), 2519–2532.
[89]
Fatemeh Nargesian, Abolfazl Asudeh, and H. V. Jagadish. 2022. Responsible data integration: Next-generation challenges. Procedings of the SIGMOD Conference.
[90]
Mark E. J. Newman. 2003. Mixing patterns in networks. Phys. Rev. E 67, 2 (2003), 026126.
[91]
Jerzy Neyman and Egon Sharpe Pearson. 1967. Contributions to the theory of testing statistical hypotheses. In Joint Statistical Papers. University of California Press, Berkeley, 203–239. DOI:
[92]
Laura Niss, Yuekai Sun, and Ambuj Tewari. 2022. Achieving representative data via convex hull feasibility sampling algorithms. arXiv preprint arXiv:2204.06664 (2022).
[93]
Eirini Ntoutsi, Pavlos Fafalios, Ujwal Gadiraju, Vasileios Iosifidis, Wolfgang Nejdl, Maria-Esther Vidal, Salvatore Ruggieri, Franco Turini, Symeon Papadopoulos, Emmanouil Krasanakis, et al. 2020. Bias in data-driven artificial intelligence systems’an introductory survey. Wiley Interdisc. Rev.: Data Mining Knowl. Discov. 10, 3 (2020), e1356.
[94]
Alexandra Olteanu, Carlos Castillo, Fernando Diaz, and Emre Kiciman. 2019. Social data: Biases, methodological pitfalls, and ethical boundaries. Front. Big Data 2 (2019), 13.
[95]
Orestis Papakyriakopoulos, Simon Hegelich, Juan Carlos Medina Serrano, and Fabienne Marco. 2020. Bias in word embeddings. In Proceedings of the Conference on Fairness, Accountability, and Transparency. 446–457.
[96]
Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Reducing gender bias in abusive language detection. arXiv preprint arXiv:1808.07231 (2018).
[97]
Eliana Pastor, Luca de Alfaro, and Elena Baralis. 2021. Looking for trouble: Analyzing classifier behavior via pattern divergence. In Proceedings of the International Conference on Management of Data. 1400–1412.
[98]
Dana Pessach and Erez Shmueli. 2022. A review on fairness in machine learning. ACM Comput. Surv. 55, 3 (2022), 1–44.
[99]
Romila Pradhan, Jiongli Zhu, Boris Glavic, and Babak Salimi. 2021. Interpretable data-based explanations for fairness debugging. arXiv preprint arXiv:2112.09745 (2021).
[100]
Tahleen Rahman, Bartlomiej Surma, Michael Backes, and Yang Zhang. 2019. Fairwalk: Towards Fair Graph Embedding. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19). International Joint Conferences on Artificial Intelligence Organization, 3289–3295. DOI:
[101]
Kumar Rakesh, Subhangi Dutta, and Kumara Shama. 2011. Gender recognition using speech processing techniques in LABVIEW. Int. J. Adv. Eng. Technol. 1, 2 (2011), 51.
[102]
Svetlana Sagadeeva and Matthias Boehm. 2021. SliceLine: Fast, linear-algebra-based slice finding for ML model debugging. In Proceedings of the International Conference on Management of Data. 2290–2299.
[103]
Beatrice Savoldi, Marco Gaido, Luisa Bentivogli, Matteo Negri, and Marco Turchi. 2021. Gender bias in machine translation. Trans. Assoc. Computat. Linguist. 9 (2021), 845–874.
[104]
Nina Schaaf, Omar de Mitri, Hang Beom Kim, Alexander Windberger, and Marco F. Huber. 2021. Towards measuring bias in image classification. In Proceedings of the International Conference on Artificial Neural Networks. Springer, 433–445.
[105]
Ben Schmidt. 2015. Rejecting the gender binary: A vector-space operation. Bens Bookworm Blog (2015). http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html.
[106]
L. S. Shapley. 1953. 17. A Value for n-Person Games. In Contributions to the Theory of Games (AM-28), Harold William Kuhn and Albert William Tucker (Eds.). Vol. II. Princeton University Press, Princeton, 307–318. DOI:
[107]
Shubham Sharma, Yunfeng Zhang, Jesús M. Ríos Aliaga, Djallel Bouneffouf, Vinod Muthusamy, and Kush R. Varshney. 2020. Data augmentation for discrimination prevention and bias disambiguation. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 358–364.
[108]
Shubhanshu Shekhar, Greg Fields, Mohammad Ghavamzadeh, and Tara Javidi. 2021. Adaptive sampling for minimax fair classification. Adv. Neural Inf. Process. Syst. 34 (2021), 24535–24544.
[109]
Suraj Shetiya, Ian P. Swift, Abolfazl Asudeh, and Gautam Das. 2022. Fairness-aware range queries for selecting unbiased data. In Proceedings of the IEEE 38th International Conference on Data Engineering (ICDE).
[110]
Indro Spinelli, Simone Scardapane, Amir Hussain, and Aurelio Uncini. 2021. FairDrop: Biased edge dropout for enhancing fairness in graph representation learning. IEEE Trans. Artif. Intell. 3, 3 (2021), 344–354.
[111]
Julia Stoyanovich, Bill Howe, and H. V. Jagadish. 2020. Responsible data management. Proc. VLDB Endow. 13, 12 (2020).
[112]
Seymour Sudman. 1976. Applied Sampling. Technical Report. Academic Press, New York.
[113]
Chenkai Sun, Abolfazl Asudeh, H. V. Jagadish, Bill Howe, and Julia Stoyanovich. 2019. MithraLabel: Flexible dataset nutritional labels for responsible data science. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2893–2896.
[114]
Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, and William Yang Wang. 2019. Mitigating gender bias in natural language processing: Literature review. arXiv preprint arXiv:1906.08976 (2019).
[115]
Harini Suresh and John Guttag. 2021. A framework for understanding sources of harm throughout the machine learning life cycle. In Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO’21). Association for Computing Machinery, New York, NY, USA, Article 17, 1–9.
[116]
Ki Hyun Tae and Steven Euijong Whang. 2021. Slice tuner: A selective data acquisition framework for accurate and fair machine learning models. In Proceedings of the International Conference on Management of Data. 1771–1783.
[117]
Antonio Torralba and Alexei A. Efros. 2011. Unbiased look at dataset bias. In Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE, 1521–1528.
[118]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 11 (2008).
[119]
Eva Vanmassenhove, Christian Hardmeier, and Andy Way. 2019. Getting gender right in neural machine translation. arXiv preprint arXiv:1909.05088 (2019).
[120]
Pranav Narayanan Venkit and Shomir Wilson. 2021. Identification of bias against people with disabilities in sentiment analysis and toxicity detection models. arXiv preprint arXiv:2111.13259 (2021).
[121]
Angelina Wang, Arvind Narayanan, and Olga Russakovsky. 2020. REVISE: A tool for measuring and mitigating bias in visual datasets. In Proceedings of the European Conference on Computer Vision. Springer, 733–751.
[122]
Seyma Yucer, Samet Akçay, Noura Al-Moubayed, and Toby P. Breckon. 2020. Exploring racial bias within face recognition via per-subject adversarially-enabled data augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 18–19.
[123]
Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2018. Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876 (2018).
[124]
Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018. Learning gender-neutral word embeddings. arXiv preprint arXiv:1809.01496 (2018).

Cited By

View all
  • (2025)Effect of dataset representation bias on generalizability of machine learning models in predicting flexural properties of ultra-high-performance concrete (UHPC) beamsEngineering Structures10.1016/j.engstruct.2024.119508326(119508)Online publication date: Mar-2025
  • (2025)Graph neural networks for classification and error detection in 2D architectural detail drawingsAutomation in Construction10.1016/j.autcon.2024.105936170(105936)Online publication date: Feb-2025
  • (2024)Ризик-орієнтований підхід щодо впровадження технології утилізації протитанкових реактивних снарядівProblems of Emergency Situations10.52363/2524-0226-2024-39-5(66-80)Online publication date: 24-Apr-2024
  • Show More Cited By

Index Terms

  1. Representation Bias in Data: A Survey on Identification and Resolution Techniques

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Computing Surveys
    ACM Computing Surveys  Volume 55, Issue 13s
    December 2023
    1367 pages
    ISSN:0360-0300
    EISSN:1557-7341
    DOI:10.1145/3606252
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 July 2023
    Online AM: 17 March 2023
    Accepted: 14 February 2023
    Revised: 25 September 2022
    Received: 15 March 2022
    Published in CSUR Volume 55, Issue 13s

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Responsible data science
    2. fairness in machine learning
    3. data equity systems
    4. data-centric AI
    5. AI-ready data

    Qualifiers

    • Survey

    Funding Sources

    • National Science Foundation

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,696
    • Downloads (Last 6 weeks)158
    Reflects downloads up to 05 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Effect of dataset representation bias on generalizability of machine learning models in predicting flexural properties of ultra-high-performance concrete (UHPC) beamsEngineering Structures10.1016/j.engstruct.2024.119508326(119508)Online publication date: Mar-2025
    • (2025)Graph neural networks for classification and error detection in 2D architectural detail drawingsAutomation in Construction10.1016/j.autcon.2024.105936170(105936)Online publication date: Feb-2025
    • (2024)Ризик-орієнтований підхід щодо впровадження технології утилізації протитанкових реактивних снарядівProblems of Emergency Situations10.52363/2524-0226-2024-39-5(66-80)Online publication date: 24-Apr-2024
    • (2024)Detection of Heart Disease Using ANNFuture of AI in Biomedicine and Biotechnology10.4018/979-8-3693-3629-8.ch009(182-196)Online publication date: 30-May-2024
    • (2024)Investigating and Mitigating the Performance–Fairness Tradeoff via Protected-Category SamplingElectronics10.3390/electronics1315302413:15(3024)Online publication date: 31-Jul-2024
    • (2024)A Comprehensive Review of AI Techniques for Addressing Algorithmic Bias in Job HiringAI10.3390/ai50100195:1(383-404)Online publication date: 7-Feb-2024
    • (2024)An Advance Review of Urban-AI and Ethical ConsiderationsProceedings of the 2nd ACM SIGSPATIAL International Workshop on Advances in Urban-AI10.1145/3681780.3697246(24-33)Online publication date: 29-Oct-2024
    • (2024)“Not quite there yet”: On Users Perception of Popular Healthcare Chatbot Apps for Personal Health ManagementProceedings of the 17th International Conference on PErvasive Technologies Related to Assistive Environments10.1145/3652037.3652042(191-197)Online publication date: 26-Jun-2024
    • (2024)Representation Debiasing of Generated Data Involving Domain ExpertsAdjunct Proceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization10.1145/3631700.3664910(516-522)Online publication date: 27-Jun-2024
    • (2024)A Critical Analysis of the Largest Source for Generative AI Training Data: Common CrawlProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3659033(2199-2208)Online publication date: 3-Jun-2024
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media