Abstract
AI-driven journalism refers to various methods and tools for gathering, verifying, producing, and distributing news information. Their potential is to extend human capabilities and create new forms of augmented journalism. Although scholars agreed on the necessity to embed journalistic values in these systems to make AI systems accountable, less attention was paid to data quality, while the results’ accuracy and efficiency depend on high-quality data in any machine learning task. Assessing data quality in the context of AI-driven journalism requires a broader and interdisciplinary approach, relying on the challenges of data quality in machine learning and the ethical challenges of using machine learning in journalism. To better identify these, we propose a data quality assessment framework to support the collection and pre-processing stages in machine learning. It relies on three of the core principles of ethical journalism—accuracy, fairness, and transparency—and participates in the shift from model-centric to data-centric AI, by focusing on data quality to reduce reliance on large datasets with errors, making data labelling consistent, and better integrating journalistic knowledge.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
AI-driven journalism leverages techniques and tools to gather, verify, produce, and distribute news (Thurman et al., 2019). These tools enhance professional practice by speeding up labour-intensive tasks, publishing automated content, identifying trends, and providing insights into large numerical or textual datasets. Their potential is therefore likely to extend human capabilities, which can be seen as a new form of augmented journalism (Lindén, 2020). This echoes McLuhan’s theory on media as an extension of the human, in the sense that AI-driven technology enables journalists to overcome traditional limitations, improving their ability to gather, analyse, and disseminate information more efficiently and accurately.
However, recent AI-driven systems rely heavily on machine learning techniques, which are perceived as opaque and potentially biased, making them difficult to use in journalism (Guidotti et al., 2019). A feature of these systems is their reliance on high-quality data to avoid inaccurate analysis and unreliable decisions (Gupta et al., 2021). Explaining how data is collected, organised, cleaned, annotated and processed helps open up the black-box nature of these machine learning systems and thus build trust between the journalist and the system and, therefore, between the journalist and the audience. This means understanding the data quality challenges that arise at different stages of the machine learning process (Gudivada et al., 2017), following the ‘garbage in, garbage out’ principle widely accepted in data and computer science. This is also true in journalism, where quality information requires quality data to ensure the accuracy and reliability of the news (e.g., Anderson, 2018; Dörr & Hollbuchner, 2017; Lowrey et al., 2019).
Although the quality of digital information is fundamental to journalism (Stray, 2016), the concept of information quality is multifaceted and difficult to define. Ethical journalism is fundamentally linked to quality journalism, and both are deeply rooted in the philosophical concept of truth (García-Avilés, 2021; Porlezza, 2019). While ethical journalism refers to the adherence to ethical standards and principles to promote trustworthiness and ensure that journalists report the news with integrity and responsibility, thereby fostering public trust in the media, quality journalism is much more complicated to define. It has been described as referring vaguely to the degree or level of excellence of information (Sundar, 1998) or to the general but vague notion of having read a ‘good story’, which depends on the socio-cultural background of the recipient and can therefore vary from reader to reader (Clerwall, 2014). It has also been defined by several dimensions of quality—accuracy, comprehensibility, timeliness, reliability and validity—in which journalists are committed to trustworthy content (Diakopoulos, 2019). This last proposition encounters the multidimensional concept of data quality as it is approached in data and computer science, considering that poor data quality has potential consequences. For example, it can lead to wrong decisions, loss of credibility or financial losses due to incorrect data analysis (De Veaux & Hand, 2005; McCallum, 2012; McCausland, 2021). In journalism, poor data quality is likely to undermine public trust, affect media practices and undermine the watchdog function of the media.
Data quality challenges have been discussed in works on data and computational journalism (Anderson, 2018; Karlsen & Stavelin, 2014), but critical aspects of data quality remain under-documented (Lowrey et al., 2019). This paper aims to bridge the gap between machine learning development and journalism by focusing on three key areas: assessing the quality of machine learning datasets, guiding the development of machine learning solutions through a data-centric approach, and promoting AI and data literacy among journalists. It also examines the development of machine learning systems in newsrooms from the perspective of trustworthy AI, highlighting the need to reflect ethical journalism standards and their corollary, high data quality. By grounding machine learning applications in journalistic principles, we also aim to foster interdisciplinary collaborations.
Insofar as data quality is context- and use-dependent (Tayi & Ballou, 1998), this paper first examines the multidimensional concept of data quality and how it challenges machine learning systems in computer and data science. This section highlights the strong ethical dimensions of data quality, recognising that poor data quality can lead to biased or inaccurate reporting. The paper then explores the ethical challenges of AI-driven journalism, identifying specific journalistic requirements for building a flexible data quality assessment framework.
The proposed Accuracy-Fairness-Transparency (AFT) framework addresses key ethical concerns common to data science, machine learning and journalism (e.g. Antunes et al., 2018; Martens, 2022; Opdahl et al., 2023; Ward, 2015, 2018). It can be used to assess the quality of existing datasets for reusability or to support the development of training datasets throughout the data preparation pipeline (data collection, cleaning, augmentation and labelling). It can therefore be seen as a valuable tool to promote data and AI literacy among journalists. From a technical perspective, this framework contributes to the shift from model-centric AI to data-centric AI, where the focus is on ensuring the quality of the underlying data, enabling AI systems to perform well on smaller datasets, as opposed to the model-centric approach that requires large amounts of data (Jarrahi et al., 2022).
Data quality in machine learning
A common way of approaching data quality refers to data that adapts to the uses of data consumers, particularly in terms of accuracy, relevance and understandability (Wang & Strong, 1996). This means that data quality cannot be reduced by examining the syntax, format and structure of the data (Eckerson, 2002), nor can it be approached from an overall data quality perspective. First, there is no absolute reference to assess the correction of information stored in a database, which is always context and domain dependent (Boydens & van Hooland, 2011). Second, the quality of a process depends on the quality of the data (Batini et al., 2009; Moody & Shanks, 2003). Third, when derived from empirical observations, data represent a specific moment, which means that values are likely to evolve, just as the concepts and standards to which they refer are likely to evolve (Boydens & van Hooland, 2011; Taleb et al., 2018).
The multidimensionality of the concept of data quality is reflected in several complementary dimensions, which refer to a set of attributes that need to be assessed either formally or empirically (Table 1). The definition of the accuracy, completeness and consistency dimensions has been refined over time, which means that the assessment of data quality also relies on empirical considerations and thus on human judgement. In addition, the assessment of data quality was also approached through four complementary semiotic levels reflecting the technical and social aspects contained in datasets (Shanks, 1999): syntactic, relating to the structure of the data; semantic, relating to the meaning of the data; pragmatic, relating to the usefulness of the data; and social, relating to the knowledge shared through the data.
Scholars have recognised that the causes of poor quality are numerous and can relate to a variety of settings: the modelling of the information system (Lindland et al., 1994; Moody & Shanks, 2003), compliance with integrity constraints (Fox et al., 1994), a lack of routine validation (Eckerson, 2002), or heterogeneous data that can lead to interpretation problems (Madnick & Zhu, 2006). Other considerations add layers to this complexity: poor quality data may coexist with correct data without producing errors (Wang et al., 1995), and data may not contain errors but may not have the meaning expected by the user (Madnick & Zhu, 2006), demonstrating once again the impossibility of approaching data quality through the sole formal examination of the “entity–attribute–value” triplet that characterises relational databases (e.g. the entity “car” has an attribute “colour” which has a value “blue”).
The emergence of big data has led to a reconsideration of data quality issues in relation to user-generated data collected online or through sensors (Batini et al., 2015), the extreme size of which exceeds the capacity of database systems (Becker et al., 2015). As big data moves from the logic of being stored in relational databases to being stored in NoSQL databases—a type of database that provides a mechanism for storing and retrieving data, modelled in a way that enables flexible, scalable and schema-less data management—it challenges dimensions of quality such as trustworthiness, verifiability and data reputation (Batini et al., 2015). At its core, the problem is that trust in big data requires three levels of trust: in the sources, in the processes that bring them together, and in the environments in which the data is applied (e.g. Saha & Srivastava, 2014; Cai & Zhu, 2015; Liu et al., 2016). Big data issues are also related to incomplete, inaccurate, inconsistent or ambiguous structured and unstructured data (Eberendu, 2016).
The approach to data quality in machine learning encompasses all of these considerations. However, it requires a nuanced understanding of the specific processes involved, given the complexity of real-world data. Machine learning relies on computational models trained on real-world data to mimic human intelligence by transforming inputs into outcomes based on mathematical relationships that are difficult to derive through deductive reasoning or simple statistical analysis (Kläs & Vollmer, 2018). It is being used for a wide range of tasks based on various techniques and trained on different types of data. In journalism, machine learning models can be used for breaking news detection, disinformation detection, automated news gathering, content verification, data analysis and visualisation, predictive analysis of future trends, content generation or personalised news recommendations.
Machine learning tasks can be supervised, primarily for classification and prediction, and require annotated data—data that has been labelled or categorised by humans or automated processes. These labels serve as a guide for the model to learn from. Unsupervised learning relies on patterns inherent in unlabelled data for segmentation or information discovery. Semi-supervised learning combines labelled and unlabelled data, e.g. for ranking or automatic content generation. On the other hand, self-supervised learning can automatically learn useful representations found in the training data without the need for extensive human annotation (Jordan & Mitchell, 2015; Jaiswal et al., 2020).
The main differences between supervised and unsupervised learning are not only related to the presence of labelled data. They are also related to the complexity of the task and the accuracy and efficiency of the results (Adnan & Akbar, 2019). In all cases, the quality of the results is influenced by the data provided as input to the system (Gudivada et al., 2017; Gupta et al., 2021). Research also highlights that models trained on incomplete or biased datasets can produce discriminatory results and affect the accuracy of the tasks (Miceli et al., 2021; Shin et al., 2022).
In machine learning, the preprocessing of data depends as much on the specifics of the data, including the types of variables used, as on the algorithm used (Kläs & Vollmer, 2018; Gupta et al., 2021; Ridzuan et al., 2022). Therefore, the choice of a machine learning algorithm depends on the task to be performed, taking into account its strengths and weaknesses regarding the influence of data on its behaviour. For example, the K-Nearest Neighbours (KNN) algorithm used for classification is considered fast and efficient but does not handle missing data well. The Naive Bayes algorithm, also considered fast and efficient, is not ideal for large datasets with many numerical features. With decision tree algorithms, a small change in the training data can lead to significant changes in the results (Lantz, 2014). Nevertheless, enough data must be available to train a model of sufficient quality (in terms of measures such as accuracy, precision and recall and, in a wider sense, fit for use).
Poor data quality can lead to inaccurate models, biased results and reduced performance. For example, using biased datasets that do not accurately represent the characteristics of a particular population can lead to biased and inaccurate results. Similarly, incomplete datasets with missing values can significantly affect the accuracy of machine learning models by providing insufficient or inaccurate training data that cannot produce satisfactory results. Data quality issues can arise at the data collection stage, as data availability does not necessarily equate to data quality (Elouataoui et al., 2022). This is particularly true when dealing with open data, user-generated content, or data from multiple sources (Hair & Sarstedt, 2021).
Pre-processing addresses classic data quality issues such as missing data, duplicates, highly correlated variables, and anomalous or inconsistent values. This stage also includes data normalisation and standardisation (Elouataoui et al., 2022; Foidl & Felderer, 2019; Polyzotis et al., 2018). Annotation of the data introduces additional complexity. Whether automated or crowdsourced, this process is prone to error despite implementing correction procedures (Gupta et al., 2021). Crowdsourcing, in particular, raises concerns about the reliability and accuracy of data annotation due to variability in user expertise and the subjective nature of interpretation (e.g. Chmielewski & Kucker, 2020; Lease, 2011; Ridzuan et al., 2022).
Training datasets, which are used to adapt models to the data, are essential for assessing the suitability of data for machine learning tasks in terms of efficiency, accuracy and complexity (Gupta et al., 2021). The validation process aims to detect and correct errors introduced during data collection, aggregation, or annotation (Gupta et al., 2021; Polyzotis et al., 2018). Data validation is crucial for ensuring the reliability and quality of a system, but it is almost impossible to achieve exhaustively. Although assessing the risk of poor data quality remains feasible (Foidl & Felderer, 2019), there is still a gap in the discussion on the methods to define the level of validation required at each stage of a machine learning process (Foidl & Felderer, 2019).
Trust is essential for the acceptance and adoption of machine learning applications. It is rooted in the data being influenced by the knowledge and expertise of domain experts, resulting in the accuracy and relevance of the outcomes (Mougan et al., 2022; Toreini et al., 2020). Consequently, users’ trust in AI systems also depends on the quality of the data, which must adhere to three basic principles: prevention, detection and correction for ensuring reliability of machine learning applications (Ehrlinger et al., 2019). In this process, empirical considerations play an important role as the construction or selection of datasets involves not only technical aspects, but also human decisions that can significantly affect the outcome (Miceli et al., 2021).
Ethical foundations for AI-driven journalism
In journalism studies, ethical considerations are based on three main philosophical approaches: Aristotelian virtue ethics, which emphasises the importance of moral character in journalism; Kantian deontology, which stresses the role of duty and respect in ethical decision-making; and consequentialism, which assesses the moral implications of actions based on their consequences (Quinn, 2007; Sanders, 2003; Ward, 2015). This paper adopts a deontological perspective, defining journalism ethics as the principles and standards that guide the ethical behaviour of journalists. These principles govern the daily practice of journalism and emphasise the social responsibility of journalists towards their audiences (Bardoel & d'Haenens, 2004). Ethical journalism is fundamentally committed to respecting the truth by providing accurate, fair and truthful information, which is essential to maintaining credibility and trust in the news (Frost, 2015; van Dalen, 2020; Ward, 2018). Similar to the ‘garbage in, garbage out’ principle in data science, the credibility of news depends heavily on the trustworthiness of its sources. Assessing this trustworthiness is complex, involving both subjective judgments and empirical evidence (Reich, 2011).
Accuracy is about providing a faithful representation of the world. It can be approached through the broader concepts of truth, objectivity, fairness, transparency and credibility (Porlezza, 2019). In practice, it is about providing verified, independent and balanced information using reliable sources (Shapiro et al., 2013). As a tool for accountability, accuracy helps to build trust with audiences. Although it can be seen as an indicator to assess the quality of reporting, it is difficult to evaluate without introducing human judgements about the truthfulness and (non)misleading nature of the narrative (Ekström & Westlund, 2019; Porlezza, 2019). In data science, accuracy refers to the congruence of data with ground truth, closely following the definition used in journalism studies. In machine learning, it is a formal measure of the frequency with which a model correctly predicts an outcome.
Fairness means using unbiased methods and providing balanced information, which contributes to objective reporting (Helberger & Diakopoulos, 2022; Ward, 2015). As part of journalism culture, it is associated with honesty, privacy, professional distance, truth and impartiality (Ettema et al., 1998; Hanitzsch, 2007). In practice, it means being impartial and disinterested in order to avoid harmful biases and stereotypes (Ali & Hassoun, 2019). Fair reporting involves providing balanced information that gives equal consideration to all stakeholders in the story and presents all sides of the argument (Frost, 2015; Wien, 2005). Fairness is also about ensuring that there is no socio-political agenda behind the news (Ryan, 2001), non-partisanship (Cavaliere, 2020) and any other type of possible bias (Muñoz-Torres, 2012a, 2012b). These considerations are echoed in data science and machine learning, where fairness most often refers to unbiased data to prevent the model from perpetuating existing biases in the data. For example, biases may be related to social characteristics such as gender or ethnicity (Caton & Haas, 2020). Fairness-conscious approaches include modifying data, incorporating fairness during training, or adjusting model output to reflect ethical principles of unbiased and balanced information (e.g. Le Quy et al., 2022; Pessach & Shmueli, 2022). However, addressing fairness in machine learning remains complex given the broader societal implications of deploying technological solutions that may change existing problems or create new ones (Selbst et al., 2019).
Along with accuracy and fairness, objectivity is another essential standard of responsible reporting. It implies that information does not contain subjective characteristics, such as the feelings and beliefs of the journalist (Frost, 2015). However, objectivity is regularly contested because journalism involves human judgements that are inherently highly subjective (Donsbach & Klett, 1993; Muñoz-Torres, 2012a, 2012b; Tong & Zuo, 2021; Wien, 2005). From news gathering to sources, angles and narratives, journalism remains a matter of choice. Consequently, a journalist’s choice of topic, angle, sources, analysis and narrative delivery reflects individual subjectivity, which cannot be fully controlled due to its unconscious nature. Issues related to objectivity are linked to the philosophical challenge of separating facts from values and the idea that raw facts have no meaning unless they are linked to individual subjectivity (Muñoz-Torres, 2012a, 2012b). While objectivity cannot completely eliminate bias, it can be mitigated by factual accuracy, which is central to the presentation of objectively verified facts (Figdor, 2010). The editorial judgements required to shape news stories can also be scrutinised through the lens of reproducibility in scientific research (Figdor, 2010). Despite its controversial status, objectivity remains a critical element of journalists’ professional identity and ethos and is seen as a universally shared value (e.g. Deuze, 2005; Muñoz-Torres, 2012a, 2012b; Schudson & Anderson, 2009; Ward, 2018; Wien, 2005).
Over the past decade, transparency has become a buzzword—starting in the Anglo-Saxon countries, as a means to increase credibility and trust among audiences by exposing the hidden factors behind news reporting, such as the processes and methods used (Craft & Vos, 2021; Koliska, 2022). Transparency refers to the exposure of the hidden factors that shape news information and permeates journalistic practices that are characterised by their openness, such as data journalism (e.g. Coddington, 2015; Mor & Reich, 2018; Zamith, 2019). The spread of fact-checking practices, under the banner of the International Fact-Checking Network (IFCN) and the European Fact-Checking Standards Network (EFSCN), as well as data journalism practices that encourage openness have also contributed to the promotion of transparency (Cavaliere, 2020). Seen either as a substitute for objectivity or at least as a tool to enable accountability, openness and trust, the practice of transparency implies that journalists need to be trained in the ethical use of data and that audiences should be informed about the processes at work, especially when it comes to AI-driven processes that are also capable of automatically generating content (e.g. Hansen et al., 2017; Montal & Reich, 2017).
The ethical challenges of transparency in the use of machine learning in journalism are multifaceted and include the data used, the algorithms employed and the resulting outcomes (Dörr & Hollbuchner, 2017; Karlsson, 2020). Nevertheless, transparency is not always easy to implement in journalism, where practitioners often lack the data and algorithmic literacy to understand how algorithms work (Porlezza & Eberwein, 2022). It is also not easy to implement in deep learning models, where even their creators have to learn how they work due to the multiplicity of their parameters (Boddington, 2017; Burkart & Huber, 2021).
Transparency in AI systems is one of the ethical requirements to ensure the traceability of the system and to build trust with its users (Floridi, 2019). It contributes to the comprehensibility of the system and thus makes it accountable (Boddington, 2017). However, transparency is not without its limitations: firstly, the user doesn’t necessarily understand how the system works due to a lack of comprehensibility; secondly, it does not guarantee that social and ethical values are embedded in the system; thirdly, computer code may be protected for proprietary reasons; fourthly, transparency is not synonymous with accuracy and reliability (Bartneck et al., 2020; Burrell, 2016; Dourish, 2016). Moreover, trust can sometimes lead to mistrust when users misinterpret correct AI predictions and overtrust when they misunderstand incorrect ones (Schmidt et al., 2020).
Given these limitations, explainability—as part of transparency—is considered a more relevant tool for understanding how a given system arrived at a given result (Bartneck et al., 2020; Shin et al., 2022). It also concerns the decision-making process (Rai, 2020; Graziani et al., 2022) and supports the accountability of AI systems (Burkart & Huber, 2021). However, it does not guarantee the absence of hidden agendas or the reliability of results (Ferrario & Loi, 2022; Bryson, 2020). Furthermore, the quality of explanations provided by AI systems is crucial; poor or misleading explanations can lead to incorrect user decisions. Moreover, simply providing explanations is insufficient if users cannot understand or interpret them correctly, especially if the explanations are too technical or irrelevant (Schmidt et al., 2020).
Responsible journalism practices are built on the pillars of accuracy, fairness, objectivity and transparency. These principles are intertwined and form the basis of accountable journalism, which is committed to serving the public interest and fulfilling its moral duty to society (Newton et al., 2004). Accountability is crucial for building trust with audiences as end users of news information and is equally important for establishing trust between journalists as users of AI systems (Siau & Wang, 2018; Rai, 2020).
In journalism studies, trust is seen as the foundation of credibility and reliability in the dissemination of public information. It derives its legitimacy from the public’s trust, which places a responsibility on journalists to provide accurate and trustworthy information. This trust is multifaceted and arises from the complex relationships between journalists, their sources and their audiences. It involves cognitive beliefs and behavioural intentions that reflect a nuanced interplay of perceptions of competence, benevolence, integrity and predictability (Grosser et al., 2016; Koliska et al., 2023; van Dalen, 2020). Trust is essential not only for journalism, but also for fostering user acceptance and facilitating human-AI interactions (Bartneck et al., 2020; Jacovi et al., 2021; Siau & Wang, 2018).
Trust in AI and machine learning for journalism ensures that these technologies are designed and implemented in a way that promotes user and public acceptance (Toreini et al., 2020). This requires an understanding of how AI affects trust dynamics, which in turn requires ethical considerations in the development and deployment of AI to maintain and enhance public trust. In machine learning, however, the paradox of trusting the results of a machine learning system is that trustworthiness is not a prerequisite for trust: “Trust can exist in a model that is not trustworthy, and a trustworthy model does not necessarily gain trust” (Jacovi et al., 2021, p. 627).
Building trust between journalists and AI technology therefore requires a balance between AI-enhanced human tasks and AI-automated routines (Opdahl et al., 2023). It also requires trust in the data on which the system relies. Given that data quality is use- and context-dependent (Wang & Strong, 1996), from a fitness-for-use perspective, meeting the ethical principles of journalism means that data must be accurate, reliable and unbiased. AI systems also challenge these standards because data availability does not equal quality (Elouataoui et al., 2022) and because incomplete or discriminatory datasets are likely to introduce bias (Gudivada et al., 2017).
If there is a recognised need to merge AI-driven systems with journalistic values in line with professional practice (Broussard et al., 2019; Gutierrez Lopez et al., 2022), it should start with the system’s data source. From a journalistic perspective, it is important to remember that if the upstream data is biased or flawed, the system is likely to reproduce these biases and errors (Marconi & Siegman, 2017; Hansen et al., 2017). Trusting the data is also a matter of trusting its origin, recognising that the credibility of the source is a cornerstone of ethical journalism, as it also ensures the accuracy and reliability of the news (Reich, 2011). At the same time, the data used in machine learning often lacks transparency about its social and technical characteristics and potential biases (Miceli et al., 2021; Wu, 2020). This opacity extends to the machine learning training process, where it is difficult to identify the individuals behind the system, their values and the ethical decisions that underpin it. These considerations also apply to the data used to train the system, as it is created by humans with specific intentions for human use.
Building the accuracy-fairness-transparency (AFT) framework
In data science, data quality assessment is considered critical. It leads to actions aimed at improving the overall quality by identifying erroneous data elements and understanding their impact on work processes (Cichy & Rass, 2019). From a user perspective, the assessment of data quality indicators refers to their adaptation to human needs or user requirements through the aggregation of different information on data quality (Devillers et al., 2002; Cappiello et al., 2004). The assessment framework we have developed for AI-driven journalism is a part of this tradition of data quality assessment. It is based on the findings of data and information quality assessment in the scientific literature in data science (e.g., Batini et al., 2009; Cichy & Rass, 2019; Fox et al., 1994; Pipino et al., 2002; Shanks, 1999; Wand & Wang, 1996) and the core principles that underpin ethical journalism practices (Ward, 2015).
This framework is based on three ethical requirements of journalism: accuracy, given the need to respect the facts; fairness, to avoid bias or unbalanced data; and transparency, which is fundamental when working with AI systems. Because of the tensions surrounding this concept, we have not included objectivity in this framework. These three ethical principles are strongly linked to the broader concept of trust, as trusting the outcomes of AI systems should start with trusting the data that feeds them. To build the framework, we compared how the multidimensional concept of data quality is usually assessed with the three requirements that we defined in light of our literature review in journalism studies. We found that these three requirements correspond to the semiotic levels of data quality. They all refer to dimensions that include formal and empirical characteristics to be assessed.
The principle of accuracy uses the journalist’s expertise in the application domain, drawing on their knowledge of the subject matter. The principle of fairness looks at the context in which the data is used. This involves identifying the data producer’s data management practices, tracing the data lifecycle from creation to validation, and considering the technical and journalistic implications of data use. The principle of transparency assumes that embedding journalistic values is a key factor in ensuring expertise and trustworthiness at every stage of data collection and pre-processing. This approach requires a shared understanding or transfer of journalistic expertise.
Linking the ethical principles with the semiotic levels and data quality makes it clear that the journalistic principle of accuracy is rooted in the pursuit of factual accuracy.
This principle is reflected in the dimensions of accuracy, consistency, correctness and comprehensibility, which are essential to ensure the reliability and usability of the data. It requires application domain knowledge to deal with, for example, incorrect values or duplicates. The concept of fairness is partly related to the semantic level (completeness) and to the pragmatic level and refers to the dimensions of timeliness, accessibility, objectivity, relevance and usability. It concerns the context of data production, validation, dissemination and use in a journalistic context. Finally, the principle of transparency relates to the social level and includes the dimensions of provenance, credibility, reliability and verifiability of the data. However, it covers only part of what transparency in journalism is, as it also involves being transparent with readers about the reporting methods used, such as revealing how and why information was gathered and verified, in order to build trust and credibility with the audience.
The application of the framework can be either objective or subjective (Pipino et al., 2002), with the aim of identifying data quality issues and challenges before they occur, either broadly or in detail. However, its main limitation lies in the inherent nature of information, which, as an empirical element, is likely to evolve due to its volatile nature. As standards and knowledge evolve over time, yesterday’s references may not be the same as today’s. It also requires a deeper understanding of the datasets, including consideration of the annotation process for classification. Who were the annotators? How were they instructed? Without documented data, we cannot answer these fundamental questions. Ensuring these aspects is also linked to the need for human-led annotation, either based on expertise in the topic of the stories or with general journalism training (Torabi Asr & Taboada, 2019).
Another limitation is that the framework allows the identification of data quality issues without providing methods to resolve them, a task that falls within the scope of other research areas (Triguero et al., 2019). Nevertheless, it serves as a means of identifying, preventing and correcting primary data quality issues, fostering a critical perspective on data quality and contributing to the development of data literacy in journalism. It is also in line with the Data-Centric AI (DCAI) approach, which advocates a shift from a model-centric approach to data quality and reliability, as exemplified by the ‘garbage in, garbage out’ principle (Table 2). This perspective is particularly relevant given that machine learning research has primarily focused on training algorithms, overlooking potential data quality issues and unwanted errors in the data (Jarrahi et al., 2022; Whang et al., 2023; Zha et al., 2023).
The AFT framework does not provide specific methods for improving data quality, but its strength lies in its grounding in journalistic standards. Evaluating the different dimensions of data quality from the perspective of journalistic practice reveals both alignment and implementation challenges. Dimensions such as verification, accuracy, objectivity, relevance, timeliness, usability, credibility and reliability are closely aligned with core journalistic practices. They are relatively easy to implement because they reflect journalists’ core values and daily activities, emphasising verification and critique. For example, the reliability dimension can be assessed through a rigorous and systematic examination of all sources used in the journalistic process (Steensen et al., 2022). Conversely, technical dimensions such as interoperability, standardisation and encoding are more challenging and require specific technical knowledge and tools that may be beyond the skills of many journalists. While ensuring data completeness and accessibility fits well with the comprehensive and source-based nature of journalism, practical challenges such as legal access and quantifying completeness may arise.
The AFT framework can be practically applied in automated fact checking, which can speed up a typically time-consuming process that is prone to data quality issues. These issues can arise from user-generated content, crowdsourced labelling methods or the use of outdated or incomplete datasets, all of which can compromise the reliability and accuracy of information used in journalistic contexts (Dierickx et al., 2023). As a generic solution, this framework can also be applied to automated image classification and selection. Ensuring accuracy in identifying and tagging images requires careful data validation and contextual understanding. The framework’s principles of accuracy, fairness and transparency can guide practitioners in selecting and verifying image datasets, addressing biases and ensuring that images used in news articles or reports are relevant, credible and representative of the intended narrative.
In addition, the framework’s versatility allows it to be applied to a wide range of contexts and data types to ensure compliance with ethical standards and increase the reliability of AI-driven journalistic output. However, given that data quality, according to the fitness-for-use principle, refers to data that adapts to its intended use, the framework may need to be adapted for some specific cases or contexts.
Discussion and conclusion
The Accuracy-Fairness-Transparency (AFT) framework presented in this paper provides valuable guidance for machine learning developers working on journalism projects, effectively bridging technical considerations with ethical requirements. It has been designed as an adaptive and flexible tool that can be used in the various forms that AI-driven journalism systems can take. Its strength lies in its consideration of journalistic requirements and ethical principles framed by accountability. From this perspective, the use of this framework can help build bridges between communities of practice whose conceptions of journalism may differ (Sirén-Heikel et al., 2023). It thus facilitates the alignment of concepts between professional cultures that share common concerns but may interpret them differently due to their different epistemologies (Dierickx & Lindén, 2023).
The AFT framework provides a practical approach to merging AI systems with the ethical principles of journalism, responding to calls for better integration of journalistic values in technology (Komatsu et al., 2020). Although it was primarily tailored to machine learning developers working on AI-driven projects in newsrooms, it recognises the interdisciplinary nature and potential involvement of journalists in such projects, whose expertise should be valued. However, this approach requires active engagement from journalists, which may not be feasible for all due to lack of time or interest in technical aspects. However, their expertise in the reliability and credibility of the data source and their knowledge of the context should not be overlooked. This also highlights the need for more interdisciplinarity in the study of AI-driven tools in journalism, as a focus on system performance alone is insufficient from an end-user perspective. However, putting the interdisciplinarity of this framework into practice also means that machine learning developers may focus on the technical aspects of the framework, while journalists may focus on the ethical and social implications. Nevertheless, integrating accuracy, fairness, and transparency principles into machine learning systems encourages critical reflection on data sourcing and processing.
As end-users of AI systems, journalists are not always aware of or informed about the origin of the data. Therefore, this framework can be used as a practical strategy to promote data and AI literacy among journalists and serves as a valuable tool for training programmes. It recognises the need for journalists to understand the role of AI in journalism and develop a critical awareness of its implications (Deuze & Beckett, 2022). If it is recognised that data journalism requires data literacy because working with data is related to scientific methods (Bobkowski & Etheridge, 2023), the same is true for AI-driven journalism, where every process starts with data.
The AFT framework has also highlighted the benefits of a data-centric approach to AI in journalism, the main advantage of which lies in supporting the development of machine learning applications with smaller datasets. Another strength lies in its ability to better understand opaque systems through the data that feeds them, as the results of machine learning are directly influenced by the quality of the data at each stage of the process. For all these reasons, this framework can therefore help to build trustworthy systems. Although a thorough assessment requires a significant investment of time due to the meticulous scrutiny involved, the benefits far outweigh the investment, particularly in terms of achieving accuracy and reliability.
Given that the relationship between journalists and AI-driven systems is based on trust, the data that feeds these systems must also be trusted. However, defining ‘good’ data in journalism remains challenging due to the multi-dimensional nature of quality. The quality of a dataset is intrinsically linked to the expertise of a particular application domain and the understanding of how data is collected, validated and disseminated. Hence, assessing data quality for machine learning applications in journalism requires consideration of both formal and empirical aspects of the dataset that can be considered from an ethical perspective.
The intersection of data quality, as approached in data and computer science, with journalistic standards shows that these different fields share common concerns in many ways. For example, the FAIR principles in data management promote the discoverability, accessibility, interoperability and reusability of data (Wilkinson et al., 2016). From an ethical perspective, data accuracy and validity, as well as unbiased data, are prerequisites for the ethical use of data (Saltz & Dewar, 2019). At the EU level, the Ethical Guidelines for Trustworthy AI emphasise that responsible and accountable AI systems rely on appropriate data governance as well as transparency, diversity, non-discrimination and fairness of data (Floridi, 2019).
As already mentioned, the AFT framework potentially supports the development of machine learning applications with smaller datasets. However, it is not directly applicable to foundational models, such as large language models, due to the inherent data quality challenges posed by large amounts of user-generated content, potentially biased sources, and copyrighted material (Dwivedi et al., 2023). While licences and agreements, such as those between news publishers and OpenAI, can help mitigate these issues by providing access to verified sources, they do not fully address all data quality concerns. In addition, considerations of timeliness and completeness remain critical to ensuring the reliability of the data used. Furthermore, the complexity of these issues highlights the need for ongoing research and development of more sophisticated methods for addressing data quality in the most complex AI applications.
References
Adnan, K., & Akbar, R. (2019). An analytical study of information extraction from unstructured and multidimensional big data. Journal of Big Data, 6(1), 1–38. https://doi.org/10.1186/s40537-019-0254-8
Ali, W., & Hassoun, M. (2019). Artificial intelligence and automated journalism: Contemporary challenges and new opportunities. International Journal of Media, Journalism and Mass Communications, 5(1), 40–49. https://doi.org/10.20431/2454-9479.0501004
Anderson, C. W. (2018). Apostles of certainty: Data journalism and the politics of doubt. Oxford University Press.
Antunes, N., Balby, L., Figueiredo, F., Lourenco, N., Meira, W., & Santos, W. (2018, June). Fairness and transparency of machine learning for trustworthy cloud services. In 2018 48th annual IEEE/IFIP international conference on dependable systems and networks workshops (DSN-W) (pp. 188–193). IEEE.
Bardoel, J., & d’Haenens, L. (2004). Media meet the citizen: Beyond market mechanisms and government regulations. European Journal of Communication, 19(2), 165–194. https://doi.org/10.1177/0267323104042909
Bartneck, C., Lütge, C., Wagner, A., & Welsh, S. (2020). Trust and fairness in AI systems. In An introduction to ethics in robotics and AI (pp. 27–38). Springer. https://doi.org/10.1007/978-3-030-51110-4_4
Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality assessment and improvement. ACM Computing Surveys, 41(3), 1–52. https://doi.org/10.1145/1541880.1541883
Batini, C., Rula, A., Scannapieco, M., & Viscusi, G. (2015). From data quality to big data quality. Journal of Database Management, 26(1), 60–82. https://doi.org/10.4018/jdm.2015010103
Becker, D., King, T. D., & McMullen, B. (2015, October). Big data, big data quality problem. In 2015 IEEE international conference on big data (Big Data) (pp. 2644–2653). IEEE.
Bobkowski, P. S., & Etheridge, C. E. (2023). Spreadsheets, software, storytelling, visualization, lifelong learning: Essential data skills for journalism and strategic communication students. Science Communication, 45(1), 95–116. https://doi.org/10.1177/10755470221147887
Boddington, P. (2017). Towards a code of ethics for artificial intelligence. Springer.
Boydens, I., & van Hooland, S. (2011). Hermeneutics applied to the quality of empirical databases. The Journal of Documentation, 67(2), 279–289. https://doi.org/10.1108/00220411111109476
Broussard, M., Diakopoulos, N., Guzman, A. L., Abebe, R., Dupagne, M., & Chuan, C.-H. (2019). Artificial intelligence and journalism. Journalism & Mass Communication Quarterly, 96(3), 673–695. https://doi.org/10.1177/1077699019859901
Bryson, J. J. (2020). The Artificial Intelligence of the ethics of Artificial Intelligence. In M. D. Dubber, F. Pasquale, & S. Das (Eds.), The Oxford handbook of ethics of AI (pp. 1–25). Oxford University Press. https://doi.org/10.1093/oxfordhb/9780190067397.013.1
Burkart, N., & Huber, M. F. (2021). A survey on the explainability of supervised machine learning. Journal of Artificial Intelligence Research, 70, 245–317. https://doi.org/10.1613/jair.1.12228
Burrell, J. (2016). How the machine ‘thinks’: Understanding opacity in machine learning algorithms. Big Data & Society. https://doi.org/10.2139/ssrn.2660674
Cai, L., & Zhu, Y. (2015). The challenges of data quality and data quality assessment in the big data era. Data Science Journal, 14, 2. https://doi.org/10.5334/dsj-2015-002
Cappiello, C., Francalanci, C., & Pernici, B.(2004). Data quality assessment from the user’s perspective. In Proceedings of the 2004 international workshop on Information quality in information systems (pp. 68–73).
Caton, S., & Haas, C. (2020). Fairness in machine learning: A survey. ACM Computing Surveys.
Cavaliere, P. (2020). From journalistic ethics to fact-checking practices: Defining the standards of content governance in the fight against disinformation. Journal of Media Law, 12(2), 133–165. https://doi.org/10.1080/17577632.2020.1869486
Chmielewski, M., & Kucker, S. C. (2020). An MTurk crisis? Shifts in data quality and the impact on study results. Social Psychological and Personality Science, 11(4), 464–473. https://doi.org/10.1177/1948550619875149
Cichy, C., & Rass, S. (2019). An overview of data quality frameworks. IEEE Access: Practical Innovations, Open Solutions, 7, 24634–24648. https://doi.org/10.1109/access.2019.2899751
Clerwall, C. (2014). Enter the Robot Journalist: Users’ perceptions of automated content. Journalism Practice, 8(5), 519–531. https://doi.org/10.1080/17512786.2014.883116
Coddington, M. (2015). Clarifying Journalism’s Quantitative Turn: A typology for evaluating data journalism, computational journalism, and computer-assisted reporting. Digital Journalism, 3(3), 331–348. https://doi.org/10.1080/21670811.2014.976400
Craft, S., & Vos, T. P. (2021). The ethics of transparency. In L. Trifonova Price, K. Sanders, & W. N. Wyatt (Eds.), The Routledge companion to Journalism Ethics (pp. 175–183). Routledge.
Deuze, M. (2005). What is journalism? Professional identity and ideology of journalists reconsidered. Journalism, 6, 443–465. https://doi.org/10.1177/1464884905056815
Deuze, M., & Beckett, C. (2022). Imagination, algorithms and news: Developing AI literacy for journalism. Digital Journalism, 10(10), 1913–1918.
De Veaux, R. D., & Hand, D. J. (2005). How to lie with bad data. Statistical Science. https://doi.org/10.1214/088342305000000269
Devillers, R., Gervais, M., & Bédard, Y. (2002). Spatial data quality: From metadata to quality indicators and contextual end-user manual. In OEEPE/ISPRS joint workshop on spatial data quality management (pp. 21–22).
Diakopoulos, N. (2019). Automating the news: How algorithms are rewriting the media. Harvard University Press.
Dierickx, L., & Lindén, C. G. (2023). Fine-tuning languages: Epistemological foundations for ethical AI in journalism. In 2023 10th IEEE Swiss conference on data science (SDS) (pp. 42–49). IEEE.
Dierickx, L., Lindén, C., & Opdahl, A. (2023). Automated fact-checking to support professional practices: Systematic literature review and meta-analysis. International Journal of Communication, 17, 5170–5190.
Donsbach, W., & Klett, B. (1993). Subjective objectivity How journalists in four countries define a key term of their profession. Gazette, 51(1), 53–83. https://doi.org/10.1177/001654929305100104
Dörr, K. N., & Hollnbuchner, K. (2017). Ethical challenges of algorithmic journalism. Digital Journalism, 5(4), 404–419. https://doi.org/10.1080/21670811.2016.1167612
Dourish, P. (2016). Algorithms and their others: Algorithmic culture in context. Big Data & Society, 3(2), 2053951716665128. https://doi.org/10.1177/2053951716665128
Dwivedi, Y. K., Kshetri, N., Hughes, L., Slade, E. L., Jeyaraj, A., Kar, A. K., & Wright, R. (2023). So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. International Journal of Information Management, 71, 102642.
Eberendu, A. C. (2016). Unstructured Data: An overview of the data of Big Data. International Journal of Computer Trends and Technology, 38(1), 46–50.
Eckerson, W. W. (2002). Data quality and the bottom line: Achieving business success through a commitment to high quality data. The Data Warehousing Institute.
Ehrlinger, L., Haunschmid, V., Palazzini, D., & Lettner, C. (2019). A DaQL to monitor data quality in machine learning applications. In Lecture Notes in Computer Science (pp. 227–237). Springer.
Ekström, M., & Westlund, O. (2019). The dislocation of news journalism: A conceptual framework for the study of epistemologies of digital journalism. Media and Communication, 7(1), 259–270. https://doi.org/10.17645/mac.v7i1.1763
Elouataoui, W., Alaoui, I. E., & Gahi, Y. (2022). Data quality in the era of big data: A global review. In Big data intelligence for smart applications (pp. 1–25). Springer.
Ettema, J. S., Glasser, T. L., & Glasser, T. (1998). Custodians of conscience: Investigative journalism and public virtue. Columbia University Press.
Figdor, C. (2010). Objectivity in the news: Finding a way forward. Journal of Mass Media Ethics, 25(1), 19–33.
Ferrario, A., & Loi, M. (2022, June). How explainability contributes to trust in AI. In Proceedings of the 2022 ACM conference on fairness, accountability, and transparency (pp. 1457–1466).
Floridi, L. (2019). Establishing the rules for building trustworthy AI. Nature Machine Intelligence, 1(6), 261–262. https://doi.org/10.2139/ssrn.3858392
Foidl, H., & Felderer, M. (2019, August). Risk-based data validation in machine learning-based software systems. In Proceedings of the 3rd ACM SIGSOFT international workshop on machine learning techniques for software quality evaluation (pp. 13–18).
Fox, C., Levitin, A., & Redman, T. (1994). The notion of data and its quality dimensions. Information Processing & Management, 30(1), 9–19. https://doi.org/10.1016/0306-4573(94)90020-5
Frost, C. (2015). Journalism ethics and regulation (4th ed.). Routledge. https://doi.org/10.4324/9781315757810
García-Avilés, J. A. (2021). An inquiry into the ethics of innovation in digital journalism. In M. Luengo & S. Herrera-Damas (Eds.), News media innovation reconsidered: Ethics and values in a creative reconstruction of journalism (pp. 1–19). Wiley.
Gudivada, V., Apon, A., & Ding, J. (2017). Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. International Journal on Advances in Software, 10(1), 1–20.
Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., & Pedreschi, D. (2019). A survey of methods for explaining black box models. ACM Computing Surveys, 51(5), 1–42. https://doi.org/10.1145/3236009
Gupta, N., Mujumdar, S., Patel, H., Masuda, S., Panwar, N., Bandyopadhyay, S., Mehta, S., Guttula, S., Afzal, S., Sharma Mittal, R., & Munigala, V. (2021). Data quality for machine learning tasks. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining.
Gutierrez Lopez, M., Porlezza, C., Cooper, G., Makri, S., MacFarlane, A., & Missaoui, S. (2022). A question of design: Strategies for embedding AI-driven tools into journalistic work routines. Digital Journalism. https://doi.org/10.1080/21670811.2022.2043759
Graziani, M., Dutkiewicz, L., Calvaresi, D., Amorim, J. P., Yordanova, K., Vered, M., Nair, R., Abreu, P. H., Blanke, T., Pulignano, V., et al. (2022). A global taxonomy of interpretable AI: Unifying the terminology for the technical and social sciences. Artificial Intelligence Review, 56(4), 3473–3504. https://doi.org/10.1007/s10462-022-10256-8
Grosser, K. M., Hase, V., & Blöbaum, B. (2016). Trust in online journalism. Trust and Communication in a Digitized World: Models and Concepts of Trust Research, 53–73.
Hair, J. F., Jr., & Sarstedt, M. (2021). Data, measurement, and causal inferences in machine learning: Opportunities and challenges for marketing. The Journal of Marketing Theory and Practice, 29(1), 65–77. https://doi.org/10.1080/10696679.2020.1860683
Hanitzsch, T. (2007). Deconstructing journalism culture: Toward a universal theory. Communication Theory, 17(4), 367–385. https://doi.org/10.1111/j.1468-2885.2007.00303.x
Hansen, M., Roca-Sales, M., Keegan, J. M., & King, G. (2017). Artificial intelligence: Practice and implications for journalism. Tow Center for Digital Journalism, Columbia University.
Helberger, N., & Diakopoulos, N. (2022). The European AI act and how it matters for research into AI in media and journalism. Digital Journalism. https://doi.org/10.1080/21670811.2022.2082505
Huh, Y. U., Keller, F. R., Redman, T. C., & Watkins, A. R. (1990). Data quality. Information and Software Technology, 32(8), 559–565. https://doi.org/10.1016/0950-5849(90)90146-i
Jacovi, A., Marasović, A., Miller, T., & Goldberg, Y. (2021). Formalizing trust in artificial intelligence. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. https://doi.org/10.1145/3442188.3445923
Jaiswal, A., Babu, A. R., Zadeh, M. Z., Banerjee, D., & Makedon, F. (2020). A survey on contrastive self-supervised learning. Technologies, 9(1), 2. https://doi.org/10.3390/technologies9010002
Jarrahi, M. H., Memariani, A., & Guha, S. (2022). The principles of data-centric AI (DCAI). arXiv preprint arXiv:2211.14611.
Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245), 255–260. https://doi.org/10.1126/science.aaa8415
Karlsen, J., & Stavelin, E. (2014). Computational journalism in Norwegian newsrooms. Journalism Practice, 8(1), 34–48. https://doi.org/10.1080/17512786.2013.813190
Karlsson, M. (2020). Dispersing the opacity of transparency in journalism on the appeal of different forms of transparency to the general public. Journalism Studies, 21(13), 1795–1814. https://doi.org/10.1080/1461670x.2020.1790028
Kläs, M., & Vollmer, A. M. (2018). Uncertainty in machine learning applications: A practice-driven classification of uncertainty. In Developments in language theory (pp. 431–438). Springer.
Koliska, M. (2022). Trust and journalistic transparency online. Journalism Studies, 23(12), 1488–1509. https://doi.org/10.1080/1461670x.2022.2102532
Koliska, M., Moroney, E., & Beavers, D. (2023). Trust through relationships in journalism. Journalism Studies. https://doi.org/10.1080/1461670X.2023.2209807
Komatsu, T., Gutierrez Lopez, M., Makri, S., Porlezza, C., Cooper, G., MacFarlane, A., & Missaoui, S. (2020, October). AI should embody our values: Investigating journalistic values to inform AI technology design. In Proceedings of the 11th nordic conference on human-computer interaction: Shaping experiences, shaping society (pp. 1–13).
Lantz, B. (2014). Machine learning with R. Shroff Publishers & Distributors.
Lease, M. (2011). On quality control and machine learning in crowdsourcing. In Proceedings of the 3rd human computation workshop (HCOMP) at AAAI.
Lindén, C. G. (2020). What makes a reporter human? Questions de communication, 37(1), 337–351.
Le Quy, T., Roy, A., Iosifidis, V., Zhang, W., & Ntoutsi, E. (2022). A survey on datasets for fairness-aware machine learning. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 12(3), e1452.
Lindland, O. I., Sindre, G., & Solvberg, A. (1994). Understanding quality in conceptual modeling. IEEE Software, 11(2), 42–49. https://doi.org/10.1109/52.268955
Liu, J., Li, J., Li, W., & Wu, J. (2016). Rethinking big data: A review on the data quality and usage issues. ISPRS Journal of Photogrammetry and Remote Sensing: Official Publication of the International Society for Photogrammetry and Remote Sensing (ISPRS), 115, 134–142. https://doi.org/10.1016/j.isprsjprs.2015.11.006
Lowrey, W., Broussard, R., & Sherrill, L. A. (2019). Data journalism and black-boxed data sets. Newspaper Research Journal, 40(1), 69–82. https://doi.org/10.1177/0739532918814451
Madnick, S., & Zhu, H. (2006). Improving data quality through effective use of data semantics. Data & Knowledge Engineering, 59(2), 460–475. https://doi.org/10.1016/j.datak.2005.10.001
Marconi, F., & Siegman, A. (2017). The Future of Augmented Journalism: A guide for newsrooms in the age of smart machines. Associated Press. https://www.ap.org/assets/files/2017_ai_guide.pdf
Martens, D. (2022). Data science ethics: Concepts, techniques, and cautionary tales. Oxford University Press.
McCallum, Q. E. (2012). Bad data handbook. O’Reilly Media.
McCausland, T. (2021). The bad data problem. Research Technology Management, 64(1), 68–71. https://doi.org/10.1080/08956308.2021.1844540
Miceli, M., Posada, J., & Yang, T. (2021). Studying up machine learning data: Why talk about bias when we mean power? Proceedings of the ACM on Human-Computer Interaction, 6, 1–14.
Montal, T., & Reich, Z. (2017). I, robot. You, journalist. Who is the author? Authorship, bylines and full disclosure in automated journalism. Digital Journalism, 5(7), 829–849. https://doi.org/10.1080/21670811.2016.1209083
Moody, D. L., & Shanks, G. G. (2003). Improving the quality of data models: Empirical validation of a quality management framework. Information Systems, 28(6), 619–650. https://doi.org/10.1016/s0306-4379(02)00043-1
Mor, N., & Reich, Z. (2018). From “Trust Me” to “Show Me” Journalism: Can DocumentCloud help to restore the deteriorating credibility of news? Journalism Practice, 12(9), 1091–1108. https://doi.org/10.1080/17512786.2017.1376593
Mougan, C., Kanellos, G., Micheler, J., Martinez, J., & Gottron, T. (2022). Introducing explainable supervised machine learning into interactive feedback loops for statistical production system. In arXiv [cs.LG]. http://arxiv.org/abs/2202.03212
Muñoz-Torres, J. R. (2012a). Truth and objectivity in journalism: Anatomy of an endless misunderstanding. Journalism Studies, 13(4), 566–582.
Muñoz-Torres, J. R. (2012b). Truth and objectivity in journalism: Anatomy of an endless misunderstanding. Journalism Studies, 13(4), 566–582. https://doi.org/10.1080/1461670x.2012.662401
Newton, L., Hodges, L., & Keith, S. (2004). Accountability in the professions: Accountability in journalism. Journal of Mass Media Ethics, 19(3), 166–190.
Opdahl, A. L., Tessem, B., Dang-Nguyen, D. -T., Motta, E., Setty, V., Throndsen, E., Tverberg, A., & Trattner, C. (2023). Trustworthy journalism through AI. Data & Knowledge Engineering, 146, 102182. https://doi.org/10.1016/j.datak.2023.102182
Pessach, D., & Shmueli, E. (2022). A review on fairness in machine learning. ACM Computing Surveys (CSUR), 55(3), 1–44.
Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45(4), 211–218. https://doi.org/10.1145/505248.506010
Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2018). Data lifecycle challenges in production machine learning: A survey. SIGMOD Record, 47(2), 17–28. https://doi.org/10.1145/3299887.3299891
Porlezza, C. (2019). Accuracy in Journalism. In Oxford research encyclopedia of communication. Oxford University Press.
Porlezza, C., & Eberwein, T. (2022). Uncharted territory: Datafication as a challenge for journalism ethics. In Media and change management (pp. 343–361). Springer.
Quinn, A. (2007). Moral virtues for journalists. Journal of Mass Media Ethics, 22(2–3), 168–186. https://doi.org/10.1080/08900520701315764
Rai, A. (2020). Explainable AI: From black box to glass box. Journal of the Academy of Marketing Science, 48(1), 137–141. https://doi.org/10.1007/s11747-019-00710-5
Reich, Z. (2011). Source credibility and journalism: Between visceral and discretional judgment. Journalism Practice, 5(1), 51–67.
Ridzuan, F., Wan Zainon, W. M. N., & Zairul, M. (2022). A thematic review on data quality challenges and dimension in the era of big data. In Lecture Notes in Electrical Engineering (pp. 725–737). Springer.
Ryan, M. (2001). Journalistic ethics, objectivity, existential journalism, standpoint epistemology, and public journalism. Journal of Mass Media Ethics, 16(1), 3–22. https://doi.org/10.1207/s15327728jmme1601_2
Saha, B., & Srivastava, D. (2014, March). Data quality: The other face of big data. In 2014 IEEE 30th international conference on data engineering (pp. 1294–1297). IEEE.
Saltz, J. S., & Dewar, N. (2019). Data science ethical considerations: A systematic literature review and proposed project framework. Ethics and Information Technology, 21, 197–208. https://doi.org/10.1007/s10676-019-09502-5-
Sanders, K. (2003). Ethics & journalism. SAGE Publications.
Schmidt, P., Biessmann, F., & Teubner, T. (2020). Transparency and trust in artificial intelligence systems. Journal of Decision Systems, 29(4), 260–278. https://doi.org/10.1080/12460125.2020.18190
Schudson, M., & Anderson, C. (2009). Objectivity, professionalism, and truth seeking in journalism. In K. Wahl-Jorgensen & T. Hanitzsch (Eds.), The Handbook of Journalism Studies (pp. 108–121). Routledge. https://doi.org/10.4324/9780203877685-15
Selbst, A. D., Boyd, D., Friedler, S. A., Venkatasubramanian, S., & Vertesi, J. (2019, January). Fairness and abstraction in sociotechnical systems. In Proceedings of the conference on fairness, accountability, and transparency (pp. 59–68).
Shanks, G. (1999). Semiotic approach to understanding representation in information systems. In Proceedings of the information systems foundations workshop: ontology, semiotics and practice.
Shapiro, I., Brin, C., Bédard-Brûlé, I., & Mychajlowycz, K. (2013). Verification as a strategic ritual: How journalists retrospectively describe processes for ensuring accuracy. Journalism Practice, 7(6), 657–673. https://doi.org/10.1080/17512786.2013.765638
Shin, D., Hameleers, M., Park, Y. J., Kim, J. N., Trielli, D., & Diakopoulos, N. (2022). Countering algorithmic bias and disinformation and effectively harnessing the power of AI in media. Journalism & Mass Communication Quarterly, 99(4), 887–907. https://doi.org/10.1177/10776990221129245
Siau, K., & Wang, W. (2018). Building trust in artificial intelligence, machine learning, and robotics. Cutter Business Technology Journal, 31(2), 47–53.
Sirén-Heikel, S., Kjellman, M., & Lindén, C. G. (2023). At the crossroads of logics: Automating newswork with artificial intelligence—(Re) defining journalistic logics from the perspective of technologists. Journal of the Association for Information Science and Technology, 74(3), 354–366. https://doi.org/10.1002/asi.24656
Steensen, S., Belair-Gagnon, V., Graves, L., Kalsnes, B., & Westlund, O. (2022). Journalism and source criticism. Revised approaches to assessing truth-claims. Journalism Studies, 23(16), 2119–2137.
Stray, J. (2016). The curious journalist’s guide to data. Columbia Journalism Review. Retrieved February 1, 2023, from https://www.cjr.org/tow_center_reports/the_curious_journalists_guide_to_data.php
Strong, D. M., Lee, Y. W., & Wang, R. Y. (1997). Data quality in context. Communications of the ACM, 40(5), 103–110. https://doi.org/10.1145/253769.253804
Sundar, S. S. (1998). Effect of source attribution on perception of online news stories. Journalism & Mass Communication Quarterly, 75(1), 55–68. https://doi.org/10.1177/107769909807500108
Taleb, I., Serhani, M. A., & Dssouli, R. (2018, July). Big data quality: A survey. In 2018 IEEE international congress on big data (BigData Congress) (pp. 166–173). IEEE.
Tayi, G. K., & Ballou, D. P. (1998). Examining data quality. Communications of the ACM, 41(2), 54–57. https://doi.org/10.1145/269012.269021
Thurman, N., Lewis, S. C., & Kunert, J. (2019). Algorithms, automation, and news. Digital Journalism, 7(8), 980–992. https://doi.org/10.1080/21670811.2019.1685395
Tong, J., & Zuo, L. (2021). The inapplicability of objectivity: Understanding the work of data journalism. Journalism Practice, 15(2), 153–169. https://doi.org/10.1080/17512786.2019.1698974
Torabi Asr, F., & Taboada, M. (2019). Big Data and quality data for fake news and misinformation detection. Big Data & Society, 6(1), 205395171984331. https://doi.org/10.1177/2053951719843310
Toreini, E., Aitken, M., Coopamootoo, K., Elliott, K., Zelaya, C. G., & Van Moorsel, A. (2020, January). The relationship between trust in AI and trustworthy machine learning technologies. In Proceedings of the 2020 conference on fairness, accountability, and transparency (pp. 272–283).
Triguero, I., García-Gil, D., Maillo, J., Luengo, J., García, S., & Herrera, F. (2019). Transforming big data into smart data: An insight on the use of the k‐nearest neighbors algorithm to obtain quality data. Wiley Interdisciplinary Reviews. Data Mining and Knowledge Discovery, 9(2), e1289. https://doi.org/10.1002/widm.1289
van Dalen, A. (2020). Journalism, trust, and credibility. In K. Wahl-Jorgensen & T. Hanitzsch (Eds.), The Handbook of Journalism Studies (2nd ed., pp. 356–371). Routledge.
Wand, Y., & Wang, R. Y. (1996). Anchoring data quality dimensions in ontological foundations. Communications of the ACM, 39(11), 86–95. https://doi.org/10.1145/240455.240479
Wang, R. Y., Reddy, M. P., & Kon, H. B. (1995). Toward quality data: An attribute-based approach. Decision Support Systems, 13(3–4), 349–372. https://doi.org/10.1016/0167-9236(93)e0050-n
Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems : JMIS, 12(4), 5–33. https://doi.org/10.1080/07421222.1996.11518099
Ward, S. J. A. (2015). The invention of journalism ethics: The path to objectivity and beyond. McGill-Queen’s Press.
Ward, S. J. A. (2018). Reconstructing journalism ethics: Disrupt, invent, collaborate. Media & Jornalismo, 18(32), 9–17. https://doi.org/10.14195/2183-5462_32_1
Wien, C. (2005). Defining objectivity within journalism: An overview. The NORDICOM Review of Nordic Research on Media & Communication, 26(2), 3–15.
Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., & Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1), 1–9.
Whang, S. E., Roh, Y., Song, H., & Lee, J. G. (2023). Data collection and quality challenges in deep learning: A data-centric AI perspective. The VLDB Journal, 32(4), 791–813. https://doi.org/10.1007/s00778-022-00775-9
Wu, Y. (2020). Is automated journalistic writing less biased? An experimental test of auto-written and human-written news stories. Journalism Practice, 14(8), 1008–1028. https://doi.org/10.1080/17512786.2019.1682940
Zamith, R. (2019). Transparency, interactivity, diversity, and information provenance in everyday data journalism. Digital Journalism, 7(4), 470–489. https://doi.org/10.1080/21670811.2018.1554409
Zha, D., Bhat, Z. P., Lai, K. H., Yang, F., & Hu, X. (2023). Data-centric ai: Perspectives and challenges. In Proceedings of the 2023 SIAM international conference on data mining (SDM) (pp. 945–948). Society for Industrial and Applied Mathematics.
Acknowledgements
This research was funded by EU CEF Grant No. 2394203.
Funding
Open access funding provided by University of Bergen (incl Haukeland University Hospital).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflict of interest to declare.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Dierickx, L., Opdahl, A.L., Khan, S.A. et al. A data-centric approach for ethical and trustworthy AI in journalism. Ethics Inf Technol 26, 64 (2024). https://doi.org/10.1007/s10676-024-09801-6
Accepted:
Published:
DOI: https://doi.org/10.1007/s10676-024-09801-6