Introduction

AI-driven journalism leverages techniques and tools to gather, verify, produce, and distribute news (Thurman et al., 2019). These tools enhance professional practice by speeding up labour-intensive tasks, publishing automated content, identifying trends, and providing insights into large numerical or textual datasets. Their potential is therefore likely to extend human capabilities, which can be seen as a new form of augmented journalism (Lindén, 2020). This echoes McLuhan’s theory on media as an extension of the human, in the sense that AI-driven technology enables journalists to overcome traditional limitations, improving their ability to gather, analyse, and disseminate information more efficiently and accurately.

However, recent AI-driven systems rely heavily on machine learning techniques, which are perceived as opaque and potentially biased, making them difficult to use in journalism (Guidotti et al., 2019). A feature of these systems is their reliance on high-quality data to avoid inaccurate analysis and unreliable decisions (Gupta et al., 2021). Explaining how data is collected, organised, cleaned, annotated and processed helps open up the black-box nature of these machine learning systems and thus build trust between the journalist and the system and, therefore, between the journalist and the audience. This means understanding the data quality challenges that arise at different stages of the machine learning process (Gudivada et al., 2017), following the ‘garbage in, garbage out’ principle widely accepted in data and computer science. This is also true in journalism, where quality information requires quality data to ensure the accuracy and reliability of the news (e.g., Anderson, 2018; Dörr & Hollbuchner, 2017; Lowrey et al., 2019).

Although the quality of digital information is fundamental to journalism (Stray, 2016), the concept of information quality is multifaceted and difficult to define. Ethical journalism is fundamentally linked to quality journalism, and both are deeply rooted in the philosophical concept of truth (García-Avilés, 2021; Porlezza, 2019). While ethical journalism refers to the adherence to ethical standards and principles to promote trustworthiness and ensure that journalists report the news with integrity and responsibility, thereby fostering public trust in the media, quality journalism is much more complicated to define. It has been described as referring vaguely to the degree or level of excellence of information (Sundar, 1998) or to the general but vague notion of having read a ‘good story’, which depends on the socio-cultural background of the recipient and can therefore vary from reader to reader (Clerwall, 2014). It has also been defined by several dimensions of quality—accuracy, comprehensibility, timeliness, reliability and validity—in which journalists are committed to trustworthy content (Diakopoulos, 2019). This last proposition encounters the multidimensional concept of data quality as it is approached in data and computer science, considering that poor data quality has potential consequences. For example, it can lead to wrong decisions, loss of credibility or financial losses due to incorrect data analysis (De Veaux & Hand, 2005; McCallum, 2012; McCausland, 2021). In journalism, poor data quality is likely to undermine public trust, affect media practices and undermine the watchdog function of the media.

Data quality challenges have been discussed in works on data and computational journalism (Anderson, 2018; Karlsen & Stavelin, 2014), but critical aspects of data quality remain under-documented (Lowrey et al., 2019). This paper aims to bridge the gap between machine learning development and journalism by focusing on three key areas: assessing the quality of machine learning datasets, guiding the development of machine learning solutions through a data-centric approach, and promoting AI and data literacy among journalists. It also examines the development of machine learning systems in newsrooms from the perspective of trustworthy AI, highlighting the need to reflect ethical journalism standards and their corollary, high data quality. By grounding machine learning applications in journalistic principles, we also aim to foster interdisciplinary collaborations.

Insofar as data quality is context- and use-dependent (Tayi & Ballou, 1998), this paper first examines the multidimensional concept of data quality and how it challenges machine learning systems in computer and data science. This section highlights the strong ethical dimensions of data quality, recognising that poor data quality can lead to biased or inaccurate reporting. The paper then explores the ethical challenges of AI-driven journalism, identifying specific journalistic requirements for building a flexible data quality assessment framework.

The proposed Accuracy-Fairness-Transparency (AFT) framework addresses key ethical concerns common to data science, machine learning and journalism (e.g. Antunes et al., 2018; Martens, 2022; Opdahl et al., 2023; Ward, 2015, 2018). It can be used to assess the quality of existing datasets for reusability or to support the development of training datasets throughout the data preparation pipeline (data collection, cleaning, augmentation and labelling). It can therefore be seen as a valuable tool to promote data and AI literacy among journalists. From a technical perspective, this framework contributes to the shift from model-centric AI to data-centric AI, where the focus is on ensuring the quality of the underlying data, enabling AI systems to perform well on smaller datasets, as opposed to the model-centric approach that requires large amounts of data (Jarrahi et al., 2022).

Data quality in machine learning

A common way of approaching data quality refers to data that adapts to the uses of data consumers, particularly in terms of accuracy, relevance and understandability (Wang & Strong, 1996). This means that data quality cannot be reduced by examining the syntax, format and structure of the data (Eckerson, 2002), nor can it be approached from an overall data quality perspective. First, there is no absolute reference to assess the correction of information stored in a database, which is always context and domain dependent (Boydens & van Hooland, 2011). Second, the quality of a process depends on the quality of the data (Batini et al., 2009; Moody & Shanks, 2003). Third, when derived from empirical observations, data represent a specific moment, which means that values are likely to evolve, just as the concepts and standards to which they refer are likely to evolve (Boydens & van Hooland, 2011; Taleb et al., 2018).

The multidimensionality of the concept of data quality is reflected in several complementary dimensions, which refer to a set of attributes that need to be assessed either formally or empirically (Table 1). The definition of the accuracy, completeness and consistency dimensions has been refined over time, which means that the assessment of data quality also relies on empirical considerations and thus on human judgement. In addition, the assessment of data quality was also approached through four complementary semiotic levels reflecting the technical and social aspects contained in datasets (Shanks, 1999): syntactic, relating to the structure of the data; semantic, relating to the meaning of the data; pragmatic, relating to the usefulness of the data; and social, relating to the knowledge shared through the data.

Table 1 Evolution of the definitions of data quality dimensions

Scholars have recognised that the causes of poor quality are numerous and can relate to a variety of settings: the modelling of the information system (Lindland et al., 1994; Moody & Shanks, 2003), compliance with integrity constraints (Fox et al., 1994), a lack of routine validation (Eckerson, 2002), or heterogeneous data that can lead to interpretation problems (Madnick & Zhu, 2006). Other considerations add layers to this complexity: poor quality data may coexist with correct data without producing errors (Wang et al., 1995), and data may not contain errors but may not have the meaning expected by the user (Madnick & Zhu, 2006), demonstrating once again the impossibility of approaching data quality through the sole formal examination of the “entity–attribute–value” triplet that characterises relational databases (e.g. the entity “car” has an attribute “colour” which has a value “blue”).

The emergence of big data has led to a reconsideration of data quality issues in relation to user-generated data collected online or through sensors (Batini et al., 2015), the extreme size of which exceeds the capacity of database systems (Becker et al., 2015). As big data moves from the logic of being stored in relational databases to being stored in NoSQL databases—a type of database that provides a mechanism for storing and retrieving data, modelled in a way that enables flexible, scalable and schema-less data management—it challenges dimensions of quality such as trustworthiness, verifiability and data reputation (Batini et al., 2015). At its core, the problem is that trust in big data requires three levels of trust: in the sources, in the processes that bring them together, and in the environments in which the data is applied (e.g. Saha & Srivastava, 2014; Cai & Zhu, 2015; Liu et al., 2016). Big data issues are also related to incomplete, inaccurate, inconsistent or ambiguous structured and unstructured data (Eberendu, 2016).

The approach to data quality in machine learning encompasses all of these considerations. However, it requires a nuanced understanding of the specific processes involved, given the complexity of real-world data. Machine learning relies on computational models trained on real-world data to mimic human intelligence by transforming inputs into outcomes based on mathematical relationships that are difficult to derive through deductive reasoning or simple statistical analysis (Kläs & Vollmer, 2018). It is being used for a wide range of tasks based on various techniques and trained on different types of data. In journalism, machine learning models can be used for breaking news detection, disinformation detection, automated news gathering, content verification, data analysis and visualisation, predictive analysis of future trends, content generation or personalised news recommendations.

Machine learning tasks can be supervised, primarily for classification and prediction, and require annotated data—data that has been labelled or categorised by humans or automated processes. These labels serve as a guide for the model to learn from. Unsupervised learning relies on patterns inherent in unlabelled data for segmentation or information discovery. Semi-supervised learning combines labelled and unlabelled data, e.g. for ranking or automatic content generation. On the other hand, self-supervised learning can automatically learn useful representations found in the training data without the need for extensive human annotation (Jordan & Mitchell, 2015; Jaiswal et al., 2020).

The main differences between supervised and unsupervised learning are not only related to the presence of labelled data. They are also related to the complexity of the task and the accuracy and efficiency of the results (Adnan & Akbar, 2019). In all cases, the quality of the results is influenced by the data provided as input to the system (Gudivada et al., 2017; Gupta et al., 2021). Research also highlights that models trained on incomplete or biased datasets can produce discriminatory results and affect the accuracy of the tasks (Miceli et al., 2021; Shin et al., 2022).

In machine learning, the preprocessing of data depends as much on the specifics of the data, including the types of variables used, as on the algorithm used (Kläs & Vollmer, 2018; Gupta et al., 2021; Ridzuan et al., 2022). Therefore, the choice of a machine learning algorithm depends on the task to be performed, taking into account its strengths and weaknesses regarding the influence of data on its behaviour. For example, the K-Nearest Neighbours (KNN) algorithm used for classification is considered fast and efficient but does not handle missing data well. The Naive Bayes algorithm, also considered fast and efficient, is not ideal for large datasets with many numerical features. With decision tree algorithms, a small change in the training data can lead to significant changes in the results (Lantz, 2014). Nevertheless, enough data must be available to train a model of sufficient quality (in terms of measures such as accuracy, precision and recall and, in a wider sense, fit for use).

Poor data quality can lead to inaccurate models, biased results and reduced performance. For example, using biased datasets that do not accurately represent the characteristics of a particular population can lead to biased and inaccurate results. Similarly, incomplete datasets with missing values can significantly affect the accuracy of machine learning models by providing insufficient or inaccurate training data that cannot produce satisfactory results. Data quality issues can arise at the data collection stage, as data availability does not necessarily equate to data quality (Elouataoui et al., 2022). This is particularly true when dealing with open data, user-generated content, or data from multiple sources (Hair & Sarstedt, 2021).

Pre-processing addresses classic data quality issues such as missing data, duplicates, highly correlated variables, and anomalous or inconsistent values. This stage also includes data normalisation and standardisation (Elouataoui et al., 2022; Foidl & Felderer, 2019; Polyzotis et al., 2018). Annotation of the data introduces additional complexity. Whether automated or crowdsourced, this process is prone to error despite implementing correction procedures (Gupta et al., 2021). Crowdsourcing, in particular, raises concerns about the reliability and accuracy of data annotation due to variability in user expertise and the subjective nature of interpretation (e.g. Chmielewski & Kucker, 2020; Lease, 2011; Ridzuan et al., 2022).

Training datasets, which are used to adapt models to the data, are essential for assessing the suitability of data for machine learning tasks in terms of efficiency, accuracy and complexity (Gupta et al., 2021). The validation process aims to detect and correct errors introduced during data collection, aggregation, or annotation (Gupta et al., 2021; Polyzotis et al., 2018). Data validation is crucial for ensuring the reliability and quality of a system, but it is almost impossible to achieve exhaustively. Although assessing the risk of poor data quality remains feasible (Foidl & Felderer, 2019), there is still a gap in the discussion on the methods to define the level of validation required at each stage of a machine learning process (Foidl & Felderer, 2019).

Trust is essential for the acceptance and adoption of machine learning applications. It is rooted in the data being influenced by the knowledge and expertise of domain experts, resulting in the accuracy and relevance of the outcomes (Mougan et al., 2022; Toreini et al., 2020). Consequently, users’ trust in AI systems also depends on the quality of the data, which must adhere to three basic principles: prevention, detection and correction for ensuring reliability of machine learning applications (Ehrlinger et al., 2019). In this process, empirical considerations play an important role as the construction or selection of datasets involves not only technical aspects, but also human decisions that can significantly affect the outcome (Miceli et al., 2021).

Ethical foundations for AI-driven journalism

In journalism studies, ethical considerations are based on three main philosophical approaches: Aristotelian virtue ethics, which emphasises the importance of moral character in journalism; Kantian deontology, which stresses the role of duty and respect in ethical decision-making; and consequentialism, which assesses the moral implications of actions based on their consequences (Quinn, 2007; Sanders, 2003; Ward, 2015). This paper adopts a deontological perspective, defining journalism ethics as the principles and standards that guide the ethical behaviour of journalists. These principles govern the daily practice of journalism and emphasise the social responsibility of journalists towards their audiences (Bardoel & d'Haenens, 2004). Ethical journalism is fundamentally committed to respecting the truth by providing accurate, fair and truthful information, which is essential to maintaining credibility and trust in the news (Frost, 2015; van Dalen, 2020; Ward, 2018). Similar to the ‘garbage in, garbage out’ principle in data science, the credibility of news depends heavily on the trustworthiness of its sources. Assessing this trustworthiness is complex, involving both subjective judgments and empirical evidence (Reich, 2011).

Accuracy is about providing a faithful representation of the world. It can be approached through the broader concepts of truth, objectivity, fairness, transparency and credibility (Porlezza, 2019). In practice, it is about providing verified, independent and balanced information using reliable sources (Shapiro et al., 2013). As a tool for accountability, accuracy helps to build trust with audiences. Although it can be seen as an indicator to assess the quality of reporting, it is difficult to evaluate without introducing human judgements about the truthfulness and (non)misleading nature of the narrative (Ekström & Westlund, 2019; Porlezza, 2019). In data science, accuracy refers to the congruence of data with ground truth, closely following the definition used in journalism studies. In machine learning, it is a formal measure of the frequency with which a model correctly predicts an outcome.

Fairness means using unbiased methods and providing balanced information, which contributes to objective reporting (Helberger & Diakopoulos, 2022; Ward, 2015). As part of journalism culture, it is associated with honesty, privacy, professional distance, truth and impartiality (Ettema et al., 1998; Hanitzsch, 2007). In practice, it means being impartial and disinterested in order to avoid harmful biases and stereotypes (Ali & Hassoun, 2019). Fair reporting involves providing balanced information that gives equal consideration to all stakeholders in the story and presents all sides of the argument (Frost, 2015; Wien, 2005). Fairness is also about ensuring that there is no socio-political agenda behind the news (Ryan, 2001), non-partisanship (Cavaliere, 2020) and any other type of possible bias (Muñoz-Torres, 2012a, 2012b). These considerations are echoed in data science and machine learning, where fairness most often refers to unbiased data to prevent the model from perpetuating existing biases in the data. For example, biases may be related to social characteristics such as gender or ethnicity (Caton & Haas, 2020). Fairness-conscious approaches include modifying data, incorporating fairness during training, or adjusting model output to reflect ethical principles of unbiased and balanced information (e.g. Le Quy et al., 2022; Pessach & Shmueli, 2022). However, addressing fairness in machine learning remains complex given the broader societal implications of deploying technological solutions that may change existing problems or create new ones (Selbst et al., 2019).

Along with accuracy and fairness, objectivity is another essential standard of responsible reporting. It implies that information does not contain subjective characteristics, such as the feelings and beliefs of the journalist (Frost, 2015). However, objectivity is regularly contested because journalism involves human judgements that are inherently highly subjective (Donsbach & Klett, 1993; Muñoz-Torres, 2012a, 2012b; Tong & Zuo, 2021; Wien, 2005). From news gathering to sources, angles and narratives, journalism remains a matter of choice. Consequently, a journalist’s choice of topic, angle, sources, analysis and narrative delivery reflects individual subjectivity, which cannot be fully controlled due to its unconscious nature. Issues related to objectivity are linked to the philosophical challenge of separating facts from values and the idea that raw facts have no meaning unless they are linked to individual subjectivity (Muñoz-Torres, 2012a, 2012b). While objectivity cannot completely eliminate bias, it can be mitigated by factual accuracy, which is central to the presentation of objectively verified facts (Figdor, 2010). The editorial judgements required to shape news stories can also be scrutinised through the lens of reproducibility in scientific research (Figdor, 2010). Despite its controversial status, objectivity remains a critical element of journalists’ professional identity and ethos and is seen as a universally shared value (e.g. Deuze, 2005; Muñoz-Torres, 2012a, 2012b; Schudson & Anderson, 2009; Ward, 2018; Wien, 2005).

Over the past decade, transparency has become a buzzword—starting in the Anglo-Saxon countries, as a means to increase credibility and trust among audiences by exposing the hidden factors behind news reporting, such as the processes and methods used (Craft & Vos, 2021; Koliska, 2022). Transparency refers to the exposure of the hidden factors that shape news information and permeates journalistic practices that are characterised by their openness, such as data journalism (e.g. Coddington, 2015; Mor & Reich, 2018; Zamith, 2019). The spread of fact-checking practices, under the banner of the International Fact-Checking Network (IFCN) and the European Fact-Checking Standards Network (EFSCN), as well as data journalism practices that encourage openness have also contributed to the promotion of transparency (Cavaliere, 2020). Seen either as a substitute for objectivity or at least as a tool to enable accountability, openness and trust, the practice of transparency implies that journalists need to be trained in the ethical use of data and that audiences should be informed about the processes at work, especially when it comes to AI-driven processes that are also capable of automatically generating content (e.g. Hansen et al., 2017; Montal & Reich, 2017).

The ethical challenges of transparency in the use of machine learning in journalism are multifaceted and include the data used, the algorithms employed and the resulting outcomes (Dörr & Hollbuchner, 2017; Karlsson, 2020). Nevertheless, transparency is not always easy to implement in journalism, where practitioners often lack the data and algorithmic literacy to understand how algorithms work (Porlezza & Eberwein, 2022). It is also not easy to implement in deep learning models, where even their creators have to learn how they work due to the multiplicity of their parameters (Boddington, 2017; Burkart & Huber, 2021).

Transparency in AI systems is one of the ethical requirements to ensure the traceability of the system and to build trust with its users (Floridi, 2019). It contributes to the comprehensibility of the system and thus makes it accountable (Boddington, 2017). However, transparency is not without its limitations: firstly, the user doesn’t necessarily understand how the system works due to a lack of comprehensibility; secondly, it does not guarantee that social and ethical values are embedded in the system; thirdly, computer code may be protected for proprietary reasons; fourthly, transparency is not synonymous with accuracy and reliability (Bartneck et al., 2020; Burrell, 2016; Dourish, 2016). Moreover, trust can sometimes lead to mistrust when users misinterpret correct AI predictions and overtrust when they misunderstand incorrect ones (Schmidt et al., 2020).

Given these limitations, explainability—as part of transparency—is considered a more relevant tool for understanding how a given system arrived at a given result (Bartneck et al., 2020; Shin et al., 2022). It also concerns the decision-making process (Rai, 2020; Graziani et al., 2022) and supports the accountability of AI systems (Burkart & Huber, 2021). However, it does not guarantee the absence of hidden agendas or the reliability of results (Ferrario & Loi, 2022; Bryson, 2020). Furthermore, the quality of explanations provided by AI systems is crucial; poor or misleading explanations can lead to incorrect user decisions. Moreover, simply providing explanations is insufficient if users cannot understand or interpret them correctly, especially if the explanations are too technical or irrelevant (Schmidt et al., 2020).

Responsible journalism practices are built on the pillars of accuracy, fairness, objectivity and transparency. These principles are intertwined and form the basis of accountable journalism, which is committed to serving the public interest and fulfilling its moral duty to society (Newton et al., 2004). Accountability is crucial for building trust with audiences as end users of news information and is equally important for establishing trust between journalists as users of AI systems (Siau & Wang, 2018; Rai, 2020).

In journalism studies, trust is seen as the foundation of credibility and reliability in the dissemination of public information. It derives its legitimacy from the public’s trust, which places a responsibility on journalists to provide accurate and trustworthy information. This trust is multifaceted and arises from the complex relationships between journalists, their sources and their audiences. It involves cognitive beliefs and behavioural intentions that reflect a nuanced interplay of perceptions of competence, benevolence, integrity and predictability (Grosser et al., 2016; Koliska et al., 2023; van Dalen, 2020). Trust is essential not only for journalism, but also for fostering user acceptance and facilitating human-AI interactions (Bartneck et al., 2020; Jacovi et al., 2021; Siau & Wang, 2018).

Trust in AI and machine learning for journalism ensures that these technologies are designed and implemented in a way that promotes user and public acceptance (Toreini et al., 2020). This requires an understanding of how AI affects trust dynamics, which in turn requires ethical considerations in the development and deployment of AI to maintain and enhance public trust. In machine learning, however, the paradox of trusting the results of a machine learning system is that trustworthiness is not a prerequisite for trust: “Trust can exist in a model that is not trustworthy, and a trustworthy model does not necessarily gain trust” (Jacovi et al., 2021, p. 627).

Building trust between journalists and AI technology therefore requires a balance between AI-enhanced human tasks and AI-automated routines (Opdahl et al., 2023). It also requires trust in the data on which the system relies. Given that data quality is use- and context-dependent (Wang & Strong, 1996), from a fitness-for-use perspective, meeting the ethical principles of journalism means that data must be accurate, reliable and unbiased. AI systems also challenge these standards because data availability does not equal quality (Elouataoui et al., 2022) and because incomplete or discriminatory datasets are likely to introduce bias (Gudivada et al., 2017).

If there is a recognised need to merge AI-driven systems with journalistic values in line with professional practice (Broussard et al., 2019; Gutierrez Lopez et al., 2022), it should start with the system’s data source. From a journalistic perspective, it is important to remember that if the upstream data is biased or flawed, the system is likely to reproduce these biases and errors (Marconi & Siegman, 2017; Hansen et al., 2017). Trusting the data is also a matter of trusting its origin, recognising that the credibility of the source is a cornerstone of ethical journalism, as it also ensures the accuracy and reliability of the news (Reich, 2011). At the same time, the data used in machine learning often lacks transparency about its social and technical characteristics and potential biases (Miceli et al., 2021; Wu, 2020). This opacity extends to the machine learning training process, where it is difficult to identify the individuals behind the system, their values and the ethical decisions that underpin it. These considerations also apply to the data used to train the system, as it is created by humans with specific intentions for human use.

Building the accuracy-fairness-transparency (AFT) framework

In data science, data quality assessment is considered critical. It leads to actions aimed at improving the overall quality by identifying erroneous data elements and understanding their impact on work processes (Cichy & Rass, 2019). From a user perspective, the assessment of data quality indicators refers to their adaptation to human needs or user requirements through the aggregation of different information on data quality (Devillers et al., 2002; Cappiello et al., 2004). The assessment framework we have developed for AI-driven journalism is a part of this tradition of data quality assessment. It is based on the findings of data and information quality assessment in the scientific literature in data science (e.g., Batini et al., 2009; Cichy & Rass, 2019; Fox et al., 1994; Pipino et al., 2002; Shanks, 1999; Wand & Wang, 1996) and the core principles that underpin ethical journalism practices (Ward, 2015).

This framework is based on three ethical requirements of journalism: accuracy, given the need to respect the facts; fairness, to avoid bias or unbalanced data; and transparency, which is fundamental when working with AI systems. Because of the tensions surrounding this concept, we have not included objectivity in this framework. These three ethical principles are strongly linked to the broader concept of trust, as trusting the outcomes of AI systems should start with trusting the data that feeds them. To build the framework, we compared how the multidimensional concept of data quality is usually assessed with the three requirements that we defined in light of our literature review in journalism studies. We found that these three requirements correspond to the semiotic levels of data quality. They all refer to dimensions that include formal and empirical characteristics to be assessed.

The principle of accuracy uses the journalist’s expertise in the application domain, drawing on their knowledge of the subject matter. The principle of fairness looks at the context in which the data is used. This involves identifying the data producer’s data management practices, tracing the data lifecycle from creation to validation, and considering the technical and journalistic implications of data use. The principle of transparency assumes that embedding journalistic values is a key factor in ensuring expertise and trustworthiness at every stage of data collection and pre-processing. This approach requires a shared understanding or transfer of journalistic expertise.

Linking the ethical principles with the semiotic levels and data quality makes it clear that the journalistic principle of accuracy is rooted in the pursuit of factual accuracy.

This principle is reflected in the dimensions of accuracy, consistency, correctness and comprehensibility, which are essential to ensure the reliability and usability of the data. It requires application domain knowledge to deal with, for example, incorrect values or duplicates. The concept of fairness is partly related to the semantic level (completeness) and to the pragmatic level and refers to the dimensions of timeliness, accessibility, objectivity, relevance and usability. It concerns the context of data production, validation, dissemination and use in a journalistic context. Finally, the principle of transparency relates to the social level and includes the dimensions of provenance, credibility, reliability and verifiability of the data. However, it covers only part of what transparency in journalism is, as it also involves being transparent with readers about the reporting methods used, such as revealing how and why information was gathered and verified, in order to build trust and credibility with the audience.

The application of the framework can be either objective or subjective (Pipino et al., 2002), with the aim of identifying data quality issues and challenges before they occur, either broadly or in detail. However, its main limitation lies in the inherent nature of information, which, as an empirical element, is likely to evolve due to its volatile nature. As standards and knowledge evolve over time, yesterday’s references may not be the same as today’s. It also requires a deeper understanding of the datasets, including consideration of the annotation process for classification. Who were the annotators? How were they instructed? Without documented data, we cannot answer these fundamental questions. Ensuring these aspects is also linked to the need for human-led annotation, either based on expertise in the topic of the stories or with general journalism training (Torabi Asr & Taboada, 2019).

Another limitation is that the framework allows the identification of data quality issues without providing methods to resolve them, a task that falls within the scope of other research areas (Triguero et al., 2019). Nevertheless, it serves as a means of identifying, preventing and correcting primary data quality issues, fostering a critical perspective on data quality and contributing to the development of data literacy in journalism. It is also in line with the Data-Centric AI (DCAI) approach, which advocates a shift from a model-centric approach to data quality and reliability, as exemplified by the ‘garbage in, garbage out’ principle (Table 2). This perspective is particularly relevant given that machine learning research has primarily focused on training algorithms, overlooking potential data quality issues and unwanted errors in the data (Jarrahi et al., 2022; Whang et al., 2023; Zha et al., 2023).

Table 2 The AFT framework to assess data quality for AI-driven journalism

The AFT framework does not provide specific methods for improving data quality, but its strength lies in its grounding in journalistic standards. Evaluating the different dimensions of data quality from the perspective of journalistic practice reveals both alignment and implementation challenges. Dimensions such as verification, accuracy, objectivity, relevance, timeliness, usability, credibility and reliability are closely aligned with core journalistic practices. They are relatively easy to implement because they reflect journalists’ core values and daily activities, emphasising verification and critique. For example, the reliability dimension can be assessed through a rigorous and systematic examination of all sources used in the journalistic process (Steensen et al., 2022). Conversely, technical dimensions such as interoperability, standardisation and encoding are more challenging and require specific technical knowledge and tools that may be beyond the skills of many journalists. While ensuring data completeness and accessibility fits well with the comprehensive and source-based nature of journalism, practical challenges such as legal access and quantifying completeness may arise.

The AFT framework can be practically applied in automated fact checking, which can speed up a typically time-consuming process that is prone to data quality issues. These issues can arise from user-generated content, crowdsourced labelling methods or the use of outdated or incomplete datasets, all of which can compromise the reliability and accuracy of information used in journalistic contexts (Dierickx et al., 2023). As a generic solution, this framework can also be applied to automated image classification and selection. Ensuring accuracy in identifying and tagging images requires careful data validation and contextual understanding. The framework’s principles of accuracy, fairness and transparency can guide practitioners in selecting and verifying image datasets, addressing biases and ensuring that images used in news articles or reports are relevant, credible and representative of the intended narrative.

In addition, the framework’s versatility allows it to be applied to a wide range of contexts and data types to ensure compliance with ethical standards and increase the reliability of AI-driven journalistic output. However, given that data quality, according to the fitness-for-use principle, refers to data that adapts to its intended use, the framework may need to be adapted for some specific cases or contexts.

Discussion and conclusion

The Accuracy-Fairness-Transparency (AFT) framework presented in this paper provides valuable guidance for machine learning developers working on journalism projects, effectively bridging technical considerations with ethical requirements. It has been designed as an adaptive and flexible tool that can be used in the various forms that AI-driven journalism systems can take. Its strength lies in its consideration of journalistic requirements and ethical principles framed by accountability. From this perspective, the use of this framework can help build bridges between communities of practice whose conceptions of journalism may differ (Sirén-Heikel et al., 2023). It thus facilitates the alignment of concepts between professional cultures that share common concerns but may interpret them differently due to their different epistemologies (Dierickx & Lindén, 2023).

The AFT framework provides a practical approach to merging AI systems with the ethical principles of journalism, responding to calls for better integration of journalistic values in technology (Komatsu et al., 2020). Although it was primarily tailored to machine learning developers working on AI-driven projects in newsrooms, it recognises the interdisciplinary nature and potential involvement of journalists in such projects, whose expertise should be valued. However, this approach requires active engagement from journalists, which may not be feasible for all due to lack of time or interest in technical aspects. However, their expertise in the reliability and credibility of the data source and their knowledge of the context should not be overlooked. This also highlights the need for more interdisciplinarity in the study of AI-driven tools in journalism, as a focus on system performance alone is insufficient from an end-user perspective. However, putting the interdisciplinarity of this framework into practice also means that machine learning developers may focus on the technical aspects of the framework, while journalists may focus on the ethical and social implications. Nevertheless, integrating accuracy, fairness, and transparency principles into machine learning systems encourages critical reflection on data sourcing and processing.

As end-users of AI systems, journalists are not always aware of or informed about the origin of the data. Therefore, this framework can be used as a practical strategy to promote data and AI literacy among journalists and serves as a valuable tool for training programmes. It recognises the need for journalists to understand the role of AI in journalism and develop a critical awareness of its implications (Deuze & Beckett, 2022). If it is recognised that data journalism requires data literacy because working with data is related to scientific methods (Bobkowski & Etheridge, 2023), the same is true for AI-driven journalism, where every process starts with data.

The AFT framework has also highlighted the benefits of a data-centric approach to AI in journalism, the main advantage of which lies in supporting the development of machine learning applications with smaller datasets. Another strength lies in its ability to better understand opaque systems through the data that feeds them, as the results of machine learning are directly influenced by the quality of the data at each stage of the process. For all these reasons, this framework can therefore help to build trustworthy systems. Although a thorough assessment requires a significant investment of time due to the meticulous scrutiny involved, the benefits far outweigh the investment, particularly in terms of achieving accuracy and reliability.

Given that the relationship between journalists and AI-driven systems is based on trust, the data that feeds these systems must also be trusted. However, defining ‘good’ data in journalism remains challenging due to the multi-dimensional nature of quality. The quality of a dataset is intrinsically linked to the expertise of a particular application domain and the understanding of how data is collected, validated and disseminated. Hence, assessing data quality for machine learning applications in journalism requires consideration of both formal and empirical aspects of the dataset that can be considered from an ethical perspective.

The intersection of data quality, as approached in data and computer science, with journalistic standards shows that these different fields share common concerns in many ways. For example, the FAIR principles in data management promote the discoverability, accessibility, interoperability and reusability of data (Wilkinson et al., 2016). From an ethical perspective, data accuracy and validity, as well as unbiased data, are prerequisites for the ethical use of data (Saltz & Dewar, 2019). At the EU level, the Ethical Guidelines for Trustworthy AI emphasise that responsible and accountable AI systems rely on appropriate data governance as well as transparency, diversity, non-discrimination and fairness of data (Floridi, 2019).

As already mentioned, the AFT framework potentially supports the development of machine learning applications with smaller datasets. However, it is not directly applicable to foundational models, such as large language models, due to the inherent data quality challenges posed by large amounts of user-generated content, potentially biased sources, and copyrighted material (Dwivedi et al., 2023). While licences and agreements, such as those between news publishers and OpenAI, can help mitigate these issues by providing access to verified sources, they do not fully address all data quality concerns. In addition, considerations of timeliness and completeness remain critical to ensuring the reliability of the data used. Furthermore, the complexity of these issues highlights the need for ongoing research and development of more sophisticated methods for addressing data quality in the most complex AI applications.