Error Spotting with Gradient Boosting: A Machine Learning-Based Application for Central Bank Data Quality
Csaba Burger () and
Mihaly Berndt ()
Additional contact information
Csaba Burger: Magyar Nemzeti Bank (the Central Bank of Hungary)
Mihaly Berndt: Clarity Consulting Kft
No 2023/148, MNB Occasional Papers from Magyar Nemzeti Bank (Central Bank of Hungary)
Abstract:
Supervised machine learning methods, in which no error labels are present, are increasingly popular methods for identifying potential data errors. Such algorithms rely on the tenet of a ‘ground truth’ in the data, which in other words assumes correctness in the majority of the cases. Points deviating from such relationships, outliers, are flagged as potential data errors. This paper implements an outlier-based error-spotting algorithm using gradient boosting, and presents a blueprint for the modelling pipeline. More specifically, it underpins three main modelling hypotheses with empirical evidence, which are related to (1) missing value imputation, (2) the loss-function choice and (3) the location of the error. By doing so, it uses a cross sectional view on the loan-to-value and its related columns of the Credit Registry (Hitelregiszter) of the Central Bank of Hungary (MNB), and introduces a set of synthetic error types to test its hypotheses. The paper shows that gradient boosting is not materially impacted by the choice of the imputation method, hence, replacement with a constant, the computationally most efficient, is recommended. Second, the Huber-loss function, which is piecewise quadratic up until the Huber-slope parameter and linear above it, is better suited to cope with outlier values; it is therefore better in capturing data errors. Finally, errors in the target variable are captured best, while errors in the predictors are hardly found at all. These empirical results may generalize to other cases, depending on data specificities, and the modelling pipeline described underscores significant modelling decisions.
Keywords: data quality; machine learning; gradient boosting; central banking; loss functions; missing values (search for similar items in EconPapers)
JEL-codes: C5 C81 E58 (search for similar items in EconPapers)
Pages: 34 pages
Date: 2023
New Economics Papers: this item is included in nep-ban, nep-big, nep-cba, nep-cmp, nep-ecm, nep-mac and nep-mon
References: Add references at CitEc
Citations:
Downloads: (external link)
https://www.mnb.hu/en/publications/studies-publica ... al-bank-data-quality (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:mnb:opaper:2023/148
Access Statistics for this paper
More papers in MNB Occasional Papers from Magyar Nemzeti Bank (Central Bank of Hungary) Contact information at EDIRC.
Bibliographic data for series maintained by Lorant Kaszab ( this e-mail address is bad, please contact ).