Comparison of Tree-Based Machine Learning Algorithms to Predict Reporting Behavior of Electronic Billing Machines
<p>Pipeline of the study.</p> "> Figure 2
<p>Relative importance of the top 10 features.</p> "> Figure 3
<p>ROC AUC of different tree-based machine learning models before pruning.</p> "> Figure 4
<p>ROC AUC of different tree-based machine learning models after pruning.</p> ">
Abstract
:1. Introduction
2. Motivating Example and Related Work
2.1. Motivating Example
2.2. Related Work
3. Proposed Approach
3.1. High-Level Study Pipeline
- EBM data collection: this stage involves data gathering from various data sources which include information on the historical performance of the electronic billing machines, such as the EBM ID, activation period, taxpayer information, receipts, and so on.
- Data pre-processing: clean and pre-process the data to remove missing values, outliers, and irrelevant features. This process also involves feature extraction, data splitting, and so on.
- Model training/selection: choose an appropriate machine learning algorithm based on the nature of the problem and the type of data. In our case, we have considered four main tree-based algorithms: decision trees, random forest, gradient boost, and XGBoost. Train the selected algorithm on the pre-processed data, optimizing its parameters to improve its performance.
- Model evaluation: we evaluate the performance of each trained model using various metrics such as accuracy, precision, recall, F1-score, Log-loss, and AUC-ROC.
- Model pruning/feature selection: select the most relevant features that have a strong influence on the reporting behavior of the machines.
- Prediction: use the model in a real-world/brand new setting to make predictions about the reporting behavior of each available electronic billing machine’s EBM scenario.
3.2. Machine Learning Algorithms
3.2.1. Decision Trees
3.2.2. Random Forest
- A subset of the training data, known as a bootstrap sample, is randomly selected to build the tree. A subset of features is also randomly selected.
- A decision tree is grown from the bootstrapped sample by recursively splitting the data into subsets based on the features that result in the largest reduction in impurity.
- Steps one, two, and three are repeated for each tree in the forest.
3.2.3. Gradient Boost
- F(x) is the final prediction for a given input x.
- is the first weak model, which is usually a simple constant or mean value.
- is the mth weak model, for m = 1, 2,…, M.
- ∑ is the summation symbol.
- M is the total number of weak models used in the gradient boost model.
3.2.4. Extreme Gradient Boosting (XGBoost)
- f(x) is the predicted value for input x;
- is the weight assigned to the i-th decision tree;
- is the prediction of the i-th decision tree for input x.
3.3. Evaluation Metrics
3.3.1. Confusion Matrix
- False positive (FP) in the confusion matrix denotes findings in which the projected outcome is negative while the actual values are negative. This is also referred to as a Type 1 Error [34].
- False negative (FN) indicates findings in which a negative outcome is predicted, but the actual values are positive. This is classified as a Type 2 Error and just as harmful as a Type 1 Error [34].
- True negative (TN) results are those in which a negative outcome is predicted, and the actual values are also negative [33].
3.3.2. Precision
3.3.3. Recall
3.3.4. F1 Score
3.3.5. Accuracy
3.3.6. Binary Cross-Entropy (Logarithmic Loss)
3.3.7. Area under Receiver Characteristic Curve (AUC-ROC)
4. Data Processing
4.1. Data Description
4.2. Data Cleaning
- An EBM is said to have reported if it has sent its updates at least once during a certain period.
- The aggregated data were then labeled with Boolean values, namely "1” meaning that the EBM reported or a “0” in case there was a missing reporting period after the time of activation.
- In order to facilitate the analysis, the time of activation was also converted into the period of activation following the algorithm shown in Listing 1.
- The labeling mechanism was also applied to the receipt information and total sales to avoid bias in the data and uniformity. This means that an EBM which issued a receipt at least once during a certain period was labeled with a Boolean value “1” and “0” in case it did not.
- In order to balance the data (for instance, an EBM might have been activated in period “40” but it is analyzed with an EBM that has been activated in the second period), the periods in which a given EBM was not yet activated were labeled with a half chance, i.e., “0.5” for the entire periods before the activation. This also applies to receipts and total sales data.
- The data regarding the characteristics of the business were later merged with the entire cleaned dataset.
Listing 1. Period assignment pseudocode. |
4.3. Feature Extraction
- data[’repo_6back_all?’]: reported fully (6 over 6);
- data[’repo_6back_70%_up?’]: reported 4 to 5 inclusive;
- data[’repo_6back_50%_down?’]: reported 2 to 3 inclusive;
- data[’repo_6back_17%_down?’]: reported only once or less.
Listing 2. Six-month reporting example. |
4.4. Data Splitting
5. Study Evaluation and Analysis
5.1. Research Questions
- ▪
- RQ1:What is the best model among those proposed, and how effective is it in tackling the problem at hand?
- ▪
- RQ2:To what extent does our proposed approach could potentially assist tax administrators in increasing taxpayer compliance?
- ▪
- RQ3:To what extent does our proposed approach advance the state of research in comparison to the existing approach?
5.2. Model Performance: Answer to RQ1
5.2.1. Feature Importance
5.2.2. Model Pruning
6. Discussion
6.1. Approach Benefits: Answer to RQ2
6.2. Research Contribution: Answer to RQ3
6.3. Study Limitations
7. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Feature Description | Feature Name | Importance (%) | |
---|---|---|---|
1 | EBM activation period | period_act | 8.13 |
2 | Receipts of 56th month | rec_56 | 7.56 |
3 | Receipts of 62nd month | rec_62 | 5.26 |
4 | Receipts of 63rd month | rec_63 | 4.18 |
5 | Receipts of 64th month | rec_64 | 3.98 |
6 | Receipts of 55th month | rec_55 | 3.72 |
7 | Reporting info for 26th month | repo_26 | 3.62 |
8 | Reporting info for 33rd month | repo_33 | 3.29 |
9 | Reporting info for 36th month | repo_36 | 2.43 |
10 | Reporting info for 37th month | repo_37 | 2.37 |
11 | Reporting info for 44th month | repo_44 | 2.32 |
12 | Reporting info for 45th month | repo_45 | 2.23 |
13 | Reporting info for 47th month | repo_47 | 2.20 |
14 | Reporting info for 49th month | repo_49 | 2.02 |
15 | Reporting info for 52nd month | repo_52 | 2.00 |
16 | Reporting info for 54th month | repo_54 | 1.97 |
17 | Reporting info for 55th month | repo_55 | 1.79 |
18 | Reporting info for 56th month | repo_56 | 1.73 |
19 | Reporting info for 57th month | repo_57 | 1.60 |
20 | Reporting info for 58th month | repo_58 | 1.54 |
21 | Reporting info for 59th month | repo_59 | 1.32 |
22 | Reporting info for 60th month | repo_60 | 1.29 |
23 | Reporting info for 61st month | repo_61 | 1.26 |
24 | Reporting info for 62nd month | repo_62 | 0.93 |
25 | Reporting info for 63rd month | repo_63 | 0.92 |
26 | Reporting info for 64th month | repo_64 | 0.80 |
27 | Total sales info for 60th month | totSales_60 | 0.74 |
28 | Total sales info for 61st month | totSales_61 | 0.73 |
29 | Total sales info for 64th month | totSales_64 | 0.73 |
30 | Total sales info for 56th month | totSales_56 | 0.71 |
31 | Reported one month back?(ref: 65th month) | 1back_? | 0.70 |
32 | Reported two months back?(ref: 65th month) | repo_2back_all? | 0.63 |
33 | Reported 50% of the two months back?(ref: 65th month) | repo_2back_50%? | 0.61 |
34 | Did not report in the last 2 months back?(ref: 65th month) | repo_2back_none? | 0.60 |
35 | Reported all last 4 months back?(ref: 65th month) | repo_4back_all? | 0.50 |
36 | Did it report 50% or less for the last 4 months back?(ref: 65th month) | repo_4back_50%_up? | 0.49 |
37 | Did it report 25% or less for the last 4 months back?(ref: 65th month) | repo_4back_25%_down? | 0.48 |
38 | Did it report 17% or less for the last 4 months back?(ref: 65th month) | repo_4back_17%_down? | 0.45 |
39 | Is the business not individual? | NON INDIVIDUAL | 0.45 |
40 | The business is located in Kigali city | KIGALI CITY | 0.45 |
41 | Is the business of type: small and it is not individual | Small_nonIndie | 0.44 |
References
- Cobham, A. Taxation Policy and Development. Available online: https://www.files.ethz.ch/isn/110040 (accessed on 1 April 2022).
- Casey, P.; Castro, P. Electronic Fiscal Devices (EFDs) An Empirical Study of their Impact on Taxpayer Compliance and Administrative Efficiency. IMF Work. Pap. 2015, 15, 56. [Google Scholar] [CrossRef]
- Steenbergen, V. Reaping the Benefits of Electronic Billing Machines Using Data-Driven Tools to Improve VAT Compliance; Working Paper; International Growth Centre: London, UK, 2017. [Google Scholar]
- Eissa, N.; Zeitlin, A.; Using mobile technologies to increase VAT compliance in Rwanda. Unpublished Working Paper. 2014. Available online: https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Using+mobile+technologies+to+increase+VAT+compliance+in+Rwanda&btnG= (accessed on 1 February 2023).
- Rwanda Revenue Authority. Tax Statistics Publication in Rwanda. Available online: https://www.rra.gov.rw/Publication/ (accessed on 1 July 2022).
- Botchey, F.E.; Qin, Z.; Hughes-Lartey, K. Mobile Money Fraud Prediction—A Cross-Case Analysis on the Efficiency of Support Vector Machines, Gradient Boosted Decision Trees, and Naïve Bayes Algorithms. Information 2020, 11, 383. [Google Scholar] [CrossRef]
- Andrade, J.P.A.; Paulucio, L.S.; Paixao, T.M.; Berriel, R.F.; Carneiro, T.C.J.; Carneiro, R.V.; De Souza, A.F.; Badue, C.; Oliveira-Santos, T. A machine learning-based system for financial fraud detection. In Proceedings of the Anais do XVIII Encontro Nacional de Inteligência Artificial e Computacional, SBC, online, 29 November 2021; pp. 165–176. [Google Scholar]
- Tang, P.; Qiu, W.; Huang, Z.; Chen, S.; Yan, M.; Lian, H.; Li, Z. Anomaly detection in electronic invoice systems based on machine learning. Inf. Sci. 2020, 535, 172–186. [Google Scholar] [CrossRef]
- Hu, P. Predicting and Improving Invoice-to-Cash Collection through Machine Learning. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2015. [Google Scholar]
- Siarka, P.; Chojnacka-Komorowska, A. Modern technologies for VAT fraud detection. In Fraud in Accounting and Taxation and Its Detection; Publishing House of Wroclaw University of Economics and Busine: Wrocław, Poland, 2022; p. 95. [Google Scholar]
- Khurana, P.; Diwan, U. A comparison of psychological factors for tax compliance: Self employed versus salaried people. Int. J. Manag. Soc. Sci. 2014, 2, 107–115. [Google Scholar]
- Murphy, R. The Cost of Tax Abuse. A Briefing Paper on the Cost of Tax Evasion Worldwide. Available online: https://openaccess.city.ac.uk/id/eprint/16561/1/cost_of_tax_ (accessed on 1 February 2022).
- Jackson, B.; Milliron, V. Tax compliance research: Findings, problems and prospects. Int. J. Account. Lit. 1986, 5, 125–165. [Google Scholar]
- Riahi-Belkaoui, A. Relationship between tax compliance internationally and selected determinants of tax morale. J. Int. Account. Audit. Tax. 2004, 13, 135–143. [Google Scholar] [CrossRef]
- Trivedi, V.; Shehata, M.; Mestelman, S. Attitudes, Incentives and Tax Compliance; Department of Economics Working Papers; McMaster University: Hamilton, ON, USA, 2004. [Google Scholar]
- Saad, N. Tax Knowledge, Tax Complexity and Tax Compliance: Taxpayers’ View. Procedia-Soc. Behav. Sci. 2014, 109, 1069–1075. [Google Scholar] [CrossRef] [Green Version]
- Ngigi, E.W. The Effect of Electronic Tax Register System on the Duration of Value Added tax Audit in Kenya. Doctoral Dissertation, University of Nairobi, Nairobi, Kenya, 2011. [Google Scholar]
- Chege, J.M. The Impact of Using Electronic tax Register on Value Added Tax Compliance in Kenya: A case Study of Classified Hotels in Nairobi. Doctoral Dissertation, University of Nairobi, Nairobi, Kenya, 2010. [Google Scholar]
- Ikasu, E. Assessment of Challenges Facing the Implementation of Electronic Fiscal Devices (EFDs) in Revenue Collection in Tanzania. Int. J. Res. Bus. Technol. 2014, 5, 349. [Google Scholar] [CrossRef]
- Mascagni, G.; Monkam, N.; Nell, C. Unlocking the Potential of Administrative Data in Africa: Tax Compliance and Progressivity in Rwanda; International Centre for Tax & Development, Working Paper; International Centre for Tax & Development: Brighton, UK, 2016; Volume 56. [Google Scholar]
- Ranaldi, L.; Pucci, G. Knowing Knowledge: Epistemological Study of Knowledge in Transformers. Appl. Sci. 2023, 13, 677. [Google Scholar] [CrossRef]
- Murorunkwere, B.F.; Tuyishimire, O.; Haughton, D.; Nzabanita, J. Fraud detection using neural networks: A case study of income tax. Future Internet 2022, 14, 168. [Google Scholar] [CrossRef]
- Bel, N.; Bracons, G.; Anderberg, S. Finding Evidence of Fraudster Companies in the CEO’s Letter to Shareholders with Sentiment Analysis. Information 2021, 12, 307. [Google Scholar] [CrossRef]
- Humski, L.; Vrdoljak, B.; Skocir, Z. Concept, development and implementation of FER e-invoice system. In Proceedings of the SoftCOM 2012, 20th International Conference on Software, Telecommunications and Computer Networks, Split-Primosten, Croatia, 18–20 September 2012; pp. 1–5. [Google Scholar]
- Shao, P.E.; Dida, M. The Implementation of an Enhanced EFD System with an Embedded Tax Evasion Detection Features: A Case of Tanzania. J. Inf. Syst. Eng. Manag. 2020, 5. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Geron, A. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd ed.; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2019. [Google Scholar]
- Rokach, L.; Maimon, O. Decision Trees. In Data Mining and Knowledge Discovery Handbook; Maimon, O., Rokach, L., Eds.; Springer: Boston, MA, USA, 2005; pp. 165–192. [Google Scholar] [CrossRef]
- Dangeti, P. Statistics for Machine Learning, 1st ed.; Packt Publishing, Limited: Birmingham, AL, USA, 2017. [Google Scholar]
- Liu, Y.; Wang, Y.; Zhang, J. New Machine Learning Algorithm: Random Forest. In Proceedings of the Information Computing and Applications, Chengde, China, 14–16 September 2012; Liu, B., Ma, M., Chang, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 246–252. [Google Scholar] [CrossRef]
- Natekin, A.; Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot. 2013, 7, 21. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Dhieb, N.; Ghazzai, H.; Besbes, H.; Massoud, Y. Extreme Gradient Boosting Machine Learning Algorithm For Safe Auto Insurance Operations. In Proceedings of the 2019 IEEE International Conference on Vehicular Electronics and Safety (ICVES), Cairo, Egypt, 4 September 2019; pp. 1–5. [Google Scholar] [CrossRef]
- Cortes, C.; Mohri, M.; Storcheus, D. Regularized Gradient Boosting. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 21019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
- Hossin, M.; Sulaiman, M.N. A Review on Evaluation Metrics for Data Classification Evaluations. Int. J. Data Min. Knowl. Manag. Process 2015, 5, 1–11. [Google Scholar] [CrossRef]
- Vujovic, Ž.D. Classification Model Evaluation Metrics. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 120670. [Google Scholar] [CrossRef]
- Kull, M.; Filho, T.S.; Flach, P. Beta calibration: A well-founded and easily implemented improvement on logistic calibration for binary classifiers. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 20–22 April 2017; Singh, A., Zhu, J., Eds.; Volume 54, pp. 623–631. [Google Scholar]
Compliance Elements | %ge Change 2010–2017 |
---|---|
Late filling | −14 |
Non-filling | −20 |
VAT payment | +20 |
VAT collections | +732 |
Taxpayer sales | +737 |
Taxpayer registration in VAT | +346 |
Actual Class | ||||
---|---|---|---|---|
Projected class | True (1) | False (0) | Total | |
Positive (1) | ||||
Negative (0) | ||||
Total | N |
Model | Precision Score | Recall Score | F1 Score | Accuracy | Log Loss | |
---|---|---|---|---|---|---|
1 | Random Forest | 0.904 | 0.900 | 0.902 | 0.903 | 0.334 |
2 | XGBoost | 0.897 | 0.894 | 0.895 | 0.897 | 0.271 |
3 | Gradient Boost | 0.891 | 0.885 | 0.887 | 0.889 | 0.367 |
4 | Decision Tree | 0.845 | 0.846 | 0.846 | 0.847 | 5.227 |
Model | Precision Score | Recall Score | F1 Score | Accuracy | Log Loss | |
---|---|---|---|---|---|---|
1 | Random Forest | 0.924 | 0.921 | 0.922 | 0.923 | 0.409 |
2 | XGBoost | 0.914 | 0.909 | 0.911 | 0.912 | 0.311 |
3 | Gradient Boost | 0.911 | 0.906 | 0.908 | 0.909 | 0.230 |
4 | Decision Tree | 0.879 | 0.877 | 0.878 | 0.879 | 4.096 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Murorunkwere, B.F.; Ihirwe, J.F.; Kayijuka, I.; Nzabanita, J.; Haughton, D. Comparison of Tree-Based Machine Learning Algorithms to Predict Reporting Behavior of Electronic Billing Machines. Information 2023, 14, 140. https://doi.org/10.3390/info14030140
Murorunkwere BF, Ihirwe JF, Kayijuka I, Nzabanita J, Haughton D. Comparison of Tree-Based Machine Learning Algorithms to Predict Reporting Behavior of Electronic Billing Machines. Information. 2023; 14(3):140. https://doi.org/10.3390/info14030140
Chicago/Turabian StyleMurorunkwere, Belle Fille, Jean Felicien Ihirwe, Idrissa Kayijuka, Joseph Nzabanita, and Dominique Haughton. 2023. "Comparison of Tree-Based Machine Learning Algorithms to Predict Reporting Behavior of Electronic Billing Machines" Information 14, no. 3: 140. https://doi.org/10.3390/info14030140
APA StyleMurorunkwere, B. F., Ihirwe, J. F., Kayijuka, I., Nzabanita, J., & Haughton, D. (2023). Comparison of Tree-Based Machine Learning Algorithms to Predict Reporting Behavior of Electronic Billing Machines. Information, 14(3), 140. https://doi.org/10.3390/info14030140