A Comprehensive Machine Learning Based Modeling of Income Tax Collection

Nghia Chung (1), Thi Mai Thom Do (2), Van Tai Pham (3), Thanh Tan Tran (4), Thi Phuong Thao Nguyen (5), Quang Huy Nguyen (6), Nguyen Van Nguyen (7), Canh Son Nguyen (8), Thanh Nam Dang (9)
(1) Maritime Manning and Training Center, Ho Chi Minh City University of Transport, Ho Chi Minh City, Vietnam
(2) Faculty of Financial Management, Vietnam Maritime University, Hai Phong, Vietnam
(3) Foreign Trade Faculty, College of Foreign Economic Relations, Ho Chi Minh City, Vietnam
(4) Foreign Trade Faculty, College of Foreign Economic Relations, Ho Chi Minh City, Vietnam
(5) Foreign Trade Faculty, College of Foreign Economic Relations, Ho Chi Minh City, Vietnam
(6) School of Economics and Law, Tra Vinh University, Tra Vinh, Vietnam
(7) School of Economics and Law, Tra Vinh University, Tra Vinh, Vietnam
(8) Faculty of Economics and Management, Dong Nai Technology University, Bien Hoa City, Vietnam
(9) Institute of Maritime, Ho Chi Minh City University of Transport, Ho Chi Minh City, Vietnam
Fulltext View | Download
How to cite (IJASEIT) :
Chung , Nghia, et al. “A Comprehensive Machine Learning Based Modeling of Income Tax Collection”. International Journal on Advanced Science, Engineering and Information Technology, vol. 14, no. 6, Dec. 2024, pp. 1896-05, doi:10.18517/ijaseit.14.6.19860.
Income tax is one of the important sources of revenue for each country, income tax forecasting is thus one of the important tasks of each country. This work presents a machine learning-based method based on Gross Domestic Product (GDP) and population data to forecast income tax collection. As a result, the violin plot shows the distribution of the data, namely that population values are concentrated around the middle, while GDP has a bimodal distribution, and income tax exhibits a pattern similar to that of the population. On both training and test data, several machine learning models were assessed for accuracy and generalization using Mean Squared Error (MSE), R-squared (R²), and Mean Absolute Percentage Error (MAPE). With Train MAPE at 2.85% and Test MAPE at 5.53%, Random Forest attained a Train MSE of 25.94 and a Test MSE of 51.00, so indicating good performance but modest overfitting. Although Gradient Boosting had a higher Test MAPE of 6.89% suggesting some overfitting, it scored almost perfect Train MSE of 0.04. While performing poorly on the test data with a Train MSE of 180.50, the Decision Tree fit the training data exactly (Train MSE of 0.00). With Train MSE of 2.32 and Test MSE of 25.49, CatBoost proved constant accuracy over both datasets, it could be considered as the best model for income tax prediction based on GDP and population since it excelled generally in stability and generalization.

S. Nghiem and X.-B. (Benjamin) Vu, “Basic income in Australia: an exploration,” J. Econ. Dev., vol. 25, no. 4, pp. 365–376, Nov. 2023, doi: 10.1108/JED-07-2022-0119.

E. Kirchler, A. Niemirowski, and A. Wearing, “Shared subjective views, intent to cooperate and tax compliance: Similarities between Australian taxpayers and tax officers,” J. Econ. Psychol., vol. 27, no. 4, pp. 502–517, Aug. 2006, doi: 10.1016/j.joep.2006.01.005.

A. Fullarton and D. Pinto, “Australian Taxation Office Pronouncements: Why Tax Advisers Need to Exercise Caution,” SSRN Electron. J., 2022, doi: 10.2139/ssrn.4163859.

A. Stokes and S. Wright, “Does Australia Have A Good Income Tax System?,” Int. Bus. Econ. Res. J., vol. 12, no. 5, p. 533, Apr. 2013, doi:10.19030/iber.v12i5.7828.

M. K. Chan, T. Morris, C. Polidano, and H. Vu, “Income and saving responses to tax incentives for private retirement savings,” J. Public Econ., vol. 206, p. 104598, Feb. 2022, doi:10.1016/j.jpubeco.2021.104598.

J. L. Hoopes, L. Robinson, and J. Slemrod, “Public tax-return disclosure,” J. Account. Econ., vol. 66, no. 1, pp. 142–162, Aug. 2018, doi: 10.1016/j.jacceco.2018.04.001.

T. Sainsbury and R. Breunig, “Tax planning in Australia’s income tax system,” Agenda - A J. Policy Anal. Reform, vol. 27, no. 1, pp. 59–83, Dec. 2020, doi: 10.22459/AG.27.01.2020.03.

J. Pope, “Reform of The Personal Income Tax System in Australia,” Econ. Pap. A J. Appl. Econ. policy, vol. 24, no. 4, pp. 316–331, Dec. 2005, doi: 10.1111/j.1759-3441.2005.tb01006.x.

R. V. Burkhauser, M. H. Hahn, and R. Wilkins, “Measuring top incomes using tax record data: a cautionary tale from Australia,” J. Econ. Inequal., vol. 13, no. 2, pp. 181–205, Jun. 2015, doi:10.1007/s10888-014-9281-z.

A. Tran and Y. H. Zhu, “The impact of adopting IFRS on corporate ETR and book-tax income gap,” in Australian Tax Forum, 2017, vol. 32, no. 4, pp. 757–792.

A. Howard Miller, “Using unsupervised machine learning to model tax practice learning theory,” Int. J. Eng. Technol., vol. 7, no. 2.4, p. 109, Mar. 2018, doi: 10.14419/ijet.v7i2.4.13019.

V. Baghdasaryan, H. Davtyan, A. Sarikyan, and Z. Navasardyan, “Improving Tax Audit Efficiency Using Machine Learning: The Role of Taxpayer’s Network Data in Fraud Detection,” Appl. Artif. Intell., vol. 36, no. 1, Dec. 2022, doi: 10.1080/08839514.2021.2012002.

N. A. Phong, P. H. Tam, and L. Q. Cuong, “Forecasting Tax Risk by Machine Learning: Case of Firms in Ho Chi Minh City,” 2022.

M. Z. Abedin, G. Chi, M. M. Uddin, M. S. Satu, M. I. Khan, and P. Hajek, “Tax Default Prediction Using Feature Transformation-Based Machine Learning,” IEEE Access, vol. 9, pp. 19864–19881, 2021, doi:10.1109/ACCESS.2020.3048018.

N. Ourdani, M. Chrayah, and N. Aknin, “Towards a new approach to maximize tax collection using machine learning algorithms,” IAES Int. J. Artif. Intell., vol. 13, no. 1, p. 737, Mar. 2024, doi:10.11591/ijai.v13.i1.pp737-746.

Olatunji Akinrinola, Wilhelmina Afua Addy, Adeola Olusola Ajayi-Nifise, Olubusola Odeyemi, and Titilola Falaiye, “Application of machine learning in tax prediction: A review with practical approaches,” Glob. J. Eng. Technol. Adv., vol. 18, no. 2, pp. 102–117, Feb. 2024, doi: 10.30574/gjeta.2024.18.2.0028.

R. Abdul Rahman, S. Masrom, N. Omar, and M. Zakaria, “An application of machine learning on corporate tax avoidance detection model,” IAES Int. J. Artif. Intell., vol. 9, no. 4, p. 721, Dec. 2020, doi:10.11591/ijai.v9.i4.pp721-725.

B. F. Murorunkwere, D. Haughton, J. Nzabanita, F. Kipkogei, and I. Kabano, “Predicting tax fraud using supervised machine learning approach,” African J. Sci. Technol. Innov. Dev., vol. 15, no. 6, pp. 731–742, Sep. 2023, doi: 10.1080/20421338.2023.2187930.

T. T. Le, H. C. Le, P. Paramasivam, and N. Chung, “Artificial intelligence applications in solar energy,” JOIV Int. J. Informatics Vis., vol. 8, no. 2, pp. 826–844, 2024, doi: 10.62527/joiv.8.2.2686.

T. H. Nguyen, P. Paramasivam, H. C. Le, and D. C. Nguyen, “Harnessing a Better Future: Exploring AI and ML Applications in Renewable Energy,” JOIV Int. J. Informatics Vis., vol. 8, no. 1, pp. 55–78, 2024.

Y. A. Seo and J. Cha, “Precipitation Probability Prediction through NWP Bias Correction for South Korea Using Random Forest,” Int. J. Adv. Sci. Eng. Inf. Technol., vol. 13, no. 3 SE-Articles, pp. 935–942, Jun. 2023, doi: 10.18517/ijaseit.13.3.18224.

A. Ramadhan, B. Susetyo, and - Indahwati, “Classification Modelling of Random Forest to Identify the Important Factors in Improving the Quality of Education,” Int. J. Adv. Sci. Eng. Inf. Technol., vol. 11, no. 2 SE-Articles, pp. 501–507, Apr. 2021, doi:10.18517/ijaseit.11.2.8878.

R. Susetyoko, E. Purwantini, B. N. Iman, and E. Satriyanto, “An Improved Accuracy of Multiclass Random Forest Classifier with Continuous Attribute Transformation Using Random Percentile Generation,” Int. J. Adv. Sci. Eng. Inf. Technol., vol. 13, no. 3 SE-Articles, pp. 943–953, Jun. 2023, doi: 10.18517/ijaseit.13.3.18379.

M. Gholizadeh, M. Jamei, I. Ahmadianfar, and R. Pourrajab, “Prediction of nanofluids viscosity using random forest (RF) approach,” Chemom. Intell. Lab. Syst., vol. 201, p. 104010, Jun. 2020, doi:10.1016/j.chemolab.2020.104010.

L. Breiman, “Random Forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1023/A:1010933404324.

P. Kumar K, M. Alruqi, H. A. Hanafi, P. Sharma, and V. V. Wanatasanappan, “Effect of particle size on second law of thermodynamics analysis of Al2O3 nanofluid: Application of XGBoost and gradient boosting regression for prognostic analysis,” Int. J. Therm. Sci., vol. 197, p. 108825, Mar. 2024, doi:10.1016/j.ijthermalsci.2023.108825.

P. Nie, M. Roccotelli, M. P. Fanti, Z. Ming, and Z. Li, “Prediction of home energy consumption based on gradient boosting regression tree,” Energy Reports, vol. 7, pp. 1246–1255, Nov. 2021, doi:10.1016/j.egyr.2021.02.006.

R. Nasiboglu and E. Nasibov, “WABL method as a universal defuzzifier in the fuzzy gradient boosting regression model,” Expert Syst. Appl., vol. 212, p. 118771, Feb. 2023, doi:10.1016/j.eswa.2022.118771.

T. Wang, S. Hu, and Y. Jiang, “Predicting shared-car use and examining nonlinear effects using gradient boosting regression trees,” Int. J. Sustain. Transp., vol. 15, no. 12, pp. 893–907, Oct. 2021, doi:10.1080/15568318.2020.1827316.

A. J. Barid and H. Hadiyanto, “Hyperparameter optimization for hourly PM2.5 pollutant prediction,” J. Emerg. Sci. Eng., vol. 2, no. 1, p. e15, Apr. 2024, doi: 10.61435/jese.2024.e15.

P. Paramasivama, K. Naima, and M. Dzida, “Soft computing-based modelling and optimization of NOx emission from a variable compression ratio diesel engine,” J. Emerg. Sci. Eng., vol. 2, no. 2, p. e21, Apr. 2024, doi: 10.61435/jese.2024.e21.

M. Yanto, S. Arlis, M. R. Putra, H. Syahputra, and V. Ariandi, “Prediction of Drug Demand Based on Deep Learning Approach and Classification Model,” Int. J. Adv. Sci. Eng. Inf. Technol., vol. 13, no. 1 SE-Articles, pp. 357–364, Feb. 2023, doi:10.18517/ijaseit.13.1.17217.

D. Puri, S. Nalbalwar, A. Nandgaonkar, J. Rajput, and A. Wagh, “Identification of Alzheimer’s Disease Using Novel Dual Decomposition Technique and Machine Learning Algorithms from EEG Signals,” Int. J. Adv. Sci. Eng. Inf. Technol., vol. 13, no. 2 SE-Articles, pp. 658–665, Apr. 2023, doi: 10.18517/ijaseit.13.2.18252.

S. B. Kotsiantis, “Decision trees: a recent overview,” Artif. Intell. Rev., vol. 39, no. 4, pp. 261–283, Apr. 2013, doi: 10.1007/s10462-011-9272-4.

J. Abdi, F. Hadavimoghaddam, M. Hadipoor, and A. Hemmati-Sarapardeh, “Modeling of CO2 adsorption capacity by porous metal organic frameworks using advanced decision tree-based models,” Sci. Rep., vol. 11, no. 1, p. 24468, Dec. 2021, doi: 10.1038/s41598-021-04168-w.

G. Shanmugasundar, M. Vanitha, R. Čep, V. Kumar, K. Kalita, and M. Ramachandran, “A Comparative Study of Linear, Random Forest and AdaBoost Regressions for Modeling Non-Traditional Machining,” Processes, vol. 9, no. 11, p. 2015, Nov. 2021, doi: 10.3390/pr9112015.

T. T. Le et al., “Unlocking renewable energy potential: Harnessing machine learning and intelligent algorithms,” Int. J. Renew. Energy Dev., vol. 13, no. 4, pp. 783–813, Jul. 2024, doi:10.61435/ijred.2024.60387.

N. S. I. Alsharabi, R. R. Al-Mola, R. E. Slewa Yonan, and Z. Y. Algamal, “Employing Several Methods to Estimate the Generalized Liu Parameter in Multiple Linear Regression Model,” Int. J. Adv. Sci. Eng. Inf. Technol., vol. 12, no. 6 SE-Articles, pp. 2386–2390, Dec. 2022, doi: 10.18517/ijaseit.12.6.14789.

O. Vernanda et al., “Correlation of Environmental Factors With Population of Horseshoe Crab (Tachypleus gigas) in Sedati Waters, Sidoarjo District,” Int. J. Adv. Sci. Eng. Inf. Technol., vol. 12, no. 2 SE-Articles, pp. 826–833, Apr. 2022, doi: 10.18517/ijaseit.12.2.14958.

Y. Rahmawati, A. F. Sari, and C. Utomo, “The Effect of Consequences in Utilizing Real Estate Investment Trust (REIT) on Property Development,” Int. J. Adv. Sci. Eng. Inf. Technol., vol. 13, no. 1 SE-Articles, pp. 173–179, Jan. 2023, doi: 10.18517/ijaseit.13.1.16275.

A. Balal, Y. Pakzad Jafarabadi, A. Demir, M. Igene, M. Giesselmann, and S. Bayne, “Forecasting Solar Power Generation Utilizing Machine Learning Models in Lubbock,” Emerg. Sci. J., vol. 7, no. 4, pp. 1052–1062, Jul. 2023, doi: 10.28991/ESJ-2023-07-04-02.

F. E. Tahiri, K. Chikh, and M. Khafallah, “Optimal Management Energy System and Control Strategies for Isolated Hybrid Solar-Wind-Battery-Diesel Power System,” Emerg. Sci. J., vol. 5, no. 2, pp. 111–124, Apr. 2021, doi: 10.28991/esj-2021-01262.

S. Pak and T. Oh, “Correlation and Simple Linear Regression,” J. Vet. Clin., vol. 27, no. 4, pp. 427–434, 2010.

A. F. Schmidt and C. Finan, “Linear regression and the normality assumption,” J. Clin. Epidemiol., vol. 98, pp. 146–151, Jun. 2018, doi:10.1016/j.jclinepi.2017.12.006.

Y. Li, R. Yang, X. Wang, J. Zhu, and N. Song, “Carbon Price Combination Forecasting Model Based on Lasso Regression and Optimal Integration,” Sustain. 2023, Vol. 15, Page 9354, vol. 15, no. 12, p. 9354, Jun. 2023, doi: 10.3390/SU15129354.

E. Ayyildiz and M. Murat, “A lasso regression-based forecasting model for daily gasoline consumption: Türkiye Case,” Turkish J. Eng., vol. 8, no. 1, pp. 162–174, Jan. 2024, doi: 10.31127/tuje.1354501.

P. J. García-Nieto, E. García-Gonzalo, • José, and P. Paredes-Sá Nchez, “Prediction of the critical temperature of a superconductor by using the WOA/MARS, Ridge, Lasso and Elastic-net machine learning techniques,” Neural Comput. Appl., vol. 33, doi: 10.1007/s00521-021-06304-z.

A. Kijkarncharoensin and S. Innet, “Consistent Regime-Switching Lasso Model of the Biomass Proximate Analysis Higher Heating Value,” Int. J. Renew. Energy Dev. Vol 12, No 1 January 2023DO - 10.14710/ijred.2023.47831 , Jan. 2023.

L. Firinguetti-Limone and M. Pereira-Barahona, “Bayesian estimation of the shrinkage parameter in ridge regression,” Commun. Stat. - Simul. Comput., vol. 49, no. 12, pp. 3314–3327, Dec. 2020, doi:10.1080/03610918.2018.1547395.

Y. Wu, N. Prezhdo, and W. Chu, “Increasing Efficiency of Nonadiabatic Molecular Dynamics by Hamiltonian Interpolation with Kernel Ridge Regression,” J. Phys. Chem. A, vol. 125, no. 41, pp. 9191–9200, Oct. 2021, doi: 10.1021/acs.jpca.1c05105.

A. Rokem and K. Kay, “Fractional ridge regression: a fast, interpretable reparameterization of ridge regression,” Gigascience, vol. 9, no. 12, Nov. 2020, doi: 10.1093/gigascience/giaa133.

I. S. Dar, S. Chand, M. Shabbir, and B. M. G. Kibria, “Condition-index based new ridge regression estimator for linear regression model with multicollinearity,” Kuwait J. Sci., vol. 50, no. 2, pp. 91–96, Apr. 2023, doi: 10.1016/j.kjs.2023.02.013.

T. T. Le, J. C. Priya, H. C. Le, N. V. L. Le, T. B. N. Nguyen, and D. N. Cao, “Harnessing artificial intelligence for data-driven energy predictive analytics: A systematic survey towards enhancing sustainability,” Int. J. Renew. Energy Dev., vol. 13, no. 2, 2024, doi: 10.61435/ijred.2024.60119.

J. Q. Yang and H. Z. Liu, “Application of EMD-Adaboost in wind speed prediction,” Int. J. Data Sci., vol. 7, no. 2, p. 164, 2022, doi:10.1504/ijds.2022.126854.

S. Tsiapoki, O. Bahrami, M. W. Häckell, J. P. Lynch, and R. Rolfes, “Combination of damage feature decisions with adaptive boosting for improving the detection performance of a structural health monitoring framework: Validation on an operating wind turbine,” Struct. Heal. Monit., vol. 20, no. 2, pp. 637–660, Mar. 2021, doi:10.1177/1475921720909379.

G. A. Busari and D. H. Lim, “Crude oil price prediction: A comparison between AdaBoost-LSTM and AdaBoost-GRU for improving forecasting performance,” Comput. Chem. Eng., vol. 155, p. 107513, Dec. 2021, doi: 10.1016/j.compchemeng.2021.107513.

R. Li, H. Sun, X. Wei, W. Ta, and H. Wang, “Lithium Battery State-of-Charge Estimation Based on AdaBoost.Rt-RNN,” Energies, vol. 15, no. 16, p. 6056, Aug. 2022, doi: 10.3390/en15166056.

N. F. Rozam and M. Riasetiawan, “XGBoost Classifier for DDOS Attack Detection in Software Defined Network Using sFlow Protocol,” Int. J. Adv. Sci. Eng. Inf. Technol., vol. 13, no. 2, pp. 718–725, Apr. 2023, doi: 10.18517/ijaseit.13.2.17810.

H. Darmawan, M. Yuliana, and M. Z. S. Hadi, “GRU and XGBoost Performance with Hyperparameter Tuning Using GridSearchCV and Bayesian Optimization on an IoT-Based Weather Prediction System,” Int. J. Adv. Sci. Eng. Inf. Technol., vol. 13, no. 3, pp. 848–859, 2023, doi: 10.18517/ijaseit.13.3.18377.

Y. Qiu, J. Zhou, M. Khandelwal, H. Yang, P. Yang, and C. Li, “Performance evaluation of hybrid WOA-XGBoost, GWO-XGBoost and BO-XGBoost models to predict blast-induced ground vibration,” Eng. Comput., vol. 38, no. S5, pp. 4145–4162, Dec. 2022, doi:10.1007/s00366-021-01393-9.

B. Akbar, H. Tayara, and K. T. Chong, “Unveiling dominant recombination loss in perovskite solar cells with a XGBoost-based machine learning approach,” iScience, vol. 27, no. 3, p. 109200, Mar. 2024, doi: 10.1016/j.isci.2024.109200.

T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2016, pp. 785–794, doi: 10.1145/2939672.2939785.

P. Zhang, Y. Jia, and Y. Shang, “Research and application of XGBoost in imbalanced data,” Int. J. Distrib. Sens. Networks, vol. 18, no. 6, p. 155013292211069, Jun. 2022, doi: 10.1177/15501329221106935.

A. T. Le et al., “Precise Prediction of Biochar Yield and Proximate Analysis by Modern Machine Learning and SHapley Additive exPlanations,” Energy & Fuels, vol. 37, no. 22, pp. 17310–17327, Nov. 2023, doi: 10.1021/acs.energyfuels.3c02868.

L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, “CatBoost: unbiased boosting with categorical features,” Adv. Neural Inf. Process. Syst., vol. 31, 2018.

S. Ben Jabeur, C. Gharib, S. Mefteh-Wali, and W. Ben Arfi, “CatBoost model and artificial intelligence techniques for corporate failure prediction,” Technol. Forecast. Soc. Change, vol. 166, p. 120658, May 2021, doi: 10.1016/j.techfore.2021.120658.

Y. Zhang, Z. Zhao, and J. Zheng, “CatBoost: A new approach for estimating daily reference crop evapotranspiration in arid and semi-arid regions of Northern China,” J. Hydrol., vol. 588, p. 125087, Sep. 2020, doi: 10.1016/j.jhydrol.2020.125087.

R. Banik and A. Biswas, “Improving Solar PV Prediction Performance with RF-CatBoost Ensemble: A Robust and Complementary Approach,” Renew. Energy Focus, vol. 46, pp. 207–221, Sep. 2023, doi: 10.1016/j.ref.2023.06.009.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

    1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
    2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
    3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).