Abstract
Chinese logistics address segmentation is a specific domain of the address resolution, which is very challenging due to language, culture, user privacy, business value, etc. Although deep learning can effectively solve problems where traditional segmentation methods are overly dependent on domain knowledge, it faces the dilemma of costly manual labeling. In this context, a decision tree model based on regular expression boundaries is proposed, which requires no additional data and manual labeling. First, different from traditional methods of describing the entire address elements, a regular expressions rule library (RERL) is constructed, which only describes the boundaries of address elements. Second, the binary split attribute is defined according to the boundary matching algorithm based on RERL. A decision tree model is then constructed concerning the distribution law of address element types to segment an address and to evaluate its effect. The final experimental results demonstrate the improvement of our model and further substantiate that our proposal can provide a high-quality labeling training set for deep learning models without any professional domain knowledge, even if in low-resource scenarios.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
An B., Qing Z. T. (2014) Learning regular expressions for clinical text classification, J Am Med Inform Assoc, 850–857
Bartoli A., De Lorenzo A., Medvet E., Tarlao F. (2016) Inference of regular expressions for text extraction from examples. IEEE Trans Knowl Data Eng 28(5):1217–1230. https://doi.org/10.1109/TKDE.2016.2515587
Bartoli A., De Lorenzo A., Medvet E., Tarlao F. (2018) Active learning of regular expressions for entity extraction. IEEE Trans Cybern 48(3):1067–1080. https://doi.org/10.1109/TCYB.2017.2680466
Bioch J.C., Meer O., Potharst R. (1997) Bivariate decision trees. In: J. Komorowski, J. Zytkow (eds) Principles of Data Mining and Knowledge Discovery, vol. 1263, pp. 232–242. Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-63223-9_122. http://link.springer.com/10.1007/3-540-63223-9_122
Bollwein F., Westphal S. (2021) A branch & bound algorithm to determine optimal bivariate splits for oblique decision tree induction Applied Intelligence. https://doi.org/10.1007/s10489-021-02281-x
Brauer F., Rieger R., Mocan A., Barczynski W.M. (2020) Enabling information extraction by inference of regular expressions from sample entities. In: Proceedings of the 20th ACM international conference on Information and knowledge management, CIKM ’11, pp. 1285–1294. Association for Computing Machinery. https://doi.org/10.1145/2063576.2063763
Chang C.H., Chuang H.M., Huang C.Y., Su Y.S., Li S.Y. (2016) Enhancing POI search on maps via online address extraction and associated information segmentation. Applied Intelligence 44(3):539–556. https://doi.org/10.1007/s10489-015-0707-5, http://link.springer.com/10.1007/s10489-015-0707-5
Chang-Xiu C., Bin Y. U. (2011) A rule-based segmenting and matching method for fuzzy chinese addresses. Geogr Geo-Inf Sci 27(3):26–29
Cheng B.L., Weihong T.H. (2019) Chinese address segmentation based on bilstm-crf. J Geo-Inf Sci 21(8):1143. https://doi.org/10.12082/dqxxkx.2019.180654, {http://www.dqxxkx.cn/EN/abstract/article_43333.shtml}
CH/Z9010-2011 (2011) Geographic Entities and Geographical Address Data Specification. Mapping and Geoinformation
Devlin J., Chang M., Lee K., Toutanova K. (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: In: J. Burstein, C. Doran, T. Solorio (eds.) Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics. https://doi.org/10.18653/v1/n19-1423
Dong C., Zhang J., Zong C., Hattori M., Di H. Lin C. Y., Xue N., Zhao D., Huang X., Feng Y. (eds) (2016) Character-based lstm-crf with radical-level features for chinese named entity recognition. Springer International Publishing, Cham
He Z., Wang Z., Wei W., Feng S., Mao X., Jiang S. (2020) A survey on recent advances in sequence labeling from deep learning models. arXiv:2011.06727
Hedderich M. A., Lange L., Adel H., Strötgen J., Klakow D. (2021) A Survey on Recent Approaches for Natural Language Processing in low-Resource Scenarios. arXiv:2010.12309
Hu Z., Ma X., Liu Z., Hovy E., Xing E. (2016) Harnessing deep neural networks with logic rules. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2410–2420. Association for Computational Linguistics, Berlin, Germany. https://doi.org/10.18653/v1/P16-1228. https://aclanthology.org/P16-1228
Huang Z., Xu W., Yu K. (2015) Bidirectional LSTM- CRF Models for Sequence Tagging. arXiv:1508.01991
Lample G., Ballesteros M., Subramanian S., Kawakami K., Dyer C. (2016) Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American chapter of the association for computational linguistics: Human Language Technologies, pp. 260–270. Association for Computational Linguistics, San Diego, California. https://doi.org/10.18653/v1/N16-1030https://www.aclweb.org/anthology/N16-1030
Lan Z., Chen M., Goodman S., Gimpel K., Sharma P., Soricut R. (2020) ALBERT: A Lite BERT For Self-supervised Learning of Language Representations. arXiv:1909.11942
Li H., Lu W., Xie P., Li L. (2019) Neural chinese address parsing, Proc. of NAACL
Li J., Sun A., Han J., Li C. (2020) A survey on deep learning for named entity recognition, IEEE Trans Knowl Data Eng, 1–1. https://doi.org/10.1109/TKDE.2020.2981314
Li Y., Liu J., Luo A. (2018) Chinese address segmentation algorithm based on depth learning. Sci Surv Mapp 43(10):107–111
Ling G.M., Xu A.P., Wang W. (2020) Research of address information automatic annotation based on deep learning (in chinese). Acta Electronica Sinica 48(11):2081–2091. https://doi.org/10.3969/j.issn.0372-2112.2020.11.001https://doi.org/10.3969/j.issn.0372-2112.2020.11.001
Liu X.Y., Li Y.L., Yin B., Tian X. (2021) Chinese address understanding by integrating neural network and spatial relationship (in chinese). Sci Surv Mapp 46(8):165–171 + 212. https://doi.org/10.16251/j.cnki.1009-2307.2021.08.023
Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Levy O., Lewis M., Zettlemoyer L., Stoyanov V. (2019 ) RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692
Prasse P., Sawade C., Landwehr N., Scheffer T. (2012) Learning to identify regular expressions that describe email campaigns. In: In international conference on machine learning (ICML), pp. 3687–3720
Tjong Kim Sang E.F., De Meulder F. (2003) Introduction to the conll-2003 shared task: Language-independent named entity recognition. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003 - Volume 4, CONLL ’03, p. 142–147. Association for Computational Linguistics, USA. https://doi.org/10.3115/1119176.1119195,
Utgoff P. E. (1989) Incremental induction of decision trees. Mach Learn 4:26. https://doi.org/10.1023/A:1022699900025
Wang G., Jia X. Method and system for place name entity recognition. WO2015027836A1. https://patents.google.com/patent/WO2015027836A1/en
Wei J., Zou K. (2019) EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6382–6388. Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1670, https://www.aclweb.org/anthology/D19-1670
Weihong L., Ao Z., Kan D. (2014) An efficient bayesian framework based place name segmentation algorithm for geocoding system. In: 2014 Fifth international conference on intelligent systems design and engineering applications, pp. 141–144. https://doi.org/10.1109/ISDEA.2014.39
Ye X. U., Shen B. X., Xiang X. U., Jun L. I. (2019) A new crf based semantic resolution approach of unstructured chinese addresses. Geogr Geo-Inf Sci 35(02):12–18
Ying S., Weiyang L. I., Biao H. E., Wang W., Yuan W. (2019) Chinese segmentation of city address set based on the statistical decision tree. Geomatics Inf Sci Wuhan Univ 44(2):302–309
Zhang H., Ren F., Li H., Yang R., Zhang S., Du Q. (2020) Recognition method of new address elements in chinese address matching based on deep learning. ISPRS International Journal of Geo-Information 9:12. https://doi.org/10.3390/ijgi9120745, https://www.mdpi.com/2220-9964/9/12/745
Zhang J. (2021) Dive into Decision Trees and forests: A Theoretical Demonstration. arXiv:2101.08656
Zhang S., He L., Vucetic S., Dragut E. (2018) Regular expression guided entity mention mining from noisy web data. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1991–2000. Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1224https://www.aclweb.org/anthology/D18-1224
Zhang X., Guonian L. V., Boqiu L. I., Chen W. (2010) Rule-based approach to semantic resolution of chinese addresses. Journal of Geo-Information Science 12(1):9–16
Zhang X., Lv G., Li B., Chen W. (2010) Rule-based approach to semantic resolution of chinese addresses. Journal of Geo-information Science 12:9. http://www.dqxxkx.cn/EN/abstract/article_23025.shtml
Zhang Y., Yang J. (2018) Chinese NER Using Lattice LSTM. arXiv:1805.02023
Zhao Y., Wang L., Qiu A. (2013) An improved algorithm for address segmentation Science of Surveying and Mapping 38(05)
Zhu F., Zhao T., Liu Y., Zhao Y. (2018) Research on chinese address resolution model based on conditional random field. In: Journal of Physics: Conference Series 1087:052040. IOP Publishing. https://doi.org/10.1088/1742-6596/1087/5/052040
Acknowledgments
This work was supported by the National Key R&D Program of China (No. 2018YFB2100603). The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work is partially supported by grants from the National Key R&D Program of China (grant no. 2018YFB2100603).
Rights and permissions
About this article
Cite this article
Ling, G., Xu, A., Wang, C. et al. REBDT: A regular expression boundary-based decision tree model for Chinese logistics address segmentation. Appl Intell 53, 6856–6872 (2023). https://doi.org/10.1007/s10489-022-03511-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03511-6