research-article

A Sequence-to-Sequence Model for Large-scale Chinese Abbreviation Database Construction

Authors:

Rui XieAuthors Info & Claims

WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining

Pages 1063 - 1071

https://doi.org/10.1145/3488560.3498430

Published: 15 February 2022 Publication History

Get Access

Abstract

Abbreviations often used in our daily communication play an important role in natural language processing. Most of the existing studies regard the Chinese abbreviation prediction as a sequence labeling problem. However, sequence labeling models usually ignore label dependencies in the process of abbreviation prediction, and the label prediction of each character should be conditioned on its previous labels. In this paper, we propose to formalize the Chinese abbreviation prediction task as a sequence generation problem, and a novel sequence-to-sequence model is designed. To boost the performance of our deep model, we further propose a multi-level pre-trained model that incorporates character, word, and concept-level embeddings. To evaluate our methods, a new dataset for Chinese abbreviation prediction is automatically built, which contains 81,351 pairs of full forms and abbreviations. Finally, we conduct extensive experiments on a public dataset and the built dataset, and the experimental results on both datasets show that our model outperforms the state-of-the-art methods. More importantly, we build a large-scale database for a specific domain, i.e., life services in Meituan Inc., with high accuracy of about 82.7%, which contains 4,134,142 pairs of full forms and abbreviations. The online A/B testing on Meituan APP and Dianping APP suggests that Click-Through Rate increases by 0.59% and 0.86% respectively when the built database is used in the searching system. We have released our API on http://kw.fudan.edu.cn/ddemos/abbr/ with over 87k API calls in 9 months.

Supplementary Material

MP4 File (WSDM-fp330.mp4)

In the video, we outline four parts, namely motivation, methodology, experiments, and conclusion. In the motivation, we introduce motivation with some examples from Chinese media news and the difference between the English abbreviations. In the methodology, considering the inadequacy of the sequence label model. We first attempt to formalize the Chinese abbreviation prediction task as a sequence generation problem. A novel seq2seq model and a multi-level pre-trained model are proposed to address this problem. Combined with adequate experimental results, our model outperforms the baselines on almost all metrics, which shows the effectiveness of our model. In addition, we focus on the application in the field of e-commerce and detail the introduction of A/B testing. The experimental results fully demonstrate the positive contribution of the model in this paper to real e-commerce.

Download
9.57 MB

References

[1]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

A Context-Enhanced Transformer with Abbr-Recover Policy for Chinese Abbreviation Prediction

SC-NER: A Sequence-to-Sequence Model with Sentence Classification for Named Entity Recognition

LSH-based large scale chinese calligraphic character recognition

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations