Abstract
We propose an attribute value extraction method based on analysing snippets from a search engine. First, a pattern based detector is applied to locate the candidate attribute values in snippets. Then a classifier is used to predict whether a candidate value is correct. To train such a classifier, only very few annotated <entity, attribute, value> triples are needed, and sufficient training data can be generated automatically by matching these triples back to snippets and titles. Finally, as a correct value may appear in multiple snippets, to exploit such redundant information, all the individual predictions are assembled together by voting. Experiments on both Chinese and English corpora in the celebrity domain demonstrate the effectiveness of our method: with only 15 annotated <entity, attribute, value> triples, 7 of 12 attributes’ precisions are over 85%; Compared to a state-of-the-art method, 11 of 12 attributes have improvements.
This paper is supported by NSFC Project 61075067 and National Key Technology R&D Program (No: 2011BAH10B04-03).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: IJCAI (2007)
Bakalov, A., Fuxman, A., Talukdar, P., Chakrabarti, S.: Scad: collective discovery of attribute values. In: Proceedings of WWW 2011, Hyderabad, India, pp. 447–456 (2011)
Cafarella, M.J.: Extracting and querying a comprehensive web database. In: CIDR (2009)
Carlson, A., Betteridge, J., Wang, R.C., Hruschka Jr., E.R., Mitchell, T.M.: Coupled semi-supervised learning for information extraction. In: Proc. of WSDM (2010a)
Carlson, A., et al.: Toward an architecture for never-ending language learning. In: Proceedings of AAAI 2010 (2010b)
Cimiano, P., Völker, J.: Text2Onto – a framework for ontology learning and data-driven change discovery. In: NLDB (2005)
Davidov, D., Rappoport, A.: Extraction and Approximation of Numerical Attributes from the Web. In: Proc. of ACL (2010)
Etzioni, O., et al.: Unsupervised named-entity extraction from the web: An experimental study. Artif. Intell. 165(1) (2005)
Kozareva, Z., Riloff, E., Hovy, E.: Semantic class learning from the web with hyponym pattern linkage graphs. In: Proceedings of ACL 2008: HLT (2008)
Pasca, M., Van Durme, B.: Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs. In: Proceedings of ACL 2008, pp. 19–27 (2008)
Probst, K., Ghani, R., Krema, M., Fano, A., Liu, Y.: Semi-supervised learning of attribute-value pairs from product descriptions. In: IJCAI (2007)
Ravi, S., Pasca, M.: Using Structured Text for Large-Scale Attribute Extraction. In: Proceedings of CIKM 2008, pp. 1183–1192 (2008)
Wang, R.C., Cohen, W.W.: Language-independent set expansion of named entities using the web. In: ICDM, pp. 342–350. IEEE Computer Society (2007)
Wu, F., Weld, D.S.: Automatically semantifying Wikipedia. In: CIKM, pp. 41–50 (2007)
Wu, F., Weld, D.S.: Automatically refining the wikipedia infobox ontology. In: Proceedings of WWW 2008 (2008)
Wu, F., Hoffmann, R., Weld, D.S.: Information extraction from Wikipedia: Moving down the long tail. In: Proceedings of KDD (2008)
Xu, F., Uszkoreit, H., Li, H.: A seed-driven bottom-up machine learning framework for extracting relations of various complexity. In: ACL (2007)
Zhang, L.: Maximum Entropy Modeling Toolkit for Python and C++ (2004), http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, X., Ge, T., Sui, Z. (2013). Learning to Extract Attribute Values from a Search Engine with Few Examples. In: Sun, M., Zhang, M., Lin, D., Wang, H. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2013 2013. Lecture Notes in Computer Science(), vol 8202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41491-6_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-41491-6_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41490-9
Online ISBN: 978-3-642-41491-6
eBook Packages: Computer ScienceComputer Science (R0)