Abstract
It is difficult to digest the poorly organized and vast amount of information contained in auction Web sites which are fast changing and highly dynamic. We develop a unified framework which can automatically extract product features and summarize hot item features from multiple auction sites. To deal with the irregularity in the layout format of Web pages and harness the uncertainty involved, we formulate the tasks of product feature extraction and hot item feature summarization as a single graph labeling problem using conditional random fields. One characteristic of this graphical model is that it can model the inter-dependence between neighbouring tokens in a Web page, tokens in different Web pages, as well as various information such as hot item features across different auction sites. We have conducted extensive experiments on several real-world auction Web sites to demonstrate the effectiveness of our framework.
Similar content being viewed by others
References
Agichtein E, Ganti V (2004) Mining reference tables for automatic text segmentation. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD), pp 20–29
Auction Sotware Review (2003) In http://www.auctionsoftwarereview.com/article-ebay-statistics.asp
Aumann Y, Feldman R, Liberzon Y, Rosenfeld B, Schler J (2006). Visual information extraction. Knowl Inform Syst 10(1):1–15
Bunescu R, Mooney R (2004) Collective information extraction with relational markov networkds. In: Proceedings of the 42nd annual meeting of the association for computational linguistics (ACL), pp 439–446
Chang C, Lui SC (2001) IEPAD: information extraction based on pattern discovery. In: Proceedings of the tenth international conference on world wide web (WWW), pp 681–688
Ciravegna F (2001) (LP)2 an adaptive algorithm for information extraction from web-related texts. In: Proceedings of the seventeenth international joint conference on artificial intelligence (IJCAI), pp 1251–1256
Collins M (2002) Ranking algorithms for named-entity extraction: boosting and the voted perceptron. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 489–496
Crescenzi V, Mecca G (2004) Automatic information extraction from large websites. J ACM 51(5):731–779
Crescenzi V, Mecca G, Merialdo P (2001) ROADRUNNER: Towards automatic data extraction from large web sites. In: Proceedings of the 27th very large databases conference (VLDB), pp 109–118
Etzioni O, Cafarella M, Kok S, Popescu A, Shaked T, Soderland S, Weld D, Yates A (2005) Unsupservised named-entity extraction from the web: an experimental study. Artif Intell 165(1): 91–134
Feldman R, Rosenfeld B, Fresko M (2006) TEG - a hybrid approach to information extraction. Knowl Inform Syst 9(1):1–18
Freitag D, McCallum A (2000) Information extraction with HMM structures learned by stochastic optimization. In: Proceedings of the seventeenth national conference on artificial intelligence (AAAI), pp 584–589
Ghani R (2005) Price prediction and insurance for online auctions. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD), pp 411–418
Ghani R, Simmons H (2004) Predicting the end-price of online auctions. In: International workshop on data mining and adaptive modelling methods for economics and management
Hu M, Liu B (2004) Mining and summarizing customer reviews. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD), pp 168–177
Kschischang F, Frey B, Loeliger H (2001) Factor graphs and the sum-product algorithm. IEEE Trans on Inform Theory 47(2):498–519
Kushmerick N (2000) Wrapper induction: efficiency and expressiveness. Artif Intell 118(1–2): 15–68
Kushmerick N, Thomas B (2002) Adaptive information extraction: core technologies for information agents. In: Intelligents information agents R&d in europe: An agentLink perspective, pp 79–103
Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of eighteenth international conference on machine learning (ICML), pp 282–289
Li Z, Ng WK, Sun A (2005) Web data extraction based on structural similarity. Knowl Inform Syst 8(4):438–491
Liu B, Grossman R, Zhai Y (2003) Mining data records in web pages. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD), pp 601–606
Mani I, Maybury M (1999) In advances in automatic text summarization. MIT press, Cambridge
McCallum A, Jensen D (2003) A note on the unification of information extraction and data mining using conditional-probability, relational models. In: Proceedings of the IJCAI workshop on learning statistical models from relational data
McCallum A, Wellner B (2003) Toward conditional models of identity uncertainty with application to proper noun coreference. In: Proceedings of the IJCAI workshop on information integration on the web
Muslea I, Minton S, and Knoblock C (2001) Hierarchical wrapper induction for semistructured information sources. J Auton Agents Multi-Agent Syst 4(1–2):93–114
Popescu A, Etzioni O (2005) Extracting product features and opinions from reviews. In: Proceedings of the human language technology conference conference on empirical methods in natural language processing, pp 339–346
Wang J, Karypis G (2005) On efficiently summarizing categorical databases. Knowl Inform Syst 9(1):19–37
Wellner B, McCallum A, Peng F, Hay M (2004) An integrated, conditional model of information extraction and coreference with application to citation matching. In: Proceedings of the 20th conference on uncertainty in artificial intelligence (UAI), pp 593–601
Wong TL, Lam W (2004) A probabilistic approach for adapting information extraction wrappers and discovering new attributes. In: Proceedings of the 2004 IEEE international conference on data mining (ICDM), pp 257–264
Wong TL, Lam W, Chan SK (2006) Extracting and summarizing hot items features across different auction web sites. In: The tenth Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 334–345
Wong TL, Lam W (2007) Adapting web information extraction knowledge via mining site- invariant and site-dependent features. ACM Trans Internet Technol (in press)
Yi J, Niblack W (2005) Sentiment mining in web fountain. In: Proceedings of the 21st international conference on data engineering (ICDE), pp 1073–1083
Author information
Authors and Affiliations
Corresponding author
Additional information
The work described in this paper is substantially supported by grants from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Nos: CUHK 4179/03E and CUHK4193/04E) and the Direct Grant of the Faculty of Engineering, CUHK (Project Codes: 2050363 and 2050391). This work is also affiliated with the Microsoft-CUHK Joint Laboratory for Human-centric Computing and Interface Technologies.
Rights and permissions
About this article
Cite this article
Wong, TL., Lam, W. Learning to extract and summarize hot item features from multiple auction web sites. Knowl Inf Syst 14, 143–160 (2008). https://doi.org/10.1007/s10115-007-0078-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-007-0078-2