[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3077136.3080694acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

SogouT-16: A New Web Corpus to Embrace IR Research

Published: 07 August 2017 Publication History

Abstract

Web collection is essential for many Web based researches such as Web Information Retrieval (IR), Web data mining, Corpus linguistics and so on. However, it is usually expensive and time-consuming to collect a large scale of Web pages in lab-based environment and public-available collection becomes a necessity for these researches. In this study, we present a Chinese Web collection, SogouT-16, which is the largest free-of-charge public Chinese Web collection so far. We provide a variety of descriptive characteristics of SogouT-16 and discuss its adoption in a newly-designed ad-hoc retrieval task in NTCIR-13, We Want Web. SogouT-16 also provides online retrieval service and contains a number of auxiliary resources including hyperlink structure graph, query logs, word embedding, and etc. We believe that SogouT-16 will provide new opportunities for novel investigations and applications in IR and other related communities.

References

[1]
2012. The ClueWeb12 Dataset - The Lemur Project. http://www.lemurproject.org/clueweb12.php. (2012). Online; Accessed: 2017-02-01.
[2]
2016. CWP200T Dataset. http://www.ccf.org.cn/sites/ccf/xhdtnry.jsp?contentId=2937064120111. (2016). Online; Accessed: 2017-02-01.
[3]
John A. Akinyemi and Charles L. A. Clarke. 2011. UWaterloo at NTCIR-9: Intent discovery with anchor text. NTCIR.
[4]
Peter Bailey, Nick Craswell, and David Hawking. 2003. Engineering a multi-purpose test collection for web retrieval experiments. Information Processing & Management Vol. 39, 6 (2003), 853--871.
[5]
Chris Buckley, Darrin Dimmick, Ian Soboroff, and Ellen Voorhees. 2007. Bias and the limits of pooling for large collections. Information Retrieval Vol. 10, 6 (2007), 491--508.
[6]
Cyril W. Cleverdon and Michael Keen. 1966. Aslib Cranfield research project-Factors determining the performance of indexing systems; Volume 2, Test results. (1966).
[7]
David Hawking, Ellen Voorhees, Nick Craswell, and Peter Bailey. 1999. Overview of the TREC-8 web track. In TREC.
[8]
Yiqun Liu, Fei Chen, Weize Kong, Huijia Yu, Min Zhang, Shaoping Ma, and Liyun. 2012. Identifying web spam with the wisdom of the crowds. ACM Transactions on the Web (TWEB) Vol. 6, 1 (2012), 2.
[9]
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Technical Report. Stanford InfoLab.
[10]
Yi-Min Wang, Ming Ma, Yuan Niu, and Hao Chen. Spam double-funnel: Connecting web spammers with advertisers WWW '07.
[11]
Qi Zhang, Jihua Kang, Jin Qian, and Xuanjing Huang. Continuous word embeddings for detecting local text reuses at the semantic level SIGIR '14.
[12]
Yu Zhao and Maosong Sun. 2013. Exploiting Lexicalized Statistical Patterns in Chinese Linguistic Analysis. Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, 238--246.

Cited By

View all
  • (2022)Knowledge Graph-Based Semantic Ranking for Efficient Semantic Query2022 IEEE 10th International Conference on Computer Science and Network Technology (ICCSNT)10.1109/ICCSNT56096.2022.9972953(75-79)Online publication date: 22-Oct-2022
  • (2022)Leveraging Document-Level and Query-Level Passage Cumulative Gain for Document RankingJournal of Computer Science and Technology10.1007/s11390-022-2031-y37:4(814-838)Online publication date: 30-Jul-2022
  • (2021)Corpulyzer: A Novel Framework for Building Low Resource Language CorporaIEEE Access10.1109/ACCESS.2021.30497939(8546-8563)Online publication date: 2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval
August 2017
1476 pages
ISBN:9781450350228
DOI:10.1145/3077136
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 August 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. search evaluation
  2. test collection
  3. web corpus

Qualifiers

  • Short-paper

Funding Sources

Conference

SIGIR '17
Sponsor:

Acceptance Rates

SIGIR '17 Paper Acceptance Rate 78 of 362 submissions, 22%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Knowledge Graph-Based Semantic Ranking for Efficient Semantic Query2022 IEEE 10th International Conference on Computer Science and Network Technology (ICCSNT)10.1109/ICCSNT56096.2022.9972953(75-79)Online publication date: 22-Oct-2022
  • (2022)Leveraging Document-Level and Query-Level Passage Cumulative Gain for Document RankingJournal of Computer Science and Technology10.1007/s11390-022-2031-y37:4(814-838)Online publication date: 30-Jul-2022
  • (2021)Corpulyzer: A Novel Framework for Building Low Resource Language CorporaIEEE Access10.1109/ACCESS.2021.30497939(8546-8563)Online publication date: 2021
  • (2020)Leveraging Passage-level Cumulative Gain for Document RankingProceedings of The Web Conference 202010.1145/3366423.3380305(2421-2431)Online publication date: 20-Apr-2020
  • (2019)Investigating Weak Supervision in Deep RankingData and Information Management10.2478/dim-2019-0010Online publication date: 11-Sep-2019
  • (2018)Query Suggestion with Feedback Memory NetworkProceedings of the 2018 World Wide Web Conference10.1145/3178876.3186068(1563-1571)Online publication date: 10-Apr-2018
  • (2018)ARARSS: A System for Constructing and Updating Arabic Textual ResourcesProceedings of the International Conference on Advanced Intelligent Systems and Informatics 201810.1007/978-3-319-99010-1_24(261-269)Online publication date: 29-Aug-2018
  • (2017)Training Deep Ranking Model with Weak Relevance LabelsDatabases Theory and Applications10.1007/978-3-319-68155-9_16(205-216)Online publication date: 20-Sep-2017

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media