short-paper

SogouT-16: A New Web Corpus to Embrace IR Research

Authors:

Shaoping MaAuthors Info & Claims

SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 1233 - 1236

https://doi.org/10.1145/3077136.3080694

Published: 07 August 2017 Publication History

Get Access

Abstract

Web collection is essential for many Web based researches such as Web Information Retrieval (IR), Web data mining, Corpus linguistics and so on. However, it is usually expensive and time-consuming to collect a large scale of Web pages in lab-based environment and public-available collection becomes a necessity for these researches. In this study, we present a Chinese Web collection, SogouT-16, which is the largest free-of-charge public Chinese Web collection so far. We provide a variety of descriptive characteristics of SogouT-16 and discuss its adoption in a newly-designed ad-hoc retrieval task in NTCIR-13, We Want Web. SogouT-16 also provides online retrieval service and contains a number of auxiliary resources including hyperlink structure graph, query logs, word embedding, and etc. We believe that SogouT-16 will provide new opportunities for novel investigations and applications in IR and other related communities.

References

[1]

2012. The ClueWeb12 Dataset - The Lemur Project. http://www.lemurproject.org/clueweb12.php. (2012). Online; Accessed: 2017-02-01.

Google Scholar

[2]

2016. CWP200T Dataset. http://www.ccf.org.cn/sites/ccf/xhdtnry.jsp?contentId=2937064120111. (2016). Online; Accessed: 2017-02-01.

Google Scholar

[3]

John A. Akinyemi and Charles L. A. Clarke. 2011. UWaterloo at NTCIR-9: Intent discovery with anchor text. NTCIR.

Google Scholar

[4]

Peter Bailey, Nick Craswell, and David Hawking. 2003. Engineering a multi-purpose test collection for web retrieval experiments. Information Processing & Management Vol. 39, 6 (2003), 853--871.

Digital Library

Google Scholar

[5]

Chris Buckley, Darrin Dimmick, Ian Soboroff, and Ellen Voorhees. 2007. Bias and the limits of pooling for large collections. Information Retrieval Vol. 10, 6 (2007), 491--508.

Digital Library

Google Scholar

[6]

Cyril W. Cleverdon and Michael Keen. 1966. Aslib Cranfield research project-Factors determining the performance of indexing systems; Volume 2, Test results. (1966).

Google Scholar

[7]

David Hawking, Ellen Voorhees, Nick Craswell, and Peter Bailey. 1999. Overview of the TREC-8 web track. In TREC.

Google Scholar

[8]

Yiqun Liu, Fei Chen, Weize Kong, Huijia Yu, Min Zhang, Shaoping Ma, and Liyun. 2012. Identifying web spam with the wisdom of the crowds. ACM Transactions on the Web (TWEB) Vol. 6, 1 (2012), 2.

Digital Library

Google Scholar

[9]

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Technical Report. Stanford InfoLab.

Google Scholar

[10]

Yi-Min Wang, Ming Ma, Yuan Niu, and Hao Chen. Spam double-funnel: Connecting web spammers with advertisers WWW '07.

Google Scholar

[11]

Qi Zhang, Jihua Kang, Jin Qian, and Xuanjing Huang. Continuous word embeddings for detecting local text reuses at the semantic level SIGIR '14.

Google Scholar

[12]

Yu Zhao and Maosong Sun. 2013. Exploiting Lexicalized Statistical Patterns in Chinese Linguistic Analysis. Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, 238--246.

Crossref

Google Scholar

Cited By

View all

Wang XLin JRen CChen J(2022)Knowledge Graph-Based Semantic Ranking for Efficient Semantic Query2022 IEEE 10th International Conference on Computer Science and Network Technology (ICCSNT)10.1109/ICCSNT56096.2022.9972953(75-79)Online publication date: 22-Oct-2022
https://doi.org/10.1109/ICCSNT56096.2022.9972953
Wu ZLiu YMao JZhang MMa S(2022)Leveraging Document-Level and Query-Level Passage Cumulative Gain for Document RankingJournal of Computer Science and Technology10.1007/s11390-022-2031-y37:4(814-838)Online publication date: 30-Jul-2022
https://doi.org/10.1007/s11390-022-2031-y
Tahir BMehmood M(2021)Corpulyzer: A Novel Framework for Building Low Resource Language CorporaIEEE Access10.1109/ACCESS.2021.30497939(8546-8563)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3049793
Show More Cited By

Index Terms

SogouT-16: A New Web Corpus to Embrace IR Research
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Test collections
  2. World Wide Web
    1. Web mining
    2. Web searching and information discovery

Recommendations

Turkish Language Resources: Morphological Parser, Morphological Disambiguator and Web Corpus
GoTAL '08: Proceedings of the 6th international conference on Advances in Natural Language Processing

In this paper, we propose a set of language resources for building Turkish language processing applications. Specifically, we present a finite-state implementation of a morphological parser, an averaged perceptron-based morphological disambiguator, and ...
Resources for Turkish morphological processing

We present a set of language resources and tools--a morphological parser, a morphological disambiguator, and a text corpus--for exploiting Turkish morphology in natural language processing applications. The morphological parser is a state-of-the-art ...
Assessing the Impact of Vocabulary Similarity on Multilingual Information Retrieval for Bantu Languages
FIRE '16: Proceedings of the 8th Annual Meeting of the Forum for Information Retrieval Evaluation

Despite the availability of massive open information and efforts to promote multilingualism on the Web, content in Bantu languages remains negligible. Additionally, Information Retrieval (IR) systems, such as the Google search engine, use algorithms ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

August 2017

1476 pages

ISBN:9781450350228

DOI:10.1145/3077136

General Chairs:
Noriko Kando
National Institute of Informatics
,
Tetsuya Sakai
Waseda University
,
Hideo Joho
University of Tsukuba
,
Program Chairs:
Hang Li
Huawei Noah's Ark Lab
,
Arjen P. de Vries
Radboud University
,
Ryen W. White
Microsoft Cortana

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 August 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

National Natural Science Foundation of China
National Key Basic Research Program

Conference

SIGIR '17

Sponsor:

SIGIR

SIGIR '17: The 40th International ACM SIGIR conference on research and development in Information Retrieval

August 7 - 11, 2017

Tokyo, Shinjuku, Japan

Acceptance Rates

SIGIR '17 Paper Acceptance Rate 78 of 362 submissions, 22%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
199
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Wang XLin JRen CChen J(2022)Knowledge Graph-Based Semantic Ranking for Efficient Semantic Query2022 IEEE 10th International Conference on Computer Science and Network Technology (ICCSNT)10.1109/ICCSNT56096.2022.9972953(75-79)Online publication date: 22-Oct-2022
https://doi.org/10.1109/ICCSNT56096.2022.9972953
Wu ZLiu YMao JZhang MMa S(2022)Leveraging Document-Level and Query-Level Passage Cumulative Gain for Document RankingJournal of Computer Science and Technology10.1007/s11390-022-2031-y37:4(814-838)Online publication date: 30-Jul-2022
https://doi.org/10.1007/s11390-022-2031-y
Tahir BMehmood M(2021)Corpulyzer: A Novel Framework for Building Low Resource Language CorporaIEEE Access10.1109/ACCESS.2021.30497939(8546-8563)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3049793
Wu ZMao JLiu YZhan JZheng YZhang MMa S(2020)Leveraging Passage-level Cumulative Gain for Document RankingProceedings of The Web Conference 202010.1145/3366423.3380305(2421-2431)Online publication date: 20-Apr-2020
https://dl.acm.org/doi/10.1145/3366423.3380305
Zheng YLiu YFan ZLuo CAi QZhang MMa S(2019)Investigating Weak Supervision in Deep RankingData and Information Management10.2478/dim-2019-0010Online publication date: 11-Sep-2019
https://doi.org/10.2478/dim-2019-0010
Wu BXiong CSun MLiu ZChampin PGandon FMédini LLalmas MIpeirotis P(2018)Query Suggestion with Feedback Memory NetworkProceedings of the 2018 World Wide Web Conference10.1145/3178876.3186068(1563-1571)Online publication date: 10-Apr-2018
https://dl.acm.org/doi/10.1145/3178876.3186068
Al-Thubaity AAlhoshan M(2018)ARARSS: A System for Constructing and Updating Arabic Textual ResourcesProceedings of the International Conference on Advanced Intelligent Systems and Informatics 201810.1007/978-3-319-99010-1_24(261-269)Online publication date: 29-Aug-2018
https://doi.org/10.1007/978-3-319-99010-1_24
Luo CZheng YMao JLiu YZhang MMa S(2017)Training Deep Ranking Model with Weak Relevance LabelsDatabases Theory and Applications10.1007/978-3-319-68155-9_16(205-216)Online publication date: 20-Sep-2017
https://doi.org/10.1007/978-3-319-68155-9_16

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Turkish Language Resources: Morphological Parser, Morphological Disambiguator and Web Corpus

Resources for Turkish morphological processing

Assessing the Impact of Vocabulary Similarity on Multilingual Information Retrieval for Bantu Languages