[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3269206.3271745acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

GYANI: An Indexing Infrastructure for Knowledge-Centric Tasks

Published: 17 October 2018 Publication History

Abstract

In this work, we describe GYANI (gyan stands for knowledge in Hindi), an indexing infrastructure for search and analysis of large semantically annotated document collections. To facilitate the search for sentences or text regions for many knowledge-centric tasks such as information extraction, question answering, and relationship extraction, it is required that one can query large annotated document collections interactively. However, currently such an indexing infrastructure that scales to millions of documents and provides fast query execution times does not exist. To alleviate this problem, we describe how we can effectively index layers of annotations (e.g., part-of-speech, named entities, temporal expressions, and numerical values) that can be attached to sequences of words. Furthermore, we describe a query language that provides the ability to express regular expressions between word sequences and semantic annotations to ease search for sentences and text regions for enabling knowledge acquisition at scale. We build our infrastructure on a state-of-the-art distributed extensible record store. We extensively evaluate GYANI over two large news archives and the entire Wikipedia amounting to more than fifteen million documents. We observe that using GYANI we can achieve significant speed ups of more than 95x in information extraction, 53x on extracting answer candidates for questions, and 12x on relationship extraction task.

References

[1]
REFERENCES {1} English Gigaword Fifth Edition. (https://catalog.ldc.upenn.edu/LDC2011T07).
[2]
New York Times (NYT) Corpus. (https://catalog.ldc.upenn.edu/LDC2008T19).
[3]
NYT: On This Day. (https://learning.blogs.nytimes.com/on-this-day/).
[4]
Wikipedia: The Free Encyclopedia. (https://www.wikipedia.org/).
[5]
JavaFastPFOR. (https://github.com/lemire/JavaFastPFOR).
[6]
GNU Grep 3.0. (https://www.gnu.org/software/grep/manual/grep.html).
[7]
H. Bast and B. Buchhold. An index for efficient semantic full-text search. In CIKM'13.
[8]
H. Bast et al. Semantic Search on Text and Knowledge Bases. Foundations and Trends in Information Retrieval 10, 2--3 (2016), 119--271.
[9]
M. J. Cafarella and O. Etzioni. A search engine for natural language applications. In WWW'05.
[10]
J. Cho and S. Rajagopalan. A Fast Regular Expression Indexing Engine. In ICDE'02.
[11]
C.L. A. Clarke et al. An Algebra for Structured Text Search and a Framework for its Implementation. Comput. J. 38, 1 (1995), 43--56.
[12]
M. P Consens and T. Milo. Algebras for Querying Text Regions - Expressive Power and Optimization. J. Comput. Syst. Sci. 57, 3 (1998), 272--288.
[13]
R. Cornacchia et al. Flexible and efficient IR using array databases. VLDB J. 17, 1 (2008), 151--168.
[14]
D. A. Ferrucci and A. Lally. UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment. Nat. Lang. Eng. 10, 3--4 (Sept. 2004), 327--348.
[15]
E. Frick et al. Evaluating Query Languages for a Corpus Processing System. In LREC'12.
[16]
D. Gupta and K. Berberich. Identifying Time Intervals for Knowledge Graph Facts. In WWW'18.
[17]
M. A. Hearst. Automatic Acquisition of Hyponyms from Large Text Corpora. In COLING'92.
[18]
D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd ed.). Prentice Hall PTR, Upper Saddle River, NJ, USA.
[19]
T. Krause et al. graphANNIS: A Fast Query Engine for Deeply Annotated Linguistic Corpora. Corpus Linguistic Software Tools 31, 1 (2016), 1--25.
[20]
M. Lalmas. XML Retrieval. Morgan & Claypool Publishers (2009).
[21]
H. Li. Data extraction from text using wild card queries. Masters Abstracts International (2006).
[22]
C. D. Manning et al. The Stanford CoreNLP Natural Language Processing Toolkit. In ACL'14.
[23]
S. Metzger et al. S3K: seeking statement-supporting top-K witnesses. In CIKM'11.
[24]
R. C Miller and Brad A Myers. Lightweight Structured Text Processing. USENIX Annual Technical Conference, General Track (1999).
[25]
N. Nakashole et al. PATTY: A Taxonomy of Relational Patterns with Semantic Types. In EMNLP-CoNLL'12.
[26]
K. Panev and K. Berberich. Phrase Queries with Inverted + Direct Indexes. In WISE'14.
[27]
A. Salminen and F. W. Tompa. PAT expressions: an algebra for text search. Acta Linguistica Hungarica (1994).
[28]
D. Savenkov and E. Agichtein. When a Knowledge Base Is Not Enough: Question Answering over Knowledge Bases with External Text Data. In SIGIR'16.
[29]
F. M. Suchanek et al. YAGO: A Large Ontology from Wikipedia and WordNet. Web Semant. 6, 3 (Sept. 2008), 203--217.
[30]
H. E. Williams et al. Fast Phrase Querying with Combined Indexes. ACM Trans. Inf. Syst. 22, 4 (Oct. 2004), 573--594.
[31]
C. Zhai and S. Massung. Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining. Association for Computing Machinery and Morgan & Claypool, New York, NY, USA.
[32]
M. Zhou et al. Data-oriented content query system - searching for data into text on the web. In WSDM'10.

Cited By

View all
  • (2024)Scalable Range Search over Temporal and Numerical ExpressionsProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672509(91-100)Online publication date: 2-Aug-2024
  • (2019)Structured Search in Annotated Document CollectionsProceedings of the Twelfth ACM International Conference on Web Search and Data Mining10.1145/3289600.3290618(794-797)Online publication date: 30-Jan-2019
  • (2019)Efficient Retrieval of Knowledge Graph Fact EvidencesThe Semantic Web: ESWC 2019 Satellite Events10.1007/978-3-030-32327-1_18(90-94)Online publication date: 10-Oct-2019

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management
October 2018
2362 pages
ISBN:9781450360142
DOI:10.1145/3269206
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. indexing infrastructure
  2. information extraction at scale
  3. knowledge-centric tasks
  4. semantic annotations

Qualifiers

  • Research-article

Conference

CIKM '18
Sponsor:

Acceptance Rates

CIKM '18 Paper Acceptance Rate 147 of 826 submissions, 18%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)1
Reflects downloads up to 22 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Scalable Range Search over Temporal and Numerical ExpressionsProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672509(91-100)Online publication date: 2-Aug-2024
  • (2019)Structured Search in Annotated Document CollectionsProceedings of the Twelfth ACM International Conference on Web Search and Data Mining10.1145/3289600.3290618(794-797)Online publication date: 30-Jan-2019
  • (2019)Efficient Retrieval of Knowledge Graph Fact EvidencesThe Semantic Web: ESWC 2019 Satellite Events10.1007/978-3-030-32327-1_18(90-94)Online publication date: 10-Oct-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media