[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1007/978-3-540-30586-6_24guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Name discrimination by clustering similar contexts

Published: 13 February 2005 Publication History

Abstract

It is relatively common for different people or organizations to share the same name. Given the increasing amount of information available online, this results in the ever growing possibility of finding misleading or incorrect information due to confusion caused by an ambiguous name. This paper presents an unsupervised approach that resolves name ambiguity by clustering the instances of a given name into groups, each of which is associated with a distinct underlying entity. The features we employ to represent the context of an ambiguous name are statistically significant bigrams that occur in the same context as the ambiguous name. From these features we create a co–occurrence matrix where the rows and columns represent the first and second words in bigrams, and the cells contain their log–likelihood scores. Then we represent each of the contexts in which an ambiguous name appears with a second order context vector. This is created by taking the average of the vectors from the co–occurrence matrix associated with the words that make up each context. This creates a high dimensional “instance by word” matrix that is reduced to its most significant dimensions by Singular Value Decomposition (SVD). The different “meanings” of a name are discriminated by clustering these second order context vectors with the method of Repeated Bisections. We evaluate this approach by conflating pairs of names found in a large corpus of text to create ambiguous pseudo-names. We find that our method is significantly more accurate than the majority classifier, and that the best results are obtained by having a small amount of local context to represent the instance, along with a larger amount of context for identifying features, or vice versa.

References

[1]
A. Bagga and B. Baldwin. Entity-based cross-document co-referencing using the vector space model. In Proceedings of the 17th international conference on Computational linguistics, pages 79-85. Association for Computational Linguistics, 1998.
[2]
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41:391-407, 1990.
[3]
T. Gaustad. Statistical corpus-based word sense disambiguation: Pseudowords vs. real ambiguous words. In Companion Volume to the Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL/EACL 2001) - Proceedings of the Student Research Workshop, pages 61-66, Toulouse, France, 2001.
[4]
F. Ginter, J. Boberg, J. Jrvine, and T. Salakoski. New techniques for disambiguation in natural language and their application to biological text. Journal of Machine Learning Research, 5:605-621, June 2004.
[5]
C. H. Gooi and J. Allan. Cross-document coreference on a large scale corpus. In S. Dumais, D. Marcu, and S. Roukos, editors, HLT-NAACL 2004: Main Proceedings, pages 9-16, Boston, Massachusetts, USA, May 2 - May 7 2004. Association for Computational Linguistics.
[6]
H. Han, L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 2004 joint ACM/IEEE conference on Digital libraries, pages 296-305, 2004.
[7]
V. Hatzivassiloglou, P. Duboue, and A. Rzhetsky. Disambiguating proteins, genes, and rna in text: A machine learning approach. In Proceedings of the 9th International Conference on Intelligent Systems for Molecular Biology, Tivoli Gardens, Denmark, July 2001.
[8]
T.K. Landauer, P.W. Foltz, and D. Laham. An introduction to Latent Semantic Analysis. Discourse Processes, 25:259-284, 1998.
[9]
G. Mann and D. Yarowsky. Unsupervised personal name disambiguation. In W. Daelemans and M. Osborne, editors, Proceedings of CoNLL-2003, pages 33-40. Edmonton, Canada, 2003.
[10]
G.A. Miller and W.G. Charles. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1-28, 1991.
[11]
P. Nakov and M. Hearst. Category-based pseudowords. In Companion Volume to the Proceedings of HLT-NAACL 2003 - Short Papers, pages 67-69, Edmonton, Alberta, Canada, May 27 - June 1 2003.
[12]
A. Purandare. Discriminating among word senses using McQuitty's similarity analysis. In Companion Volume to the Proceedings of HLT-NAACL 2003 - Student Research Workshop, pages 19-24, Edmonton, Alberta, Canada, May 27 - June 1 2003.
[13]
A. Purandare and T. Pedersen. Word sense discrimination by clustering contexts in vector and similarity spaces. In Proceedings of the Conference on Computational Natural Language Learning, pages 41-48, Boston, MA, 2004.
[14]
H. Schütze. Automatic word sense discrimination. Computational Linguistics, 24(1):97-123, 1998.
[15]
N. Wacholder, Y. Ravin, and M. Choi. Disambiguation of proper names in text. In Proceedings of the fifth conference on Applied natural language processing, pages 202-208. Morgan Kaufmann Publishers Inc., 1997.
[16]
Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of the 11th Conference of Information and Knowledge Management (CIKM), pages 515-524, 2002.

Cited By

View all
  • (2019)ANDMC: An Algorithm for Author Name Disambiguation Based on Molecular Cross ClusteringDatabase Systems for Advanced Applications10.1007/978-3-030-18590-9_12(173-185)Online publication date: 22-Apr-2019
  • (2018)A retrospective of knowledge graphsFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-016-5228-912:1(55-74)Online publication date: 1-Feb-2018
  • (2017)A narrow-domain entity recognition method based on domain relevance measurement and context informationProceedings of the International Conference on Web Intelligence10.1145/3106426.3106470(623-628)Online publication date: 23-Aug-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
CICLing'05: Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
February 2005
826 pages
ISBN:3540245235
  • Editor:
  • Alexander Gelbukh

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 13 February 2005

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2019)ANDMC: An Algorithm for Author Name Disambiguation Based on Molecular Cross ClusteringDatabase Systems for Advanced Applications10.1007/978-3-030-18590-9_12(173-185)Online publication date: 22-Apr-2019
  • (2018)A retrospective of knowledge graphsFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-016-5228-912:1(55-74)Online publication date: 1-Feb-2018
  • (2017)A narrow-domain entity recognition method based on domain relevance measurement and context informationProceedings of the International Conference on Web Intelligence10.1145/3106426.3106470(623-628)Online publication date: 23-Aug-2017
  • (2017)Toponym disambiguation in historical documents using semantic and geographic featuresProceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage10.1145/3078081.3078099(175-180)Online publication date: 1-Jun-2017
  • (2016)CohEELWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2016.03.00137:C(75-89)Online publication date: 1-Mar-2016
  • (2014)The TALP participation at ERD 2014Proceedings of the first international workshop on Entity recognition & disambiguation10.1145/2633211.2634359(89-94)Online publication date: 11-Jul-2014
  • (2013)Towards a fair comparison between name disambiguation approachesProceedings of the 10th Conference on Open Research Areas in Information Retrieval10.5555/2491748.2491752(17-20)Online publication date: 15-May-2013
  • (2013)Domain-Independent Entity Coreference for Linking Ontology InstancesJournal of Data and Information Quality10.1145/2435221.24352234:2(1-29)Online publication date: 1-Mar-2013
  • (2012)An entity-topic model for entity linkingProceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning10.5555/2390948.2390962(105-115)Online publication date: 12-Jul-2012
  • (2012)Named entity disambiguation in streaming dataProceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 110.5555/2390524.2390639(815-824)Online publication date: 8-Jul-2012
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media