Diewald, 2022 - Google Patents
Matrix and double-array representations for efficient finite state tokenizationDiewald, 2022
View PDF- Document ID
- 5890568513607868029
- Author
- Diewald N
- Publication year
- Publication venue
- Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)
External Links
Snippet
This paper presents an algorithm and implementation for efficient tokenization of space- delimited languages based on a deterministic finite state automaton. Two representations of the underlying data structure are presented and a model implementation for German is …
- 239000011159 matrix material 0 title description 13
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G06F17/2765—Recognition
- G06F17/277—Lexical analysis, e.g. tokenisation, collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/21—Text processing
- G06F17/22—Manipulating or registering by use of codes, e.g. in sequence of text characters
- G06F17/2217—Character encodings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G06F17/2705—Parsing
- G06F17/271—Syntactic parsing, e.g. based on context-free grammar [CFG], unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G06F17/2705—Parsing
- G06F17/2715—Statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/28—Processing or translating of natural language
- G06F17/2863—Processing of non-latin text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G06F17/2795—Thesaurus; Synonyms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G06F17/274—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/28—Processing or translating of natural language
- G06F17/2809—Data driven translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/21—Text processing
- G06F17/211—Formatting, i.e. changing of presentation of document
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/28—Processing or translating of natural language
- G06F17/2872—Rule based translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30943—Information retrieval; Database structures therefor; File system structures therefor details of database functions independent of the retrieved data type
- G06F17/30964—Querying
- G06F17/30979—Query processing
- G06F17/30985—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F2207/00—Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F2207/02—Indexing scheme relating to groups G06F7/02 - G06F7/026
- G06F2207/025—String search, i.e. pattern matching, e.g. find identical word or best match in a string
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5610812A (en) | Contextual tagger utilizing deterministic finite state transducer | |
Sudkamp | Languages and machines: an introduction to the theory of computer science | |
Xue et al. | The penn chinese treebank: Phrase structure annotation of a large corpus | |
US7552051B2 (en) | Method and apparatus for mapping multiword expressions to identifiers using finite-state networks | |
US7191177B2 (en) | Keyword extracting device | |
US20060047500A1 (en) | Named entity recognition using compiler methods | |
US20060136196A1 (en) | Bi-dimensional rewriting rules for natural language processing | |
JPH0351020B2 (en) | ||
US20080208566A1 (en) | Automated word-form transformation and part of speech tag assignment | |
Antony et al. | Computational morphology and natural language parsing for Indian languages: a literature survey | |
US7346511B2 (en) | Method and apparatus for recognizing multiword expressions | |
Agbago et al. | Truecasing for the Portage system | |
Diewald | Matrix and double-array representations for efficient finite state tokenization | |
Zaman et al. | Leveraging Bidirectionl LSTM with CRFs for Pashto tagging | |
Zhou et al. | A hybrid approach to Chinese word segmentation around CRFs | |
Broda et al. | Towards a set of general purpose morphosyntactic tools for Polish | |
Hu et al. | Chinese named entity recognition with CRFs: Two levels | |
Nguyen et al. | An in-depth analysis of OCR errors for unconstrained Vietnamese handwriting | |
Saini et al. | Shahmukhi to Gurmukhi transliteration system: A corpus based approach | |
KR100283100B1 (en) | Statistical Application Extraction Method and Method for Massive Coral | |
JP2002351870A (en) | Method for analyzing morpheme | |
AlGahtani et al. | Joint Arabic segmentation and part-of-speech tagging | |
Shokrollahi-Far | Self-Organizing Computational Efficiency in Quranic Grammar | |
EP1429257B1 (en) | Method and apparatus for recognizing multiword expressions | |
Sassano | Using a partially annotated corpus to build a dependency parser for Japanese |