-
Notifications
You must be signed in to change notification settings - Fork 20
Personal copy of libexttextcat repository
License
giuliopaci/libexttextcat
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
libexttextcat is an N-Gram-Based Text Categorization library primarily intended for language guessing. Fundamentally this is an adaption of wiseguys libtextcat extended to be UTF-8 aware. See README.libtextcat for details. Quickstart: language guesser Assuming that you have successfully compiled the library, you need some language models to start guessing languages. A collection of over 70 language models mostly derived from Gertjan van Noord's "TextCat" package is bundled, with a matching configuration file, in the langclass directory: * cd langclass/ * ../src/testtextcat conf.txt Paste some text onto the commandline, and watch it get classified. Using the API: Classifying the language of a textbuffer can be as easy as: #include "textcat.h" ... void *h = textcat_Init( "conf.txt" ); ... printf( "Language: %s\n", textcat_Classify(h, buffer, 400); ... textcat_Done(h); Creating your own fingerprints: The createfp program allows you to easily create your own document finger 6F9A prints. Just feed it an example document on standard input, and store the standard output: % createfp < mydocument.txt > myfingerprint.txt Put the names of your fingerprints in a configuration file, add some id's and you're ready to classify. Performance tuning: This library was made with efficiency in mind. There are couple of parameters you may wish to tweak if you intend to use it for other tasks than language guessing. The most important thing is buffer size. For reliable language guessing the classifier only needs a couple of hundreds of bytes max. So don't feed it 100KB of text unless you are creating a fingerprint. If you insist on feeding the classifier lots of text, try fiddling with TABLEPOW, which determines the size of the hash table that is used to store the n-grams. Making it too small will result in many hashtable clashes, making it too large will cause wild memory behaviour and both are bad for the performance. Putting the most probable models at the top of the list in your config file improves performance, because this will raise the threshold for likely candidates more quickly. Since the speed of the classifier is roughly linear with respect to the number of models, you should consider how many models you really need. In case of language guessing: do you really want to recognize every language ever invented? Acknowledgements UTF-8 conversion and adaption for OpenOffice.org, Jocelyn Merand Original libTextCat, Frank Scheelen & Rob de Wit at wise-guys.nl Language models, copyright Gertjan van Noord References: [1] The document that started it all can be downloaded at John M. Trenkle's site: N-Gram-Based Text Categorization http://www.novodynamics.com/trenkle/papers/sdair-94-bc.ps.gz [2] The Perl implementation by Gertjan van Noord (code + language models): downloadable from his website http://odur.let.rug.nl/~vannoord/TextCat/ [3] Original libtextcat implementation at http://software.wise-guys.nl/libtextcat/ Contact: Questions or patches can be directed to libreoffice@lists.freedesktop.org. Bugs can be directed to https://bugs.freedesktop.org
About
Personal copy of libexttextcat repository
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published