This is an individual project based on a Naive Bayesian Spell Checker. This project is an extension on an lab question of CS2180 Artificial Intelligence Lab.
The default Dataset used is the War and Peace book by Leo Tolstoy from Project Gutenburg Website (https://www.gutenberg.org/cache/epub/2600/pg2600.txt). However, an option is provided of using own dataset.
How the Spell Checker works:
- A frequency table (in form of a dictionary) is prepared from the dataset. This serves two purposes: i. Every unique word from the dataset is treated as valid word. ii. Each word has a frequency associated with it. This is used in calculating the Probability of its occurence (Higher the frequency, higher its probability of being referred).
- The input to the class is your word file (in txt format). The spell checker traverses through each word (assumes that it is misspelled), and performs a search for the most likely valid word (which can be the same word)
- Essentially it is a Bayesian Classification. It classifies the word into different valid words which can be obtained from it (via permuting/deletion/insertion/replacement), finds the probability of that classification, and chooses the one which maximum probability.
- What makes it Naive is the fact that we assume that the occurence of each word is independent of other words. Although this is a imperfect assumption, it provides a decent prediction.
- The conditional probability function requires a few base variables, which have a default value set. However, you can provide your own values for them.
The Alternate valid words (or classifications) are obtained as following:
- Cyclic Permutation: O(n)
- Swapping one character : O(n^2)
- Deleting one character : O(n)
- Inserting one character : O(n^2)
- Replacing one character : O(n^2) There are two sets, Fw and Gw, with Fw being the set of all variations of a word, and Gw being the set of all variations of the elements in Fw. Further changes in terms of more characters can be made to increase the accuracy, but keeping in mind the speed of the checker, it is not recommended.