Implementation of Average Perceptron for POS tagging and NER. (NLP)
Part -1: Multiclass Perceptron. Includes 3 files for preprocessing learning and classification.
- percepinputpreprocessing.py:
Generates feature vector for the training data. Training data is to be provided in the below mentioned format. Classname followed by text content of the class in one line. The training data can contain multiple lines, and each line indicates data from one class or one file.
The module takes 2 arguments,
contains the extracted features from training file.
- perceplearn.py:
Generated the trained model for classification. Output generated by the module is a ModelFile of the training data. Model file contains a vector of features with their associated weights for each class. The module takes 3 arguments. -h is generated from preprocessing training data. file indicates the file in which trained model has to be stored. -h option indicates that development data set is provided while training to check the accuracy after each iteration of training. is the path to development set which is of the same format as the training set.
- percepclassify:
Classifies a given document to a class based on the generated model. The module takes 1 argument: is generated after learning from training data. The text data to be classified is provided through STDIN. ex: cat testfile | python3 percepclassify.py The class which the content belongs to is printed to STDOUT
Part -2: Postagger using multiclass perceptron. Includes 3 files for preprocessing learning and classification.
-
pospreprocessing.py: Generates features for learning. Input to this is a text file containing training data as a collection /. argument list is same as part1-(1).
-
poslearn.py: Generates model from training data. Argument list is same as part1-(2). Development data is of the same format as training data.
-
postag.py: Outputs text after postagging in the same format as training data. argument list and input text follows same format as part1-(3)
Part -3: Named entity recognition using multiclass perceptron. includes 3 files for preprocessing learning and classification.
-
nerpreprocessing.py: Generates features for learning. Input is a text file containg training data as a collection of //. Argument list is same as part1-(1)
-
nelearn.py: Generates model from training data. Argument list is same as part1-(2). Format of development data is same as training data.
-
netag.py: Outputs text after NER tagging the text input. Input contains only / and output contains the nertag to look the same as training data. Argument list is same as part1-(3)
Enhancements to be worked on:
- limiting the number of training iterations by computing the error rate after each iteration. 2) Accuracy of NER can be improved by considering additional features for learning. 3) Speed