This repository is part of the analysis done for my master thesis.
Thesis aims to identify the group of edited tokens in the revision history of an article which are fine grained in their respective revisions.
In order to achieve our research goal, we break it into following steps.
- We define and identify fine grained edit tokens, we call it Change Objects.
- We transform Change Objects into Change Vectors of fixed dimension using pre-trained word vectors.
- We create groups of Change Objects by clustering Change Vector.
- We evaluate groups of fine grained Change Objects.
- We compare our algorithm of identifying group of Change Objects with algorithm proposed by [bykau et al].
All the analysis steps are released as IPython notebook. Further we also release example of cluster groups created on John Logie Baird in the example notebook
After cloning this repository, get the word vectors from fast text. Download English Word Vector. Create a directory /wordvectors
in the root directory where the code is cloned. Save the word vector in this directory to be used in next steps.
Inside wikiconflict directory, create the directory /data
for storing all the results of analysis.
Inside /data
, create various subdirectories. Each of these subdirectories will store data at various stages of processing.
/content
/change objects
/change_vector
/bykau_change_object
/annotation
/pre_evaluation
As the analysis is done on history of edit of a single article, all the steps of the analysis are performed on a target Wikipedia article. Follow the steps listed below to re-run the analysis:
We use tokens from WikiWho API to identify edited tokens. Therefore, the first step requires to download all the contents of the target article.
Tokenised content of the article can be downloaded using the notebook which is saved in the /data/content
directory for next steps of analysis.
From the edited tokens downloaded in the data/content
directory, we create change vector using the notebook 2_create_change_object-v2.ipynb. This notebook saves the identified change object in the directory /data/change
objects.
Next step is to transform Change Objects stored in /data/change
objects into 600 dimensional Change Vector using pre - trained word vectors downloaded from fast text.
The notebook 3_create_change_vector-v3.ipynb creates different change vectors corresponding to different values of parameter, context_length. All of these change vectors are saved in /data/change_vector
directory. Change vector is created from neighbouring tokens of change vectors. Neighbouring tokens are tokens in left and right of the gap of change object. Number of neighbouring tokens taken from left and right of change object gap is called context_length. Using pre - trained word vectors already downlaoded in wordvectors
, we average vectors corresponding to words in neighbouring context. As different values of context_length gives differnt neighbouring tokens, we get different Change Vectors for same Change Object for different value of context_length.
Change Vectors saved in /data/change_vector
corresponnding to different values of context_length is used to create DBSCAN clusters. These clusters are first evaluated using both intrinsic and extrinsic mechanism. We compare our cluster to one created by re-implementation of Bykau. et. Al.. DBSCAN has two parameters eps and min_samples which when combined with context_length gives us three paramters of our model against which we evaluate and analyse performance of our model.
We use 16 articles in small article list and big article list for intrinsic evaluation and for comparison with Bykau. et. Al..
Finally extrinsic evauation is done on the gold standard created by annotation of change object pertaining to edit history of Wikipedia article John Logie Baird.
Being a density based clustering algorithm DBSCAN identifies clusters of unequal size. We first analyse the cluster length distribution using various descriptive statistics. We identify Gini co-effecient of cluster length distribution.
For further investigating Change Object groups intrinsicially, we propose various measure based on the assumption that cluster of vector created by averaging word tokens from immediately before and after the Change Object should be able to have similar word tokens and come from similar relative position in the article. In order to quantify different kind of words in Cluster, we define token entropy for the edited token in gap of change object. Similarly to quantify the relative position of Change Objects in a cluster we define the relative position entropy. All of the intrinsic evaluation analysis is done in the notebook 4_1_clustering-dbscan-intrinsic-evaluation-all.ipynb.
To run intrinsic evaluation for all 16 articles we use the script in 3_dbscan_intrinsic.py. Results of intrinsic evaluation is saved in /pre_evaluation
directory. We create visualisation of these results using [notebook](./notebooks/6_2_Plots (Response Variables).ipynb).
Compare with Bykau. et. Al.
First we reimplement paper from Bykau. et. Al. Analysis of optimisation and clustering as suggested in Bykau. et. Al. is done in 5_1_reproduce_fine_grained. We run this reimplemented algorithm on all the change objects saved in /change_object
directory and save the change object groups created by Bykau. et. Al. in /bykau_change_object
.
Agreement of our cluster with Bykau. et. Al.
We find Fowlkes–Mallows index of cluster created by our model and Bykau. et. Al. using 5_3_fowlkes_intercluster.
Extrinsic evaluation is done using V-Measure Analysis. Which is an entropy based measure consisting of Homogenity,Completness and V-measure.
We evaluate our model extriniscally against a golden data set prepared by human annotators. Extrinsic evaluation is done in the 4_2_clustering-dbscan-extrinsic-evaluation-v-measure.ipynb for the annotations of the article John Logie Baird. Evaluation is done with respect to three parameters of the model Context_length, eps and min_sample.
Extrinsic evaluation against Bykau. et. Al.
We further evaluate Bykau. et. Al.. clusters using annotations of John Logie Baird. To give a fair comparison, we implement Bykau. et. al. with and without optimisation. 5_2_1_compare-with-bykau_with_optimizations-v-measure.ipynb evaluates Bykau. et. al. with optimisation whereas 5_2_2_compare-with-bykau-without_optimizations-vmeasure.