bk2vec

Injecting background knowledge into the word vectors

How to train word vectors

Place your texts under the corpora folder. Script assumes one text per line, splits sentences on dot.
Launch TextReader.py. It will train word vectors and save them into model.w2v file
Fill free to implement a proper script that will actually take parameters and won't use the hardcoded ones

How to extract texts from the wikipedia dump

java -cp thewikimachine.jar org.fbk.cit.hlt.thewikimachine.xmldump.WikipediaTextExtractor 
-d <path-to-dump.xml>
-o <path-to-output-directory>
-t <amount of threads>

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
bk2vec		bk2vec
datasets		datasets
preprocessing		preprocessing
.gitignore		.gitignore
README.md		README.md
TextReader.py		TextReader.py
__init__.py		__init__.py
batch_producer.py		batch_producer.py
benchmark.py		benchmark.py
bk2vec.py		bk2vec.py
bk2vec_categories.py		bk2vec_categories.py
bk2vec_categories_noskip.py		bk2vec_categories_noskip.py
bk2vec_clean.py
calculate_distances.py		calculate_distances.py
category_loss.py		category_loss.py
embeddings.proto		embeddings.proto
evaluation.py		evaluation.py
evaluation_old.py		evaluation_old.py
filter_test_set.py		filter_test_set.py
gen_numbers.py		gen_numbers.py
launch_benchmark.sh		launch_benchmark.sh
params.py.dist		params.py.dist
render_embeddings.py		render_embeddings.py
sandbox.py		sandbox.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

bk2vec

How to train word vectors

How to extract texts from the wikipedia dump

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Remper/bk2vec

Folders and files

Latest commit

History

Repository files navigation

bk2vec

How to train word vectors

How to extract texts from the wikipedia dump

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages