8000 GitHub - huberemanuel/porttagger: A Brazilian Portuguese part-of-speech tagger according to Universal Dependencies
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

huberemanuel/porttagger

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Porttagger - Brazilian Portuguese Part-of-Speech Tagger

This repository contains the code necessary to reproduce results of my master's dissertation "Etiquetagem morfossintática multigênero para o português do Brasil segundo o modelo Universal Dependencies".

If just want to use the created tagger without downloading anything, you can play with it in a live demo available at Hugginface Spaces 🤗.

The main goal of this work is to create a robust multigenre part-of-speech tagger for Brazillian Portuguese. In the porttinari folder can be found the training notebooks to reproduce the evaluation of seven tagging methods on the Porttinari-base corpus. You can find the artifacts of each experiment in google drive for the CNCSR, Meta-Bilstm, Stanza, and UDPipe 2 models. For the Transformer-based models, the results of the experiments are stored at wandb.

After experimentation and the selection of the BERTimbau-base model in the evaluation process in a single corpus, the multigenre evaluation was made in three corpora with distintict genres with similar annotation guidelines:

  • Porttinari-base: A subset of the Porttinari treebank and contains 8,420 news sentences and 168,400 tokens.
  • DANTEStocks: A total of 4,048 Brazilian stock market tweets and 81,048 tokens.
  • PetroGold: A total of 8,946 academic texts from the oil & gas domain and 250,905 tokens.

The notebook for reproducibility is the same one used in the Porttinari-base (single corpus) experiment, therefore, it can be located in the porttinari/notebooks dir. The predictions and confusion matrices can be found at the multigenre/outputs folder.

For error analysis notebooks, they can be found in the notebooks folder.

This project was supported by Centro de Inteligência Artificial (C4AI-USP), Fundação de Apoio à Pesquisa do Estado de São Paulo (FAPESP), IBM Corporation, and Motorola through the Ministério da Ciência e Tecnologia (MCTI).


About

A Brazilian Portuguese part-of-speech tagger according to Universal Dependencies

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0