Porttagger - Brazilian Portuguese Part-of-Speech Tagger

This repository contains the code necessary to reproduce results of my master's dissertation "Etiquetagem morfossintática multigênero para o português do Brasil segundo o modelo Universal Dependencies".

If just want to use the created tagger without downloading anything, you can play with it in a live demo available at Hugginface Spaces 🤗.

The main goal of this work is to create a robust multigenre part-of-speech tagger for Brazillian Portuguese. In the porttinari folder can be found the training notebooks to reproduce the evaluation of seven tagging methods on the Porttinari-base corpus. You can find the artifacts of each experiment in google drive for the CNCSR, Meta-Bilstm, Stanza, and UDPipe 2 models. For the Transformer-based models, the results of the experiments are stored at wandb.

After experimentation and the selection of the BERTimbau-base model in the evaluation process in a single corpus, the multigenre evaluation was made in three corpora with distintict genres with similar annotation guidelines:

Porttinari-base: A subset of the Porttinari treebank and contains 8,420 news sentences and 168,400 tokens.
DANTEStocks: A total of 4,048 Brazilian stock market tweets and 81,048 tokens.
PetroGold: A total of 8,946 academic texts from the oil & gas domain and 250,905 tokens.

The notebook for reproducibility is the same one used in the Porttinari-base (single corpus) experiment, therefore, it can be located in the porttinari/notebooks dir. The predictions and confusion matrices can be found at the multigenre/outputs folder.

For error analysis notebooks, they can be found in the notebooks folder.

This project was supported by Centro de Inteligência Artificial (C4AI-USP), Fundação de Apoio à Pesquisa do Estado de São Paulo (FAPESP), IBM Corporation, and Motorola through the Ministério da Ciência e Tecnologia (MCTI).

NILC (Núcleo Interinstitucional de Linguística Computacional) log
55DD
o

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
data		data
multigenre/outputs		multigenre/outputs
notebooks		notebooks
porttinari		porttinari
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Porttagger - Brazilian Portuguese Part-of-Speech Tagger

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

huberemanuel/porttagger

Folders and files

Latest commit

History

Repository files navigation

Porttagger - Brazilian Portuguese Part-of-Speech Tagger

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages