Language resources
In the course of our R&D activities, and as instrumental assets for the execution of our projects, we developed or are developing the following language resources:
BDCamões Collection of Portuguese Literary Documents
Collection of literary documents in Portuguese with almost 4 Million tokens, to support research in language technology and digital humanities. The words in these documents are automatically annotated with morphological information, and the sentences with constituency and dependency syntactic analysis. It is distributed via PORTULAN CLARIN
LX-DSemVectors
Distributional semantic representation of Portuguese words (aka word embeddings). Distributed via github
LX-4WAnalogies
Test set, made of four-word based analogies, for distributional semantic representation of Portuguese words (aka word embeddings). Distributed via github
TimeBankPT
Portuguese corpus annotated with rich temporal annotations, adopting the TimeML conventions. It includes annotations not only of temporal expressions but also of events and temporal relations. This corpus is the result of translating and adapting the English corpus used in the first TempEval challenge to the Portuguese language. Distributed via META-SHARE (search for "PT").
DeepBankPT
Bank of deep grammatical representations sentence aligned with the Penn treebank of English: corpus of Portuguese sentences annotated with their fully fledged grammatical representations, along a HPSG grammar. The raw text corpus results from the translation into Portuguese of the WSJ corpus of English. Distributed via META-SHARE (search for "PT").
LogicalFormBankPT
Bank of logical forms sentence aligned with the Penn treebank of English: corpus of Portuguese sentences annotated with logical forms representing their meaning. The raw text corpus results from the translation into Portuguese of the WSJ corpus of English. Distributed via META-SHARE (search for "PT").
DependencyBankPT
Dependency bank of Portuguese sentence aligned with the Penn treebank of English: corpus of Portuguese sentences annotated with graphs representing grammatical dependencies, whose arcs are decorated with grammatical functions and semantic roles. The raw text corpus results from the translation into Portuguese of the WSJ corpus of English. Distributed via META-SHARE (search for "PT").
PropBankPT
PropBank of Portuguese sentence aligned with the Penn treebank of English: corpus of Portuguese sentences annotated with trees representing syntactic constituency decorated with grammatical functions and semantic roles. The raw text corpus results from the translation into Portuguese of the WSJ corpus of English. Distributed via META-SHARE (search for "PT").
TreebankPT
Treebank of Portuguese sentence aligned with the Penn treebank of English: corpus of Portuguese sentences annotated with trees of syntactic constituency. The raw text corpus results from the translation into Portuguese of the WSJ corpus of English. Distributed via META-SHARE (search for "PT").
CINTIL-QATreeBank
Corpus of Portuguese interrogative and imperative sentences. This Treebank includes declarative sentences from the pre-existing CINTIL-Treebank whose syntactic structure was manually transformed into their non-declarative counterpart: interrogative and imperative clauses. Distributed via META-SHARE (search for "CINTIL").
CINTIL-Definitions
Corpus of Portuguese definitions. Collection of annotated corpus (POS tags and morphological information) with and additional layer of annotation marking definitions. Distributed via META-SHARE (search for "CINTIL").
CINTIL-DeepBank
Bank of deep grammatical representations: corpus of Portuguese sentences annotated with their fully fledged grammatical representations, along a HPSG grammar. Distributed via META-SHARE (search for "CINTIL").
CINTIL-LogicalFormBank
Bank of logical forms: corpus of Portuguese sentences annotated with logical forms representing their meaning. Distributed via META-SHARE (search for "CINTIL").
CINTIL-DependencyBank
Dependency bank of Portuguese: corpus of Portuguese sentences annotated with graphs representing grammatical dependencies, whose arcs are decorated with grammatical functions and semantic roles. Distributed via META-SHARE (search for "CINTIL").
CINTIL-DependencyBank PREMIUM
Dependency bank of Portuguese: corpus of Portuguese sentences annotated with graphs representing grammatical dependencies, whose arcs are decorated with grammatical functions and semantic roles. Distributed via META-SHARE (search for "CINTIL").
CINTIL-PropBank
PropBank of Portuguese: corpus of Portuguese sentences annotated with trees representing syntactic constituency decorated with grammatical functions and semantic roles. Distributed via META-SHARE (search for "CINTIL").
CINTIL-Treebank
Treebank of Portuguese: corpus of Portuguese sentences annotated with trees of syntactic constituency. Distributed via META-SHARE (search for "CINTIL").
CINTIL-WordSenses
CINTIL extended by means of the annotation of word tokens with the identifer of concepts (synsets) they happen to express, with these identifiers belonging to the MWNPT-International Wordnet of Portuguese. Distributed via META-SHARE (search for "CINTIL").
CINTIL-NamedEntities
CINTIL extended by means of the annotation of named entities manually disambiguated and annotated with links to appropriate pages in the Portuguese Dbpedia. Distributed via META-SHARE (search for "CINTIL").
CINTIL-Corpus Internacional do Português
High quality, linguistically interpreted, accurately hand tagged 1Mtoken corpus of Portuguese wrt POS, inflection and NER. Developed and maintained in cooepration with CLUL-Centro de Linguística da Universidade de Lisboa.
CINTIL Concordancer
Advanced, freely available online concordancer for the CINTIL corpus. Developed and maintained in cooperation with CLUL-Centro de Linguística da Universidade de Lisboa.
CINTIL TagSet
Exhaustive set of part of speech tags for Portuguese, including coverage of transcriptions of verbal productions. This is the tagset used in the annotation of the CINTIL corpus.
CINTIL Annotation Manual
Companion manual of CINTIL corpus with explicit guidelines for annotation/interpretation.
LX-VerbalInflections
Collection of the verbforms of the Portuguese verbs associated with information on the respective inflection features.
LX-Abbreviations
Collection of abbreviations of different types from Portuguese. Each type of abbreviation is mannually divided and annotated with grammatical categories, gender and number, and, finally, with the respective full expression. Distributed via META-SHARE (search for "LX").
LX-StopWords
List of words from Portuguese composed by 2631 words of 51 types. The words are grouped in three big classes, arranged according to their morpho-syntactic category and inflectional feature value (closed classes, open classes, and multi-word units). Distributed via META-SHARE (search for "LX").
MWNPT-International WordNet of Portuguese
WordNet of Portuguese, developed in cooperation with MultiWordnet project of FBK-Foundation Bruno Kessler, from Trento, Italy.
QTLeap WSD/NED Multilingual Parallel Corpora
QTLeap Multilingual Parallel Corpora extended by means of the annotation of named entities automatically disambiguated and annotated with links to appropriate pages in DBpedia, and of the automatic annotation of word tokens with the identifer of concepts (synsets) they happen to express, with these identifiers belonging to wordnets. Distributed via META-SHARE (search for "QTLeap").
QTLeap Multilingual Parallel Corpora
Collection of queries and respective replies as these occurred in and were collected from a chat service to support troubleshooting in the domain of Information Technology, and their translations into Portuguese, English, German, Spanish, Basque, Dutch, Bulgarian and Czech. Distributed via META-SHARE (search for "QTLeap").
Nexing Corpus
Corpus with the transcriptions of syllogistic reasoning protocols.