Language resources

In the course of our R&D activities, and as instrumental assets for the execution of our projects, we developed or are developing the following language resources:



BDCamões Collection of Portuguese Literary Documents

Collection of literary documents in Portuguese with almost 4 Million tokens, to support research in language technology and digital humanities. The words in these documents are automatically annotated with morphological information, and the sentences with constituency and dependency syntactic analysis. It is distributed via PORTULAN CLARIN


LX-DSemVectors

Distributional semantic representation of Portuguese words (aka word embeddings). Distributed via github


LX-4WAnalogies

Test set, made of four-word based analogies, for distributional semantic representation of Portuguese words (aka word embeddings). Distributed via github


TimeBankPT

Portuguese corpus annotated with rich temporal annotations, adopting the TimeML conventions. It includes annotations not only of temporal expressions but also of events and temporal relations. This corpus is the result of translating and adapting the English corpus used in the first TempEval challenge to the Portuguese language. Distributed via META-SHARE (search for "PT").


DeepBankPT

Bank of deep grammatical representations sentence aligned with the Penn treebank of English: corpus of Portuguese sentences annotated with their fully fledged grammatical representations, along a HPSG grammar. The raw text corpus results from the translation into Portuguese of the WSJ corpus of English. Distributed via META-SHARE (search for "PT").


LogicalFormBankPT

Bank of logical forms sentence aligned with the Penn treebank of English: corpus of Portuguese sentences annotated with logical forms representing their meaning. The raw text corpus results from the translation into Portuguese of the WSJ corpus of English. Distributed via META-SHARE (search for "PT").


DependencyBankPT

Dependency bank of Portuguese sentence aligned with the Penn treebank of English: corpus of Portuguese sentences annotated with graphs representing grammatical dependencies, whose arcs are decorated with grammatical functions and semantic roles. The raw text corpus results from the translation into Portuguese of the WSJ corpus of English. Distributed via META-SHARE (search for "PT").


PropBankPT

PropBank of Portuguese sentence aligned with the Penn treebank of English: corpus of Portuguese sentences annotated with trees representing syntactic constituency decorated with grammatical functions and semantic roles. The raw text corpus results from the translation into Portuguese of the WSJ corpus of English. Distributed via META-SHARE (search for "PT").


TreebankPT

Treebank of Portuguese sentence aligned with the Penn treebank of English: corpus of Portuguese sentences annotated with trees of syntactic constituency. The raw text corpus results from the translation into Portuguese of the WSJ corpus of English. Distributed via META-SHARE (search for "PT").


CINTIL-QATreeBank

Corpus of Portuguese interrogative and imperative sentences. This Treebank includes declarative sentences from the pre-existing CINTIL-Treebank whose syntactic structure was manually transformed into their non-declarative counterpart: interrogative and imperative clauses. Distributed via META-SHARE (search for "CINTIL").


CINTIL-Definitions

Corpus of Portuguese definitions. Collection of annotated corpus (POS tags and morphological information) with and additional layer of annotation marking definitions. Distributed via META-SHARE (search for "CINTIL").


CINTIL-DeepBank

Bank of deep grammatical representations: corpus of Portuguese sentences annotated with their fully fledged grammatical representations, along a HPSG grammar. Distributed via META-SHARE (search for "CINTIL").


CINTIL-LogicalFormBank

Bank of logical forms: corpus of Portuguese sentences annotated with logical forms representing their meaning. Distributed via META-SHARE (search for "CINTIL").


CINTIL-DependencyBank

Dependency bank of Portuguese: corpus of Portuguese sentences annotated with graphs representing grammatical dependencies, whose arcs are decorated with grammatical functions and semantic roles. Distributed via META-SHARE (search for "CINTIL").


CINTIL-DependencyBank PREMIUM

Dependency bank of Portuguese: corpus of Portuguese sentences annotated with graphs representing grammatical dependencies, whose arcs are decorated with grammatical functions and semantic roles. Distributed via META-SHARE (search for "CINTIL").


CINTIL-PropBank

PropBank of Portuguese: corpus of Portuguese sentences annotated with trees representing syntactic constituency decorated with grammatical functions and semantic roles. Distributed via META-SHARE (search for "CINTIL").


CINTIL-Treebank

Treebank of Portuguese: corpus of Portuguese sentences annotated with trees of syntactic constituency. Distributed via META-SHARE (search for "CINTIL").


CINTIL-WordSenses

CINTIL extended by means of the annotation of word tokens with the identifer of concepts (synsets) they happen to express, with these identifiers belonging to the MWNPT-International Wordnet of Portuguese. Distributed via META-SHARE (search for "CINTIL").


CINTIL-NamedEntities

CINTIL extended by means of the annotation of named entities manually disambiguated and annotated with links to appropriate pages in the Portuguese Dbpedia. Distributed via META-SHARE (search for "CINTIL").


CINTIL-Corpus Internacional do Português

High quality, linguistically interpreted, accurately hand tagged 1Mtoken corpus of Portuguese wrt POS, inflection and NER. Developed and maintained in cooepration with CLUL-Centro de Linguística da Universidade de Lisboa.


CINTIL Concordancer

Advanced, freely available online concordancer for the CINTIL corpus. Developed and maintained in cooperation with CLUL-Centro de Linguística da Universidade de Lisboa.


CINTIL TagSet

Exhaustive set of part of speech tags for Portuguese, including coverage of transcriptions of verbal productions. This is the tagset used in the annotation of the CINTIL corpus.


CINTIL Annotation Manual

Companion manual of CINTIL corpus with explicit guidelines for annotation/interpretation.


LX-VerbalInflections

Collection of the verbforms of the Portuguese verbs associated with information on the respective inflection features.


LX-Abbreviations

Collection of abbreviations of different types from Portuguese. Each type of abbreviation is mannually divided and annotated with grammatical categories, gender and number, and, finally, with the respective full expression. Distributed via META-SHARE (search for "LX").


LX-StopWords

List of words from Portuguese composed by 2631 words of 51 types. The words are grouped in three big classes, arranged according to their morpho-syntactic category and inflectional feature value (closed classes, open classes, and multi-word units). Distributed via META-SHARE (search for "LX").


MWNPT-International WordNet of Portuguese

WordNet of Portuguese, developed in cooperation with MultiWordnet project of FBK-Foundation Bruno Kessler, from Trento, Italy.


QTLeap WSD/NED Multilingual Parallel Corpora

QTLeap Multilingual Parallel Corpora extended by means of the annotation of named entities automatically disambiguated and annotated with links to appropriate pages in DBpedia, and of the automatic annotation of word tokens with the identifer of concepts (synsets) they happen to express, with these identifiers belonging to wordnets. Distributed via META-SHARE (search for "QTLeap").


QTLeap Multilingual Parallel Corpora

Collection of queries and respective replies as these occurred in and were collected from a chat service to support troubleshooting in the domain of Information Technology, and their translations into Portuguese, English, German, Spanish, Basque, Dutch, Bulgarian and Czech. Distributed via META-SHARE (search for "QTLeap").


Nexing Corpus

Corpus with the transcriptions of syllogistic reasoning protocols.