TimeBankPT
TimeBankPT is a corpus of Portuguese text with annotations about time.
The annotation scheme used is similar to TimeML.
TimeBankPT is the result of adapting the English corpus used in the first TempEval challenge to the Portuguese language.
Contents
Citation
The preferred citation is Costa and Branco (2012).
Further details about the corpus can be found in the following publications:
- Costa, Francisco and Branco, António. 2010. Temporal Information Processing
of a New Language: Fast Porting with Minimal Resources. In
ACL2010-Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics.
[ bib ]
- Costa, Francisco and Branco, António. 2012. TimeBankPT: A
TimeML Annotated Corpus of Portuguese. In Proceedings of LREC2012.
[ bib ]
- Costa, Francisco. to appear. Processing Temporal Information in
Unstructured Documents. Ph.D.thesis, Universidade de Lisboa, Lisbon.
[ bib ]
Features
Some of the features of TimeBankPT:
- It uses the new Portuguese spelling (official document describing it, Wikipedia article).
- It was automatically checked for errors using reasoning code.
- It contains around 70,000 words of text, divided in a train set and a test set.
- It contains annotations for events, temporal expressions and temporal relations.
Size of TimeBankPT
Sentences | 2,281 | 351 |
Word Tokens |
According to white space | 60,782 | 8,920 |
Splitting contractions and detaching punctuation | 68,351 | 9,829 |
Events | 6,790 | 1,097 |
Temporal Expressions | 1,244 | 165 |
Temporal Relations | 5,781 | 758 |
License
Coming soon.
Example
This short text from TimeBankPT is an example of what can be found in TimeBankPT.
Download
Version 1 of TimeBankPT is available for download.
Last update: December 13, 2012