Part-of-speech tagging with maximum entropy and distribution al similarity features in a subregional corpus of Spanish

(Eng) With the primary objective of automatically tagging grammar categories in a collection of unstructured text, which was designed to assist in a series of linguistic tasks, this research has used two first-generation automatic taggers for Spanish. These taggers have been applied to the Baja Cali...

Full description

Autores:
Rico-Sulayes, Antonio
Saldívar-Arreola, Rafael
Rábago-Tánori, Álvaro
Tipo de recurso:
Article of journal
Fecha de publicación:
2017
Institución:
Universidad del Valle
Repositorio:
Repositorio Digital Univalle
Idioma:
eng
OAI Identifier:
oai:bibliotecadigital.univalle.edu.co:10893/18151
Acceso en línea:
https://hdl.handle.net/10893/18151
Palabra clave:
Corpus etiquetado
Español mexicano
Etiquetado gramatical estocástico
Tagged corpus
Mexican Spanish
Stochastic POS tagging
Rights
closedAccess
License
http://purl.org/coar/access_right/c_14cb
Description
Summary:(Eng) With the primary objective of automatically tagging grammar categories in a collection of unstructured text, which was designed to assist in a series of linguistic tasks, this research has used two first-generation automatic taggers for Spanish. These taggers have been applied to the Baja California Corpus del Habla (CHBC) that covers a sub-region of Mexico. The two taggers, one based on the Maximum Entropy principle and the other that adds distributional similarity traits to this statistical model, are recently introduced and a range of precision has not been offered for them. Therefore, this article has had as a second objective to evaluate and provide a figure of proven precision for the language models that underlie the taggers in question. In order to achieve these two objectives, this article has proposed a reduced labeling, which has also been useful in the pursuit of these objectives. Applied to a sample of around 11,000 words and more than 12,500 grammatical labels for two genders (written text and transcribed oral speech), the two labellers, the one for Maximum Entropy and the one that adds distributional similarity traits to it, have obtained results. of 97.2% and 97.4%, respectively. When comparing these figures with the standard criterion of 97.1% obtained among human annotators, the results of both taggers appear competitive, even when applied to an external data collection for which they have not been previously trained or calibrated. This is particularly important because under these types of experimental conditions it has been found that the performance of the labelers can deteriorate.