An automatic approach to generate corpus in Spanish

A corpus is an indispensable linguistic resource for any application of natural language processing. Some corpora have been created manually or semi-automatically for a specific domain. In this paper, we present an automatic approach to generate corpus from digital information sources such as Wikipe...

Full description

Autores:
Tipo de recurso:
Fecha de publicación:
2018
Institución:
Universidad Tecnológica de Bolívar
Repositorio:
Repositorio Institucional UTB
Idioma:
eng
OAI Identifier:
oai:repositorio.utb.edu.co:20.500.12585/8916
Acceso en línea:
https://hdl.handle.net/20.500.12585/8916
Palabra clave:
Corpus
Knowledge extraction
Linguistic computational
Natural language processing
Text mining
Data mining
Extraction
Natural language processing systems
Tellurium compounds
Websites
Automatic approaches
Corpus
Digital information
Knowledge extraction
Linguistic resources
Propagation algorithm
Text mining
Wikipedia
Linguistics
Rights
restrictedAccess
License
http://creativecommons.org/licenses/by-nc-nd/4.0/
Description
Summary:A corpus is an indispensable linguistic resource for any application of natural language processing. Some corpora have been created manually or semi-automatically for a specific domain. In this paper, we present an automatic approach to generate corpus from digital information sources such as Wikipedia and web pages. The information extracted by Wikipedia is done by delimiting the domain, using a propagation algorithm to determine the categories associated with a domain region and a set of seeds to delimit the search. The information extracted from the web pages is carried out efficiently, determining the patterns associated with the structure of each page with the purpose of defining the quality of the extraction. © Springer Nature Switzerland AG 2018.