An automatic approach to generate corpus in Spanish

A corpus is an indispensable linguistic resource for any application of natural language processing. Some corpora have been created manually or semi-automatically for a specific domain. In this paper, we present an automatic approach to generate corpus from digital information sources such as Wikipe...

Full description

Autores:: Puertas Del Castillo, Edwin Alexander

Tipo de recurso:

Fecha de publicación:: 2018

Institución:: Universidad Tecnológica de Bolívar

Repositorio:: Repositorio Institucional UTB

Idioma:: eng

id	UTB2_c2a42ebab902e899d27864e627f506a3
oai_identifier_str	oai:repositorio.utb.edu.co:20.500.12585/8916
network_acronym_str	UTB2
network_name_str	Repositorio Institucional UTB
repository_id_str
dc.title.none.fl_str_mv	An automatic approach to generate corpus in Spanish
title	An automatic approach to generate corpus in Spanish
spellingShingle	An automatic approach to generate corpus in Spanish Corpus Knowledge extraction Linguistic computational Natural language processing Text mining Data mining Extraction Natural language processing systems Tellurium compounds Websites Automatic approaches Corpus Digital information Knowledge extraction Linguistic resources Propagation algorithm Text mining Wikipedia Linguistics
title_short	An automatic approach to generate corpus in Spanish
title_full	An automatic approach to generate corpus in Spanish
title_fullStr	An automatic approach to generate corpus in Spanish
title_full_unstemmed	An automatic approach to generate corpus in Spanish
title_sort	An automatic approach to generate corpus in Spanish
dc.creator.fl_str_mv	Puertas Del Castillo, Edwin Alexander
dc.contributor.author.none.fl_str_mv	Puertas Del Castillo, Edwin Alexander
dc.contributor.editor.none.fl_str_mv	Martínez Santos, Juan Carlos Serrano Castañeda, Jairo Enrique
dc.subject.keywords.none.fl_str_mv	Corpus Knowledge extraction Linguistic computational Natural language processing Text mining Data mining Extraction Natural language processing systems Tellurium compounds Websites Automatic approaches Corpus Digital information Knowledge extraction Linguistic resources Propagation algorithm Text mining Wikipedia Linguistics
topic	Corpus Knowledge extraction Linguistic computational Natural language processing Text mining Data mining Extraction Natural language processing systems Tellurium compounds Websites Automatic approaches Corpus Digital information Knowledge extraction Linguistic resources Propagation algorithm Text mining Wikipedia Linguistics
description	A corpus is an indispensable linguistic resource for any application of natural language processing. Some corpora have been created manually or semi-automatically for a specific domain. In this paper, we present an automatic approach to generate corpus from digital information sources such as Wikipedia and web pages. The information extracted by Wikipedia is done by delimiting the domain, using a propagation algorithm to determine the categories associated with a domain region and a set of seeds to delimit the search. The information extracted from the web pages is carried out efficiently, determining the patterns associated with the structure of each page with the purpose of defining the quality of the extraction. © Springer Nature Switzerland AG 2018.
publishDate	2018
dc.date.issued.none.fl_str_mv	2018
dc.date.accessioned.none.fl_str_mv	2020-03-26T16:32:36Z
dc.date.available.none.fl_str_mv	2020-03-26T16:32:36Z
dc.type.none.fl_str_mv	Conferencia
dc.type.coarversion.fl_str_mv	http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.coar.fl_str_mv	http://purl.org/coar/resource_type/c_c94f
dc.type.driver.none.fl_str_mv	info:eu-repo/semantics/conferenceObject
dc.type.hasversion.none.fl_str_mv	info:eu-repo/semantics/publishedVersion
status_str	publishedVersion
dc.identifier.citation.none.fl_str_mv	Communications in Computer and Information Science; Vol. 885, pp. 150-161
dc.identifier.isbn.none.fl_str_mv	9783319989976
dc.identifier.issn.none.fl_str_mv	18650929
dc.identifier.uri.none.fl_str_mv	https://hdl.handle.net/20.500.12585/8916
dc.identifier.doi.none.fl_str_mv	10.1007/978-3-319-98998-3_12
dc.identifier.instname.none.fl_str_mv	Universidad Tecnológica de Bolívar
dc.identifier.reponame.none.fl_str_mv	Repositorio UTB
dc.identifier.orcid.none.fl_str_mv	57202285682 8738428200 57194828933 57203852380
identifier_str_mv	Communications in Computer and Information Science; Vol. 885, pp. 150-161 9783319989976 18650929 10.1007/978-3-319-98998-3_12 Universidad Tecnológica de Bolívar Repositorio UTB 57202285682 8738428200 57194828933 57203852380
url	https://hdl.handle.net/20.500.12585/8916
dc.language.iso.none.fl_str_mv	eng
language	eng
dc.relation.conferencedate.none.fl_str_mv	26 September 2018 through 28 September 2018
dc.rights.coar.fl_str_mv	http://purl.org/coar/access_right/c_16ec
dc.rights.uri.none.fl_str_mv	http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.rights.accessrights.none.fl_str_mv	info:eu-repo/semantics/restrictedAccess
dc.rights.cc.none.fl_str_mv	Atribución-NoComercial 4.0 Internacional
rights_invalid_str_mv	http://creativecommons.org/licenses/by-nc-nd/4.0/ Atribución-NoComercial 4.0 Internacional http://purl.org/coar/access_right/c_16ec
eu_rights_str_mv	restrictedAccess
dc.format.medium.none.fl_str_mv	Recurso electrónico
dc.format.mimetype.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Springer Verlag
publisher.none.fl_str_mv	Springer Verlag
dc.source.none.fl_str_mv	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85054377708&doi=10.1007%2f978-3-319-98998-3_12&partnerID=40&md5=d8689ca7ab863965c5539711ded485c1
institution	Universidad Tecnológica de Bolívar
dc.source.event.none.fl_str_mv	13th Colombian Conference on Computing, CCC 2018
bitstream.url.fl_str_mv	https://repositorio.utb.edu.co/bitstreams/f302c765-69b2-4566-9913-8b80304cb94c/download
bitstream.checksum.fl_str_mv	0cb0f101a8d16897fb46fc914d3d7043
bitstream.checksumAlgorithm.fl_str_mv	MD5
repository.name.fl_str_mv	Repositorio Digital Universidad Tecnológica de Bolívar
repository.mail.fl_str_mv	bdigital@metabiblioteca.com
_version_	1858228405623848960
spelling	Puertas Del Castillo, Edwin Alexandervirtual::3508-1Alvarado Valencia, Jorge AndrésMoreno Sandoval, L.G.Pomares Quimbaya, A.Martínez Santos, Juan Carlosvirtual::1893-1Serrano Castañeda, Jairo Enriquevirtual::2707-12020-03-26T16:32:36Z2020-03-26T16:32:36Z2018Communications in Computer and Information Science; Vol. 885, pp. 150-161978331998997618650929https://hdl.handle.net/20.500.12585/891610.1007/978-3-319-98998-3_12Universidad Tecnológica de BolívarRepositorio UTB5720228568287384282005719482893357203852380A corpus is an indispensable linguistic resource for any application of natural language processing. Some corpora have been created manually or semi-automatically for a specific domain. In this paper, we present an automatic approach to generate corpus from digital information sources such as Wikipedia and web pages. The information extracted by Wikipedia is done by delimiting the domain, using a propagation algorithm to determine the categories associated with a domain region and a set of seeds to delimit the search. The information extracted from the web pages is carried out efficiently, determining the patterns associated with the structure of each page with the purpose of defining the quality of the extraction. © Springer Nature Switzerland AG 2018.Pontificia Universidad JaverianaAcknowledgements. The tool presented was carried out within the construction of research capabilities of the Center for Excellence and Appropriation in Big Data and Data Analytics (CAOBA), led by the Pontificia Universidad Javeriana, funded by the Ministry of Information Technologies and Telecommunications of the Republic of Colombia (MinTIC).Recurso electrónicoapplication/pdfengSpringer Verlaghttp://creativecommons.org/licenses/by-nc-nd/4.0/info:eu-repo/semantics/restrictedAccessAtribución-NoComercial 4.0 Internacionalhttp://purl.org/coar/access_right/c_16echttps://www.scopus.com/inward/record.uri?eid=2-s2.0-85054377708&doi=10.1007%2f978-3-319-98998-3_12&partnerID=40&md5=d8689ca7ab863965c5539711ded485c113th Colombian Conference on Computing, CCC 2018An automatic approach to generate corpus in SpanishConferenciainfo:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/version/c_970fb48d4fbd8a85http://purl.org/coar/resource_type/c_c94fCorpusKnowledge extractionLinguistic computationalNatural language processingText miningData miningExtractionNatural language processing systemsTellurium compoundsWebsitesAutomatic approachesCorpusDigital informationKnowledge extractionLinguistic resourcesPropagation algorithmText miningWikipediaLinguistics26 September 2018 through 28 September 2018Arnold, P., Rahm, E., Automatic extraction of semantic relations from wikipedia (2015) Int. J. Artif. Intell. Tools, 24 (2)Berners-Lee, T., Connolly, D., (1995) Hypertext Markup Language-2.0, , Technical report, USABlei, D.M., Ng, A.Y., Jordan, M.I., Latent dirichlet allocation (2003) J. Mach. Learn. Res, 3, pp. 993-1022. , Jan(2006) Extensible Markup Language (Xml) 1.1Crawford, W., Csomay, E., Doing Corpus Linguistics (2015) Routledge, , AbingdonCrockford, D., (2006) The Application/Json Media Type for Javascript Object Notation, , JSONDrechsler, A., Hevner, A., A four-cycle model of is design science research: Capturing the dynamic nature of is artifact design (2016) Breakthroughs and Emerging Insights from Ongoing Design Science Projects: Research-In-Progress Papers and Poster Presentations from the 11Th International Conference on Design Science Research in Information Systems and Technology (DESRIST). DESRIST 2016, , St. John, CanadaDutta, B., Chatterjee, U., Madalli, D.P., YAMO: Yet another methodology for large-scale faceted ontology construction (2015) J. Knowl. Manag., 19 (1), pp. 6-24Edeki, C., Agile unified process (2013) Int. J. Comput. Sci., 1 (3), pp. 13-17Fan, J., Kalyanpur, A., Gondek, D.C., Ferrucci, D.A., Automatic knowledge extraction from documents (2012) IBM J. Res. Dev., 56 (3), pp. 1-5Ferrara, E., de Meo, P., Fiumara, G., Baumgartner, R., Web data extraction, applications and techniques: A survey (2014) Knowl.-Based Syst., 70, pp. 301-323Gharib, T.F., Badr, N.L., Haridy, S., Abraham, A., Enriching ontology concepts based on texts from WWW and corpus (2012) J. UCS, 18 (16), pp. 2234-2251Jiang, J., Information extraction from text (2012) Mining Text Data, pp. 11-41. , https://doi.org/10.1007/978-1-4614-3223-42, Aggarwal, C., Zhai, C. (eds.), Springer, BostonJurafsky, D., Martin, J.H., Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2009) Prentice Hall Series in Artificial Intelligence, pp. 1-1024Kanakaraj, M., Kamath, S.S., NLP based intelligent news search engine using information extraction from e-newspapers (2014) 2014 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), pp. 1-5. , IEEEKanavos, A., Makris, C., Plegas, Y., Theodoridis, E., Ranking web search results exploiting wikipedia (2016) Int. J. Artif. Intell. Tools, 25 (3)Kozareva, Z., Hovy, E., Tailoring the automated construction of large-scale taxonomies using the web (2013) Lang. Resour. Eval., 47 (3), pp. 859-890Küçük, D., Arslan, Y., Semi-automatic construction of a domain ontology for wind energy using wikipedia articles (2014) Renew. Energy, 62, pp. 484-489Lahbib, W., Bounhas, I., Slimani, Y., Arabic terminology extraction and enrichment based on domain-specific text mining (2015) 2015 IEEE 27Th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 340-347. , IEEELeskovec, J., Rajaraman, A., Ullman, J.D., (2014) Mining of Massive Datasets, , Cambridge University Press, CambridgeLiu, S., Zhang, C., Termhood-based comparability metrics of comparable corpus in special domain (2013) CLSW 2012. LNCS (LNAI), 7717, pp. 134-144. , https://doi.org/10.1007/978-3-642-36337-515, Ji, D., Xiao, G. (eds.), Springer, HeidelbergLoria, S., TextBlob: Simplified text processing (2014) Secondary Textblob: Simplified Text ProcessingMarch, S.T., Smith, G.F., Design and natural science research on information technology (1995) Decis. Support Syst., 15 (4), pp. 251-266March, S.T., Storey, V.C., Design science in the information systems discipline: An introduction to the special issue on design science research (2008) MIS Q, 32, pp. 725-730Medelyan, O., Witten, I.H., Divoli, A., Broekstra, J., Automatic construction of lexicons, taxonomies, ontologies, and other knowledge structures (2013) Wiley Interdisc. Rev.: Data Min. Knowl. Discov., 3 (4), pp. 257-279Morell, M.F., The Wikimedia foundation and the governance of Wikipedias infrastructure: Historical trajectories and its hybrid character (2011) Critical Point of View: A Wikipedia Reader, pp. 325-341Petrov, S., Das, D., McDonald, R., (2011) A Universal Part-Of-Speech TagsetPowers, D.M.W., Evaluation: From precision, recall and F-measure to ROC, informedness, markedness & correlation (2011) J. Mach. Learn. Technol., 2 (1), pp. 37-63Richardson, L., Ruby, S., (2008) Restful Web Services, , O’Reilly Media, Inc., SebastopolSchwaber, K., Beedle, M., (2002) Agile Software Development with Scrum, 1. , Prentice Hall, Upper Saddle RiverVállez, M., Pedraza-Jiménez, R., Codina, L., Blanco, S., Rovira, C., A semiautomatic indexing system based on embedded information in HTML documents (2015) Library Hi Tech, 33 (2), pp. 195-210van Rossum, G., Drake, F.L., Python Language Reference Manual (2003) Network Theory, , BristolWood, L., Nicol, G., Robie, J., Champion, M., Byrne, S., (2004) Document Object Model (DOM) Level 3 Core SpecificationZhu, M., Recall, precision and average precision (2004) Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, 2, p. 30http://purl.org/coar/resource_type/c_c94fPublication84e86005-e232-4d13-ab38-68f0f2b4aeb0virtual::3508-184e86005-e232-4d13-ab38-68f0f2b4aeb0virtual::3508-135de2f55-a620-47ac-97f2-9961adeac601virtual::1893-1db6967a8-73d5-4623-92c5-5e62d5ad495cvirtual::2707-135de2f55-a620-47ac-97f2-9961adeac601virtual::1893-1db6967a8-73d5-4623-92c5-5e62d5ad495cvirtual::2707-1THUMBNAILMiniProdInv.pngMiniProdInv.pngimage/png23941https://repositorio.utb.edu.co/bitstreams/f302c765-69b2-4566-9913-8b80304cb94c/download0cb0f101a8d16897fb46fc914d3d7043MD51falseAnonymousREAD20.500.12585/8916oai:repositorio.utb.edu.co:20.500.12585/89162025-05-26 16:09:19.54http://creativecommons.org/licenses/by-nc-nd/4.0/metadata.onlyhttps://repositorio.utb.edu.coRepositorio Digital Universidad Tecnológica de Bolívarbdigital@metabiblioteca.com

An automatic approach to generate corpus in Spanish

Publicaciones similares