ScraCOVID-19: Plataforma informativa de contenido digital mediante Scraping y almacenamiento NoSQL
Introducción— Mantener informada a la comunidad sobre la reciente pandemia causada por el COVID-19, se ha convertido en una necesidad haciéndose indispensable el uso de canales de comunicación confiables, información precisa y basada en la evidencia. Objetivos— Este trabajo tiene como objetivo princ...
- Autores:
-
Sánchez Paipilla, Ariel Guillermo
Durán Vaca, Mónica Katherine
González Amarillo, Angela María
Ballesteros Ricaurte, Javier Antonio
- Tipo de recurso:
- Article of journal
- Fecha de publicación:
- 2020
- Institución:
- Corporación Universidad de la Costa
- Repositorio:
- REDICUC - Repositorio CUC
- Idioma:
- spa
- OAI Identifier:
- oai:repositorio.cuc.edu.co:11323/12303
- Palabra clave:
- data analysis
NoSQL Database
digital communication
web page
information extraction
análisis de datos
bases de datos NoSQL
comunicación digital
página web
extracción de información
- Rights
- openAccess
- License
- INGE CUC - 2020
id |
RCUC2_0ea1912f93ef05b8021e758c0037f65a |
---|---|
oai_identifier_str |
oai:repositorio.cuc.edu.co:11323/12303 |
network_acronym_str |
RCUC2 |
network_name_str |
REDICUC - Repositorio CUC |
repository_id_str |
|
dc.title.spa.fl_str_mv |
ScraCOVID-19: Plataforma informativa de contenido digital mediante Scraping y almacenamiento NoSQL |
dc.title.translated.eng.fl_str_mv |
ScraCOVID-19: Digital content information platform through Scraping and NoSQL storage |
title |
ScraCOVID-19: Plataforma informativa de contenido digital mediante Scraping y almacenamiento NoSQL |
spellingShingle |
ScraCOVID-19: Plataforma informativa de contenido digital mediante Scraping y almacenamiento NoSQL data analysis NoSQL Database digital communication web page information extraction análisis de datos bases de datos NoSQL comunicación digital página web extracción de información |
title_short |
ScraCOVID-19: Plataforma informativa de contenido digital mediante Scraping y almacenamiento NoSQL |
title_full |
ScraCOVID-19: Plataforma informativa de contenido digital mediante Scraping y almacenamiento NoSQL |
title_fullStr |
ScraCOVID-19: Plataforma informativa de contenido digital mediante Scraping y almacenamiento NoSQL |
title_full_unstemmed |
ScraCOVID-19: Plataforma informativa de contenido digital mediante Scraping y almacenamiento NoSQL |
title_sort |
ScraCOVID-19: Plataforma informativa de contenido digital mediante Scraping y almacenamiento NoSQL |
dc.creator.fl_str_mv |
Sánchez Paipilla, Ariel Guillermo Durán Vaca, Mónica Katherine González Amarillo, Angela María Ballesteros Ricaurte, Javier Antonio |
dc.contributor.author.spa.fl_str_mv |
Sánchez Paipilla, Ariel Guillermo Durán Vaca, Mónica Katherine González Amarillo, Angela María Ballesteros Ricaurte, Javier Antonio |
dc.subject.eng.fl_str_mv |
data analysis NoSQL Database digital communication web page information extraction |
topic |
data analysis NoSQL Database digital communication web page information extraction análisis de datos bases de datos NoSQL comunicación digital página web extracción de información |
dc.subject.spa.fl_str_mv |
análisis de datos bases de datos NoSQL comunicación digital página web extracción de información |
description |
Introducción— Mantener informada a la comunidad sobre la reciente pandemia causada por el COVID-19, se ha convertido en una necesidad haciéndose indispensable el uso de canales de comunicación confiables, información precisa y basada en la evidencia. Objetivos— Este trabajo tiene como objetivo principal crear ScraCOVID-19 una plataforma web de contenido digital dedicada a acceder a las noticias actualizadas y de manera rápida. Como caso de estudio se manejan cuatro medios digitales con licencia a nivel nacional. Las noticias se presentan de manera resumida para permitir a los lectores, en función de su interés, leer las noticias mediante algunos filtros como: desempleo, educación, maltrato, corrupción y discriminación. Metodología— ScraCOVID-19 se crea a partir de la técnica de extracción Scraping, mediante el uso de BeautifulSoup, librería que permite extraer información en formato HTML de varios sitios web, utilizando el lenguaje de programación Python. Resultado: Se describe un modelo para realizar la categorización que extrae información útil para clasificar información en categorías haciendo referencia a las URL. Conclusiones— A partir de técnicas de extracción utilizadas en conjunto con herramientas de almacenamiento de datos no estructurados, se obtiene información de diferentes páginas web y se administran todos los datos recogidos en una misma web generada dinámicamente. |
publishDate |
2020 |
dc.date.accessioned.none.fl_str_mv |
2020-04-30 00:00:00 2024-04-09T20:21:22Z |
dc.date.available.none.fl_str_mv |
2020-04-30 00:00:00 2024-04-09T20:21:22Z |
dc.date.issued.none.fl_str_mv |
2020-04-30 |
dc.type.spa.fl_str_mv |
Artículo de revista |
dc.type.coar.spa.fl_str_mv |
http://purl.org/coar/resource_type/c_6501 http://purl.org/coar/resource_type/c_2df8fbb1 |
dc.type.content.spa.fl_str_mv |
Text |
dc.type.driver.spa.fl_str_mv |
info:eu-repo/semantics/article |
dc.type.local.eng.fl_str_mv |
Journal article |
dc.type.redcol.spa.fl_str_mv |
http://purl.org/redcol/resource_type/ART |
dc.type.version.spa.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.coarversion.spa.fl_str_mv |
http://purl.org/coar/version/c_970fb48d4fbd8a85 |
format |
http://purl.org/coar/resource_type/c_6501 |
status_str |
publishedVersion |
dc.identifier.issn.none.fl_str_mv |
0122-6517 |
dc.identifier.uri.none.fl_str_mv |
https://hdl.handle.net/11323/12303 |
dc.identifier.url.none.fl_str_mv |
https://doi.org/10.17981/ingecuc.16.2.2020.18 |
dc.identifier.doi.none.fl_str_mv |
10.17981/ingecuc.16.2.2020.18 |
dc.identifier.eissn.none.fl_str_mv |
2382-4700 |
identifier_str_mv |
0122-6517 10.17981/ingecuc.16.2.2020.18 2382-4700 |
url |
https://hdl.handle.net/11323/12303 https://doi.org/10.17981/ingecuc.16.2.2020.18 |
dc.language.iso.spa.fl_str_mv |
spa |
language |
spa |
dc.relation.ispartofjournal.spa.fl_str_mv |
Inge Cuc |
dc.relation.references.spa.fl_str_mv |
A. Landers, R. N., Brusso, R. C., Cavanaugh, K. J. & A. B. Collmus, “A primer on theory-driven web scraping: Automatic extraction of big data from the Internet for use in psychological research,” Psychol. Mhetods, vol. 21, no. 4, 475–492, 2016. https://doi.org/10.1037/met0000081 R. S. Chaulagain, S. Pandey, S. R. Basnet & S. Shakya, “Cloud Based Web Scraping for Big Data Applications,” presented at 2nd IEEE International Conference on Smart Cloud, SmartCloud, NY, USA, 3-5 Nov. 2017, pp. 138–143. https://doi.org/10.1109/SmartCloud.2017.28 E. Uzun, “A Novel Web Scraping Approach Using the Additional Information Obtained from Web Pages,” IEEE Access, vol. 8, pp. 61726–61740, 2020. https://doi.org/10.1109/ACCESS.2020.2984503 AMI, “AMI en los medios de comunicación,” ami.org, 2020. https://ami.org.co/ami-en-los-medios-de-comunicacion/. ASOMEDIOS, “Medios Digitales,” asomedios.com, 2020. http://www.asomedios.com/medios-digitales/ S. C. M. de S Sirisuriya, “A Comparative Study on Web Scraping,” presented at 8th International Research Conference IRS, KDU, RML, LK, 27-28 Aug. 2015, pp. 135–140. Available from http://ir.kdu.ac.lk/bitstream/handle/345/1051/com-059.pdf?sequence=1&isAllowed=y N. R. Haddaway, “The Use of Web-scraping Software in Searching for Grey Literature,” Grey J, vol. 11, no. 3, pp. 186–190, 2015. L. Citra, Meiliana & A. Chandra, “Social media web scraping using social media developers API and regex,” Procedia Comput Sci, vol. 157, pp. 444–449, 2019. https://doi.org/10.1016/j.procs.2019.08.237 A. Josi, L. A. Abdillah & Suryayusra, “Penerapan teknik web scraping pada mesin pencari artikel ilmiah,” SISFO, vol. 5, pp. 1–6, 2014. Available: https://arxiv.org/abs/1410.5777 D. M. Thomas & S. Mathur, “Data Analysis by Web Scraping using Python,” presented at 3rd International Conference on Electronics and Communication and Aerospace Technology, ICECA 2019, CJB, IN, 12-14 Jun. 2019, pp. 450–454. https://doi.org/10.1109/ICECA.2019.8822022 D. K. Mahto & L. Singh, “A dive into Web Scraper world,” presented at 3rd International Conference on Computing for Sustainable Global Development, INDIACom 2016, New DEL, IN, 16-18 Mar. 2016, pp. 689–693. R. Mitchell, Web Scraping with Python: Collecting More Data from the Modern Web. Sebastopol, USA: O'Reilly Media, 2018. R. Diouf, E. N. Sarr, O. Sall, B. Birregah, M. Bousso & S. N. Mbaye, “Web Scraping: State-of-the-Art and Areas of Application,” presented at 2019 IEEE International Conference on Big Data, Big Data 2019, LA, USA, 9-12 Dec. 2019, pp. 6040–6042. https://doi.org/10.1109/BigData47090.2019.9005594 U. Baskaran & K. Ramanujam, “Automated scraping of structured data records from health discussion forums using semantic analysis,” Inform Med Unlocked, vol. 10, pp. 149–158, 2018. https://doi.org/10.1016/j.imu.2018.01.003 I. Hui, “Shaping the Coast with Permits: Making the State Regulatory Permitting Process Transparent with Text Mining,” Coast Manag, vol. 45, no. 3, pp. 179–198, 2017. https://doi.org/10.1080/08920753.2017.1303694 M. Z. Kurdi, “Text Complexity Classification Based on Linguistic Information: Application to Intelligent Tutoring of ESL,” JDMDH, pp. 1–38, 2020. Available: https://arxiv.org/abs/2001.01863 L. Junjoewong, S. Sangnapachai & T. Sunetnanta, “ProCircle: A promotion platform using crowdsourcing and web data scraping technique,” presented at 7th ICT International Student Project Conference, ICT-ISPC 2018, Nakhon, TH, 11-13 Jul. 2018, pp. 1–5. https://doi.org/10.1109/ICT-ISPC.2018.8524003 E. N. Sarr, O. Sall & A. Diallo, “FactExtract: Automatic C ollection and A ggregation of A rticles and J ournalistic F actual Claims from Online Newspaper,” 5 International Conference on Social Networks Analysis, Management and Security, SNAMS, VA, ES, 15-18 Oct. 2018, pp. 336–341. https://doi.org/10.1109/SNAMS.2018.8554421 Alexa, “Top Sites in Colombia,” Amazom Company, 2020. https://www.alexa.com/topsites/countries/CO J. Díez, “Aplicación para monitorización de precios para Android,” Trabajo grado, ETS, UPNA, PNA, ES, 2019. Disponible en https://hdl.handle.net/2454/33693 C. Lopezosa, L. Codina & C. Gonzalo-Penela, “Off-page SEO and link building: General strategies and authority transfer in the digital news media,” Prof Inf, vol. 28, no. 1, pp. 1–14, 2019. https://doi.org/10.3145/epi.2019.ene.07 R. Bahana, R. Adinugroho, F. L. Gaol, A. Trisetyarso, B. S. Abbas & W. Suparta, “Web crawler and back-end for news aggregator system (Noox project),” presented at 2017 IEEE International Conference on Cybernetics and Computational Intelligence, Cybernetics, HKT, TH, 20-22 Nov. 2018, pp. 56–61. https://doi.org/10.1109/CYBERNETICSCOM.2017.8311684 |
dc.relation.citationendpage.none.fl_str_mv |
237 |
dc.relation.citationstartpage.none.fl_str_mv |
229 |
dc.relation.citationissue.spa.fl_str_mv |
2 |
dc.relation.citationvolume.spa.fl_str_mv |
16 |
dc.relation.bitstream.none.fl_str_mv |
https://revistascientificas.cuc.edu.co/ingecuc/article/download/3280/3018 https://revistascientificas.cuc.edu.co/ingecuc/article/download/3280/3543 https://revistascientificas.cuc.edu.co/ingecuc/article/download/3280/3572 |
dc.relation.citationedition.spa.fl_str_mv |
Núm. 2 , Año 2020 : (Julio-Diciembre) |
dc.rights.spa.fl_str_mv |
INGE CUC - 2020 |
dc.rights.uri.spa.fl_str_mv |
http://creativecommons.org/licenses/by-nc-nd/4.0 |
dc.rights.accessrights.spa.fl_str_mv |
info:eu-repo/semantics/openAccess |
dc.rights.coar.spa.fl_str_mv |
http://purl.org/coar/access_right/c_abf2 |
rights_invalid_str_mv |
INGE CUC - 2020 http://creativecommons.org/licenses/by-nc-nd/4.0 http://purl.org/coar/access_right/c_abf2 |
eu_rights_str_mv |
openAccess |
dc.format.mimetype.spa.fl_str_mv |
application/pdf text/html application/xml |
dc.publisher.spa.fl_str_mv |
Universidad de la Costa |
dc.source.spa.fl_str_mv |
https://revistascientificas.cuc.edu.co/ingecuc/article/view/3280 |
institution |
Corporación Universidad de la Costa |
bitstream.url.fl_str_mv |
https://repositorio.cuc.edu.co/bitstreams/065bba61-bc61-4b0c-8215-f47186e47a90/download |
bitstream.checksum.fl_str_mv |
43417916d413958e31c157583e72a409 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 |
repository.name.fl_str_mv |
Repositorio de la Universidad de la Costa CUC |
repository.mail.fl_str_mv |
repdigital@cuc.edu.co |
_version_ |
1811760710504415232 |
spelling |
Sánchez Paipilla, Ariel GuillermoDurán Vaca, Mónica KatherineGonzález Amarillo, Angela MaríaBallesteros Ricaurte, Javier Antonio2020-04-30 00:00:002024-04-09T20:21:22Z2020-04-30 00:00:002024-04-09T20:21:22Z2020-04-300122-6517https://hdl.handle.net/11323/12303https://doi.org/10.17981/ingecuc.16.2.2020.1810.17981/ingecuc.16.2.2020.182382-4700Introducción— Mantener informada a la comunidad sobre la reciente pandemia causada por el COVID-19, se ha convertido en una necesidad haciéndose indispensable el uso de canales de comunicación confiables, información precisa y basada en la evidencia. Objetivos— Este trabajo tiene como objetivo principal crear ScraCOVID-19 una plataforma web de contenido digital dedicada a acceder a las noticias actualizadas y de manera rápida. Como caso de estudio se manejan cuatro medios digitales con licencia a nivel nacional. Las noticias se presentan de manera resumida para permitir a los lectores, en función de su interés, leer las noticias mediante algunos filtros como: desempleo, educación, maltrato, corrupción y discriminación. Metodología— ScraCOVID-19 se crea a partir de la técnica de extracción Scraping, mediante el uso de BeautifulSoup, librería que permite extraer información en formato HTML de varios sitios web, utilizando el lenguaje de programación Python. Resultado: Se describe un modelo para realizar la categorización que extrae información útil para clasificar información en categorías haciendo referencia a las URL. Conclusiones— A partir de técnicas de extracción utilizadas en conjunto con herramientas de almacenamiento de datos no estructurados, se obtiene información de diferentes páginas web y se administran todos los datos recogidos en una misma web generada dinámicamente.Introduction— Keeping the community informed about the recent pandemic caused by COVID-19 has become a necessity, making the use of reliable communication channels accurate and evidence-based information indispensable. Objectives— His work has as main objective to create ScraCOVID-19 on a connected digital content web platform to access updated news quickly. As a case study, four digital media are managed with national license. The news is presented in a summarized way to allow readers, depending on their interest, to read the news through some filters such as: unemployment, education, abuse, corruption and discrimination. Methodology— ScraCOVID-19 is created from the Scraping extraction technique, using BeautifulSoup, a library that allows information in HTML format to be extracted from various websites, using the Python programming language. Results: As a result, a categorization model is described that extracts useful information to classify information into categories by referring to the URL. Conclusions— It is concluded that, from extraction techniques used in conjunction with unstructured data storage tools, information is obtained from different web pages and all the data collected on the same dynamically generated web is managed.application/pdftext/htmlapplication/xmlspaUniversidad de la CostaINGE CUC - 2020http://creativecommons.org/licenses/by-nc-nd/4.0info:eu-repo/semantics/openAccessEsta obra está bajo una licencia internacional Creative Commons Atribución-NoComercial-SinDerivadas 4.0.http://purl.org/coar/access_right/c_abf2https://revistascientificas.cuc.edu.co/ingecuc/article/view/3280data analysisNoSQL Databasedigital communicationweb pageinformation extractionanálisis de datosbases de datos NoSQLcomunicación digitalpágina webextracción de informaciónScraCOVID-19: Plataforma informativa de contenido digital mediante Scraping y almacenamiento NoSQLScraCOVID-19: Digital content information platform through Scraping and NoSQL storageArtículo de revistahttp://purl.org/coar/resource_type/c_6501http://purl.org/coar/resource_type/c_2df8fbb1Textinfo:eu-repo/semantics/articleJournal articlehttp://purl.org/redcol/resource_type/ARTinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/version/c_970fb48d4fbd8a85Inge Cuc A. Landers, R. N., Brusso, R. C., Cavanaugh, K. J. & A. B. Collmus, “A primer on theory-driven web scraping: Automatic extraction of big data from the Internet for use in psychological research,” Psychol. Mhetods, vol. 21, no. 4, 475–492, 2016. https://doi.org/10.1037/met0000081 R. S. Chaulagain, S. Pandey, S. R. Basnet & S. Shakya, “Cloud Based Web Scraping for Big Data Applications,” presented at 2nd IEEE International Conference on Smart Cloud, SmartCloud, NY, USA, 3-5 Nov. 2017, pp. 138–143. https://doi.org/10.1109/SmartCloud.2017.28 E. Uzun, “A Novel Web Scraping Approach Using the Additional Information Obtained from Web Pages,” IEEE Access, vol. 8, pp. 61726–61740, 2020. https://doi.org/10.1109/ACCESS.2020.2984503 AMI, “AMI en los medios de comunicación,” ami.org, 2020. https://ami.org.co/ami-en-los-medios-de-comunicacion/. ASOMEDIOS, “Medios Digitales,” asomedios.com, 2020. http://www.asomedios.com/medios-digitales/ S. C. M. de S Sirisuriya, “A Comparative Study on Web Scraping,” presented at 8th International Research Conference IRS, KDU, RML, LK, 27-28 Aug. 2015, pp. 135–140. Available from http://ir.kdu.ac.lk/bitstream/handle/345/1051/com-059.pdf?sequence=1&isAllowed=y N. R. Haddaway, “The Use of Web-scraping Software in Searching for Grey Literature,” Grey J, vol. 11, no. 3, pp. 186–190, 2015. L. Citra, Meiliana & A. Chandra, “Social media web scraping using social media developers API and regex,” Procedia Comput Sci, vol. 157, pp. 444–449, 2019. https://doi.org/10.1016/j.procs.2019.08.237 A. Josi, L. A. Abdillah & Suryayusra, “Penerapan teknik web scraping pada mesin pencari artikel ilmiah,” SISFO, vol. 5, pp. 1–6, 2014. Available: https://arxiv.org/abs/1410.5777 D. M. Thomas & S. Mathur, “Data Analysis by Web Scraping using Python,” presented at 3rd International Conference on Electronics and Communication and Aerospace Technology, ICECA 2019, CJB, IN, 12-14 Jun. 2019, pp. 450–454. https://doi.org/10.1109/ICECA.2019.8822022 D. K. Mahto & L. Singh, “A dive into Web Scraper world,” presented at 3rd International Conference on Computing for Sustainable Global Development, INDIACom 2016, New DEL, IN, 16-18 Mar. 2016, pp. 689–693. R. Mitchell, Web Scraping with Python: Collecting More Data from the Modern Web. Sebastopol, USA: O'Reilly Media, 2018. R. Diouf, E. N. Sarr, O. Sall, B. Birregah, M. Bousso & S. N. Mbaye, “Web Scraping: State-of-the-Art and Areas of Application,” presented at 2019 IEEE International Conference on Big Data, Big Data 2019, LA, USA, 9-12 Dec. 2019, pp. 6040–6042. https://doi.org/10.1109/BigData47090.2019.9005594 U. Baskaran & K. Ramanujam, “Automated scraping of structured data records from health discussion forums using semantic analysis,” Inform Med Unlocked, vol. 10, pp. 149–158, 2018. https://doi.org/10.1016/j.imu.2018.01.003 I. Hui, “Shaping the Coast with Permits: Making the State Regulatory Permitting Process Transparent with Text Mining,” Coast Manag, vol. 45, no. 3, pp. 179–198, 2017. https://doi.org/10.1080/08920753.2017.1303694 M. Z. Kurdi, “Text Complexity Classification Based on Linguistic Information: Application to Intelligent Tutoring of ESL,” JDMDH, pp. 1–38, 2020. Available: https://arxiv.org/abs/2001.01863 L. Junjoewong, S. Sangnapachai & T. Sunetnanta, “ProCircle: A promotion platform using crowdsourcing and web data scraping technique,” presented at 7th ICT International Student Project Conference, ICT-ISPC 2018, Nakhon, TH, 11-13 Jul. 2018, pp. 1–5. https://doi.org/10.1109/ICT-ISPC.2018.8524003 E. N. Sarr, O. Sall & A. Diallo, “FactExtract: Automatic C ollection and A ggregation of A rticles and J ournalistic F actual Claims from Online Newspaper,” 5 International Conference on Social Networks Analysis, Management and Security, SNAMS, VA, ES, 15-18 Oct. 2018, pp. 336–341. https://doi.org/10.1109/SNAMS.2018.8554421 Alexa, “Top Sites in Colombia,” Amazom Company, 2020. https://www.alexa.com/topsites/countries/CO J. Díez, “Aplicación para monitorización de precios para Android,” Trabajo grado, ETS, UPNA, PNA, ES, 2019. Disponible en https://hdl.handle.net/2454/33693 C. Lopezosa, L. Codina & C. Gonzalo-Penela, “Off-page SEO and link building: General strategies and authority transfer in the digital news media,” Prof Inf, vol. 28, no. 1, pp. 1–14, 2019. https://doi.org/10.3145/epi.2019.ene.07R. Bahana, R. Adinugroho, F. L. Gaol, A. Trisetyarso, B. S. Abbas & W. Suparta, “Web crawler and back-end for news aggregator system (Noox project),” presented at 2017 IEEE International Conference on Cybernetics and Computational Intelligence, Cybernetics, HKT, TH, 20-22 Nov. 2018, pp. 56–61. https://doi.org/10.1109/CYBERNETICSCOM.2017.8311684237229216https://revistascientificas.cuc.edu.co/ingecuc/article/download/3280/3018https://revistascientificas.cuc.edu.co/ingecuc/article/download/3280/3543https://revistascientificas.cuc.edu.co/ingecuc/article/download/3280/3572Núm. 2 , Año 2020 : (Julio-Diciembre)PublicationOREORE.xmltext/xml2774https://repositorio.cuc.edu.co/bitstreams/065bba61-bc61-4b0c-8215-f47186e47a90/download43417916d413958e31c157583e72a409MD5111323/12303oai:repositorio.cuc.edu.co:11323/123032024-09-17 10:44:52.205http://creativecommons.org/licenses/by-nc-nd/4.0INGE CUC - 2020metadata.onlyhttps://repositorio.cuc.edu.coRepositorio de la Universidad de la Costa CUCrepdigital@cuc.edu.co |