An information retrieval strategy for large multimodal data collections involving source code and natural language

Source code repositories store data from software products. Among this data we can find the evolution of the source code, requirements, bugs and communication between developers. Source code repositories have been growing rapidly in the recent years andwith them the need of extracting information fr...

Full description

Autores:
Baquero Vargas, Juan Felipe
Tipo de recurso:
Fecha de publicación:
2019
Institución:
Universidad Nacional de Colombia
Repositorio:
Universidad Nacional de Colombia
Idioma:
spa
OAI Identifier:
oai:repositorio.unal.edu.co:unal/76556
Acceso en línea:
https://repositorio.unal.edu.co/handle/unal/76556
http://bdigital.unal.edu.co/73062/
Palabra clave:
Stack Overflow
source code analysis
Duplication detection
Predicting programming language
Análisis de código fuente
Detección de duplicados
Predecir el lenguaje de programación
Rights
openAccess
License
Atribución-NoComercial 4.0 Internacional
id UNACIONAL2_58476dc6cf5ecfffe596250f7fa4e6c7
oai_identifier_str oai:repositorio.unal.edu.co:unal/76556
network_acronym_str UNACIONAL2
network_name_str Universidad Nacional de Colombia
repository_id_str
dc.title.spa.fl_str_mv An information retrieval strategy for large multimodal data collections involving source code and natural language
title An information retrieval strategy for large multimodal data collections involving source code and natural language
spellingShingle An information retrieval strategy for large multimodal data collections involving source code and natural language
Stack Overflow
source code analysis
Duplication detection
Predicting programming language
Análisis de código fuente
Detección de duplicados
Predecir el lenguaje de programación
title_short An information retrieval strategy for large multimodal data collections involving source code and natural language
title_full An information retrieval strategy for large multimodal data collections involving source code and natural language
title_fullStr An information retrieval strategy for large multimodal data collections involving source code and natural language
title_full_unstemmed An information retrieval strategy for large multimodal data collections involving source code and natural language
title_sort An information retrieval strategy for large multimodal data collections involving source code and natural language
dc.creator.fl_str_mv Baquero Vargas, Juan Felipe
dc.contributor.author.spa.fl_str_mv Baquero Vargas, Juan Felipe
dc.contributor.spa.fl_str_mv González Osorio, Fabio Augusto
Restrepo Calle, Felipe
dc.subject.proposal.spa.fl_str_mv Stack Overflow
source code analysis
Duplication detection
Predicting programming language
Análisis de código fuente
Detección de duplicados
Predecir el lenguaje de programación
topic Stack Overflow
source code analysis
Duplication detection
Predicting programming language
Análisis de código fuente
Detección de duplicados
Predecir el lenguaje de programación
description Source code repositories store data from software products. Among this data we can find the evolution of the source code, requirements, bugs and communication between developers. Source code repositories have been growing rapidly in the recent years andwith them the need of extracting information from them. An interesting source code repository that is growing both in usage and information is Stack Overflow (SO), this web site provides one of the biggest Question Answering places used by thousands of developers everyday. In SO the developers can ask any question related to a programming issue and it will be answered by other users. We can find a source code repository with both source code and natural language with thousands of samples and the possibility of combining both sources of information to extract useful and not eye-noticeable information from it. In this thesis, we explore how to represent source code and natural language and how to combine these representations. We try to solve the task of understanding how users in SO talk about the programming language, how similar these programming languages are among them based on how users talk about them, and finally, we provide tools on the building of an information retrieval strategy by identifying duplicated post.
publishDate 2019
dc.date.issued.spa.fl_str_mv 2019-07-03
dc.date.accessioned.spa.fl_str_mv 2020-03-30T06:22:16Z
dc.date.available.spa.fl_str_mv 2020-03-30T06:22:16Z
dc.type.spa.fl_str_mv Trabajo de grado - Maestría
dc.type.driver.spa.fl_str_mv info:eu-repo/semantics/masterThesis
dc.type.version.spa.fl_str_mv info:eu-repo/semantics/acceptedVersion
dc.type.content.spa.fl_str_mv Text
dc.type.redcol.spa.fl_str_mv http://purl.org/redcol/resource_type/TM
status_str acceptedVersion
dc.identifier.uri.none.fl_str_mv https://repositorio.unal.edu.co/handle/unal/76556
dc.identifier.eprints.spa.fl_str_mv http://bdigital.unal.edu.co/73062/
url https://repositorio.unal.edu.co/handle/unal/76556
http://bdigital.unal.edu.co/73062/
dc.language.iso.spa.fl_str_mv spa
language spa
dc.relation.ispartof.spa.fl_str_mv Universidad Nacional de Colombia Sede Bogotá Facultad de Ingeniería Departamento de Ingeniería de Sistemas e Industrial Ingeniería de Sistemas
Ingeniería de Sistemas
dc.relation.haspart.spa.fl_str_mv 0 Generalidades / Computer science, information and general works
6 Tecnología (ciencias aplicadas) / Technology
62 Ingeniería y operaciones afines / Engineering
dc.relation.references.spa.fl_str_mv Baquero Vargas, Juan Felipe (2019) An information retrieval strategy for large multimodal data collections involving source code and natural language. Maestría thesis, Universidad Nacional de Colombia - Sede Bogotá.
dc.rights.spa.fl_str_mv Derechos reservados - Universidad Nacional de Colombia
dc.rights.coar.fl_str_mv http://purl.org/coar/access_right/c_abf2
dc.rights.license.spa.fl_str_mv Atribución-NoComercial 4.0 Internacional
dc.rights.uri.spa.fl_str_mv http://creativecommons.org/licenses/by-nc/4.0/
dc.rights.accessrights.spa.fl_str_mv info:eu-repo/semantics/openAccess
rights_invalid_str_mv Atribución-NoComercial 4.0 Internacional
Derechos reservados - Universidad Nacional de Colombia
http://creativecommons.org/licenses/by-nc/4.0/
http://purl.org/coar/access_right/c_abf2
eu_rights_str_mv openAccess
dc.format.mimetype.spa.fl_str_mv application/pdf
institution Universidad Nacional de Colombia
bitstream.url.fl_str_mv https://repositorio.unal.edu.co/bitstream/unal/76556/1/Tesis_Maestra_JFBV__Universidad_Nacional_de_Colombia.pdf
https://repositorio.unal.edu.co/bitstream/unal/76556/2/Tesis_Maestra_JFBV__Universidad_Nacional_de_Colombia.pdf.jpg
bitstream.checksum.fl_str_mv b4d7f67ed4300ad6e047fbdbc37a1542
4d9ee26fac36bc8b0a9dfe2f9119d586
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
repository.name.fl_str_mv Repositorio Institucional Universidad Nacional de Colombia
repository.mail.fl_str_mv repositorio_nal@unal.edu.co
_version_ 1806886304339722240
spelling Atribución-NoComercial 4.0 InternacionalDerechos reservados - Universidad Nacional de Colombiahttp://creativecommons.org/licenses/by-nc/4.0/info:eu-repo/semantics/openAccesshttp://purl.org/coar/access_right/c_abf2González Osorio, Fabio AugustoRestrepo Calle, FelipeBaquero Vargas, Juan Felipe941ae0d4-da37-438e-b049-aa10a5a649483002020-03-30T06:22:16Z2020-03-30T06:22:16Z2019-07-03https://repositorio.unal.edu.co/handle/unal/76556http://bdigital.unal.edu.co/73062/Source code repositories store data from software products. Among this data we can find the evolution of the source code, requirements, bugs and communication between developers. Source code repositories have been growing rapidly in the recent years andwith them the need of extracting information from them. An interesting source code repository that is growing both in usage and information is Stack Overflow (SO), this web site provides one of the biggest Question Answering places used by thousands of developers everyday. In SO the developers can ask any question related to a programming issue and it will be answered by other users. We can find a source code repository with both source code and natural language with thousands of samples and the possibility of combining both sources of information to extract useful and not eye-noticeable information from it. In this thesis, we explore how to represent source code and natural language and how to combine these representations. We try to solve the task of understanding how users in SO talk about the programming language, how similar these programming languages are among them based on how users talk about them, and finally, we provide tools on the building of an information retrieval strategy by identifying duplicated post.Los repositorios de software almacenan datos sobre los productos de software, datos relacionados con la evolución de código fuente, requerimientos de software, reporte de bugs y comunicación entre desarrolladores. Los repositorios de software han crecido rápidamente en los últimos años y con ellos la necesidad de extraer información significativa de ellos. Un repositorio de software intersante es Stack Overflow(SO), este sitio web es uno de los sitios de Question Answering más grandes y usados por miles de desarrolladores de sofware en su día a día. En SO los desarrollares pueden preguntar cualquier duda relacionada con programación y software que será respondida por otros usuarios. Como SO, existen muchos repositorios de software con código fuente y texto con millones de ejemplares y la posibilidad de combinar ambas fuentes para extraer información de ellos que no es visible a simple vista. En este trabajo de tesis, exploramos como representar código fuente y lenguaje natural y cómo combinar estas representaciones. Intentamos resolver la tarea de entender como los usuarios de SO hablan sobre un lenguage de programación, que tan similares son los lenguajes de programación basados en cómo los usuarios hablen sobre ellos y, finalmente, proporcionar herramientas para construir una estrategia de recuperación de información para identificar post duplicados.Maestríaapplication/pdfspaUniversidad Nacional de Colombia Sede Bogotá Facultad de Ingeniería Departamento de Ingeniería de Sistemas e Industrial Ingeniería de SistemasIngeniería de Sistemas0 Generalidades / Computer science, information and general works6 Tecnología (ciencias aplicadas) / Technology62 Ingeniería y operaciones afines / EngineeringBaquero Vargas, Juan Felipe (2019) An information retrieval strategy for large multimodal data collections involving source code and natural language. Maestría thesis, Universidad Nacional de Colombia - Sede Bogotá.An information retrieval strategy for large multimodal data collections involving source code and natural languageTrabajo de grado - Maestríainfo:eu-repo/semantics/masterThesisinfo:eu-repo/semantics/acceptedVersionTexthttp://purl.org/redcol/resource_type/TMStack Overflowsource code analysisDuplication detectionPredicting programming languageAnálisis de código fuenteDetección de duplicadosPredecir el lenguaje de programaciónORIGINALTesis_Maestra_JFBV__Universidad_Nacional_de_Colombia.pdfapplication/pdf1145354https://repositorio.unal.edu.co/bitstream/unal/76556/1/Tesis_Maestra_JFBV__Universidad_Nacional_de_Colombia.pdfb4d7f67ed4300ad6e047fbdbc37a1542MD51THUMBNAILTesis_Maestra_JFBV__Universidad_Nacional_de_Colombia.pdf.jpgTesis_Maestra_JFBV__Universidad_Nacional_de_Colombia.pdf.jpgGenerated Thumbnailimage/jpeg4697https://repositorio.unal.edu.co/bitstream/unal/76556/2/Tesis_Maestra_JFBV__Universidad_Nacional_de_Colombia.pdf.jpg4d9ee26fac36bc8b0a9dfe2f9119d586MD52unal/76556oai:repositorio.unal.edu.co:unal/765562024-07-14 01:01:05.013Repositorio Institucional Universidad Nacional de Colombiarepositorio_nal@unal.edu.co