An information retrieval strategy for large multimodal data collections involving source code and natural language

Source code repositories store data from software products. Among this data we can find the evolution of the source code, requirements, bugs and communication between developers. Source code repositories have been growing rapidly in the recent years andwith them the need of extracting information fr...

Full description

Autores:
Baquero Vargas, Juan Felipe
Tipo de recurso:
Fecha de publicación:
2019
Institución:
Universidad Nacional de Colombia
Repositorio:
Universidad Nacional de Colombia
Idioma:
spa
OAI Identifier:
oai:repositorio.unal.edu.co:unal/76556
Acceso en línea:
https://repositorio.unal.edu.co/handle/unal/76556
http://bdigital.unal.edu.co/73062/
Palabra clave:
Stack Overflow
source code analysis
Duplication detection
Predicting programming language
Análisis de código fuente
Detección de duplicados
Predecir el lenguaje de programación
Rights
openAccess
License
Atribución-NoComercial 4.0 Internacional
Description
Summary:Source code repositories store data from software products. Among this data we can find the evolution of the source code, requirements, bugs and communication between developers. Source code repositories have been growing rapidly in the recent years andwith them the need of extracting information from them. An interesting source code repository that is growing both in usage and information is Stack Overflow (SO), this web site provides one of the biggest Question Answering places used by thousands of developers everyday. In SO the developers can ask any question related to a programming issue and it will be answered by other users. We can find a source code repository with both source code and natural language with thousands of samples and the possibility of combining both sources of information to extract useful and not eye-noticeable information from it. In this thesis, we explore how to represent source code and natural language and how to combine these representations. We try to solve the task of understanding how users in SO talk about the programming language, how similar these programming languages are among them based on how users talk about them, and finally, we provide tools on the building of an information retrieval strategy by identifying duplicated post.