Automatic multi-label categorization of Java applications using Dependency graphs

Automatic approaches for categorization of software repositories are increasingly gaining acceptance because they reduce manual effort and can produce high quality results. Most of the existing approaches have strongly relied on supervised machine learning {which requires a set of predefined categor...

Full description

Autores:
Vargas Baldrich, Santiago
Tipo de recurso:
Fecha de publicación:
2015
Institución:
Universidad Nacional de Colombia
Repositorio:
Universidad Nacional de Colombia
Idioma:
spa
OAI Identifier:
oai:repositorio.unal.edu.co:unal/53794
Acceso en línea:
https://repositorio.unal.edu.co/handle/unal/53794
http://bdigital.unal.edu.co/48450/
Palabra clave:
0 Generalidades / Computer science, information and general works
Closed-source
Open-source
Software categorization
Machine learning
Código propietario
Código abierto
Categorización de software
Aprendizaje de máquina
Rights
openAccess
License
Atribución-NoComercial 4.0 Internacional
id UNACIONAL2_82066b4ebba18abcf8e3a73fbfb4dd21
oai_identifier_str oai:repositorio.unal.edu.co:unal/53794
network_acronym_str UNACIONAL2
network_name_str Universidad Nacional de Colombia
repository_id_str
dc.title.spa.fl_str_mv Automatic multi-label categorization of Java applications using Dependency graphs
title Automatic multi-label categorization of Java applications using Dependency graphs
spellingShingle Automatic multi-label categorization of Java applications using Dependency graphs
0 Generalidades / Computer science, information and general works
Closed-source
Open-source
Software categorization
Machine learning
Código propietario
Código abierto
Categorización de software
Aprendizaje de máquina
title_short Automatic multi-label categorization of Java applications using Dependency graphs
title_full Automatic multi-label categorization of Java applications using Dependency graphs
title_fullStr Automatic multi-label categorization of Java applications using Dependency graphs
title_full_unstemmed Automatic multi-label categorization of Java applications using Dependency graphs
title_sort Automatic multi-label categorization of Java applications using Dependency graphs
dc.creator.fl_str_mv Vargas Baldrich, Santiago
dc.contributor.advisor.spa.fl_str_mv Aponte Melo, Jairo Hernán (Thesis advisor)
dc.contributor.author.spa.fl_str_mv Vargas Baldrich, Santiago
dc.contributor.spa.fl_str_mv Linares Vásquez, Mario
dc.subject.ddc.spa.fl_str_mv 0 Generalidades / Computer science, information and general works
topic 0 Generalidades / Computer science, information and general works
Closed-source
Open-source
Software categorization
Machine learning
Código propietario
Código abierto
Categorización de software
Aprendizaje de máquina
dc.subject.proposal.spa.fl_str_mv Closed-source
Open-source
Software categorization
Machine learning
Código propietario
Código abierto
Categorización de software
Aprendizaje de máquina
description Automatic approaches for categorization of software repositories are increasingly gaining acceptance because they reduce manual effort and can produce high quality results. Most of the existing approaches have strongly relied on supervised machine learning {which requires a set of predefined categories to be used as training data{ and have used source code, comments, API Calls and other sources to obtain information about the projects to be categorized. We consider that existing approaches have weaknesses that can have major implications on the categorization results and haven't been solved at the same time, namely the assumption of non-restricted access to source code and the use of predefined sets of categories. Therefore, we present Sally: a novel, unsupervised and multi-label automatic categorization model that is able to obtain meaningful categories without depending on access to source code nor the existence of predefined categories by leveraging on information obtained from the projects in the categorization corpus and the dependency relations between them. We performed two experiments in which we compared Sally to the categorization strategies of two widely used websites and to MUDABlue, a categorization model proposed by Kawaguchi et al. that we consider to be a good baseline. Additionally, we assessed the proposed model by conducting a survey with 14 developers with a wide range of programming experience and developed a web application to make the proposed model available to potential users.
publishDate 2015
dc.date.issued.spa.fl_str_mv 2015-05-13
dc.date.accessioned.spa.fl_str_mv 2019-06-29T18:25:19Z
dc.date.available.spa.fl_str_mv 2019-06-29T18:25:19Z
dc.type.spa.fl_str_mv Trabajo de grado - Maestría
dc.type.driver.spa.fl_str_mv info:eu-repo/semantics/masterThesis
dc.type.version.spa.fl_str_mv info:eu-repo/semantics/acceptedVersion
dc.type.content.spa.fl_str_mv Text
dc.type.redcol.spa.fl_str_mv http://purl.org/redcol/resource_type/TM
status_str acceptedVersion
dc.identifier.uri.none.fl_str_mv https://repositorio.unal.edu.co/handle/unal/53794
dc.identifier.eprints.spa.fl_str_mv http://bdigital.unal.edu.co/48450/
url https://repositorio.unal.edu.co/handle/unal/53794
http://bdigital.unal.edu.co/48450/
dc.language.iso.spa.fl_str_mv spa
language spa
dc.relation.ispartof.spa.fl_str_mv Universidad Nacional de Colombia Sede Bogotá Facultad de Ingeniería Departamento de Ingeniería de Sistemas e Industrial Ingeniería de Sistemas
Ingeniería de Sistemas
dc.relation.references.spa.fl_str_mv Vargas Baldrich, Santiago (2015) Automatic multi-label categorization of Java applications using Dependency graphs. Maestría thesis, Universidad Nacional de Colombia.
dc.rights.spa.fl_str_mv Derechos reservados - Universidad Nacional de Colombia
dc.rights.coar.fl_str_mv http://purl.org/coar/access_right/c_abf2
dc.rights.license.spa.fl_str_mv Atribución-NoComercial 4.0 Internacional
dc.rights.uri.spa.fl_str_mv http://creativecommons.org/licenses/by-nc/4.0/
dc.rights.accessrights.spa.fl_str_mv info:eu-repo/semantics/openAccess
rights_invalid_str_mv Atribución-NoComercial 4.0 Internacional
Derechos reservados - Universidad Nacional de Colombia
http://creativecommons.org/licenses/by-nc/4.0/
http://purl.org/coar/access_right/c_abf2
eu_rights_str_mv openAccess
dc.format.mimetype.spa.fl_str_mv application/pdf
institution Universidad Nacional de Colombia
bitstream.url.fl_str_mv https://repositorio.unal.edu.co/bitstream/unal/53794/1/1020748671.2015.pdf
https://repositorio.unal.edu.co/bitstream/unal/53794/2/1020748671.2015.pdf.jpg
bitstream.checksum.fl_str_mv bc08b5077a6b6ddc481fd60e837016de
b54c6a26d31b9a25322e3f95bac47d16
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
repository.name.fl_str_mv Repositorio Institucional Universidad Nacional de Colombia
repository.mail.fl_str_mv repositorio_nal@unal.edu.co
_version_ 1814089896359362560
spelling Atribución-NoComercial 4.0 InternacionalDerechos reservados - Universidad Nacional de Colombiahttp://creativecommons.org/licenses/by-nc/4.0/info:eu-repo/semantics/openAccesshttp://purl.org/coar/access_right/c_abf2Linares Vásquez, MarioAponte Melo, Jairo Hernán (Thesis advisor)71bbd054-4261-47de-8e49-34704a192ef5-1Vargas Baldrich, Santiagoa1819757-1c7c-4a7d-85cd-7bc99300c8f23002019-06-29T18:25:19Z2019-06-29T18:25:19Z2015-05-13https://repositorio.unal.edu.co/handle/unal/53794http://bdigital.unal.edu.co/48450/Automatic approaches for categorization of software repositories are increasingly gaining acceptance because they reduce manual effort and can produce high quality results. Most of the existing approaches have strongly relied on supervised machine learning {which requires a set of predefined categories to be used as training data{ and have used source code, comments, API Calls and other sources to obtain information about the projects to be categorized. We consider that existing approaches have weaknesses that can have major implications on the categorization results and haven't been solved at the same time, namely the assumption of non-restricted access to source code and the use of predefined sets of categories. Therefore, we present Sally: a novel, unsupervised and multi-label automatic categorization model that is able to obtain meaningful categories without depending on access to source code nor the existence of predefined categories by leveraging on information obtained from the projects in the categorization corpus and the dependency relations between them. We performed two experiments in which we compared Sally to the categorization strategies of two widely used websites and to MUDABlue, a categorization model proposed by Kawaguchi et al. that we consider to be a good baseline. Additionally, we assessed the proposed model by conducting a survey with 14 developers with a wide range of programming experience and developed a web application to make the proposed model available to potential users.Resumen. La categorización automática de repositorios de software ha ido ganando aceptación debido a que reduce el esfuerzo manual y puede generar resultados de alta calidad. La mayoría de los modelos existentes dependen fuertemente del aprendizaje de máquina supervisado { que necesita de un conjunto predefinido de categorías para ser usado como datos de entrenamiento{ y han usado código fuente, comentarios, llamadas de API y otras fuentes para obtener información sobre los proyectos a categorizar. Consideramos que los modelos existentes tienen debilidades que pueden tener implicaciones importantes en el resultado de la categorización y no han sido resueltas al mismo tiempo, específicamente la suposición de que el código fuente de los proyectos se encuentra completamente disponible y la necesidad de conjuntos predefinidos de categorías. Por esto, presentamos el modelo Sally: Un enfoque de categorización automática de software novedoso, no supervisado y multi-etiqueta que es capaz de generar categorías descriptivas sin depender del acceso al código fuente ni a categorías predefinidas usando información obtenida de los proyectos a categorizar y las relaciones entre ellos. Realizamos dos experimentos en los que comparamos a Sally con las estrategias de categorización automática de dos herramientas online ámpliamente utilizadas y con MUDABlue, un modelo de categorización automática de software propuesto por Kawaguchi et al. que consideramos una buena base de comparación. Adicionalmente, evaluamos el modelo propuesto por medio de un caso de estudio llevado a cabo con la participación de 14 desarrolladores con un ámplio rango de experiencia en programación y desarrollamos una aplicación web para poner el modelo propuesto a disposición de usuarios potenciales.Maestríaapplication/pdfspaUniversidad Nacional de Colombia Sede Bogotá Facultad de Ingeniería Departamento de Ingeniería de Sistemas e Industrial Ingeniería de SistemasIngeniería de SistemasVargas Baldrich, Santiago (2015) Automatic multi-label categorization of Java applications using Dependency graphs. Maestría thesis, Universidad Nacional de Colombia.0 Generalidades / Computer science, information and general worksClosed-sourceOpen-sourceSoftware categorizationMachine learningCódigo propietarioCódigo abiertoCategorización de softwareAprendizaje de máquinaAutomatic multi-label categorization of Java applications using Dependency graphsTrabajo de grado - Maestríainfo:eu-repo/semantics/masterThesisinfo:eu-repo/semantics/acceptedVersionTexthttp://purl.org/redcol/resource_type/TMORIGINAL1020748671.2015.pdfapplication/pdf1086575https://repositorio.unal.edu.co/bitstream/unal/53794/1/1020748671.2015.pdfbc08b5077a6b6ddc481fd60e837016deMD51THUMBNAIL1020748671.2015.pdf.jpg1020748671.2015.pdf.jpgGenerated Thumbnailimage/jpeg4322https://repositorio.unal.edu.co/bitstream/unal/53794/2/1020748671.2015.pdf.jpgb54c6a26d31b9a25322e3f95bac47d16MD52unal/53794oai:repositorio.unal.edu.co:unal/537942024-03-09 23:07:44.661Repositorio Institucional Universidad Nacional de Colombiarepositorio_nal@unal.edu.co