Automatic multi-label categorization of Java applications using Dependency graphs

Automatic approaches for categorization of software repositories are increasingly gaining acceptance because they reduce manual effort and can produce high quality results. Most of the existing approaches have strongly relied on supervised machine learning {which requires a set of predefined categor...

Full description

Autores:
Vargas Baldrich, Santiago
Tipo de recurso:
Fecha de publicación:
2015
Institución:
Universidad Nacional de Colombia
Repositorio:
Universidad Nacional de Colombia
Idioma:
spa
OAI Identifier:
oai:repositorio.unal.edu.co:unal/53794
Acceso en línea:
https://repositorio.unal.edu.co/handle/unal/53794
http://bdigital.unal.edu.co/48450/
Palabra clave:
0 Generalidades / Computer science, information and general works
Closed-source
Open-source
Software categorization
Machine learning
Código propietario
Código abierto
Categorización de software
Aprendizaje de máquina
Rights
openAccess
License
Atribución-NoComercial 4.0 Internacional
Description
Summary:Automatic approaches for categorization of software repositories are increasingly gaining acceptance because they reduce manual effort and can produce high quality results. Most of the existing approaches have strongly relied on supervised machine learning {which requires a set of predefined categories to be used as training data{ and have used source code, comments, API Calls and other sources to obtain information about the projects to be categorized. We consider that existing approaches have weaknesses that can have major implications on the categorization results and haven't been solved at the same time, namely the assumption of non-restricted access to source code and the use of predefined sets of categories. Therefore, we present Sally: a novel, unsupervised and multi-label automatic categorization model that is able to obtain meaningful categories without depending on access to source code nor the existence of predefined categories by leveraging on information obtained from the projects in the categorization corpus and the dependency relations between them. We performed two experiments in which we compared Sally to the categorization strategies of two widely used websites and to MUDABlue, a categorization model proposed by Kawaguchi et al. that we consider to be a good baseline. Additionally, we assessed the proposed model by conducting a survey with 14 developers with a wide range of programming experience and developed a web application to make the proposed model available to potential users.