Automatic multi-label categorization of Java applications using Dependency graphs
Automatic approaches for categorization of software repositories are increasingly gaining acceptance because they reduce manual effort and can produce high quality results. Most of the existing approaches have strongly relied on supervised machine learning {which requires a set of predefined categor...
- Autores:
-
Vargas Baldrich, Santiago
- Tipo de recurso:
- Fecha de publicación:
- 2015
- Institución:
- Universidad Nacional de Colombia
- Repositorio:
- Universidad Nacional de Colombia
- Idioma:
- spa
- OAI Identifier:
- oai:repositorio.unal.edu.co:unal/53794
- Acceso en línea:
- https://repositorio.unal.edu.co/handle/unal/53794
http://bdigital.unal.edu.co/48450/
- Palabra clave:
- 0 Generalidades / Computer science, information and general works
Closed-source
Open-source
Software categorization
Machine learning
Código propietario
Código abierto
Categorización de software
Aprendizaje de máquina
- Rights
- openAccess
- License
- Atribución-NoComercial 4.0 Internacional
Summary: | Automatic approaches for categorization of software repositories are increasingly gaining acceptance because they reduce manual effort and can produce high quality results. Most of the existing approaches have strongly relied on supervised machine learning {which requires a set of predefined categories to be used as training data{ and have used source code, comments, API Calls and other sources to obtain information about the projects to be categorized. We consider that existing approaches have weaknesses that can have major implications on the categorization results and haven't been solved at the same time, namely the assumption of non-restricted access to source code and the use of predefined sets of categories. Therefore, we present Sally: a novel, unsupervised and multi-label automatic categorization model that is able to obtain meaningful categories without depending on access to source code nor the existence of predefined categories by leveraging on information obtained from the projects in the categorization corpus and the dependency relations between them. We performed two experiments in which we compared Sally to the categorization strategies of two widely used websites and to MUDABlue, a categorization model proposed by Kawaguchi et al. that we consider to be a good baseline. Additionally, we assessed the proposed model by conducting a survey with 14 developers with a wide range of programming experience and developed a web application to make the proposed model available to potential users. |
---|