A model for automatic categorization of software applications using non-parametric clustering and bytecode analysis

Automatic software categorization is the task of assigning software systems or libraries to categories based on their functionality. Correctly assigning these categories is essential to ensure that relevant libraries can be easily retrieved by developers from large repositories. State of the art app...

Full description

Autores:
Escobar Avila, Javier Ricardo
Tipo de recurso:
Fecha de publicación:
2015
Institución:
Universidad Nacional de Colombia
Repositorio:
Universidad Nacional de Colombia
Idioma:
spa
OAI Identifier:
oai:repositorio.unal.edu.co:unal/54862
Acceso en línea:
https://repositorio.unal.edu.co/handle/unal/54862
http://bdigital.unal.edu.co/50071/
Palabra clave:
0 Generalidades / Computer science, information and general works
62 Ingeniería y operaciones afines / Engineering
Software categorization
Categorización de software
Bytecode
Non-parametric clustering
Automatic labeling
Clustering no paramétrico
Etiquetado automático
Rights
openAccess
License
Atribución-NoComercial 4.0 Internacional
Description
Summary:Automatic software categorization is the task of assigning software systems or libraries to categories based on their functionality. Correctly assigning these categories is essential to ensure that relevant libraries can be easily retrieved by developers from large repositories. State of the art approaches rely on the semantics reflected by identifiers and comments in the source code of the libraries in order to determine their category. However, these approaches fail when the source code of the libraries is not available. In this document, we describe a novel approach for the automatic categorization of Java libraries, which needs only the bytecode of a library in order to determine its category. We show that the approach, based on Dirichlet Process Clustering with automatic labeling, is able to successfully categorize libraries from the Apache Foundation Repository.