Measuring Representativeness Using Covering Array Principles

Representativeness is an important data quality characteristic in data science processes; a data sample is said to be representative when it reflects a larger group as accurately as possible. Having low representativeness indices in the data can lead to the generation of biased models. Hence, this s...

Full description

Autores:
Tipo de recurso:
Fecha de publicación:
2023
Institución:
Universidad Pedagógica y Tecnológica de Colombia
Repositorio:
RiUPTC: Repositorio Institucional UPTC
Idioma:
eng
OAI Identifier:
oai:repositorio.uptc.edu.co:001/14367
Acceso en línea:
https://revistas.uptc.edu.co/index.php/ingenieria/article/view/15314
https://repositorio.uptc.edu.co/handle/001/14367
Palabra clave:
algoritmos de clasificación
arreglos de cobertura
calidad de datos
conjuntos de datos
representatividad de datos
classification algorithms
coverage arrays
data quality
data sets
data representativeness
Rights
License
Copyright (c) 2023 Alexander Castro-Romero, Carlos-Alberto Cobos-Lozada
id REPOUPTC2_2d8fa851868f714dc839322a44adc7ed
oai_identifier_str oai:repositorio.uptc.edu.co:001/14367
network_acronym_str REPOUPTC2
network_name_str RiUPTC: Repositorio Institucional UPTC
repository_id_str
dc.title.en-US.fl_str_mv Measuring Representativeness Using Covering Array Principles
dc.title.es-ES.fl_str_mv Medición de la representatividad utilizando principios de la matriz de cobertura
title Measuring Representativeness Using Covering Array Principles
spellingShingle Measuring Representativeness Using Covering Array Principles
algoritmos de clasificación
arreglos de cobertura
calidad de datos
conjuntos de datos
representatividad de datos
classification algorithms
coverage arrays
data quality
data sets
data representativeness
title_short Measuring Representativeness Using Covering Array Principles
title_full Measuring Representativeness Using Covering Array Principles
title_fullStr Measuring Representativeness Using Covering Array Principles
title_full_unstemmed Measuring Representativeness Using Covering Array Principles
title_sort Measuring Representativeness Using Covering Array Principles
dc.subject.es-ES.fl_str_mv algoritmos de clasificación
arreglos de cobertura
calidad de datos
conjuntos de datos
representatividad de datos
topic algoritmos de clasificación
arreglos de cobertura
calidad de datos
conjuntos de datos
representatividad de datos
classification algorithms
coverage arrays
data quality
data sets
data representativeness
dc.subject.en-US.fl_str_mv classification algorithms
coverage arrays
data quality
data sets
data representativeness
description Representativeness is an important data quality characteristic in data science processes; a data sample is said to be representative when it reflects a larger group as accurately as possible. Having low representativeness indices in the data can lead to the generation of biased models. Hence, this study shows the elements that make up a new model for measuring representativeness using a mathematical object testing element of coverage arrays called the "P Matrix". To test the model, an experiment was proposed where a data set is taken, divided into training and test data subsets using two sampling strategies: Random and Stratified, and the representativeness values are compared. If the data division is adequate, the two sampling strategies should present similar representativeness indexes. The model was implemented in a prototype software using Python (for data processing) and Vue (for data visualization) technologies, this version of the model only allows to analyze binary data sets (for now). To test the model, the "Wines" dataset (UC Irvine Machine Learning Repository) was fitted. The conclusion is that both sampling strategies generate similar representativeness results for this dataset, although this result is predictable, it is clear that adequate representativeness of the data is important when generating the test and training datasets subsets. Therefore, as future work we plan to extend the model to categorical data and explore more complex datasets.
publishDate 2023
dc.date.accessioned.none.fl_str_mv 2024-07-05T19:12:10Z
dc.date.available.none.fl_str_mv 2024-07-05T19:12:10Z
dc.date.none.fl_str_mv 2023-09-30
dc.type.none.fl_str_mv info:eu-repo/semantics/article
dc.type.coar.fl_str_mv http://purl.org/coar/resource_type/c_2df8fbb1
dc.type.coarversion.fl_str_mv http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.version.spa.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.coarversion.spa.fl_str_mv http://purl.org/coar/version/c_970fb48d4fbd8a396
status_str publishedVersion
dc.identifier.none.fl_str_mv https://revistas.uptc.edu.co/index.php/ingenieria/article/view/15314
dc.identifier.uri.none.fl_str_mv https://repositorio.uptc.edu.co/handle/001/14367
url https://revistas.uptc.edu.co/index.php/ingenieria/article/view/15314
https://repositorio.uptc.edu.co/handle/001/14367
dc.language.none.fl_str_mv eng
dc.language.iso.spa.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv https://revistas.uptc.edu.co/index.php/ingenieria/article/view/15314/13578
https://revistas.uptc.edu.co/index.php/ingenieria/article/view/15314/13816
dc.rights.en-US.fl_str_mv Copyright (c) 2023 Alexander Castro-Romero, Carlos-Alberto Cobos-Lozada
http://creativecommons.org/licenses/by/4.0
dc.rights.coar.fl_str_mv http://purl.org/coar/access_right/c_abf2
dc.rights.coar.spa.fl_str_mv http://purl.org/coar/access_right/c_abf313
rights_invalid_str_mv Copyright (c) 2023 Alexander Castro-Romero, Carlos-Alberto Cobos-Lozada
http://creativecommons.org/licenses/by/4.0
http://purl.org/coar/access_right/c_abf313
http://purl.org/coar/access_right/c_abf2
dc.format.none.fl_str_mv application/pdf
text/xml
dc.publisher.en-US.fl_str_mv Universidad Pedagógica y Tecnológica de Colombia
dc.source.en-US.fl_str_mv Revista Facultad de Ingeniería; Vol. 32 No. 65 (2023): July-September 2023 (Continuous Publication); e15314
dc.source.es-ES.fl_str_mv Revista Facultad de Ingeniería; Vol. 32 Núm. 65 (2023): Julio-Septiembre 2023 (Publicación Continua); e15314
dc.source.none.fl_str_mv 2357-5328
0121-1129
institution Universidad Pedagógica y Tecnológica de Colombia
repository.name.fl_str_mv Repositorio Institucional UPTC
repository.mail.fl_str_mv repositorio.uptc@uptc.edu.co
_version_ 1839633871890546688
spelling 2023-09-302024-07-05T19:12:10Z2024-07-05T19:12:10Zhttps://revistas.uptc.edu.co/index.php/ingenieria/article/view/15314https://repositorio.uptc.edu.co/handle/001/14367Representativeness is an important data quality characteristic in data science processes; a data sample is said to be representative when it reflects a larger group as accurately as possible. Having low representativeness indices in the data can lead to the generation of biased models. Hence, this study shows the elements that make up a new model for measuring representativeness using a mathematical object testing element of coverage arrays called the "P Matrix". To test the model, an experiment was proposed where a data set is taken, divided into training and test data subsets using two sampling strategies: Random and Stratified, and the representativeness values are compared. If the data division is adequate, the two sampling strategies should present similar representativeness indexes. The model was implemented in a prototype software using Python (for data processing) and Vue (for data visualization) technologies, this version of the model only allows to analyze binary data sets (for now). To test the model, the "Wines" dataset (UC Irvine Machine Learning Repository) was fitted. The conclusion is that both sampling strategies generate similar representativeness results for this dataset, although this result is predictable, it is clear that adequate representativeness of the data is important when generating the test and training datasets subsets. Therefore, as future work we plan to extend the model to categorical data and explore more complex datasets.La representatividad es una característica importante de la calidad de los datos en procesos de ciencia de datos; se dice que una muestra de datos es representativa cuando refleja a un grupo más grande con la mayor precisión posible. Tener bajos índices de representatividad en los datos puede conducir a la generación de modelos sesgados, por tanto, este estudio muestra los elementos que conforman un nuevo modelo para medir la representatividad utilizando un elemento de prueba de objetos matemáticos de matrices de cobertura llamado "Matriz P". Para probar el modelo se propuso un experimento donde se toma un conjunto de datos y se divide en subconjuntos de datos de entrenamiento y prueba utilizando dos estrategias de muestreo: Aleatorio y Estratificado, finalmente, se comparan los valores de representatividad. Si la división de datos es adecuada, las dos estrategias de muestreo deben presentar índices de representatividad similares. El modelo se implementó en un software prototipo usando tecnologías Python (para procesamiento de datos) y Vue (para visualización de datos); esta versión solo permite analizar conjuntos de datos binarios (por ahora). Para probar el modelo, se ajustó el conjunto de datos "Wines" (UC Irvine Machine Learning Repository). La conclusión es que ambas estrategias de muestreo generan resultados de representatividad similares para este conjunto de datos. Aunque este resultado es predecible, está claro que la representatividad adecuada de los datos es importante al generar subconjuntos de conjuntos de datos de prueba y entrenamiento, por lo tanto, como trabajo futuro, planeamos extender el modelo a datos categóricos y explorar conjuntos de datos más complejos.application/pdftext/xmlengengUniversidad Pedagógica y Tecnológica de Colombiahttps://revistas.uptc.edu.co/index.php/ingenieria/article/view/15314/13578https://revistas.uptc.edu.co/index.php/ingenieria/article/view/15314/13816Copyright (c) 2023 Alexander Castro-Romero, Carlos-Alberto Cobos-Lozadahttp://creativecommons.org/licenses/by/4.0http://purl.org/coar/access_right/c_abf313http://purl.org/coar/access_right/c_abf2Revista Facultad de Ingeniería; Vol. 32 No. 65 (2023): July-September 2023 (Continuous Publication); e15314Revista Facultad de Ingeniería; Vol. 32 Núm. 65 (2023): Julio-Septiembre 2023 (Publicación Continua); e153142357-53280121-1129algoritmos de clasificaciónarreglos de coberturacalidad de datosconjuntos de datosrepresentatividad de datosclassification algorithmscoverage arraysdata qualitydata setsdata representativenessMeasuring Representativeness Using Covering Array PrinciplesMedición de la representatividad utilizando principios de la matriz de coberturainfo:eu-repo/semantics/articlehttp://purl.org/coar/resource_type/c_2df8fbb1info:eu-repo/semantics/publishedVersionhttp://purl.org/coar/version/c_970fb48d4fbd8a396http://purl.org/coar/version/c_970fb48d4fbd8a85Castro-Romero, AlexanderCobos-Lozada, Carlos-Alberto001/14367oai:repositorio.uptc.edu.co:001/143672025-07-18 11:53:51.149metadata.onlyhttps://repositorio.uptc.edu.coRepositorio Institucional UPTCrepositorio.uptc@uptc.edu.co