Una revisión sistemática de problemas de calidad en los datos en tareas de descubrimiento de conocimiento

Hay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil par...

Full description

Autores:: Corrales, David Camilo
Ledezma, Agapito Ismael
Corrales, Juan Carlos

Tipo de recurso:: Article of journal

Fecha de publicación:: 2016

Institución:: Universidad de Medellín

Repositorio:: Repositorio UDEM

Idioma:: eng

id	REPOUDEM2_62e7b68bcf6b426a3661fc681180ed18
oai_identifier_str	oai:repository.udem.edu.co:11407/3550
network_acronym_str	REPOUDEM2
network_name_str	Repositorio UDEM
repository_id_str
dc.title.spa.fl_str_mv	Una revisión sistemática de problemas de calidad en los datos en tareas de descubrimiento de conocimiento A systematic review of data quality issues in knowledge discovery tasks
title	Una revisión sistemática de problemas de calidad en los datos en tareas de descubrimiento de conocimiento
spellingShingle	Una revisión sistemática de problemas de calidad en los datos en tareas de descubrimiento de conocimiento Heterogeneity Outliers noise Inconsistency Incompleteness Amount of data Redundancy Timeliness Heterogeneidad Valores atípicos Ruido Inconsistencia Valores perdidos Cantidad de datos Redundancia Oportunidad
title_short	Una revisión sistemática de problemas de calidad en los datos en tareas de descubrimiento de conocimiento
title_full	Una revisión sistemática de problemas de calidad en los datos en tareas de descubrimiento de conocimiento
title_fullStr	Una revisión sistemática de problemas de calidad en los datos en tareas de descubrimiento de conocimiento
title_full_unstemmed	Una revisión sistemática de problemas de calidad en los datos en tareas de descubrimiento de conocimiento
title_sort	Una revisión sistemática de problemas de calidad en los datos en tareas de descubrimiento de conocimiento
dc.creator.fl_str_mv	Corrales, David Camilo Ledezma, Agapito Ismael Corrales, Juan Carlos
dc.contributor.author.none.fl_str_mv	Corrales, David Camilo Ledezma, Agapito Ismael Corrales, Juan Carlos
dc.subject.spa.fl_str_mv	Heterogeneity Outliers noise Inconsistency Incompleteness Amount of data Redundancy Timeliness Heterogeneidad Valores atípicos Ruido Inconsistencia Valores perdidos Cantidad de datos Redundancia Oportunidad
topic	Heterogeneity Outliers noise Inconsistency Incompleteness Amount of data Redundancy Timeliness Heterogeneidad Valores atípicos Ruido Inconsistencia Valores perdidos Cantidad de datos Redundancia Oportunidad
description	Hay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.
publishDate	2016
dc.date.created.none.fl_str_mv	2016-06-30
dc.date.accessioned.none.fl_str_mv	2017-06-29T22:22:36Z
dc.date.available.none.fl_str_mv	2017-06-29T22:22:36Z
dc.type.eng.fl_str_mv	Article
dc.type.coar.fl_str_mv	http://purl.org/coar/resource_type/c_2df8fbb1
dc.type.coarversion.fl_str_mv	http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.coar.none.fl_str_mv	http://purl.org/coar/resource_type/c_6501
dc.type.local.spa.fl_str_mv	Artículo científico
dc.type.driver.none.fl_str_mv	info:eu-repo/semantics/article
format	http://purl.org/coar/resource_type/c_6501
dc.identifier.issn.none.fl_str_mv	1692-3324
dc.identifier.uri.none.fl_str_mv	http://hdl.handle.net/11407/3550
dc.identifier.doi.none.fl_str_mv	http://dx.doi.org/10.22395/rium.v15n28a7
dc.identifier.eissn.none.fl_str_mv	2248-4094
dc.identifier.reponame.spa.fl_str_mv	reponame:Repositorio Institucional Universidad de Medellín
dc.identifier.repourl.none.fl_str_mv	repourl:https://repository.udem.edu.co/
dc.identifier.instname.spa.fl_str_mv	instname:Universidad de Medellín
identifier_str_mv	1692-3324 http://dx.doi.org/10.22395/rium.v15n28a7 2248-4094 reponame:Repositorio Institucional Universidad de Medellín repourl:https://repository.udem.edu.co/ instname:Universidad de Medellín
url	http://hdl.handle.net/11407/3550
dc.language.iso.none.fl_str_mv	eng
language	eng
dc.relation.uri.none.fl_str_mv	http://revistas.udem.edu.co/index.php/ingenierias/article/view/1066
dc.relation.citationvolume.none.fl_str_mv	15
dc.relation.citationissue.none.fl_str_mv	28
dc.relation.citationstartpage.none.fl_str_mv	125
dc.relation.citationendpage.none.fl_str_mv	150
dc.relation.references.spa.fl_str_mv	J. Gantz and David Reinsel, “The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east,” IDC VIEW, pp. 1-16, 2012. H. Hu, Y. Wen, T.-S. Chua, and X. Li, “Toward Scalable Systems for Big Data Analytics: A Technology Tutorial,” IEEE Access, vol. 2, pp. 652-687, 2014. A. Rajaraman and J. D. Ullman, Mining of Massive Datasets. New York, N.Y. ; Cambridge: Cambridge University Press, 2011. F. Pacheco, C. Rangel, J. Aguilar, M. Cerrada, and J. Altamiranda, “Methodological framework for data processing based on the Data Science paradigm,” in Computing Conference (CLEI), 2014 XL Latin American, 2014, pp. 1-12. G. A. Liebchen and M. Shepperd, “Software productivity analysis of a large data set and issues of confidentiality and data quality,” in Software Metrics, 2005. 11th IEEE International Symposium, 2005, p. 3 pp.-46. G. A. Liebchen and M. Shepperd, “Data Sets and Data Quality in Software Engineering,” in Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, New York, NY, USA, 2008, pp. 39-44. M. F. Bosu and S. G. Macdonell, “A Taxonomy of Data Quality Challenges in Empirical Software Engineering,” in Software Engineering Conference (ASWEC), 2013 22nd Australian, 2013, pp. 97-106. D. C. Corrales, A. Ledezma, and J. C. Corrales, “A conceptual Framework for data quality in knowledge discovery tasks (FDQ-KDT): a proposal,” in Journal of Computers, Chicago, 2015. B. A. Kitchenham, “Systematic Review in Software Engineering: Where We Are and Where We Should Be Going,” in Proceedings of the 2Nd International Workshop on Evidential Assessment of Software Technologies, New York, NY, USA, 2012, pp. 1-2. F. Hakimpour and A. Geppert, “Resolving Semantic Heterogeneity in Schema Integration,” in Proceedings of the International Conference on Formal Ontology in Information Systems - Volume 2001, New York, NY, USA, 2001, pp. 297-308. F. Castanedo, “A Review of Data Fusion Techniques,” Sci. World J., vol. 2013, p. e704504, Oct. 2013. W. Zou and W. Sun, “A Multi-dimensional Data Association Algorithm for Multi-sensor Fusion,” in Intelligent Science and Intelligent Data Engineering, J. Yang, F. Fang, and C. Sun, Eds. Springer Berlin Heidelberg, 2013, pp. 280-288. S. Lloyd, “Least Squares Quantization in PCM,” IEEE Trans Inf Theor, vol. 28, no. 2, pp. 129-137, Sep. 2006. A. W. Michael Shindler, “Fast and Accurate k-means For Large Datasets,” 2011. S. K. Chang, E. Jungert, and X. Li, “A progressive query language and interactive reasoner for information fusion support,” Inf. Fusion, vol. 8, no. 1, pp. 70-83, Jan. 2007. T. Aluja-Banet, J. Daunis-i-Estadella, and D. Pellicer, “GRAFT, a complete system for data fusion,” Comput. Stat. Data Anal., vol. 52, no. 2, pp. 635-649, Oct. 2007. D. M. Hawkins, “Introduction,” in Identification of Outliers, Springer Netherlands, 1980, pp. 1-12. A. Daneshpazhouh and A. Sami, “Entropy-based outlier detection using semi-supervised approach with few positive examples,” Pattern Recognit. Lett., vol. 49, pp. 77-84, Nov. 2014. W. Yalin, X. Wenping, W. Xiaoli, and C. Bin, “Study on online outlier detection method based on principal component analysis and Bayesian classification,” in Control Conference (CCC), 2013 32nd Chinese, 2013, pp. 7803-7808. B. Liang, “A hierarchical clustering based global outlier detection method,” in 2010 IEEE Fifth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA), 2010, pp. 1213-1215. R. Pamula, J. K. Deka, and S. Nandi, “An Outlier Detection Method Based on Clustering,” in 2011 Second International Conference on Emerging Applications of Information Technology (EAIT), 2011, pp. 253-256. J. Qu, W. Qin, Y. Feng, and Y. Sai, “An Outlier Detection Method Based on Voronoi Diagram for Financial Surveillance,” in International Workshop on Intelligent Systems and Applications, 2009. ISA 2009, 2009, pp. 1-4. J. Liu and H. Deng, “Outlier detection on uncertain data based on local information,” Knowl.- Based Syst., vol. 51, pp. 60-71, Oct. 2013. B. Mogoş, “Exploratory data analysis for outlier detection in bioequivalence studies,” Biocybern. Biomed. Eng., vol. 33, no. 3, pp. 164-170, 2013. D. Cucina, A. di Salvatore, and M. K. Protopapas, “Outliers detection in multivariate time series using genetic algorithms,” Chemom. Intell. Lab. Syst., vol. 132, pp. 103-110, Mar. 2014. J. Shen, J. Liu, R. Zhao, and X. Lin, “A Kd-Tree-Based Outlier Detection Method for Airborne LiDAR Point Clouds,” in 2011 International Symposium on Image and Data Fusion (ISIDF), 2011, pp. 1-4. X. Peng, J. Chen, and H. Shen, “Outlier detection method based on SVM and its application in copper-matte converting,” in Control and Decision Conference (CCDC), 2010 Chinese, 2010, pp. 628-631. H. Xiong, G. Pandey, M. Steinbach, and V. Kumar, “Enhancing data analysis with noise removal,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 3, pp. 304-319, Mar. 2006. V. Chandola, A. Banerjee, and V. Kumar, “Anomaly Detection: A Survey,” ACM Comput Surv, vol. 41, no. 3, pp. 15:1-15:58, Jul. 2009. N. Verbiest, E. Ramentol, C. Cornelis, and F. Herrera, “Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection,” Appl. Soft Comput., vol. 22, pp. 511-517, Sep. 2014. Z. J. Ding and Y.-Q. Zhang, “Additive noise analysis on microarray data via SVM classification,” in 2010 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2010, pp. 1-7. H. Yin, H. Dong, and Y. Li, “A Cluster-Based Noise Detection Algorithm,” in 2009 First International Workshop on Database Technology and Applications, 2009, pp. 386-389. S. R. Kannan, R. Devi, S. Ramathilagam, and K. Takezawa, “Effective FCM Noise Clustering Algorithms in Medical Images,” Comput Biol Med, vol. 43, no. 2, pp. 73-83, Feb. 2013. Y.-L. He, Z.-Q. Geng, Y. Xu, and Q.-X. Zhu, “A hierarchical structure of extreme learning machine (HELM) for high-dimensional datasets with noise,” Neurocomputing, vol. 128, pp. 407-414, Mar. 2014. K. Hayashi, “A simple extension of boosting for asymmetric mislabeled data,” Stat. Probab. Lett., vol. 82, no. 2, pp. 348-356, Feb. 2012. B. Sluban and N. Lavrač, “Relating ensemble diversity and performance: A study in class noise detection,” Neurocomputing, vol. 160, pp. 120-131, Jul. 2015. P. Shen, S. Tamura, and S. Hayamizu, “Feature reconstruction using sparse imputation for noise robust audio-visual speech recognition,” in Signal Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacific, 2012, pp. 1-4. B. Frenay and M. Verleysen, “Classification in the Presence of Label Noise: A Survey,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 5, pp. 845-869, May 2014. C. Catal, O. Alan, and K. Balkan, “Class noise detection based on software metrics and ROC curves,” Inf. Sci., vol. 181, no. 21, pp. 4867-4877, Nov. 2011. I. B. Aydilek and A. Arslan, “A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm,” Inf. Sci., vol. 233, pp. 25-35, Jun. 2013. F. Qin and J. Lee, “Dynamic Methods for Missing Value Estimation for DNA Sequences,” in 2010 International Conference on Computational and Information Sciences (ICCIS), 2010, pp. 442-445. S. Zhang, Z. Jin, and X. Zhu, “Missing data imputation by utilizing information within incomplete instances,” J. Syst. Softw., vol. 84, no. 3, pp. 452-459, Mar. 2011. B. Lotfi, M. Mourad, M. B. Najiba, and E. Mohamed, “Treatment methodology of erroneous and missing data in wind farm dataset,” in 2011 8th International Multi-Conference on Systems, Signals and Devices (SSD), 2011, pp. 1-6. Z. Sahri, R. Yusof, and J. Watada, “FINNIM: Iterative Imputation of Missing Values in #x00A0;Dissolved Gas Analysis Dataset,” IEEE Trans. Ind. Inform., vol. 10, no. 4, pp. 2093-2102, Nov. 2014. P. Keerin, W. Kurutach, and T. Boongoen, “An improvement of missing value imputation in DNA microarray data using cluster-based LLS method,” in 2013 13th International Symposium on Communications and Information Technologies (ISCIT), 2013, pp. 559-564. F. O. de França, G. P. Coelho, and F. J. Von Zuben, “Predicting missing values with biclustering: A coherence-based approach,” Pattern Recognit., vol. 46, no. 5, pp. 1255-1266, May 2013. W. Insuwan, U. Suksawatchon, and J. Suksawatchon, “Improving missing values imputation in collaborative filtering with user-preference genre and singular value decomposition,” in 2014 6th International Conference on Knowledge and Smart Technology (KST), 2014, pp. 87-92. T.-P. Hong and C.-W. Wu, “Mining rules from an incomplete dataset with a high missing rate,” Expert Syst. Appl., vol. 38, no. 4, pp. 3931-3936, Apr. 2011. K. Jiang, H. Chen, and S. Yuan, “Classification for Incomplete Data Using Classifier Ensembles,” in International Conference on Neural Networks and Brain, 2005. ICNN B ’05, 2005, vol. 1, pp. 559-563. C.-H. Wu, C.-H. Wun, and H.-J. Chou, “Using association rules for completing missing data,” in Fourth International Conference on Hybrid Intelligent Systems, 2004. HIS ’04, 2004, pp. 236-241. A. C. Yang, H.-H. Hsu, and M.-D. Lu, “Imputing missing values in microarray data with ontology information,” in 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW), 2010, pp. 535-540. R. Blagus and L. Lusa, “Evaluation of SMOTE for High-Dimensional Class-Imbalanced Microarray Data,” in 2012 11th International Conference on Machine Learning and Applications (ICMLA), 2012, vol. 2, pp. 89-94. F. Koto, “SMOTE-Out, SMOTE-Cosine, and Selected-SMOTE: An enhancement strategy to handle imbalance in data level,” in 2014 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 2014, pp. 280-284. Y. Cheung and F. Gu, “A direct search algorithm based on kernel density estimator for nonlinear optimization,” in 2014 10th International Conference on Natural Computation (ICNC), 2014, pp. 297-302. M. B. Abidine, N. Yala, B. Fergani, and L. Clavier, “Soft margin SVM modeling for handling imbalanced human activity datasets in multiple homes,” in 2014 International Conference on Multimedia Computing and Systems (ICMCS), 2014, pp. 421-426. A. Adam, I. Shapiai, Z. Ibrahim, M. Khalid, L. C. Chew, L. W. Jau, and J. Watada, “A Modified Artificial Neural Network Learning Algorithm for Imbalanced Data Set Problem,” in 2010 Second International Conference on Computational Intelligence, Communication Systems and Networks (CICSyN), 2010, pp. 44-48. A. Adam, L. C. Chew, M. I. Shapiai, L. W. Jau, Z. Ibrahim, and M. Khalid, “A Hybrid Artificial Neural Network-Naive Bayes for solving imbalanced dataset problems in semiconductor manufacturing test process,” in 2011 11th International Conference on Hybrid Intelligent Systems (HIS), 2011, pp. 133-138. N. A. Abolkarlou, A. A. Niknafs, and M. K. Ebrahimpour, “Ensemble imbalance classification: Using data preprocessing, clustering algorithm and genetic algorithm,” in 2014 4th International eConference on Computer and Knowledge Engineering (ICCKE), 2014, pp. 171-176. C. Galarda Varassin, A. Plastino, H. C. Da Gama Leitao, and B. Zadrozny, “Undersampling Strategy Based on Clustering to Improve the Performance of Splice Site Classification in Human Genes,” in 2013 24th International Workshop on Database and Expert Systems Applications (DEXA), 2013, pp. 85-89. J. Liang, L. Bai, C. Dang, and F. Cao, “The -Means-Type Algorithms Versus Imbalanced Data Distributions,” IEEE Trans. Fuzzy Syst., vol. 20, no. 4, pp. 728-745, Aug. 2012. G. Y. Wong, F. H. F. Leung, and S.-H. Ling, “A novel evolutionary preprocessing method based on over-sampling and under-sampling for imbalanced datasets,” in IECON 2013 - 39th Annual Conference of the IEEE Industrial Electronics Society, 2013, pp. 2354-2359. W. Mingnan, J. Watada, Z. Ibrahim, and M. Khalid, “Building a Memetic Algorithm Based Support Vector Machine for Imbalaced Classification,” in 2011 Fifth International Conference on Genetic and Evolutionary Computing (ICGEC), 2011, pp. 389-392. T. Z. Tan, G. S. Ng, and C. Quek, “Complementary Learning Fuzzy Neural Network: An Approach to Imbalanced Dataset,” in International Joint Conference on Neural Networks, 2007. IJCNN 2007, 2007, pp. 2306-2311. G. Y. Wong, F. H. F. Leung, and S.-H. Ling, “An under-sampling method based on fuzzy logic for large imbalanced dataset,” in 2014 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2014, pp. 1248-1252. J. A. Olvera-López, J. A. Carrasco-Ochoa, J. F. Martínez-Trinidad, and J. Kittler, “A review of instance selection methods,” Artif. Intell. Rev., vol. 34, no. 2, pp. 133-143, May 2010. S. Khalid, T. Khalil, and S. Nasreen, “A survey of feature selection and feature extraction techniques in machine learning,” in Science and Information Conference (SAI), 2014, 2014, pp. 372-378. G. Kalpana, R. P. Kumar, and T. Ravi, “Classifier based duplicate record elimination for query results from web databases,” in Trendz in Information Sciences Computing (TISC), 2010, 2010, pp. 50-53. B. Martins, H. Galhardas, and N. Goncalves, “Using Random Forest classifiers to detect duplicate gazetteer records,” in 2012 7th Iberian Conference on Information Systems and Technologies (CISTI), 2012, pp. 1-4. Y. Pei, J. Xu, Z. Cen, and J. Sun, “IKMC: An Improved K-Medoids Clustering Method for Near-Duplicated Records Detection,” in International Conference on Computational Intelligence and Software Engineering, 2009. CiSE 2009, 2009, pp. 1-4. X. Mansheng, L. Youshi, and Z. Xiaoqi, “A property optimization method in support of approximately duplicated records detecting,” in IEEE International Conference on Intelligent Computing and Intelligent Systems, 2009. ICIS 2009, 2009, vol. 3, pp. 118-122. L. D. Avendaño-Valencia, J. D. Martínez-Vargas, E. Giraldo, and G. Castellanos-Domíngue, “Reduction of irrelevant and redundant data from TFRs for EEG signal classification,” Conf. Proc. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. IEEE Eng. Med. Biol. Soc. Annu. Conf., vol. 2010, pp. 4010-4013, 2010. Q. Hua, M. Xiang, and F. Sun, “An optimal feature selection method for approximately duplicate records detecting,” in 2010 The 2nd IEEE International Conference on Information Management and Engineering (ICIME), 2010, pp. 446-450. M. Finger and F. S. Da Silva, “Temporal data obsolescence: modelling problems,” in Fifth International Workshop on Temporal Representation and Reasoning, 1998. Proceedings, 1998, pp. 45-50. A. Maydanchik, Data Quality Assessment. Technics Publications, 2007. J. Debenham, “Knowledge Decay in a Normalised Knowledge Base,” in Database and Expert Systems Applications, M. Ibrahim, J. Küng, and N. Revell, Eds. Springer Berlin Heidelberg, 2000, pp. 417-426. G. Cormode, V. Shkapenyuk, D. Srivastava, and B. Xu, “Forward Decay: A Practical Time Decay Model for Streaming Systems,” in Proceedings of the 2009 IEEE International Conference on Data Engineering, Washington, DC, USA, 2009, pp. 138-149. M. Placide and Y. Lasheng, “Information Decay in Building Predictive Models Using Temporal Data,” in 2010 International Symposium on Information Science and Engineering (ISISE), 2010, pp. 458-462. M. E. Cintra, C. A. A. Meira, M. C. Monard, H. A. Camargo, and L. H. A. Rodrigues, “The use of fuzzy decision trees for coffee rust warning in Brazilian crops,” in 2011 11th International Conference on Intelligent Systems Design and Applications (ISDA), 2011, pp. 1347-1352. D. C. Corrales, A. J. P. Q, C. León, A. Figueroa, and J. C. Corrales, “Early warning system for coffee rust disease based on error correcting output codes: a proposal,” Rev. Ing. Univ. Medellín, vol. 13, no. 25, 2014. D. C. Corrales, A. Ledezma, A. J. P. Q, J. Hoyos, A. Figueroa, and J. C. Corrales, “A new dataset for coffee rust detection in Colombian crops base on classifiers,” Sist. Telemática, vol. 12, no. 29, pp. 9-23, Jun. 2014. D. C. C. Corrales, J. C. Corrales, and A. Figueroa-Casas, “Toward detecting crop diseases and pest by supervised learning,” Ing. Univ., vol. 19, no. 1, 2015. D. C. Corrales, A. Figueroa, A. Ledezma, and J. C. Corrales, “An Empirical Multi-classifier for Coffee Rust Detection in Colombian Crops,” in Computational Science and Its Applications -- ICCSA 2015, O. Gervasi, B. Murgante, S. Misra, M. L. Gavrilova, A. M. A. C. Rocha, C. Torre, D. Taniar, and B. O. Apduhan, Eds. Springer International Publishing, 2015, pp. 60-74.
dc.relation.ispartofjournal.spa.fl_str_mv	Revista Ingenierías Universidad de Medellín
dc.rights.coar.fl_str_mv	http://purl.org/coar/access_right/c_abf2
dc.rights.uri.*.fl_str_mv	http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.creativecommons.*.fl_str_mv	Attribution-NonCommercial-ShareAlike 4.0 International
rights_invalid_str_mv	http://creativecommons.org/licenses/by-nc-sa/4.0/ Attribution-NonCommercial-ShareAlike 4.0 International http://purl.org/coar/access_right/c_abf2
dc.format.extent.spa.fl_str_mv	p. 125-150
dc.format.medium.spa.fl_str_mv	Electrónico
dc.format.mimetype.none.fl_str_mv	application/pdf
dc.coverage.spa.fl_str_mv	Lat: 06 15 00 N degrees minutes Lat: 6.2500 decimal degreesLong: 075 36 00 W degrees minutes Long: -75.6000 decimal degrees
dc.publisher.spa.fl_str_mv	Universidad de Medellín
dc.publisher.faculty.spa.fl_str_mv	Facultad de Ingenierías
dc.publisher.place.spa.fl_str_mv	Medellín
dc.source.spa.fl_str_mv	Revista Ingenierías Universidad de Medellín; Vol. 15, núm. 28 (2016) 2248-4094 1692-3324
institution	Universidad de Medellín
bitstream.url.fl_str_mv	http://repository.udem.edu.co/bitstream/11407/3550/3/Revista_Ingenierias_UdeM_277.pdf.jpg http://repository.udem.edu.co/bitstream/11407/3550/1/Articulo.html http://repository.udem.edu.co/bitstream/11407/3550/2/Revista_Ingenierias_UdeM_277.pdf
bitstream.checksum.fl_str_mv	ee4199aa9bd9873b6af7bb0ad40f64ff 4807fedbf2fd1bd99750d44b3cb5e466 d9dc4208c6eefe798bf1213569e887ab
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5
repository.name.fl_str_mv	Repositorio Institucional Universidad de Medellin
repository.mail.fl_str_mv	repositorio@udem.edu.co
_version_	1851059147738447872
spelling	Corrales, David CamiloLedezma, Agapito IsmaelCorrales, Juan CarlosCorrales, David Camilo; Universidad del Cauca - Universidad Carlos III de MadridLedezma, Agapito Ismael; Universidad Carlos III de MadridCorrales, Juan Carlos2017-06-29T22:22:36Z2017-06-29T22:22:36Z2016-06-301692-3324http://hdl.handle.net/11407/3550 http://dx.doi.org/10.22395/rium.v15n28a72248-4094reponame:Repositorio Institucional Universidad de Medellínrepourl:https://repository.udem.edu.co/instname:Universidad de MedellínHay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.Large volume of data is growing because the organizations are continuously capturing the collective amount of data for better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust.p. 125-150Electrónicoapplication/pdfengUniversidad de MedellínFacultad de IngenieríasMedellínhttp://revistas.udem.edu.co/index.php/ingenierias/article/view/10661528125150J. Gantz and David Reinsel, “The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east,” IDC VIEW, pp. 1-16, 2012.H. Hu, Y. Wen, T.-S. Chua, and X. Li, “Toward Scalable Systems for Big Data Analytics: A Technology Tutorial,” IEEE Access, vol. 2, pp. 652-687, 2014.A. Rajaraman and J. D. Ullman, Mining of Massive Datasets. New York, N.Y. ; Cambridge: Cambridge University Press, 2011.F. Pacheco, C. Rangel, J. Aguilar, M. Cerrada, and J. Altamiranda, “Methodological framework for data processing based on the Data Science paradigm,” in Computing Conference (CLEI), 2014 XL Latin American, 2014, pp. 1-12.G. A. Liebchen and M. Shepperd, “Software productivity analysis of a large data set and issues of confidentiality and data quality,” in Software Metrics, 2005. 11th IEEE International Symposium, 2005, p. 3 pp.-46.G. A. Liebchen and M. Shepperd, “Data Sets and Data Quality in Software Engineering,” in Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, New York, NY, USA, 2008, pp. 39-44.M. F. Bosu and S. G. Macdonell, “A Taxonomy of Data Quality Challenges in Empirical Software Engineering,” in Software Engineering Conference (ASWEC), 2013 22nd Australian, 2013, pp. 97-106.D. C. Corrales, A. Ledezma, and J. C. Corrales, “A conceptual Framework for data quality in knowledge discovery tasks (FDQ-KDT): a proposal,” in Journal of Computers, Chicago, 2015.B. A. Kitchenham, “Systematic Review in Software Engineering: Where We Are and Where We Should Be Going,” in Proceedings of the 2Nd International Workshop on Evidential Assessment of Software Technologies, New York, NY, USA, 2012, pp. 1-2.F. Hakimpour and A. Geppert, “Resolving Semantic Heterogeneity in Schema Integration,” in Proceedings of the International Conference on Formal Ontology in Information Systems - Volume 2001, New York, NY, USA, 2001, pp. 297-308.F. Castanedo, “A Review of Data Fusion Techniques,” Sci. World J., vol. 2013, p. e704504, Oct. 2013.W. Zou and W. Sun, “A Multi-dimensional Data Association Algorithm for Multi-sensor Fusion,” in Intelligent Science and Intelligent Data Engineering, J. Yang, F. Fang, and C. Sun, Eds. Springer Berlin Heidelberg, 2013, pp. 280-288.S. Lloyd, “Least Squares Quantization in PCM,” IEEE Trans Inf Theor, vol. 28, no. 2, pp. 129-137, Sep. 2006.A. W. Michael Shindler, “Fast and Accurate k-means For Large Datasets,” 2011.S. K. Chang, E. Jungert, and X. Li, “A progressive query language and interactive reasoner for information fusion support,” Inf. Fusion, vol. 8, no. 1, pp. 70-83, Jan. 2007.T. Aluja-Banet, J. Daunis-i-Estadella, and D. Pellicer, “GRAFT, a complete system for data fusion,” Comput. Stat. Data Anal., vol. 52, no. 2, pp. 635-649, Oct. 2007.D. M. Hawkins, “Introduction,” in Identification of Outliers, Springer Netherlands, 1980, pp. 1-12.A. Daneshpazhouh and A. Sami, “Entropy-based outlier detection using semi-supervised approach with few positive examples,” Pattern Recognit. Lett., vol. 49, pp. 77-84, Nov. 2014.W. Yalin, X. Wenping, W. Xiaoli, and C. Bin, “Study on online outlier detection method based on principal component analysis and Bayesian classification,” in Control Conference (CCC), 2013 32nd Chinese, 2013, pp. 7803-7808.B. Liang, “A hierarchical clustering based global outlier detection method,” in 2010 IEEE Fifth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA), 2010, pp. 1213-1215.R. Pamula, J. K. Deka, and S. Nandi, “An Outlier Detection Method Based on Clustering,” in 2011 Second International Conference on Emerging Applications of Information Technology (EAIT), 2011, pp. 253-256.J. Qu, W. Qin, Y. Feng, and Y. Sai, “An Outlier Detection Method Based on Voronoi Diagram for Financial Surveillance,” in International Workshop on Intelligent Systems and Applications, 2009. ISA 2009, 2009, pp. 1-4.J. Liu and H. Deng, “Outlier detection on uncertain data based on local information,” Knowl.- Based Syst., vol. 51, pp. 60-71, Oct. 2013.B. Mogoş, “Exploratory data analysis for outlier detection in bioequivalence studies,” Biocybern. Biomed. Eng., vol. 33, no. 3, pp. 164-170, 2013.D. Cucina, A. di Salvatore, and M. K. Protopapas, “Outliers detection in multivariate time series using genetic algorithms,” Chemom. Intell. Lab. Syst., vol. 132, pp. 103-110, Mar. 2014.J. Shen, J. Liu, R. Zhao, and X. Lin, “A Kd-Tree-Based Outlier Detection Method for Airborne LiDAR Point Clouds,” in 2011 International Symposium on Image and Data Fusion (ISIDF), 2011, pp. 1-4.X. Peng, J. Chen, and H. Shen, “Outlier detection method based on SVM and its application in copper-matte converting,” in Control and Decision Conference (CCDC), 2010 Chinese, 2010, pp. 628-631.H. Xiong, G. Pandey, M. Steinbach, and V. Kumar, “Enhancing data analysis with noise removal,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 3, pp. 304-319, Mar. 2006.V. Chandola, A. Banerjee, and V. Kumar, “Anomaly Detection: A Survey,” ACM Comput Surv, vol. 41, no. 3, pp. 15:1-15:58, Jul. 2009.N. Verbiest, E. Ramentol, C. Cornelis, and F. Herrera, “Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection,” Appl. Soft Comput., vol. 22, pp. 511-517, Sep. 2014.Z. J. Ding and Y.-Q. Zhang, “Additive noise analysis on microarray data via SVM classification,” in 2010 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2010, pp. 1-7.H. Yin, H. Dong, and Y. Li, “A Cluster-Based Noise Detection Algorithm,” in 2009 First International Workshop on Database Technology and Applications, 2009, pp. 386-389.S. R. Kannan, R. Devi, S. Ramathilagam, and K. Takezawa, “Effective FCM Noise Clustering Algorithms in Medical Images,” Comput Biol Med, vol. 43, no. 2, pp. 73-83, Feb. 2013.Y.-L. He, Z.-Q. Geng, Y. Xu, and Q.-X. Zhu, “A hierarchical structure of extreme learning machine (HELM) for high-dimensional datasets with noise,” Neurocomputing, vol. 128, pp. 407-414, Mar. 2014.K. Hayashi, “A simple extension of boosting for asymmetric mislabeled data,” Stat. Probab. Lett., vol. 82, no. 2, pp. 348-356, Feb. 2012.B. Sluban and N. Lavrač, “Relating ensemble diversity and performance: A study in class noise detection,” Neurocomputing, vol. 160, pp. 120-131, Jul. 2015.P. Shen, S. Tamura, and S. Hayamizu, “Feature reconstruction using sparse imputation for noise robust audio-visual speech recognition,” in Signal Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacific, 2012, pp. 1-4.B. Frenay and M. Verleysen, “Classification in the Presence of Label Noise: A Survey,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 5, pp. 845-869, May 2014.C. Catal, O. Alan, and K. Balkan, “Class noise detection based on software metrics and ROC curves,” Inf. Sci., vol. 181, no. 21, pp. 4867-4877, Nov. 2011.I. B. Aydilek and A. Arslan, “A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm,” Inf. Sci., vol. 233, pp. 25-35, Jun. 2013.F. Qin and J. Lee, “Dynamic Methods for Missing Value Estimation for DNA Sequences,” in 2010 International Conference on Computational and Information Sciences (ICCIS), 2010, pp. 442-445.S. Zhang, Z. Jin, and X. Zhu, “Missing data imputation by utilizing information within incomplete instances,” J. Syst. Softw., vol. 84, no. 3, pp. 452-459, Mar. 2011.B. Lotfi, M. Mourad, M. B. Najiba, and E. Mohamed, “Treatment methodology of erroneous and missing data in wind farm dataset,” in 2011 8th International Multi-Conference on Systems, Signals and Devices (SSD), 2011, pp. 1-6.Z. Sahri, R. Yusof, and J. Watada, “FINNIM: Iterative Imputation of Missing Values in #x00A0;Dissolved Gas Analysis Dataset,” IEEE Trans. Ind. Inform., vol. 10, no. 4, pp. 2093-2102, Nov. 2014.P. Keerin, W. Kurutach, and T. Boongoen, “An improvement of missing value imputation in DNA microarray data using cluster-based LLS method,” in 2013 13th International Symposium on Communications and Information Technologies (ISCIT), 2013, pp. 559-564.F. O. de França, G. P. Coelho, and F. J. Von Zuben, “Predicting missing values with biclustering: A coherence-based approach,” Pattern Recognit., vol. 46, no. 5, pp. 1255-1266, May 2013.W. Insuwan, U. Suksawatchon, and J. Suksawatchon, “Improving missing values imputation in collaborative filtering with user-preference genre and singular value decomposition,” in 2014 6th International Conference on Knowledge and Smart Technology (KST), 2014, pp. 87-92.T.-P. Hong and C.-W. Wu, “Mining rules from an incomplete dataset with a high missing rate,” Expert Syst. Appl., vol. 38, no. 4, pp. 3931-3936, Apr. 2011.K. Jiang, H. Chen, and S. Yuan, “Classification for Incomplete Data Using Classifier Ensembles,” in International Conference on Neural Networks and Brain, 2005. ICNN B ’05, 2005, vol. 1, pp. 559-563.C.-H. Wu, C.-H. Wun, and H.-J. Chou, “Using association rules for completing missing data,” in Fourth International Conference on Hybrid Intelligent Systems, 2004. HIS ’04, 2004, pp. 236-241.A. C. Yang, H.-H. Hsu, and M.-D. Lu, “Imputing missing values in microarray data with ontology information,” in 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW), 2010, pp. 535-540.R. Blagus and L. Lusa, “Evaluation of SMOTE for High-Dimensional Class-Imbalanced Microarray Data,” in 2012 11th International Conference on Machine Learning and Applications (ICMLA), 2012, vol. 2, pp. 89-94.F. Koto, “SMOTE-Out, SMOTE-Cosine, and Selected-SMOTE: An enhancement strategy to handle imbalance in data level,” in 2014 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 2014, pp. 280-284.Y. Cheung and F. Gu, “A direct search algorithm based on kernel density estimator for nonlinear optimization,” in 2014 10th International Conference on Natural Computation (ICNC), 2014, pp. 297-302.M. B. Abidine, N. Yala, B. Fergani, and L. Clavier, “Soft margin SVM modeling for handling imbalanced human activity datasets in multiple homes,” in 2014 International Conference on Multimedia Computing and Systems (ICMCS), 2014, pp. 421-426.A. Adam, I. Shapiai, Z. Ibrahim, M. Khalid, L. C. Chew, L. W. Jau, and J. Watada, “A Modified Artificial Neural Network Learning Algorithm for Imbalanced Data Set Problem,” in 2010 Second International Conference on Computational Intelligence, Communication Systems and Networks (CICSyN), 2010, pp. 44-48.A. Adam, L. C. Chew, M. I. Shapiai, L. W. Jau, Z. Ibrahim, and M. Khalid, “A Hybrid Artificial Neural Network-Naive Bayes for solving imbalanced dataset problems in semiconductor manufacturing test process,” in 2011 11th International Conference on Hybrid Intelligent Systems (HIS), 2011, pp. 133-138.N. A. Abolkarlou, A. A. Niknafs, and M. K. Ebrahimpour, “Ensemble imbalance classification: Using data preprocessing, clustering algorithm and genetic algorithm,” in 2014 4th International eConference on Computer and Knowledge Engineering (ICCKE), 2014, pp. 171-176.C. Galarda Varassin, A. Plastino, H. C. Da Gama Leitao, and B. Zadrozny, “Undersampling Strategy Based on Clustering to Improve the Performance of Splice Site Classification in Human Genes,” in 2013 24th International Workshop on Database and Expert Systems Applications (DEXA), 2013, pp. 85-89.J. Liang, L. Bai, C. Dang, and F. Cao, “The -Means-Type Algorithms Versus Imbalanced Data Distributions,” IEEE Trans. Fuzzy Syst., vol. 20, no. 4, pp. 728-745, Aug. 2012.G. Y. Wong, F. H. F. Leung, and S.-H. Ling, “A novel evolutionary preprocessing method based on over-sampling and under-sampling for imbalanced datasets,” in IECON 2013 - 39th Annual Conference of the IEEE Industrial Electronics Society, 2013, pp. 2354-2359.W. Mingnan, J. Watada, Z. Ibrahim, and M. Khalid, “Building a Memetic Algorithm Based Support Vector Machine for Imbalaced Classification,” in 2011 Fifth International Conference on Genetic and Evolutionary Computing (ICGEC), 2011, pp. 389-392.T. Z. Tan, G. S. Ng, and C. Quek, “Complementary Learning Fuzzy Neural Network: An Approach to Imbalanced Dataset,” in International Joint Conference on Neural Networks, 2007. IJCNN 2007, 2007, pp. 2306-2311.G. Y. Wong, F. H. F. Leung, and S.-H. Ling, “An under-sampling method based on fuzzy logic for large imbalanced dataset,” in 2014 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2014, pp. 1248-1252.J. A. Olvera-López, J. A. Carrasco-Ochoa, J. F. Martínez-Trinidad, and J. Kittler, “A review of instance selection methods,” Artif. Intell. Rev., vol. 34, no. 2, pp. 133-143, May 2010.S. Khalid, T. Khalil, and S. Nasreen, “A survey of feature selection and feature extraction techniques in machine learning,” in Science and Information Conference (SAI), 2014, 2014, pp. 372-378.G. Kalpana, R. P. Kumar, and T. Ravi, “Classifier based duplicate record elimination for query results from web databases,” in Trendz in Information Sciences Computing (TISC), 2010, 2010, pp. 50-53.B. Martins, H. Galhardas, and N. Goncalves, “Using Random Forest classifiers to detect duplicate gazetteer records,” in 2012 7th Iberian Conference on Information Systems and Technologies (CISTI), 2012, pp. 1-4.Y. Pei, J. Xu, Z. Cen, and J. Sun, “IKMC: An Improved K-Medoids Clustering Method for Near-Duplicated Records Detection,” in International Conference on Computational Intelligence and Software Engineering, 2009. CiSE 2009, 2009, pp. 1-4.X. Mansheng, L. Youshi, and Z. Xiaoqi, “A property optimization method in support of approximately duplicated records detecting,” in IEEE International Conference on Intelligent Computing and Intelligent Systems, 2009. ICIS 2009, 2009, vol. 3, pp. 118-122.L. D. Avendaño-Valencia, J. D. Martínez-Vargas, E. Giraldo, and G. Castellanos-Domíngue, “Reduction of irrelevant and redundant data from TFRs for EEG signal classification,” Conf. Proc. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. IEEE Eng. Med. Biol. Soc. Annu. Conf., vol. 2010, pp. 4010-4013, 2010.Q. Hua, M. Xiang, and F. Sun, “An optimal feature selection method for approximately duplicate records detecting,” in 2010 The 2nd IEEE International Conference on Information Management and Engineering (ICIME), 2010, pp. 446-450.M. Finger and F. S. Da Silva, “Temporal data obsolescence: modelling problems,” in Fifth International Workshop on Temporal Representation and Reasoning, 1998. Proceedings, 1998, pp. 45-50.A. Maydanchik, Data Quality Assessment. Technics Publications, 2007.J. Debenham, “Knowledge Decay in a Normalised Knowledge Base,” in Database and Expert Systems Applications, M. Ibrahim, J. Küng, and N. Revell, Eds. Springer Berlin Heidelberg, 2000, pp. 417-426.G. Cormode, V. Shkapenyuk, D. Srivastava, and B. Xu, “Forward Decay: A Practical Time Decay Model for Streaming Systems,” in Proceedings of the 2009 IEEE International Conference on Data Engineering, Washington, DC, USA, 2009, pp. 138-149.M. Placide and Y. Lasheng, “Information Decay in Building Predictive Models Using Temporal Data,” in 2010 International Symposium on Information Science and Engineering (ISISE), 2010, pp. 458-462.M. E. Cintra, C. A. A. Meira, M. C. Monard, H. A. Camargo, and L. H. A. Rodrigues, “The use of fuzzy decision trees for coffee rust warning in Brazilian crops,” in 2011 11th International Conference on Intelligent Systems Design and Applications (ISDA), 2011, pp. 1347-1352.D. C. Corrales, A. J. P. Q, C. León, A. Figueroa, and J. C. Corrales, “Early warning system for coffee rust disease based on error correcting output codes: a proposal,” Rev. Ing. Univ. Medellín, vol. 13, no. 25, 2014.D. C. Corrales, A. Ledezma, A. J. P. Q, J. Hoyos, A. Figueroa, and J. C. Corrales, “A new dataset for coffee rust detection in Colombian crops base on classifiers,” Sist. Telemática, vol. 12, no. 29, pp. 9-23, Jun. 2014.D. C. C. Corrales, J. C. Corrales, and A. Figueroa-Casas, “Toward detecting crop diseases and pest by supervised learning,” Ing. Univ., vol. 19, no. 1, 2015.D. C. Corrales, A. Figueroa, A. Ledezma, and J. C. Corrales, “An Empirical Multi-classifier for Coffee Rust Detection in Colombian Crops,” in Computational Science and Its Applications -- ICCSA 2015, O. Gervasi, B. Murgante, S. Misra, M. L. Gavrilova, A. M. A. C. Rocha, C. Torre, D. Taniar, and B. O. Apduhan, Eds. Springer International Publishing, 2015, pp. 60-74.Revista Ingenierías Universidad de Medellínhttp://creativecommons.org/licenses/by-nc-sa/4.0/Attribution-NonCommercial-ShareAlike 4.0 Internationalhttp://purl.org/coar/access_right/c_abf2Revista Ingenierías Universidad de Medellín; Vol. 15, núm. 28 (2016)2248-40941692-3324HeterogeneityOutliersnoiseInconsistencyIncompletenessAmount of dataRedundancyTimelinessHeterogeneidadValores atípicosRuidoInconsistenciaValores perdidosCantidad de datosRedundanciaOportunidadUna revisión sistemática de problemas de calidad en los datos en tareas de descubrimiento de conocimientoA systematic review of data quality issues in knowledge discovery tasksArticlehttp://purl.org/coar/resource_type/c_6501http://purl.org/coar/resource_type/c_2df8fbb1Artículo científicoinfo:eu-repo/semantics/articlehttp://purl.org/coar/version/c_970fb48d4fbd8a85Comunidad Universidad de MedellínLat: 06 15 00 N degrees minutes Lat: 6.2500 decimal degreesLong: 075 36 00 W degrees minutes Long: -75.6000 decimal degreesTHUMBNAILRevista_Ingenierias_UdeM_277.pdf.jpgRevista_Ingenierias_UdeM_277.pdf.jpgIM Thumbnailimage/jpeg7204http://repository.udem.edu.co/bitstream/11407/3550/3/Revista_Ingenierias_UdeM_277.pdf.jpgee4199aa9bd9873b6af7bb0ad40f64ffMD53ORIGINALArticulo.htmltext/html497http://repository.udem.edu.co/bitstream/11407/3550/1/Articulo.html4807fedbf2fd1bd99750d44b3cb5e466MD51Revista_Ingenierias_UdeM_277.pdfRevista_Ingenierias_UdeM_277.pdfapplication/pdf1810323http://repository.udem.edu.co/bitstream/11407/3550/2/Revista_Ingenierias_UdeM_277.pdfd9dc4208c6eefe798bf1213569e887abMD5211407/3550oai:repository.udem.edu.co:11407/35502021-05-14 14:28:18.834Repositorio Institucional Universidad de Medellinrepositorio@udem.edu.co

Una revisión sistemática de problemas de calidad en los datos en tareas de descubrimiento de conocimiento

Publicaciones similares