IMMAN: free software for information theory-based chemometric analysis

Abstract: The features and theoretical background of a new and free computational program for chemometric analysis denominated IMMAN (acronym for Information theory-based CheMoMetrics ANalysis) are presented. This is multi-platform software developed in the Java programming language, designed with a...

Full description

Autores:
Tipo de recurso:
Fecha de publicación:
2015
Institución:
Universidad Tecnológica de Bolívar
Repositorio:
Repositorio Institucional UTB
Idioma:
eng
OAI Identifier:
oai:repositorio.utb.edu.co:20.500.12585/9015
Acceso en línea:
https://hdl.handle.net/20.500.12585/9015
Palabra clave:
Chemometric analysis
Classification
Computational program
Feature selection
IMMAN
Information-theoretic function
Algorithm
Software
Theoretical model
Algorithms
Models, Theoretical
Software
Rights
restrictedAccess
License
http://creativecommons.org/licenses/by-nc-nd/4.0/
id UTB2_ee6ea57c2f49c0d56f8d3492f6704104
oai_identifier_str oai:repositorio.utb.edu.co:20.500.12585/9015
network_acronym_str UTB2
network_name_str Repositorio Institucional UTB
repository_id_str
dc.title.none.fl_str_mv IMMAN: free software for information theory-based chemometric analysis
title IMMAN: free software for information theory-based chemometric analysis
spellingShingle IMMAN: free software for information theory-based chemometric analysis
Chemometric analysis
Classification
Computational program
Feature selection
IMMAN
Information-theoretic function
Algorithm
Software
Theoretical model
Algorithms
Models, Theoretical
Software
title_short IMMAN: free software for information theory-based chemometric analysis
title_full IMMAN: free software for information theory-based chemometric analysis
title_fullStr IMMAN: free software for information theory-based chemometric analysis
title_full_unstemmed IMMAN: free software for information theory-based chemometric analysis
title_sort IMMAN: free software for information theory-based chemometric analysis
dc.subject.keywords.none.fl_str_mv Chemometric analysis
Classification
Computational program
Feature selection
IMMAN
Information-theoretic function
Algorithm
Software
Theoretical model
Algorithms
Models, Theoretical
Software
topic Chemometric analysis
Classification
Computational program
Feature selection
IMMAN
Information-theoretic function
Algorithm
Software
Theoretical model
Algorithms
Models, Theoretical
Software
description Abstract: The features and theoretical background of a new and free computational program for chemometric analysis denominated IMMAN (acronym for Information theory-based CheMoMetrics ANalysis) are presented. This is multi-platform software developed in the Java programming language, designed with a remarkably user-friendly graphical interface for the computation of a collection of information-theoretic functions adapted for rank-based unsupervised and supervised feature selection tasks. A total of 20 feature selection parameters are presented, with the unsupervised and supervised frameworks represented by 10 approaches in each case. Several information-theoretic parameters traditionally used as molecular descriptors (MDs) are adapted for use as unsupervised rank-based feature selection methods. On the other hand, a generalization scheme for the previously defined differential Shannon’s entropy is discussed, as well as the introduction of Jeffreys information measure for supervised feature selection. Moreover, well-known information-theoretic feature selection parameters, such as information gain, gain ratio, and symmetrical uncertainty are incorporated to the IMMAN software (http://mobiosd-hub.com/imman-soft/), following an equal-interval discretization approach. IMMAN offers data pre-processing functionalities, such as missing values processing, dataset partitioning, and browsing. Moreover, single parameter or ensemble (multi-criteria) ranking options are provided. Consequently, this software is suitable for tasks like dimensionality reduction, feature ranking, as well as comparative diversity analysis of data matrices. Simple examples of applications performed with this program are presented. A comparative study between IMMAN and WEKA feature selection tools using the Arcene dataset was performed, demonstrating similar behavior. In addition, it is revealed that the use of IMMAN unsupervised feature selection methods improves the performance of both IMMAN and WEKA supervised algorithms. © 2015, Springer International Publishing Switzerland.
publishDate 2015
dc.date.issued.none.fl_str_mv 2015
dc.date.accessioned.none.fl_str_mv 2020-03-26T16:32:46Z
dc.date.available.none.fl_str_mv 2020-03-26T16:32:46Z
dc.type.coarversion.fl_str_mv http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.coar.fl_str_mv http://purl.org/coar/resource_type/c_2df8fbb1
dc.type.driver.none.fl_str_mv info:eu-repo/semantics/article
dc.type.hasVersion.none.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.spa.none.fl_str_mv Artículo
status_str publishedVersion
dc.identifier.citation.none.fl_str_mv Molecular Diversity; Vol. 19, Núm. 2; pp. 305-319
dc.identifier.issn.none.fl_str_mv 13811991
dc.identifier.uri.none.fl_str_mv https://hdl.handle.net/20.500.12585/9015
dc.identifier.doi.none.fl_str_mv 10.1007/s11030-014-9565-z
dc.identifier.instname.none.fl_str_mv Universidad Tecnológica de Bolívar
dc.identifier.reponame.none.fl_str_mv Repositorio UTB
dc.identifier.orcid.none.fl_str_mv 56497011800
55363486500
55665599200
56189852800
56191215400
6701762262
identifier_str_mv Molecular Diversity; Vol. 19, Núm. 2; pp. 305-319
13811991
10.1007/s11030-014-9565-z
Universidad Tecnológica de Bolívar
Repositorio UTB
56497011800
55363486500
55665599200
56189852800
56191215400
6701762262
url https://hdl.handle.net/20.500.12585/9015
dc.language.iso.none.fl_str_mv eng
language eng
dc.rights.coar.fl_str_mv http://purl.org/coar/access_right/c_16ec
dc.rights.uri.none.fl_str_mv http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.rights.accessRights.none.fl_str_mv info:eu-repo/semantics/restrictedAccess
dc.rights.cc.none.fl_str_mv Atribución-NoComercial 4.0 Internacional
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-nd/4.0/
Atribución-NoComercial 4.0 Internacional
http://purl.org/coar/access_right/c_16ec
eu_rights_str_mv restrictedAccess
dc.format.medium.none.fl_str_mv Recurso electrónico
dc.format.mimetype.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Kluwer Academic Publishers
publisher.none.fl_str_mv Kluwer Academic Publishers
dc.source.none.fl_str_mv https://www.scopus.com/inward/record.uri?eid=2-s2.0-84937517073&doi=10.1007%2fs11030-014-9565-z&partnerID=40&md5=bebd134ed45279902c02db40eaa3b28c
institution Universidad Tecnológica de Bolívar
bitstream.url.fl_str_mv https://repositorio.utb.edu.co/bitstream/20.500.12585/9015/1/MiniProdInv.png
bitstream.checksum.fl_str_mv 0cb0f101a8d16897fb46fc914d3d7043
bitstream.checksumAlgorithm.fl_str_mv MD5
repository.name.fl_str_mv Repositorio Institucional UTB
repository.mail.fl_str_mv repositorioutb@utb.edu.co
_version_ 1808397589728460800
spelling 2020-03-26T16:32:46Z2020-03-26T16:32:46Z2015Molecular Diversity; Vol. 19, Núm. 2; pp. 305-31913811991https://hdl.handle.net/20.500.12585/901510.1007/s11030-014-9565-zUniversidad Tecnológica de BolívarRepositorio UTB56497011800553634865005566559920056189852800561912154006701762262Abstract: The features and theoretical background of a new and free computational program for chemometric analysis denominated IMMAN (acronym for Information theory-based CheMoMetrics ANalysis) are presented. This is multi-platform software developed in the Java programming language, designed with a remarkably user-friendly graphical interface for the computation of a collection of information-theoretic functions adapted for rank-based unsupervised and supervised feature selection tasks. A total of 20 feature selection parameters are presented, with the unsupervised and supervised frameworks represented by 10 approaches in each case. Several information-theoretic parameters traditionally used as molecular descriptors (MDs) are adapted for use as unsupervised rank-based feature selection methods. On the other hand, a generalization scheme for the previously defined differential Shannon’s entropy is discussed, as well as the introduction of Jeffreys information measure for supervised feature selection. Moreover, well-known information-theoretic feature selection parameters, such as information gain, gain ratio, and symmetrical uncertainty are incorporated to the IMMAN software (http://mobiosd-hub.com/imman-soft/), following an equal-interval discretization approach. IMMAN offers data pre-processing functionalities, such as missing values processing, dataset partitioning, and browsing. Moreover, single parameter or ensemble (multi-criteria) ranking options are provided. Consequently, this software is suitable for tasks like dimensionality reduction, feature ranking, as well as comparative diversity analysis of data matrices. Simple examples of applications performed with this program are presented. A comparative study between IMMAN and WEKA feature selection tools using the Arcene dataset was performed, demonstrating similar behavior. In addition, it is revealed that the use of IMMAN unsupervised feature selection methods improves the performance of both IMMAN and WEKA supervised algorithms. © 2015, Springer International Publishing Switzerland.Conselho Nacional de Desenvolvimento Científico e Tecnológico, CNPqRecurso electrónicoapplication/pdfengKluwer Academic Publishershttp://creativecommons.org/licenses/by-nc-nd/4.0/info:eu-repo/semantics/restrictedAccessAtribución-NoComercial 4.0 Internacionalhttp://purl.org/coar/access_right/c_16echttps://www.scopus.com/inward/record.uri?eid=2-s2.0-84937517073&doi=10.1007%2fs11030-014-9565-z&partnerID=40&md5=bebd134ed45279902c02db40eaa3b28cIMMAN: free software for information theory-based chemometric analysisinfo:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionArtículohttp://purl.org/coar/version/c_970fb48d4fbd8a85http://purl.org/coar/resource_type/c_2df8fbb1Chemometric analysisClassificationComputational programFeature selectionIMMANInformation-theoretic functionAlgorithmSoftwareTheoretical modelAlgorithmsModels, TheoreticalSoftwareUrias R.W.P.Barigye S.J.Marrero-Ponce Y.García-Jacas C.R.Valdes-Martiní J.R.Perez-Gimenez F.Todeschini, R., Consonni, V., (2009) Molecular descriptors for chemoinformatics, , 1, Wiley-VCH, Weinheim:Todeschini, R., Consonni, V., Pavan, M., DRAGON Software version 2.1. Milano Chemometric and QSAR Research Group (2002) MilanoGuha, R., The CDK descriptor calculator, 0.94th edn (1991) IndianaYap, C.W., PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints (2011) J Comput Chem, 32, pp. 1466-1474. , COI: 1:CAS:528:DC%2BC3MXjsF2isLc%3D, PID: 21425294Georg, H., (2008) BlueDesc-molecular descriptor calculator, , University of Tübingen, Tübingen:Liu, J., Feng, J., Brooks, A., Young, S., (2005) PowerMV, , National Institute of Statistical Sciences, Research Triangle Park:Code, A.D.R.I.A.N.A., (2011) Molecular Networks, , Erlangen, Germany:Hong, H., Xie, Q., Ge, W., Qian, F., Fang, H., Shi, L., Su, Z., Tong, W., Mold2, molecular descriptors from 2D structures for chemoinformatics and toxicoinformatics (2008) J Chem Inf Comput Sci, 48, pp. 1337-1344. , COI: 1:CAS:528:DC%2BD1cXnsVehtL0%3DKellogg, G.E., Molconn-Z 4.0 edn. eduSoft (2001) VirginiaLiu, H., Motoda, H., Liu, H., Motoda, H., Less is More (2008) Computational methods of feature selection. Data mining and knowledge discovery series, p. 411. , Taylor * Francis Group, Boca Raton:Wolpert, D.H., Macready, W.G., No free lunch theorems for optimization (1997) IEEE Trans Evol Comput, 1, pp. 67-82Venkatraman, V., Dalby, A.R., Yang, Z.R., Evaluation of mutual information and genetic programming for feature selection in QSAR (2004) J Chem Inf Comput Sci, 44, pp. 1686-1692. , COI: 1:CAS:528:DC%2BD2cXmsVensr4%3D, PID: 15446827Yu, L., Liu, H., Feature selection for high-dimensional data: a fast correlation-based filter solution (2003) In, , Proceedings of the Twentieth international conference on machine learning, Washington DC:Kira, K., Rendell, L., The feature selection problem: traditional methods and a new algorithm (1992) Association for the advancement of artificial intelligence, pp. 129-134. , AAAI Press and MIT Press, Cambridge:Kullback, S., Leibler, R.A., On information and sufficiency (1951) Ann Math Stat, 22, pp. 79-86Jeffreys, H., An invariant form for the prior probability in estimation problems (1946) Proc Roy Soc A, 186, pp. 453-461. , COI: 1:STN:280:DyaH28%2Fhs1yntA%3D%3DJennifer, G.D., Liu, H., Motoda, H., Unsupervised Feature Selection (2008) Computational methods of feature selection. Data mining and knowledge discovery series. Taylor &, p. 411. , Francis Group, Boca Raton:Varshavsky, R., Gottlieb, A., Linial, M., Horn, D., Novel unsupervised feature filtering of biological data (2006) Bioinformatics, 22, pp. 507-513. , COI: 1:CAS:528:DC%2BD28Xotl2rt7Y%3D, PID: 16873514Maldonado, A.G., Doucet, J.P., Petitjean, M., Fan, B.-T., Molecular similarity and diversity in chemoinformatics: from theory to applications (2006) Mol Divers, 10, pp. 39-79. , COI: 1:CAS:528:DC%2BD28XjsFCmsg%3D%3D, PID: 16404528Godden, J.W., Stahura, F.L., Variability of molecular descriptors in compound databases revealed by Shannon entropy calculations (2000) J Chem Inf Comput Sci, 40, pp. 796-800. , COI: 1:CAS:528:DC%2BD3cXisVOqurc%3D, PID: 10850785Godden, J.W., Bajorath, J., Chemical descriptors with distinct levels of information content and varying sensitivity to differences between selected compound databases identified by SE-DSE analysis (2002) J Chem Inf Comput Sci, 42, pp. 87-93. , COI: 1:CAS:528:DC%2BD3MXosFOqsbk%3D, PID: 11855971Barigye, S.J., Marrero-Ponce, Y., Pérez-Giménez, F., Bonchev, D., Trends in information theory-based chemical structure codification (2014) Mol Divers, 18, pp. 673-686. , COI: 1:CAS:528:DC%2BC2cXls1Kmsr8%3D, PID: 24705993Witten, I.H., Eibe, F., Hall, M.A., Data mining: practical machine learning tools and techniques (2011) The Morgan Kaufmann series in data management systems, , Morgan Kaufmann, BurlingtonAlter, O., Brown, P.O., Botstein, D., Singular value decomposition for genome-wide expression data processing and modeling (2000) Proc Natl Acad Sci USA, 97, pp. 10101-10106. , COI: 1:CAS:528:DC%2BD3cXmtlehsbs%3D, PID: 10963673Devakumari, D., Thangavel, K., Unsupervised adaptive floating search feature selection based on contribution entropy. In: 2010 international conference on communication and computational intelligence (INCOCCI) (2010) pp 623–627Dash, M., Choi, K., Scheuermann, P., Huan, L., Feature selection for clustering—a filter solution (2002) Proceedings of the 2002 IEEE international conference on data mining (ICDM, 2003, pp. 115-122Stahura, F.L., Godden, J.W., Bajorath, J., Differential Shannon entropy analysis identifies molecular property descriptors that predict aqueous solubility of synthetic compounds with high accuracy in binary QSAR calculations (2002) J Chem Inf Comput Sci, 42, pp. 550-558. , COI: 1:CAS:528:DC%2BD38Xht1Gktrs%3D, PID: 12086513Wassermann, A.M., Nisius, B., Vogt, M., Bajorath, J., Identification of descriptors capturing compound class-specific features by mutual information analysis (2010) J Chem Inf Model, 50, pp. 1935-1940. , COI: 1:CAS:528:DC%2BC3cXhtlWiu7zO, PID: 20961115Cover, T.M., Thomas, J.A., (1991) Elements of Information theory, , Wiley, New York:Desurvire, E., (2009) Classical and quantum information theory, , Cambridge University Press, New York:Quinlan, J.R., Learning efficient classification procedures and their application to chess end games. In: Michalski R, Carbonell J, Mitchell T (eds) Machine learning. Symbolic computation. Springer, Berlin, pp 463–482 (1983) doi:10.1007/978-3-662-12405-5_15Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T., (1988) Numerical recipes in C: the art of scientific computing, , Cambridge University Press, New York:Consonni, V., Todeschini, R., Pavan, M., Gramatica, P., Structure/response correlations and similarity/diversity analysis by GETAWAY descriptors. Part 2. Application of the novel 3D molecular descriptors to QSAR/QSPR studies (2002) J Chem Inf Comput Sci, 42, pp. 693-705. , COI: 1:CAS:528:DC%2BD38XivFCgtrc%3D, PID: 12086531Pérez González, M., Terán, C., Teijeira, M., González-Moa, M.J., GETAWAY descriptors to predicting A2A adenosine receptors agonists (2005) Eur J Med Chem, 40, pp. 1080-1086Saiz-Urra, L., Pérez González, M., Quantitative structure-activity relationship studies of HIV-1 integrase inhibition.1. GETAWAY descriptors (2007) Eur J Med Chem, 42, pp. 64-70. , COI: 1:CAS:528:DC%2BD2sXhsFyku7s%3D, PID: 17030481Fedorowicz, A., Singh, H., Soderholm, S., Demchuk, E., Structure–activity models for contact sensitization (2005) Chem Res Toxicol, 18, pp. 954-969. , COI: 1:CAS:528:DC%2BD2MXjvFKjtbs%3D, PID: 15962930Saiz-Urra, L., Pérez González, M., QSAR studies about cytotoxicity of benzophenazines with dual inhibition toward both topoisomerases I and II: 3D-MoRSE descriptors and statistical considerations about variable selection (2006) Bioorg Med Chem, 14, pp. 7347-7358. , COI: 1:CAS:528:DC%2BD28XpvFGjtb4%3D, PID: 16962784Gasteiger, J., Sadowski, J., Schuur, J., Selzer, P., Steinhauer, L., Steinhauer, V., Chemical information in 3Dspace (1996) J Chem Inf Comput Sci, 36, pp. 1030-1037. , COI: 1:CAS:528:DyaK28XltlCms7k%3DGasteiger, J., Schuur, J., Selzer, P., Steinhauer, L., Steinhauer, V., Finding the 3D structure of a molecule in its IR spectrum (1997) Fresen J Anal Chem, 359, pp. 50-55. , COI: 1:CAS:528:DyaK2sXls1Clt7c%3DSchuur, J., Selzer, P., Gasteiger, J., The coding of the three-dimensional structure of molecules by molecular transforms and its application to structure-spectra correlations and studies of biological activity (1996) J Chem Inf Comput Sci, 36, pp. 334-344. , COI: 1:CAS:528:DyaK28Xhtlygtb4%3DBaumann, K., Uniform-length molecular descriptors for quantitative structure-property relationships (QSPR) and quantitative structure-activity relationships (QSAR): classification studies and similarity searching (1999) TRAC, 18, pp. 36-46. , COI: 1:CAS:528:DyaK1MXltFShsg%3D%3DJelcic, Z., Solvent molecular descriptors on poly(D, L-lactide-co-glycolide) particle size in emulsification-diffusion process (2004) Coll Surf A Physico-Chem Eng Asp, 242, pp. 159-166. , COI: 1:CAS:528:DC%2BD2cXlvFGktbs%3DTodeschini, R., Bettiol, C., Giurin, G., Gramatica, P., Miana, P., Argese, E., Modeling and prediction by using WHIM descriptors in QSAR studies. Submitochondrial particles (SMP) as toxicity biosensors of chlorophenols (1996) Chemosphere, 33, pp. 71-79. , COI: 1:CAS:528:DyaK28XktlersLs%3DRandic, M., Molecular profiles. Novel geometry-dependent molecular descriptors (1995) New J Chem, 19, pp. 781-791. , COI: 1:CAS:528:DyaK2MXnvVWisbg%3DFayyad, U.M., Irani, K.B., Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th international joint conference on artificial intelligence (1993) pp 1022–1027, , http://dblp.uni-trier.de/db/conf/ijcai/ijcai93.html#FayyadI93http://www.ics.uci.edu/~mlearn/MLRepository.html, Newman DJ, Hettich S, Blake CL, Merz CJ (1998) UCI repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CAGuyon, I., Gunn, S.R., Ben-Hur, A., Dror G (2004) Result analysis of the NIPS (2003) feature selection challenge. In, pp. 545-552. , http://papers.nips.cc/paper/2728-result-analysis-of-the-nips-2003-feature-selection-challenge, Advances in neural information processing systems, Vancouver, BC:Webb, A.R., (2002) Statistical pattern recognition, , Wiley, Chichester:Cover, T.M., The best two independent measurements are not the two best (1974) IEEE Trans Syst Man Cybern, 4, pp. 116-117http://purl.org/coar/resource_type/c_6501THUMBNAILMiniProdInv.pngMiniProdInv.pngimage/png23941https://repositorio.utb.edu.co/bitstream/20.500.12585/9015/1/MiniProdInv.png0cb0f101a8d16897fb46fc914d3d7043MD5120.500.12585/9015oai:repositorio.utb.edu.co:20.500.12585/90152021-02-02 14:22:03.716Repositorio Institucional UTBrepositorioutb@utb.edu.co