Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets

Research on similarity searching of cheminformatic data sets has been focused on similarity measures using fingerprints. However, nominal scales are the least informative of all metric scales, increasing the tied similarity scores, and decreasing the effectivity of the retrieval engines. Tanimoto�...

Full description

Autores:

Tipo de recurso:

Fecha de publicación:: 2016

Institución:: Universidad Tecnológica de Bolívar

Repositorio:: Repositorio Institucional UTB

Idioma:: eng

id	UTB2_8d09c54b88cf2713f224d3cd06185a1d
oai_identifier_str	oai:repositorio.utb.edu.co:20.500.12585/9004
network_acronym_str	UTB2
network_name_str	Repositorio Institucional UTB
repository_id_str
dc.title.none.fl_str_mv	Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets
title	Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets
spellingShingle	Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets Chemistry Reliability Similarity measures Sorting and searching Benchmarking Chemistry Nearest neighbor search Reliability Four-nearest-neighbors Molecular interpretation No free lunch theorem Performance metrices Proximity measure Similarity measure Similarity Searching Sorting and searching Population statistics Algorithm Chemical database Chemistry Data mining Information science Procedures Algorithms Chemistry Data mining Databases, Chemical Informatics
title_short	Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets
title_full	Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets
title_fullStr	Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets
title_full_unstemmed	Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets
title_sort	Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets
dc.subject.keywords.none.fl_str_mv	Chemistry Reliability Similarity measures Sorting and searching Benchmarking Chemistry Nearest neighbor search Reliability Four-nearest-neighbors Molecular interpretation No free lunch theorem Performance metrices Proximity measure Similarity measure Similarity Searching Sorting and searching Population statistics Algorithm Chemical database Chemistry Data mining Information science Procedures Algorithms Chemistry Data mining Databases, Chemical Informatics
topic	Chemistry Reliability Similarity measures Sorting and searching Benchmarking Chemistry Nearest neighbor search Reliability Four-nearest-neighbors Molecular interpretation No free lunch theorem Performance metrices Proximity measure Similarity measure Similarity Searching Sorting and searching Population statistics Algorithm Chemical database Chemistry Data mining Information science Procedures Algorithms Chemistry Data mining Databases, Chemical Informatics
description	Research on similarity searching of cheminformatic data sets has been focused on similarity measures using fingerprints. However, nominal scales are the least informative of all metric scales, increasing the tied similarity scores, and decreasing the effectivity of the retrieval engines. Tanimoto's coefficient has been claimed to be the most prominent measure for this task. Nevertheless, this field is far from being exhausted since the computer science no free lunch theorem predicts that "no similarity measure has overall superiority over the population of data sets". We introduce 12 relational agreement (RA) coefficients for seven metric scales, which are integrated within a group fusion-based similarity searching algorithm. These similarity measures are compared to a reference panel of 21 proximity quantifiers over 17 benchmark data sets (MUV), by using informative descriptors, a feature selection stage, a suitable performance metric, and powerful comparison tests. In this stage, RA coefficients perform favourably with repect to the state-of-the-art proximity measures. Afterward, the RA-based method outperform another four nearest neighbor searching algorithms over the same data domains. In a third validation stage, RA measures are successfully applied to the virtual screening of the NCI data set. Finally, we discuss a possible molecular interpretation for these similarity variants. © 2016 IEEE.
publishDate	2016
dc.date.issued.none.fl_str_mv	2016
dc.date.accessioned.none.fl_str_mv	2020-03-26T16:32:45Z
dc.date.available.none.fl_str_mv	2020-03-26T16:32:45Z
dc.type.coarversion.fl_str_mv	http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.coar.fl_str_mv	http://purl.org/coar/resource_type/c_2df8fbb1
dc.type.driver.none.fl_str_mv	info:eu-repo/semantics/article
dc.type.hasVersion.none.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.spa.none.fl_str_mv	Artículo
status_str	publishedVersion
dc.identifier.citation.none.fl_str_mv	IEEE/ACM Transactions on Computational Biology and Bioinformatics; Vol. 13, Núm. 1; pp. 158-167
dc.identifier.issn.none.fl_str_mv	15455963
dc.identifier.uri.none.fl_str_mv	https://hdl.handle.net/20.500.12585/9004
dc.identifier.doi.none.fl_str_mv	10.1109/TCBB.2015.2424435
dc.identifier.instname.none.fl_str_mv	Universidad Tecnológica de Bolívar
dc.identifier.reponame.none.fl_str_mv	Repositorio UTB
dc.identifier.orcid.none.fl_str_mv	24436944800 57188713140 55665599200 57193746355
identifier_str_mv	IEEE/ACM Transactions on Computational Biology and Bioinformatics; Vol. 13, Núm. 1; pp. 158-167 15455963 10.1109/TCBB.2015.2424435 Universidad Tecnológica de Bolívar Repositorio UTB 24436944800 57188713140 55665599200 57193746355
url	https://hdl.handle.net/20.500.12585/9004
dc.language.iso.none.fl_str_mv	eng
language	eng
dc.rights.coar.fl_str_mv	http://purl.org/coar/access_right/c_16ec
dc.rights.uri.none.fl_str_mv	http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.rights.accessRights.none.fl_str_mv	info:eu-repo/semantics/restrictedAccess
dc.rights.cc.none.fl_str_mv	Atribución-NoComercial 4.0 Internacional
rights_invalid_str_mv	http://creativecommons.org/licenses/by-nc-nd/4.0/ Atribución-NoComercial 4.0 Internacional http://purl.org/coar/access_right/c_16ec
eu_rights_str_mv	restrictedAccess
dc.format.medium.none.fl_str_mv	Recurso electrónico
dc.format.mimetype.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Institute of Electrical and Electronics Engineers Inc.
publisher.none.fl_str_mv	Institute of Electrical and Electronics Engineers Inc.
dc.source.none.fl_str_mv	https://www.scopus.com/inward/record.uri?eid=2-s2.0-84962028690&doi=10.1109%2fTCBB.2015.2424435&partnerID=40&md5=fbef0edaa9b5080d13f6b2c9480cf72b
institution	Universidad Tecnológica de Bolívar
bitstream.url.fl_str_mv	https://repositorio.utb.edu.co/bitstream/20.500.12585/9004/1/MiniProdInv.png
bitstream.checksum.fl_str_mv	0cb0f101a8d16897fb46fc914d3d7043
bitstream.checksumAlgorithm.fl_str_mv	MD5
repository.name.fl_str_mv	Repositorio Institucional UTB
repository.mail.fl_str_mv	repositorioutb@utb.edu.co
_version_	1837010869490286592
spelling	2020-03-26T16:32:45Z2020-03-26T16:32:45Z2016IEEE/ACM Transactions on Computational Biology and Bioinformatics; Vol. 13, Núm. 1; pp. 158-16715455963https://hdl.handle.net/20.500.12585/900410.1109/TCBB.2015.2424435Universidad Tecnológica de BolívarRepositorio UTB24436944800571887131405566559920057193746355Research on similarity searching of cheminformatic data sets has been focused on similarity measures using fingerprints. However, nominal scales are the least informative of all metric scales, increasing the tied similarity scores, and decreasing the effectivity of the retrieval engines. Tanimoto's coefficient has been claimed to be the most prominent measure for this task. Nevertheless, this field is far from being exhausted since the computer science no free lunch theorem predicts that "no similarity measure has overall superiority over the population of data sets". We introduce 12 relational agreement (RA) coefficients for seven metric scales, which are integrated within a group fusion-based similarity searching algorithm. These similarity measures are compared to a reference panel of 21 proximity quantifiers over 17 benchmark data sets (MUV), by using informative descriptors, a feature selection stage, a suitable performance metric, and powerful comparison tests. In this stage, RA coefficients perform favourably with repect to the state-of-the-art proximity measures. Afterward, the RA-based method outperform another four nearest neighbor searching algorithms over the same data domains. In a third validation stage, RA measures are successfully applied to the virtual screening of the NCI data set. Finally, we discuss a possible molecular interpretation for these similarity variants. © 2016 IEEE.Recurso electrónicoapplication/pdfengInstitute of Electrical and Electronics Engineers Inc.http://creativecommons.org/licenses/by-nc-nd/4.0/info:eu-repo/semantics/restrictedAccessAtribución-NoComercial 4.0 Internacionalhttp://purl.org/coar/access_right/c_16echttps://www.scopus.com/inward/record.uri?eid=2-s2.0-84962028690&doi=10.1109%2fTCBB.2015.2424435&partnerID=40&md5=fbef0edaa9b5080d13f6b2c9480cf72bRelational Agreement Measures for Similarity Searching of Cheminformatic Data Setsinfo:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionArtículohttp://purl.org/coar/version/c_970fb48d4fbd8a85http://purl.org/coar/resource_type/c_2df8fbb1ChemistryReliabilitySimilarity measuresSorting and searchingBenchmarkingChemistryNearest neighbor searchReliabilityFour-nearest-neighborsMolecular interpretationNo free lunch theoremPerformance metricesProximity measureSimilarity measureSimilarity SearchingSorting and searchingPopulation statisticsAlgorithmChemical databaseChemistryData miningInformation scienceProceduresAlgorithmsChemistryData miningDatabases, ChemicalInformaticsRivera-Borroto O.M.García-De La Vega J.M.Marrero-Ponce Y.Grau R.Maggiora, G., Shanmugasundaram, V., Molecular similarity measures (2011) Chemoinformatics and Computational Chemical Biology, pp. 77-84. , Methods in Molecular Biology, J. Bajorath, ed. New York, NY, USA: Humana PressÁgoston, V., Kaján, L., Carugo, O., Hegedüs, Z., Vlahovicek, K., Pongor, S., Concepts of similarity in bioinformatics (2005) Essays in Bioinformatics, pp. 11-31. , NATO Science Series, I: Life and Behavioural Sciences, D. S. Moss, S. Jelaska, and S. Pongor, Eds. Amsterdam, The Netherland: IOS PressMartin, Y.C., Kofron, J.L., Traphagen, L.M., Do structurally similar molecules have similar biological activity? (2002) J. Med. Chem., 45 (19), pp. 4350-4358. , SepValencia, A., Automatic annotation of protein function (2005) Currency Opinion Struct. Biol., 15 (3), pp. 267-274. , JunMedina-Franco, J.L., Scanning structure-activity relationships with structure-activity similarity and related maps: From consensus activity cliffs to selectivity switches (2012) J. Chem. Inf. Model, 52 (10), pp. 2485-2493. , OctPunta, M., Ofran, Y., The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function (2008) PLoS Comput. Biol., 4 (10), p. e1000160. , OctGower, J.C., Legendre, P., Metric and Euclidean properties of dissimilarity coefficients (1986) J. Classification, 3 (1), pp. 5-48. , MarDeza, M.M., Deza, E., (2013) Encyclopedia of Distances, , 2nd ed. Berlin, Germany: Springer-VerlagTversky, A., Features of similarity (1977) Psychol. Rev., 84 (4), pp. 327-352. , JulUcar, D., Altiparmak, F., Ferhatosmanoglu, H., Parthasarathy, S., Investigating the use of extrinsic similarity measures for microarray analysis (2007) Proc. 7th Int. Workshop Data Mining Bioinformat, pp. 10-18Dobson, C.M., Chemical space and biology (2004) Nature, 432 (7019), pp. 824-828. , DecLee, D., Redfern, O., Orengo, C., Predicting protein function from sequence and structure (2007) Nat. Rev. Mol. Cell Biol., 8 (12), pp. 995-1005. , AugBajorath, J., Integration of virtual and high-throughput screening (2002) Nat. Rev. Drug Discov., 1 (11), pp. 882-894. , NovSeifert, M.H.J., Wolf, K., Vitt, D., Virtual high-throughput in Silico Screening (2003) Biosilico, 1 (4), pp. 143-149. , SepWillett, P., Similarity-based virtual screening using 2D fingerprints (2006) Drug Discov. Today, 11 (23-24), pp. 1046-1053. , DecWolpert, D.H., The supervised learning no-free-lunch theorems (2001) Proc. 6th Online World Conf. Soft Comput. Ind. Appl., pp. 1-20. , http://ti.arc.nasa.gov/profile/dhw/statistical/, [Online]Holliday, J.D., Salim, N., Whittle, M., Willett, P., Analysis and display of the size dependence of chemical similarity coefficients (2003) J. Chem. Inf. Comput. Sci., 43 (3), pp. 819-828. , MayVogt, M., Bajorath, J., Introduction of the conditional correlated bernoulli model of similarity value distributions and its application to the prospective prediction of fingerprint search performance (2011) J. Chem. Inf. Model, 51 (10), pp. 2496-2506. , OctZegers, F.E., Ten Berge, J.M.F., A family of association coefficients for metric scales (1985) Psychometrika, 50 (1), pp. 17-24. , MarZegers, F.E., A family of chance-corrected association coefficients for metric scales (1986) Psychometrika, 51 (4), pp. 559-562. , DecStine, W.W., Meaningful inference: The role of measurement in statistics (1989) Psychol. Bull., 105 (1), p. 147. , JanConover, W.J., Iman, R.L., Rank transformations as a bridge between parametric and nonparametric statistics (1981) Amer. Stat., 35 (3), pp. 124-129. , AugGower, J.C., Some distance properties of latent root and vector methods used in multivariate analysis (1966) Biometrika, 53 (3-4), pp. 325-338. , DecRivera-Borroto, O.M., García-De La-Vega, J.M., Hernández-Díaz, Y., Theoretical advances on coefficients of relational agreement: Application to cheminformatics as k-way biomolecular similarity measures (2013) J. Chemometrics, 27 (11), pp. 420-430. , NovAl-Khalifa, A., Haranczyk, M., Holliday, J., Comparison of nonbinary similarity coefficients for similarity searching, clustering and compound selection (2009) J. Chem. Inf. Model, 49 (5), pp. 1193-1201. , MayJobson, J., A coefficient of equality for questionnaire items with interval scales (1976) Educ. Psychol. Meas., 36 (2), pp. 271-274. , JulLin, L.I.-K.L., A concordance correlation coefficient to evaluate reproducibility (1989) Biometrics, 45 (1), p. 255. , MarKing, T.S., Chinchilli, V.M., A generalized concordance correlation coefficient for continuous and categorical data (2001) Stat. Med., 20 (14), pp. 2131-2147. , JulMcDonald, R.P., Linear versus nonlinear models in item response theory (1982) Appl. Psychol. Meas., 6 (4), pp. 379-396. , SepCureton, E.E., The definition and estimation of test reliability (1958) Educ. Psychol. Meas., 18 (4), pp. 715-738. , DecMehta, J., Gurland, J., Some properties and an application of a statistic arising in testing correlation (1969) Ann. Math. Statist., 40 (5), pp. 1736-1745. , OctKristof, W., On a statistic arising in testing correlation (1972) Psychometrika, 37 (4), pp. 377-384. , DecBurt, C., The factorial study of temperamental traits (1948) Brit. J. Psychol., 1 (3), pp. 178-203. , NovTucker, L.R., (1951) A Method for Synthesis of Factor Analysis Studies, , Princeton, NJ, USA: Educational Testing ServiseSjöberg, L., Holley, J.W., A measure of similarity between individuals when scoring directions of variables are arbitrary (1967) Multivar. Behav. Res., 2 (3), pp. 377-384. , SepKendall, M.G., Kendall, S.F.H., Smith, B.B., The distribution of spearman's coefficient of rank correlation in a universe in which all rankings occur an equal number of times (1939) Biometrika, 30 (3-4), pp. 251-273. , JanVarin, T., Bureau, R., Mueller, C., Willett, P., Clustering files of chemical structures using the Székely-Rizzo generalization of ward's method (2009) J. Mol. Graph. Modell., 28 (2), pp. 187-195. , SepRohrer, S.G., Baumann, K., Maximum unbiased validation (MUV) data sets for virtual screening based on pubchem bioactivity data (2009) J. Chem. Inf. Model, 49 (2), pp. 169-184. , FebNational Cancer Institute, https://resresources.nci.nih.gov/resources/, Bethesda, MD, USA [Online](2014) JChem for Excel is A Microsoft Excel Integrated Tool Enabling Scientists to Manage and Analyze Chemical Structures and Their Data, , http://www.chemaxon.com, JChem for Excel v. 14.7.2100, Budapest, Hungary. ChemAxon Kft [Online]Sadowski, J., Gasteiger, J., Klebe, G., Comparison of automatic three-dimensional model builders using 639 X-ray structures (1994) J. Chem. Inf. Comput. Sci., 34 (4), pp. 1000-1008. , Jul(2007) The Software for Molecular Descriptors Calculations DRAGON is Available from Talete Srl, , http://www.talete.mi.it, DRAGON for Windows v. 5.5, Milano, Italy. [Online]Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H., The WEKA data mining software: An update (2009) SIGKDD Explor. Newsl., 11 (1), pp. 10-18. , (Jun.) Jun. 2009Guyon, I., Elisseeff, A., An introduction to variable and feature selection (2003) J. Mach. Learn. Res., 3, pp. 1157-1182. , MarBender, A., Mussa, H.Y., Glen, R.C., Molecular similarity searching using atom environments, information-based feature selection, and a naïve Bayesian classifier (2004) J. Chem. Inf. Comput. Sci., 44 (1), pp. 170-178. , JanPatterson, D.E., Cramer, R.D., Ferguson, A.M., Clark, R.D., Weinberger, L.E., Neighborhood behavior: A useful concept for validation of ""molecular diversity"" descriptors (1996) J. Med. Chem., 39 (16), pp. 3049-3059. , AugNikolova, N., Jaworska, J., Approaches to measure chemical similarity-A review (2003) QSAR Comb. Sci., 22 (11), pp. 1006-1026. , NovCruz-Monteagudo, M., Medina-Franco, J.L., Pérez-Castillo, Y., Nicolotti, O., Cordeiro, M.N., Borges, F., Activity cliffs in drug discovery: Dr Jekyll or Mr Hyde? (2014) Drug Discov. Today, 19 (8), pp. 1069-1080. , AugNasr, R.J., Swamidass, S.J., Baldi, P.F., Large scale study of multiple-molecule queries (2009) J. Cheminf., 1 (7), pp. 1-19. , JunHert, J., Willett, P., Wilton, D.J., Acklin, P., Azzaoui, K., Jacoby, E., Schuffenhauer, A., Comparison of fingerprint-based methods for virtual screening using multiple bioactive reference structures (2004) J. Chem. Inf. Comput. Sci., 44 (3), pp. 1177-1185. , MarSwamidass, S.J., Azencott, C.-A., Daily, K., Baldi, P., A CROC stronger than ROC: Measuring, visualizing and optimizing early retrieval (2010) Bioinformatics, 26 (10), pp. 1348-1356. , MayTruchon, J., Bayly, C.I., Evaluating virtual screening methods: Good and bad metrics for the ""early recognition"" problem (2007) J. Chem. Inf. Model, 47 (2), pp. 488-508. , MarApostol, T.M., (1974) Mathematical Analysis, , 2nd ed. Reading, MA, USA: Addison-WesleyBullen, P.S., A dictionary of inequalities (1998) Pitman Monographs and Surveys in Pure and Applied Mathematics 97, p. 296. , Reading, MA, USA: Addison Wesley LogmanMitrinović, D.S., Vasić, P.M., (1970) Analytic Inequalities, , Berlin, Germany: Springer-VerlagIman, R.L., Davenport, J.M., Approximations of the critical region of the Friedman's statistic (1980) Commun. Stat. Theory, 9 (6), pp. 571-595. , JanDemšar, J., Statistical comparisons of classifiers over multiple data sets (2006) J. Mach. Learn. Res., 7, pp. 1-30. , JanGarcía, S., Fernández, A., Luengo, J., Herrera, F., A study of statistical techniques and performance measures for geneticsbased machine learning: Accuracy and interpretability (2009) Soft Comput., 13 (10), pp. 959-977. , AugLi, J., A two-step rejection procedure for testing multiple hypotheses (2008) J. Stat. Planning Inference, 138 (6), pp. 1521-1527. , JulWillett, P., The calculation of molecular structural similarity: Principles and practice (2014) Mol. Inf., 33 (6-7), pp. 403-413. , AprNasr, R.J., Swamidass, S.J., Baldi, P.F., Large scale study of multiple-molecule queries (2009) J. Cheminf., 1 (7), p. 19. , JunTiikkainen, P., Markt, P., Wolber, G., Kirchmair, J., Distinto, S., Poso, A., Kallioniemi, O., Critical comparison of virtual screening methods against the MUV data set (2009) J. Chem. Inf. Model, 49 (10), pp. 2168-2178. , OctRosenbaum, L., Hinselmann, G., Jahn, A., Zell, A., Interpreting linear support vector machine models with heat map molecule coloring (2011) J. Cheminf., 3 (1), p. 12. , MarRiniker, S., Landrum, G., Open-source platform to benchmark fingerprints for ligand-based virtual screening (2013) J. Cheminf., 5 (1), p. 17. , MayHinselmann, G., Rosenbaum, L., Jahn, A., Fechner, N., Ostermann, C., Zell, A., Large-scale learning of structure-activity relationships using a linear support vector machine and problem-specific metrics (2011) J. Chem. Inf. Model, 51 (2), pp. 203-213. , FebGardiner, E.J., Holliday, J.D., O'Dowd, C., Willett, P., Effectiveness of 2D fingerprints for scaffold hopping (2011) Future Med. Chem., 3 (4), pp. 405-414. , MarAhmed, A., Saeed, F., Salim, N., Abdo, A., Condorcet and borda count fusion method for ligand-based virtual screening (2014) J. Cheminf., 6 (1), p. 10Duesbury, E.V., Holliday, J., Willett, P., Maximum common substructure-based data fusion in similarity searching (2015) J. Chem. Inf. Model, 55 (2), pp. 222-230Hallgren, K.A., Computing inter-rater reliability for observational data: An overview and tutorial (2012) Quant. Meth. Psych., 8 (1), pp. 23-34. , JanWillett, P., Combination of similarity rankings using data fusion (2013) J. Chem. Inf. Model, 53 (1), pp. 1-10. , JanCao, Y., Jiang, T., Girke, T., Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing (2010) Bioinformatics, 26 (7), pp. 953-959. , Aprhttp://purl.org/coar/resource_type/c_6501THUMBNAILMiniProdInv.pngMiniProdInv.pngimage/png23941https://repositorio.utb.edu.co/bitstream/20.500.12585/9004/1/MiniProdInv.png0cb0f101a8d16897fb46fc914d3d7043MD5120.500.12585/9004oai:repositorio.utb.edu.co:20.500.12585/90042021-02-02 14:21:18.405Repositorio Institucional UTBrepositorioutb@utb.edu.co

Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets

Publicaciones similares