Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets

Research on similarity searching of cheminformatic data sets has been focused on similarity measures using fingerprints. However, nominal scales are the least informative of all metric scales, increasing the tied similarity scores, and decreasing the effectivity of the retrieval engines. Tanimoto�...

Full description

Autores:
Tipo de recurso:
Fecha de publicación:
2016
Institución:
Universidad Tecnológica de Bolívar
Repositorio:
Repositorio Institucional UTB
Idioma:
eng
OAI Identifier:
oai:repositorio.utb.edu.co:20.500.12585/9004
Acceso en línea:
https://hdl.handle.net/20.500.12585/9004
Palabra clave:
Chemistry
Reliability
Similarity measures
Sorting and searching
Benchmarking
Chemistry
Nearest neighbor search
Reliability
Four-nearest-neighbors
Molecular interpretation
No free lunch theorem
Performance metrices
Proximity measure
Similarity measure
Similarity Searching
Sorting and searching
Population statistics
Algorithm
Chemical database
Chemistry
Data mining
Information science
Procedures
Algorithms
Chemistry
Data mining
Databases, Chemical
Informatics
Rights
restrictedAccess
License
http://creativecommons.org/licenses/by-nc-nd/4.0/
id UTB2_8d09c54b88cf2713f224d3cd06185a1d
oai_identifier_str oai:repositorio.utb.edu.co:20.500.12585/9004
network_acronym_str UTB2
network_name_str Repositorio Institucional UTB
repository_id_str
dc.title.none.fl_str_mv Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets
title Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets
spellingShingle Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets
Chemistry
Reliability
Similarity measures
Sorting and searching
Benchmarking
Chemistry
Nearest neighbor search
Reliability
Four-nearest-neighbors
Molecular interpretation
No free lunch theorem
Performance metrices
Proximity measure
Similarity measure
Similarity Searching
Sorting and searching
Population statistics
Algorithm
Chemical database
Chemistry
Data mining
Information science
Procedures
Algorithms
Chemistry
Data mining
Databases, Chemical
Informatics
title_short Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets
title_full Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets
title_fullStr Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets
title_full_unstemmed Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets
title_sort Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets
dc.subject.keywords.none.fl_str_mv Chemistry
Reliability
Similarity measures
Sorting and searching
Benchmarking
Chemistry
Nearest neighbor search
Reliability
Four-nearest-neighbors
Molecular interpretation
No free lunch theorem
Performance metrices
Proximity measure
Similarity measure
Similarity Searching
Sorting and searching
Population statistics
Algorithm
Chemical database
Chemistry
Data mining
Information science
Procedures
Algorithms
Chemistry
Data mining
Databases, Chemical
Informatics
topic Chemistry
Reliability
Similarity measures
Sorting and searching
Benchmarking
Chemistry
Nearest neighbor search
Reliability
Four-nearest-neighbors
Molecular interpretation
No free lunch theorem
Performance metrices
Proximity measure
Similarity measure
Similarity Searching
Sorting and searching
Population statistics
Algorithm
Chemical database
Chemistry
Data mining
Information science
Procedures
Algorithms
Chemistry
Data mining
Databases, Chemical
Informatics
description Research on similarity searching of cheminformatic data sets has been focused on similarity measures using fingerprints. However, nominal scales are the least informative of all metric scales, increasing the tied similarity scores, and decreasing the effectivity of the retrieval engines. Tanimoto's coefficient has been claimed to be the most prominent measure for this task. Nevertheless, this field is far from being exhausted since the computer science no free lunch theorem predicts that "no similarity measure has overall superiority over the population of data sets". We introduce 12 relational agreement (RA) coefficients for seven metric scales, which are integrated within a group fusion-based similarity searching algorithm. These similarity measures are compared to a reference panel of 21 proximity quantifiers over 17 benchmark data sets (MUV), by using informative descriptors, a feature selection stage, a suitable performance metric, and powerful comparison tests. In this stage, RA coefficients perform favourably with repect to the state-of-the-art proximity measures. Afterward, the RA-based method outperform another four nearest neighbor searching algorithms over the same data domains. In a third validation stage, RA measures are successfully applied to the virtual screening of the NCI data set. Finally, we discuss a possible molecular interpretation for these similarity variants. © 2016 IEEE.
publishDate 2016
dc.date.issued.none.fl_str_mv 2016
dc.date.accessioned.none.fl_str_mv 2020-03-26T16:32:45Z
dc.date.available.none.fl_str_mv 2020-03-26T16:32:45Z
dc.type.coarversion.fl_str_mv http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.coar.fl_str_mv http://purl.org/coar/resource_type/c_2df8fbb1
dc.type.driver.none.fl_str_mv info:eu-repo/semantics/article
dc.type.hasVersion.none.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.spa.none.fl_str_mv Artículo
status_str publishedVersion
dc.identifier.citation.none.fl_str_mv IEEE/ACM Transactions on Computational Biology and Bioinformatics; Vol. 13, Núm. 1; pp. 158-167
dc.identifier.issn.none.fl_str_mv 15455963
dc.identifier.uri.none.fl_str_mv https://hdl.handle.net/20.500.12585/9004
dc.identifier.doi.none.fl_str_mv 10.1109/TCBB.2015.2424435
dc.identifier.instname.none.fl_str_mv Universidad Tecnológica de Bolívar
dc.identifier.reponame.none.fl_str_mv Repositorio UTB
dc.identifier.orcid.none.fl_str_mv 24436944800
57188713140
55665599200
57193746355
identifier_str_mv IEEE/ACM Transactions on Computational Biology and Bioinformatics; Vol. 13, Núm. 1; pp. 158-167
15455963
10.1109/TCBB.2015.2424435
Universidad Tecnológica de Bolívar
Repositorio UTB
24436944800
57188713140
55665599200
57193746355
url https://hdl.handle.net/20.500.12585/9004
dc.language.iso.none.fl_str_mv eng
language eng
dc.rights.coar.fl_str_mv http://purl.org/coar/access_right/c_16ec
dc.rights.uri.none.fl_str_mv http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.rights.accessRights.none.fl_str_mv info:eu-repo/semantics/restrictedAccess
dc.rights.cc.none.fl_str_mv Atribución-NoComercial 4.0 Internacional
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-nd/4.0/
Atribución-NoComercial 4.0 Internacional
http://purl.org/coar/access_right/c_16ec
eu_rights_str_mv restrictedAccess
dc.format.medium.none.fl_str_mv Recurso electrónico
dc.format.mimetype.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Institute of Electrical and Electronics Engineers Inc.
publisher.none.fl_str_mv Institute of Electrical and Electronics Engineers Inc.
dc.source.none.fl_str_mv https://www.scopus.com/inward/record.uri?eid=2-s2.0-84962028690&doi=10.1109%2fTCBB.2015.2424435&partnerID=40&md5=fbef0edaa9b5080d13f6b2c9480cf72b
institution Universidad Tecnológica de Bolívar
bitstream.url.fl_str_mv https://repositorio.utb.edu.co/bitstream/20.500.12585/9004/1/MiniProdInv.png
bitstream.checksum.fl_str_mv 0cb0f101a8d16897fb46fc914d3d7043
bitstream.checksumAlgorithm.fl_str_mv MD5
repository.name.fl_str_mv Repositorio Institucional UTB
repository.mail.fl_str_mv repositorioutb@utb.edu.co
_version_ 1814021805621379072
spelling 2020-03-26T16:32:45Z2020-03-26T16:32:45Z2016IEEE/ACM Transactions on Computational Biology and Bioinformatics; Vol. 13, Núm. 1; pp. 158-16715455963https://hdl.handle.net/20.500.12585/900410.1109/TCBB.2015.2424435Universidad Tecnológica de BolívarRepositorio UTB24436944800571887131405566559920057193746355Research on similarity searching of cheminformatic data sets has been focused on similarity measures using fingerprints. However, nominal scales are the least informative of all metric scales, increasing the tied similarity scores, and decreasing the effectivity of the retrieval engines. Tanimoto's coefficient has been claimed to be the most prominent measure for this task. Nevertheless, this field is far from being exhausted since the computer science no free lunch theorem predicts that "no similarity measure has overall superiority over the population of data sets". We introduce 12 relational agreement (RA) coefficients for seven metric scales, which are integrated within a group fusion-based similarity searching algorithm. These similarity measures are compared to a reference panel of 21 proximity quantifiers over 17 benchmark data sets (MUV), by using informative descriptors, a feature selection stage, a suitable performance metric, and powerful comparison tests. In this stage, RA coefficients perform favourably with repect to the state-of-the-art proximity measures. Afterward, the RA-based method outperform another four nearest neighbor searching algorithms over the same data domains. In a third validation stage, RA measures are successfully applied to the virtual screening of the NCI data set. Finally, we discuss a possible molecular interpretation for these similarity variants. © 2016 IEEE.Recurso electrónicoapplication/pdfengInstitute of Electrical and Electronics Engineers Inc.http://creativecommons.org/licenses/by-nc-nd/4.0/info:eu-repo/semantics/restrictedAccessAtribución-NoComercial 4.0 Internacionalhttp://purl.org/coar/access_right/c_16echttps://www.scopus.com/inward/record.uri?eid=2-s2.0-84962028690&doi=10.1109%2fTCBB.2015.2424435&partnerID=40&md5=fbef0edaa9b5080d13f6b2c9480cf72bRelational Agreement Measures for Similarity Searching of Cheminformatic Data Setsinfo:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionArtículohttp://purl.org/coar/version/c_970fb48d4fbd8a85http://purl.org/coar/resource_type/c_2df8fbb1ChemistryReliabilitySimilarity measuresSorting and searchingBenchmarkingChemistryNearest neighbor searchReliabilityFour-nearest-neighborsMolecular interpretationNo free lunch theoremPerformance metricesProximity measureSimilarity measureSimilarity SearchingSorting and searchingPopulation statisticsAlgorithmChemical databaseChemistryData miningInformation scienceProceduresAlgorithmsChemistryData miningDatabases, ChemicalInformaticsRivera-Borroto O.M.García-De La Vega J.M.Marrero-Ponce Y.Grau R.Maggiora, G., Shanmugasundaram, V., Molecular similarity measures (2011) Chemoinformatics and Computational Chemical Biology, pp. 77-84. , Methods in Molecular Biology, J. Bajorath, ed. New York, NY, USA: Humana PressÁgoston, V., Kaján, L., Carugo, O., Hegedüs, Z., Vlahovicek, K., Pongor, S., Concepts of similarity in bioinformatics (2005) Essays in Bioinformatics, pp. 11-31. , NATO Science Series, I: Life and Behavioural Sciences, D. S. Moss, S. Jelaska, and S. Pongor, Eds. Amsterdam, The Netherland: IOS PressMartin, Y.C., Kofron, J.L., Traphagen, L.M., Do structurally similar molecules have similar biological activity? (2002) J. Med. Chem., 45 (19), pp. 4350-4358. , SepValencia, A., Automatic annotation of protein function (2005) Currency Opinion Struct. Biol., 15 (3), pp. 267-274. , JunMedina-Franco, J.L., Scanning structure-activity relationships with structure-activity similarity and related maps: From consensus activity cliffs to selectivity switches (2012) J. Chem. Inf. Model, 52 (10), pp. 2485-2493. , OctPunta, M., Ofran, Y., The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function (2008) PLoS Comput. Biol., 4 (10), p. e1000160. , OctGower, J.C., Legendre, P., Metric and Euclidean properties of dissimilarity coefficients (1986) J. Classification, 3 (1), pp. 5-48. , MarDeza, M.M., Deza, E., (2013) Encyclopedia of Distances, , 2nd ed. Berlin, Germany: Springer-VerlagTversky, A., Features of similarity (1977) Psychol. Rev., 84 (4), pp. 327-352. , JulUcar, D., Altiparmak, F., Ferhatosmanoglu, H., Parthasarathy, S., Investigating the use of extrinsic similarity measures for microarray analysis (2007) Proc. 7th Int. Workshop Data Mining Bioinformat, pp. 10-18Dobson, C.M., Chemical space and biology (2004) Nature, 432 (7019), pp. 824-828. , DecLee, D., Redfern, O., Orengo, C., Predicting protein function from sequence and structure (2007) Nat. Rev. Mol. Cell Biol., 8 (12), pp. 995-1005. , AugBajorath, J., Integration of virtual and high-throughput screening (2002) Nat. Rev. Drug Discov., 1 (11), pp. 882-894. , NovSeifert, M.H.J., Wolf, K., Vitt, D., Virtual high-throughput in Silico Screening (2003) Biosilico, 1 (4), pp. 143-149. , SepWillett, P., Similarity-based virtual screening using 2D fingerprints (2006) Drug Discov. Today, 11 (23-24), pp. 1046-1053. , DecWolpert, D.H., The supervised learning no-free-lunch theorems (2001) Proc. 6th Online World Conf. Soft Comput. Ind. Appl., pp. 1-20. , http://ti.arc.nasa.gov/profile/dhw/statistical/, [Online]Holliday, J.D., Salim, N., Whittle, M., Willett, P., Analysis and display of the size dependence of chemical similarity coefficients (2003) J. Chem. Inf. Comput. Sci., 43 (3), pp. 819-828. , MayVogt, M., Bajorath, J., Introduction of the conditional correlated bernoulli model of similarity value distributions and its application to the prospective prediction of fingerprint search performance (2011) J. Chem. Inf. Model, 51 (10), pp. 2496-2506. , OctZegers, F.E., Ten Berge, J.M.F., A family of association coefficients for metric scales (1985) Psychometrika, 50 (1), pp. 17-24. , MarZegers, F.E., A family of chance-corrected association coefficients for metric scales (1986) Psychometrika, 51 (4), pp. 559-562. , DecStine, W.W., Meaningful inference: The role of measurement in statistics (1989) Psychol. Bull., 105 (1), p. 147. , JanConover, W.J., Iman, R.L., Rank transformations as a bridge between parametric and nonparametric statistics (1981) Amer. Stat., 35 (3), pp. 124-129. , AugGower, J.C., Some distance properties of latent root and vector methods used in multivariate analysis (1966) Biometrika, 53 (3-4), pp. 325-338. , DecRivera-Borroto, O.M., García-De La-Vega, J.M., Hernández-Díaz, Y., Theoretical advances on coefficients of relational agreement: Application to cheminformatics as k-way biomolecular similarity measures (2013) J. Chemometrics, 27 (11), pp. 420-430. , NovAl-Khalifa, A., Haranczyk, M., Holliday, J., Comparison of nonbinary similarity coefficients for similarity searching, clustering and compound selection (2009) J. Chem. Inf. Model, 49 (5), pp. 1193-1201. , MayJobson, J., A coefficient of equality for questionnaire items with interval scales (1976) Educ. Psychol. Meas., 36 (2), pp. 271-274. , JulLin, L.I.-K.L., A concordance correlation coefficient to evaluate reproducibility (1989) Biometrics, 45 (1), p. 255. , MarKing, T.S., Chinchilli, V.M., A generalized concordance correlation coefficient for continuous and categorical data (2001) Stat. Med., 20 (14), pp. 2131-2147. , JulMcDonald, R.P., Linear versus nonlinear models in item response theory (1982) Appl. Psychol. Meas., 6 (4), pp. 379-396. , SepCureton, E.E., The definition and estimation of test reliability (1958) Educ. Psychol. Meas., 18 (4), pp. 715-738. , DecMehta, J., Gurland, J., Some properties and an application of a statistic arising in testing correlation (1969) Ann. Math. Statist., 40 (5), pp. 1736-1745. , OctKristof, W., On a statistic arising in testing correlation (1972) Psychometrika, 37 (4), pp. 377-384. , DecBurt, C., The factorial study of temperamental traits (1948) Brit. J. Psychol., 1 (3), pp. 178-203. , NovTucker, L.R., (1951) A Method for Synthesis of Factor Analysis Studies, , Princeton, NJ, USA: Educational Testing ServiseSjöberg, L., Holley, J.W., A measure of similarity between individuals when scoring directions of variables are arbitrary (1967) Multivar. Behav. Res., 2 (3), pp. 377-384. , SepKendall, M.G., Kendall, S.F.H., Smith, B.B., The distribution of spearman's coefficient of rank correlation in a universe in which all rankings occur an equal number of times (1939) Biometrika, 30 (3-4), pp. 251-273. , JanVarin, T., Bureau, R., Mueller, C., Willett, P., Clustering files of chemical structures using the Székely-Rizzo generalization of ward's method (2009) J. Mol. Graph. Modell., 28 (2), pp. 187-195. , SepRohrer, S.G., Baumann, K., Maximum unbiased validation (MUV) data sets for virtual screening based on pubchem bioactivity data (2009) J. Chem. Inf. Model, 49 (2), pp. 169-184. , FebNational Cancer Institute, https://resresources.nci.nih.gov/resources/, Bethesda, MD, USA [Online](2014) JChem for Excel is A Microsoft Excel Integrated Tool Enabling Scientists to Manage and Analyze Chemical Structures and Their Data, , http://www.chemaxon.com, JChem for Excel v. 14.7.2100, Budapest, Hungary. ChemAxon Kft [Online]Sadowski, J., Gasteiger, J., Klebe, G., Comparison of automatic three-dimensional model builders using 639 X-ray structures (1994) J. Chem. Inf. Comput. Sci., 34 (4), pp. 1000-1008. , Jul(2007) The Software for Molecular Descriptors Calculations DRAGON is Available from Talete Srl, , http://www.talete.mi.it, DRAGON for Windows v. 5.5, Milano, Italy. [Online]Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H., The WEKA data mining software: An update (2009) SIGKDD Explor. Newsl., 11 (1), pp. 10-18. , (Jun.) Jun. 2009Guyon, I., Elisseeff, A., An introduction to variable and feature selection (2003) J. Mach. Learn. Res., 3, pp. 1157-1182. , MarBender, A., Mussa, H.Y., Glen, R.C., Molecular similarity searching using atom environments, information-based feature selection, and a naïve Bayesian classifier (2004) J. Chem. Inf. Comput. Sci., 44 (1), pp. 170-178. , JanPatterson, D.E., Cramer, R.D., Ferguson, A.M., Clark, R.D., Weinberger, L.E., Neighborhood behavior: A useful concept for validation of ""molecular diversity"" descriptors (1996) J. Med. Chem., 39 (16), pp. 3049-3059. , AugNikolova, N., Jaworska, J., Approaches to measure chemical similarity-A review (2003) QSAR Comb. Sci., 22 (11), pp. 1006-1026. , NovCruz-Monteagudo, M., Medina-Franco, J.L., Pérez-Castillo, Y., Nicolotti, O., Cordeiro, M.N., Borges, F., Activity cliffs in drug discovery: Dr Jekyll or Mr Hyde? (2014) Drug Discov. Today, 19 (8), pp. 1069-1080. , AugNasr, R.J., Swamidass, S.J., Baldi, P.F., Large scale study of multiple-molecule queries (2009) J. Cheminf., 1 (7), pp. 1-19. , JunHert, J., Willett, P., Wilton, D.J., Acklin, P., Azzaoui, K., Jacoby, E., Schuffenhauer, A., Comparison of fingerprint-based methods for virtual screening using multiple bioactive reference structures (2004) J. Chem. Inf. Comput. Sci., 44 (3), pp. 1177-1185. , MarSwamidass, S.J., Azencott, C.-A., Daily, K., Baldi, P., A CROC stronger than ROC: Measuring, visualizing and optimizing early retrieval (2010) Bioinformatics, 26 (10), pp. 1348-1356. , MayTruchon, J., Bayly, C.I., Evaluating virtual screening methods: Good and bad metrics for the ""early recognition"" problem (2007) J. Chem. Inf. Model, 47 (2), pp. 488-508. , MarApostol, T.M., (1974) Mathematical Analysis, , 2nd ed. Reading, MA, USA: Addison-WesleyBullen, P.S., A dictionary of inequalities (1998) Pitman Monographs and Surveys in Pure and Applied Mathematics 97, p. 296. , Reading, MA, USA: Addison Wesley LogmanMitrinović, D.S., Vasić, P.M., (1970) Analytic Inequalities, , Berlin, Germany: Springer-VerlagIman, R.L., Davenport, J.M., Approximations of the critical region of the Friedman's statistic (1980) Commun. Stat. Theory, 9 (6), pp. 571-595. , JanDemšar, J., Statistical comparisons of classifiers over multiple data sets (2006) J. Mach. Learn. Res., 7, pp. 1-30. , JanGarcía, S., Fernández, A., Luengo, J., Herrera, F., A study of statistical techniques and performance measures for geneticsbased machine learning: Accuracy and interpretability (2009) Soft Comput., 13 (10), pp. 959-977. , AugLi, J., A two-step rejection procedure for testing multiple hypotheses (2008) J. Stat. Planning Inference, 138 (6), pp. 1521-1527. , JulWillett, P., The calculation of molecular structural similarity: Principles and practice (2014) Mol. Inf., 33 (6-7), pp. 403-413. , AprNasr, R.J., Swamidass, S.J., Baldi, P.F., Large scale study of multiple-molecule queries (2009) J. Cheminf., 1 (7), p. 19. , JunTiikkainen, P., Markt, P., Wolber, G., Kirchmair, J., Distinto, S., Poso, A., Kallioniemi, O., Critical comparison of virtual screening methods against the MUV data set (2009) J. Chem. Inf. Model, 49 (10), pp. 2168-2178. , OctRosenbaum, L., Hinselmann, G., Jahn, A., Zell, A., Interpreting linear support vector machine models with heat map molecule coloring (2011) J. Cheminf., 3 (1), p. 12. , MarRiniker, S., Landrum, G., Open-source platform to benchmark fingerprints for ligand-based virtual screening (2013) J. Cheminf., 5 (1), p. 17. , MayHinselmann, G., Rosenbaum, L., Jahn, A., Fechner, N., Ostermann, C., Zell, A., Large-scale learning of structure-activity relationships using a linear support vector machine and problem-specific metrics (2011) J. Chem. Inf. Model, 51 (2), pp. 203-213. , FebGardiner, E.J., Holliday, J.D., O'Dowd, C., Willett, P., Effectiveness of 2D fingerprints for scaffold hopping (2011) Future Med. Chem., 3 (4), pp. 405-414. , MarAhmed, A., Saeed, F., Salim, N., Abdo, A., Condorcet and borda count fusion method for ligand-based virtual screening (2014) J. Cheminf., 6 (1), p. 10Duesbury, E.V., Holliday, J., Willett, P., Maximum common substructure-based data fusion in similarity searching (2015) J. Chem. Inf. Model, 55 (2), pp. 222-230Hallgren, K.A., Computing inter-rater reliability for observational data: An overview and tutorial (2012) Quant. Meth. Psych., 8 (1), pp. 23-34. , JanWillett, P., Combination of similarity rankings using data fusion (2013) J. Chem. Inf. Model, 53 (1), pp. 1-10. , JanCao, Y., Jiang, T., Girke, T., Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing (2010) Bioinformatics, 26 (7), pp. 953-959. , Aprhttp://purl.org/coar/resource_type/c_6501THUMBNAILMiniProdInv.pngMiniProdInv.pngimage/png23941https://repositorio.utb.edu.co/bitstream/20.500.12585/9004/1/MiniProdInv.png0cb0f101a8d16897fb46fc914d3d7043MD5120.500.12585/9004oai:repositorio.utb.edu.co:20.500.12585/90042021-02-02 14:21:18.405Repositorio Institucional UTBrepositorioutb@utb.edu.co