How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity

Background: Hierarchical cluster analysis (HCA) is a widely used classificatory technique in many areas of scientific knowledge. Applications usually yield a dendrogram from an HCA run over a given data set, using a grouping algorithm and a similarity measure. However, even when such parameters are...

Full description

Autores:
Tipo de recurso:
Fecha de publicación:
2016
Institución:
Universidad del Rosario
Repositorio:
Repositorio EdocUR - U. Rosario
Idioma:
eng
OAI Identifier:
oai:repository.urosario.edu.co:10336/22285
Acceso en línea:
https://doi.org/10.1186/s13321-016-0114-x
https://repository.urosario.edu.co/handle/10336/22285
Palabra clave:
Cluster frequency
Cluster stability
Dendrogram
Hierarchical cluster analysis (HCA)
Molecular descriptor
Ties in proximity
Rights
License
Abierto (Texto Completo)
id EDOCUR2_9be069ed1263c1463d442d95f34dfb06
oai_identifier_str oai:repository.urosario.edu.co:10336/22285
network_acronym_str EDOCUR2
network_name_str Repositorio EdocUR - U. Rosario
repository_id_str
dc.title.spa.fl_str_mv How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity
title How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity
spellingShingle How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity
Cluster frequency
Cluster stability
Dendrogram
Hierarchical cluster analysis (HCA)
Molecular descriptor
Ties in proximity
title_short How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity
title_full How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity
title_fullStr How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity
title_full_unstemmed How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity
title_sort How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity
dc.subject.keyword.spa.fl_str_mv Cluster frequency
Cluster stability
Dendrogram
Hierarchical cluster analysis (HCA)
Molecular descriptor
Ties in proximity
topic Cluster frequency
Cluster stability
Dendrogram
Hierarchical cluster analysis (HCA)
Molecular descriptor
Ties in proximity
description Background: Hierarchical cluster analysis (HCA) is a widely used classificatory technique in many areas of scientific knowledge. Applications usually yield a dendrogram from an HCA run over a given data set, using a grouping algorithm and a similarity measure. However, even when such parameters are fixed, ties in proximity (i.e. two equidistant clusters from a third one) may produce several different dendrograms, having different possible clustering patterns (different classifications). This situation is usually disregarded and conclusions are based on a single result, leading to questions concerning the permanence of clusters in all the resulting dendrograms; this happens, for example, when using HCA for grouping molecular descriptors to select that less similar ones in QSAR studies. Results: Representing dendrograms in graph theoretical terms allowed us to introduce four measures of cluster frequency in a canonical way, and use them to calculate cluster frequencies over the set of all possible dendrograms, taking all ties in proximity into account. A toy example of well separated clusters was used, as well as a set of 1666 molecular descriptors calculated for a group of molecules having hepatotoxic activity to show how our functions may be used for studying the effect of ties in HCA analysis. Such functions were not restricted to the tie case; the possibility of using them to derive cluster stability measurements on arbitrary sets of dendrograms having the same leaves is discussed, e.g. dendrograms from variations of HCA parameters. It was found that ties occurred frequently, some yielding tens of thousands of dendrograms, even for small data sets. Conclusions: Our approach was able to detect trends in clustering patterns by offering a simple way of measuring their frequency, which is often very low. This would imply, that inferences and models based on descriptor classifications (e.g. QSAR) are likely to be biased, thereby requiring an assessment of their reliability. Moreover, any classification of molecular descriptors is likely to be far from unique. Our results highlight the need for evaluating the effect of ties on clustering patterns before classification results can be used accurately. © 2016 Leal et al.
publishDate 2016
dc.date.created.spa.fl_str_mv 2016
dc.date.accessioned.none.fl_str_mv 2020-05-25T23:55:59Z
dc.date.available.none.fl_str_mv 2020-05-25T23:55:59Z
dc.type.eng.fl_str_mv article
dc.type.coarversion.fl_str_mv http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.coar.fl_str_mv http://purl.org/coar/resource_type/c_6501
dc.type.spa.spa.fl_str_mv Artículo
dc.identifier.doi.none.fl_str_mv https://doi.org/10.1186/s13321-016-0114-x
dc.identifier.issn.none.fl_str_mv 17582946
dc.identifier.uri.none.fl_str_mv https://repository.urosario.edu.co/handle/10336/22285
url https://doi.org/10.1186/s13321-016-0114-x
https://repository.urosario.edu.co/handle/10336/22285
identifier_str_mv 17582946
dc.language.iso.spa.fl_str_mv eng
language eng
dc.relation.citationIssue.none.fl_str_mv No. 1
dc.relation.citationTitle.none.fl_str_mv Journal of Cheminformatics
dc.relation.citationVolume.none.fl_str_mv Vol. 8
dc.relation.ispartof.spa.fl_str_mv Journal of Cheminformatics, ISSN:17582946, Vol.8, No.1 (2016)
dc.relation.uri.spa.fl_str_mv https://www.scopus.com/inward/record.uri?eid=2-s2.0-84958102688&doi=10.1186%2fs13321-016-0114-x&partnerID=40&md5=605590ad3b3b3c9de85624edec80f43a
dc.rights.coar.fl_str_mv http://purl.org/coar/access_right/c_abf2
dc.rights.acceso.spa.fl_str_mv Abierto (Texto Completo)
rights_invalid_str_mv Abierto (Texto Completo)
http://purl.org/coar/access_right/c_abf2
dc.format.mimetype.none.fl_str_mv application/pdf
dc.publisher.spa.fl_str_mv BioMed Central Ltd.
institution Universidad del Rosario
dc.source.instname.spa.fl_str_mv instname:Universidad del Rosario
dc.source.reponame.spa.fl_str_mv reponame:Repositorio Institucional EdocUR
bitstream.url.fl_str_mv https://repository.urosario.edu.co/bitstreams/12545f80-1fa3-4979-826e-c61c65fff2f3/download
https://repository.urosario.edu.co/bitstreams/6db03d49-e327-4469-8412-f7d9d4357bc6/download
https://repository.urosario.edu.co/bitstreams/18c6a794-3335-409a-a15b-6ccc9a712b3b/download
bitstream.checksum.fl_str_mv 39bd63e1b2cb5941b59b3dd083382fe6
9e31387adf420a209e5c6231c207d291
8743acf01d3229c53461d0695cffd399
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
repository.name.fl_str_mv Repositorio institucional EdocUR
repository.mail.fl_str_mv edocur@urosario.edu.co
_version_ 1808390938186219520
spelling 7e2fc12c-bee1-4a39-a454-afdc0d8ba0ad-1da34253e-0c4f-43c1-94c3-bdfdbafad03e-1c7889dba-8a96-4b13-83d2-b581a6edb5c0-185e2385a-174f-4c6b-87d7-def493015d27-19e3ba9df-fe89-48fe-9521-cc8f452d56f5-12020-05-25T23:55:59Z2020-05-25T23:55:59Z2016Background: Hierarchical cluster analysis (HCA) is a widely used classificatory technique in many areas of scientific knowledge. Applications usually yield a dendrogram from an HCA run over a given data set, using a grouping algorithm and a similarity measure. However, even when such parameters are fixed, ties in proximity (i.e. two equidistant clusters from a third one) may produce several different dendrograms, having different possible clustering patterns (different classifications). This situation is usually disregarded and conclusions are based on a single result, leading to questions concerning the permanence of clusters in all the resulting dendrograms; this happens, for example, when using HCA for grouping molecular descriptors to select that less similar ones in QSAR studies. Results: Representing dendrograms in graph theoretical terms allowed us to introduce four measures of cluster frequency in a canonical way, and use them to calculate cluster frequencies over the set of all possible dendrograms, taking all ties in proximity into account. A toy example of well separated clusters was used, as well as a set of 1666 molecular descriptors calculated for a group of molecules having hepatotoxic activity to show how our functions may be used for studying the effect of ties in HCA analysis. Such functions were not restricted to the tie case; the possibility of using them to derive cluster stability measurements on arbitrary sets of dendrograms having the same leaves is discussed, e.g. dendrograms from variations of HCA parameters. It was found that ties occurred frequently, some yielding tens of thousands of dendrograms, even for small data sets. Conclusions: Our approach was able to detect trends in clustering patterns by offering a simple way of measuring their frequency, which is often very low. This would imply, that inferences and models based on descriptor classifications (e.g. QSAR) are likely to be biased, thereby requiring an assessment of their reliability. Moreover, any classification of molecular descriptors is likely to be far from unique. Our results highlight the need for evaluating the effect of ties on clustering patterns before classification results can be used accurately. © 2016 Leal et al.application/pdfhttps://doi.org/10.1186/s13321-016-0114-x17582946https://repository.urosario.edu.co/handle/10336/22285engBioMed Central Ltd.No. 1Journal of CheminformaticsVol. 8Journal of Cheminformatics, ISSN:17582946, Vol.8, No.1 (2016)https://www.scopus.com/inward/record.uri?eid=2-s2.0-84958102688&doi=10.1186%2fs13321-016-0114-x&partnerID=40&md5=605590ad3b3b3c9de85624edec80f43aAbierto (Texto Completo)http://purl.org/coar/access_right/c_abf2instname:Universidad del Rosarioreponame:Repositorio Institucional EdocURCluster frequencyCluster stabilityDendrogramHierarchical cluster analysis (HCA)Molecular descriptorTies in proximityHow frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximityarticleArtículohttp://purl.org/coar/version/c_970fb48d4fbd8a85http://purl.org/coar/resource_type/c_6501Leal, WilmerLlanos, Eugenio J.Restrepo, GuillermoSuárez, Carlos F.Patarroyo, Manuel ElkinORIGINALs13321-016-0114-x.pdfapplication/pdf8500224https://repository.urosario.edu.co/bitstreams/12545f80-1fa3-4979-826e-c61c65fff2f3/download39bd63e1b2cb5941b59b3dd083382fe6MD51TEXTs13321-016-0114-x.pdf.txts13321-016-0114-x.pdf.txtExtracted texttext/plain68765https://repository.urosario.edu.co/bitstreams/6db03d49-e327-4469-8412-f7d9d4357bc6/download9e31387adf420a209e5c6231c207d291MD52THUMBNAILs13321-016-0114-x.pdf.jpgs13321-016-0114-x.pdf.jpgGenerated Thumbnailimage/jpeg4673https://repository.urosario.edu.co/bitstreams/18c6a794-3335-409a-a15b-6ccc9a712b3b/download8743acf01d3229c53461d0695cffd399MD5310336/22285oai:repository.urosario.edu.co:10336/222852022-05-02 07:37:20.32351https://repository.urosario.edu.coRepositorio institucional EdocURedocur@urosario.edu.co