Modelo de un Meta Buscador que Realiza Agrupación de Documentos Web, Enriquecido con una Taxonomía, Ontologías e Información del Usuario

In pursuing the central theme of this Ph.D. thesis, which is effective web search, the author seeks through synergistic combination, to make the most of the different potentials of thematic indices, traditional web search engines, and meta web search engines, bypassing the weaknesses inherent in eac...

Full description

Autores:
Cobos Lozada, Carlos Alberto
Tipo de recurso:
Doctoral thesis
Fecha de publicación:
2013
Institución:
Universidad Nacional de Colombia
Repositorio:
Universidad Nacional de Colombia
Idioma:
spa
OAI Identifier:
oai:repositorio.unal.edu.co:unal/52281
Acceso en línea:
https://repositorio.unal.edu.co/handle/unal/52281
http://bdigital.unal.edu.co/46605/
Palabra clave:
0 Generalidades / Computer science, information and general works
62 Ingeniería y operaciones afines / Engineering
Clustering search results
Web clustering engine
Taxonomies
Ontologies
Memetic algorithm
Global-best harmony search
Balanced Bayesian information criterion
Cuckoo search
Hyper-heuristic approach
User modeling
Meta-search engine
Personalized information retrieval
Semantic search engine
Agrupación de resultados web
Motor que agrupa documentos web
Taxonomías
Ontologías
Algoritmos meméticos
Mejor búsqueda armónica global
Criterio bayesiano de información balanceado
Búsqueda cucú
Enfoque híper heurístico
Modelamiento de usuario
Meta buscador
Recuperación de información personalizada
Motor de búsqueda semántica
Rights
openAccess
License
Atribución-NoComercial 4.0 Internacional
Description
Summary:In pursuing the central theme of this Ph.D. thesis, which is effective web search, the author seeks through synergistic combination, to make the most of the different potentials of thematic indices, traditional web search engines, and meta web search engines, bypassing the weaknesses inherent in each, when they are operating in isolation. A general taxonomy of knowledge, ontologies, and user information (user profile and user feedback) are synergistically combined, together with the clustering of web results in a meta search model that brings up for the user only those results (documents) of greatest relevance, thereby reducing the time spent by users on searches. The proposed model includes five main components. The first component is responsible for supporting the query expansion of the user based on the semantic relationship (extracted from ontologies that are organized in a taxonomic hierarchy) of the terms that each user has stored in their profile. The second component is responsible for search result acquisition from traditional web search engines (Google, Yahoo! and Bing). The third component is responsible for pre-processing documents and generating two representations of them, one based on the vector space model and another based on frequent phrases. The fourth component is responsible for cluster construction and labeling, for which there are three heuristic algorithms that perform clustering based on the vector space representation of the results, and labeling based on frequent phrase representation. The fifth component is responsible for visualization of the resulting clusters, which involves the presentation of search results organized into thematic groups (folders) and updating of the user profile based on the user feedback (relevant or not relevant). The cluster construction and labeling component is supported by three new heuristic algorithms based on the following global search strategies: global-best harmony search, cuckoo search and a genetic algorithm. The K-means algorithm is employed as a local search improvement strategy in each of the algorithms. A new fitness function, called Balanced Bayesian Information Criterion guides the evolution process of these algorithms and is proposed from the genetic programming approach. A hyper-heuristic framework is also presented and used to evaluate a wide set of heuristics that can be used to solve the problem of web result clustering. The evaluation process of the model and the algorithms is based on synthetic data sets (from traditional repositories) and answers provided by a real population of users. The evaluation is supported by traditional validation metrics from the information retrieval field (precision, recall, F-measure, accuracy, and fall-out) and from user satisfaction (utility of each cluster, precision of allocation of documents in each cluster and their order, quality of labels for each cluster, and the Subtopic Search Length under k document sufficiency - SSLk- measure used for assessing the ease with which the users can use the clustering results). The results obtained are compared against results delivered by other state of the art algorithms, among them Bisecting K-means, STC and Lingo.