An ontology-based information extractor for data-rich documents in the information technology domain

This paper presents an information extraction method, suitable for data-rich documents, based on the knowledge represented in a domain ontology. The extractor combines a fuzzy string matcher and a word sense disambiguation (WSD) algorithm. The fuzzy string matcher finds mentions of terms combining c...

Full description

Autores:
Jiménez Vargas, Sergio Gonzalo
González Osorio, Fabio Augusto
Tipo de recurso:
Article of journal
Fecha de publicación:
2008
Institución:
Universidad Nacional de Colombia
Repositorio:
Universidad Nacional de Colombia
Idioma:
spa
OAI Identifier:
oai:repositorio.unal.edu.co:unal/24330
Acceso en línea:
https://repositorio.unal.edu.co/handle/unal/24330
http://bdigital.unal.edu.co/15367/
Palabra clave:
Knowledge Management
Information Extraction
Ontologies
Fuzzy String Searching
Word Sense Disambiguation
Semantic Relatedness
Rights
openAccess
License
Atribución-NoComercial 4.0 Internacional
id UNACIONAL2_3322b9c16bdf039b9271230225e5462b
oai_identifier_str oai:repositorio.unal.edu.co:unal/24330
network_acronym_str UNACIONAL2
network_name_str Universidad Nacional de Colombia
repository_id_str
spelling Atribución-NoComercial 4.0 InternacionalDerechos reservados - Universidad Nacional de Colombiahttp://creativecommons.org/licenses/by-nc/4.0/info:eu-repo/semantics/openAccesshttp://purl.org/coar/access_right/c_abf2Jiménez Vargas, Sergio Gonzalo52be9f7b-a13e-452d-a89c-41a42eafe311300González Osorio, Fabio Augustocb310629-a242-42c5-8301-883fbab8e3ab3002019-06-25T22:35:59Z2019-06-25T22:35:59Z2008https://repositorio.unal.edu.co/handle/unal/24330http://bdigital.unal.edu.co/15367/This paper presents an information extraction method, suitable for data-rich documents, based on the knowledge represented in a domain ontology. The extractor combines a fuzzy string matcher and a word sense disambiguation (WSD) algorithm. The fuzzy string matcher finds mentions of terms combining character-level and token-level similarity measures dealing with non-standardized acronyms and inconsistent abbreviation styles. We propose a new character-level edit distance sensitive to prefixes called root distance and a token-level similarity algorithm for fuzzy acronym detection. Additionally, a WSD strategy using an ontology-based semantic relatedness measure is used to solve the inherent ambiguity of some entities. The WSD module finds a sense combination over all the document length optimizing the document semantic coherence. Our approach seems to be suitable to extract information from data-rich documents describing Orly one main object (i.e. product) by document. The results showed a precision of 78.9% with 99.5% recall using documents and an ontology related to laptop computers domain.application/pdfspaUniversidad Nacional de Colombia -Sede Medellínhttp://revistas.unal.edu.co/index.php/avances/article/view/9972Universidad Nacional de Colombia Revistas electrónicas UN Avances en Sistemas e InformáticaAvances en Sistemas e InformáticaAvances en Sistemas e Informática; Vol. 5, núm. 1 (2008) Avances en Sistemas e Informática; Vol. 5, núm. 1 (2008) 1909-0056 1657-7663Jiménez Vargas, Sergio Gonzalo and González Osorio, Fabio Augusto (2008) An ontology-based information extractor for data-rich documents in the information technology domain. Avances en Sistemas e Informática; Vol. 5, núm. 1 (2008) Avances en Sistemas e Informática; Vol. 5, núm. 1 (2008) 1909-0056 1657-7663 .An ontology-based information extractor for data-rich documents in the information technology domainArtículo de revistainfo:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501http://purl.org/coar/resource_type/c_2df8fbb1http://purl.org/coar/version/c_970fb48d4fbd8a85Texthttp://purl.org/redcol/resource_type/ARTKnowledge ManagementInformation ExtractionOntologiesFuzzy String SearchingWord Sense DisambiguationSemantic RelatednessORIGINAL9972-18047-1-PB.pdfapplication/pdf289799https://repositorio.unal.edu.co/bitstream/unal/24330/1/9972-18047-1-PB.pdf02a308e3de33481fb6b16798f0260a0eMD51THUMBNAIL9972-18047-1-PB.pdf.jpg9972-18047-1-PB.pdf.jpgGenerated Thumbnailimage/jpeg10340https://repositorio.unal.edu.co/bitstream/unal/24330/2/9972-18047-1-PB.pdf.jpg0e8c3950bc9dac8bc7aa605c238a9ceeMD52unal/24330oai:repositorio.unal.edu.co:unal/243302022-10-24 23:02:18.912Repositorio Institucional Universidad Nacional de Colombiarepositorio_nal@unal.edu.co
dc.title.spa.fl_str_mv An ontology-based information extractor for data-rich documents in the information technology domain
title An ontology-based information extractor for data-rich documents in the information technology domain
spellingShingle An ontology-based information extractor for data-rich documents in the information technology domain
Knowledge Management
Information Extraction
Ontologies
Fuzzy String Searching
Word Sense Disambiguation
Semantic Relatedness
title_short An ontology-based information extractor for data-rich documents in the information technology domain
title_full An ontology-based information extractor for data-rich documents in the information technology domain
title_fullStr An ontology-based information extractor for data-rich documents in the information technology domain
title_full_unstemmed An ontology-based information extractor for data-rich documents in the information technology domain
title_sort An ontology-based information extractor for data-rich documents in the information technology domain
dc.creator.fl_str_mv Jiménez Vargas, Sergio Gonzalo
González Osorio, Fabio Augusto
dc.contributor.author.spa.fl_str_mv Jiménez Vargas, Sergio Gonzalo
González Osorio, Fabio Augusto
dc.subject.proposal.spa.fl_str_mv Knowledge Management
Information Extraction
Ontologies
Fuzzy String Searching
Word Sense Disambiguation
Semantic Relatedness
topic Knowledge Management
Information Extraction
Ontologies
Fuzzy String Searching
Word Sense Disambiguation
Semantic Relatedness
description This paper presents an information extraction method, suitable for data-rich documents, based on the knowledge represented in a domain ontology. The extractor combines a fuzzy string matcher and a word sense disambiguation (WSD) algorithm. The fuzzy string matcher finds mentions of terms combining character-level and token-level similarity measures dealing with non-standardized acronyms and inconsistent abbreviation styles. We propose a new character-level edit distance sensitive to prefixes called root distance and a token-level similarity algorithm for fuzzy acronym detection. Additionally, a WSD strategy using an ontology-based semantic relatedness measure is used to solve the inherent ambiguity of some entities. The WSD module finds a sense combination over all the document length optimizing the document semantic coherence. Our approach seems to be suitable to extract information from data-rich documents describing Orly one main object (i.e. product) by document. The results showed a precision of 78.9% with 99.5% recall using documents and an ontology related to laptop computers domain.
publishDate 2008
dc.date.issued.spa.fl_str_mv 2008
dc.date.accessioned.spa.fl_str_mv 2019-06-25T22:35:59Z
dc.date.available.spa.fl_str_mv 2019-06-25T22:35:59Z
dc.type.spa.fl_str_mv Artículo de revista
dc.type.coar.fl_str_mv http://purl.org/coar/resource_type/c_2df8fbb1
dc.type.driver.spa.fl_str_mv info:eu-repo/semantics/article
dc.type.version.spa.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.coar.spa.fl_str_mv http://purl.org/coar/resource_type/c_6501
dc.type.coarversion.spa.fl_str_mv http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.content.spa.fl_str_mv Text
dc.type.redcol.spa.fl_str_mv http://purl.org/redcol/resource_type/ART
format http://purl.org/coar/resource_type/c_6501
status_str publishedVersion
dc.identifier.uri.none.fl_str_mv https://repositorio.unal.edu.co/handle/unal/24330
dc.identifier.eprints.spa.fl_str_mv http://bdigital.unal.edu.co/15367/
url https://repositorio.unal.edu.co/handle/unal/24330
http://bdigital.unal.edu.co/15367/
dc.language.iso.spa.fl_str_mv spa
language spa
dc.relation.spa.fl_str_mv http://revistas.unal.edu.co/index.php/avances/article/view/9972
dc.relation.ispartof.spa.fl_str_mv Universidad Nacional de Colombia Revistas electrónicas UN Avances en Sistemas e Informática
Avances en Sistemas e Informática
dc.relation.ispartofseries.none.fl_str_mv Avances en Sistemas e Informática; Vol. 5, núm. 1 (2008) Avances en Sistemas e Informática; Vol. 5, núm. 1 (2008) 1909-0056 1657-7663
dc.relation.references.spa.fl_str_mv Jiménez Vargas, Sergio Gonzalo and González Osorio, Fabio Augusto (2008) An ontology-based information extractor for data-rich documents in the information technology domain. Avances en Sistemas e Informática; Vol. 5, núm. 1 (2008) Avances en Sistemas e Informática; Vol. 5, núm. 1 (2008) 1909-0056 1657-7663 .
dc.rights.spa.fl_str_mv Derechos reservados - Universidad Nacional de Colombia
dc.rights.coar.fl_str_mv http://purl.org/coar/access_right/c_abf2
dc.rights.license.spa.fl_str_mv Atribución-NoComercial 4.0 Internacional
dc.rights.uri.spa.fl_str_mv http://creativecommons.org/licenses/by-nc/4.0/
dc.rights.accessrights.spa.fl_str_mv info:eu-repo/semantics/openAccess
rights_invalid_str_mv Atribución-NoComercial 4.0 Internacional
Derechos reservados - Universidad Nacional de Colombia
http://creativecommons.org/licenses/by-nc/4.0/
http://purl.org/coar/access_right/c_abf2
eu_rights_str_mv openAccess
dc.format.mimetype.spa.fl_str_mv application/pdf
dc.publisher.spa.fl_str_mv Universidad Nacional de Colombia -Sede Medellín
institution Universidad Nacional de Colombia
bitstream.url.fl_str_mv https://repositorio.unal.edu.co/bitstream/unal/24330/1/9972-18047-1-PB.pdf
https://repositorio.unal.edu.co/bitstream/unal/24330/2/9972-18047-1-PB.pdf.jpg
bitstream.checksum.fl_str_mv 02a308e3de33481fb6b16798f0260a0e
0e8c3950bc9dac8bc7aa605c238a9cee
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
repository.name.fl_str_mv Repositorio Institucional Universidad Nacional de Colombia
repository.mail.fl_str_mv repositorio_nal@unal.edu.co
_version_ 1814089318532120576