An ontology-based information extractor for data-rich documents in the information technology domain
This paper presents an information extraction method, suitable for data-rich documents, based on the knowledge represented in a domain ontology. The extractor combines a fuzzy string matcher and a word sense disambiguation (WSD) algorithm. The fuzzy string matcher finds mentions of terms combining c...
- Autores:
-
Jiménez Vargas, Sergio Gonzalo
González Osorio, Fabio Augusto
- Tipo de recurso:
- Article of journal
- Fecha de publicación:
- 2008
- Institución:
- Universidad Nacional de Colombia
- Repositorio:
- Universidad Nacional de Colombia
- Idioma:
- spa
- OAI Identifier:
- oai:repositorio.unal.edu.co:unal/24330
- Acceso en línea:
- https://repositorio.unal.edu.co/handle/unal/24330
http://bdigital.unal.edu.co/15367/
- Palabra clave:
- Knowledge Management
Information Extraction
Ontologies
Fuzzy String Searching
Word Sense Disambiguation
Semantic Relatedness
- Rights
- openAccess
- License
- Atribución-NoComercial 4.0 Internacional
id |
UNACIONAL2_3322b9c16bdf039b9271230225e5462b |
---|---|
oai_identifier_str |
oai:repositorio.unal.edu.co:unal/24330 |
network_acronym_str |
UNACIONAL2 |
network_name_str |
Universidad Nacional de Colombia |
repository_id_str |
|
spelling |
Atribución-NoComercial 4.0 InternacionalDerechos reservados - Universidad Nacional de Colombiahttp://creativecommons.org/licenses/by-nc/4.0/info:eu-repo/semantics/openAccesshttp://purl.org/coar/access_right/c_abf2Jiménez Vargas, Sergio Gonzalo52be9f7b-a13e-452d-a89c-41a42eafe311300González Osorio, Fabio Augustocb310629-a242-42c5-8301-883fbab8e3ab3002019-06-25T22:35:59Z2019-06-25T22:35:59Z2008https://repositorio.unal.edu.co/handle/unal/24330http://bdigital.unal.edu.co/15367/This paper presents an information extraction method, suitable for data-rich documents, based on the knowledge represented in a domain ontology. The extractor combines a fuzzy string matcher and a word sense disambiguation (WSD) algorithm. The fuzzy string matcher finds mentions of terms combining character-level and token-level similarity measures dealing with non-standardized acronyms and inconsistent abbreviation styles. We propose a new character-level edit distance sensitive to prefixes called root distance and a token-level similarity algorithm for fuzzy acronym detection. Additionally, a WSD strategy using an ontology-based semantic relatedness measure is used to solve the inherent ambiguity of some entities. The WSD module finds a sense combination over all the document length optimizing the document semantic coherence. Our approach seems to be suitable to extract information from data-rich documents describing Orly one main object (i.e. product) by document. The results showed a precision of 78.9% with 99.5% recall using documents and an ontology related to laptop computers domain.application/pdfspaUniversidad Nacional de Colombia -Sede Medellínhttp://revistas.unal.edu.co/index.php/avances/article/view/9972Universidad Nacional de Colombia Revistas electrónicas UN Avances en Sistemas e InformáticaAvances en Sistemas e InformáticaAvances en Sistemas e Informática; Vol. 5, núm. 1 (2008) Avances en Sistemas e Informática; Vol. 5, núm. 1 (2008) 1909-0056 1657-7663Jiménez Vargas, Sergio Gonzalo and González Osorio, Fabio Augusto (2008) An ontology-based information extractor for data-rich documents in the information technology domain. Avances en Sistemas e Informática; Vol. 5, núm. 1 (2008) Avances en Sistemas e Informática; Vol. 5, núm. 1 (2008) 1909-0056 1657-7663 .An ontology-based information extractor for data-rich documents in the information technology domainArtículo de revistainfo:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501http://purl.org/coar/resource_type/c_2df8fbb1http://purl.org/coar/version/c_970fb48d4fbd8a85Texthttp://purl.org/redcol/resource_type/ARTKnowledge ManagementInformation ExtractionOntologiesFuzzy String SearchingWord Sense DisambiguationSemantic RelatednessORIGINAL9972-18047-1-PB.pdfapplication/pdf289799https://repositorio.unal.edu.co/bitstream/unal/24330/1/9972-18047-1-PB.pdf02a308e3de33481fb6b16798f0260a0eMD51THUMBNAIL9972-18047-1-PB.pdf.jpg9972-18047-1-PB.pdf.jpgGenerated Thumbnailimage/jpeg10340https://repositorio.unal.edu.co/bitstream/unal/24330/2/9972-18047-1-PB.pdf.jpg0e8c3950bc9dac8bc7aa605c238a9ceeMD52unal/24330oai:repositorio.unal.edu.co:unal/243302022-10-24 23:02:18.912Repositorio Institucional Universidad Nacional de Colombiarepositorio_nal@unal.edu.co |
dc.title.spa.fl_str_mv |
An ontology-based information extractor for data-rich documents in the information technology domain |
title |
An ontology-based information extractor for data-rich documents in the information technology domain |
spellingShingle |
An ontology-based information extractor for data-rich documents in the information technology domain Knowledge Management Information Extraction Ontologies Fuzzy String Searching Word Sense Disambiguation Semantic Relatedness |
title_short |
An ontology-based information extractor for data-rich documents in the information technology domain |
title_full |
An ontology-based information extractor for data-rich documents in the information technology domain |
title_fullStr |
An ontology-based information extractor for data-rich documents in the information technology domain |
title_full_unstemmed |
An ontology-based information extractor for data-rich documents in the information technology domain |
title_sort |
An ontology-based information extractor for data-rich documents in the information technology domain |
dc.creator.fl_str_mv |
Jiménez Vargas, Sergio Gonzalo González Osorio, Fabio Augusto |
dc.contributor.author.spa.fl_str_mv |
Jiménez Vargas, Sergio Gonzalo González Osorio, Fabio Augusto |
dc.subject.proposal.spa.fl_str_mv |
Knowledge Management Information Extraction Ontologies Fuzzy String Searching Word Sense Disambiguation Semantic Relatedness |
topic |
Knowledge Management Information Extraction Ontologies Fuzzy String Searching Word Sense Disambiguation Semantic Relatedness |
description |
This paper presents an information extraction method, suitable for data-rich documents, based on the knowledge represented in a domain ontology. The extractor combines a fuzzy string matcher and a word sense disambiguation (WSD) algorithm. The fuzzy string matcher finds mentions of terms combining character-level and token-level similarity measures dealing with non-standardized acronyms and inconsistent abbreviation styles. We propose a new character-level edit distance sensitive to prefixes called root distance and a token-level similarity algorithm for fuzzy acronym detection. Additionally, a WSD strategy using an ontology-based semantic relatedness measure is used to solve the inherent ambiguity of some entities. The WSD module finds a sense combination over all the document length optimizing the document semantic coherence. Our approach seems to be suitable to extract information from data-rich documents describing Orly one main object (i.e. product) by document. The results showed a precision of 78.9% with 99.5% recall using documents and an ontology related to laptop computers domain. |
publishDate |
2008 |
dc.date.issued.spa.fl_str_mv |
2008 |
dc.date.accessioned.spa.fl_str_mv |
2019-06-25T22:35:59Z |
dc.date.available.spa.fl_str_mv |
2019-06-25T22:35:59Z |
dc.type.spa.fl_str_mv |
Artículo de revista |
dc.type.coar.fl_str_mv |
http://purl.org/coar/resource_type/c_2df8fbb1 |
dc.type.driver.spa.fl_str_mv |
info:eu-repo/semantics/article |
dc.type.version.spa.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.coar.spa.fl_str_mv |
http://purl.org/coar/resource_type/c_6501 |
dc.type.coarversion.spa.fl_str_mv |
http://purl.org/coar/version/c_970fb48d4fbd8a85 |
dc.type.content.spa.fl_str_mv |
Text |
dc.type.redcol.spa.fl_str_mv |
http://purl.org/redcol/resource_type/ART |
format |
http://purl.org/coar/resource_type/c_6501 |
status_str |
publishedVersion |
dc.identifier.uri.none.fl_str_mv |
https://repositorio.unal.edu.co/handle/unal/24330 |
dc.identifier.eprints.spa.fl_str_mv |
http://bdigital.unal.edu.co/15367/ |
url |
https://repositorio.unal.edu.co/handle/unal/24330 http://bdigital.unal.edu.co/15367/ |
dc.language.iso.spa.fl_str_mv |
spa |
language |
spa |
dc.relation.spa.fl_str_mv |
http://revistas.unal.edu.co/index.php/avances/article/view/9972 |
dc.relation.ispartof.spa.fl_str_mv |
Universidad Nacional de Colombia Revistas electrónicas UN Avances en Sistemas e Informática Avances en Sistemas e Informática |
dc.relation.ispartofseries.none.fl_str_mv |
Avances en Sistemas e Informática; Vol. 5, núm. 1 (2008) Avances en Sistemas e Informática; Vol. 5, núm. 1 (2008) 1909-0056 1657-7663 |
dc.relation.references.spa.fl_str_mv |
Jiménez Vargas, Sergio Gonzalo and González Osorio, Fabio Augusto (2008) An ontology-based information extractor for data-rich documents in the information technology domain. Avances en Sistemas e Informática; Vol. 5, núm. 1 (2008) Avances en Sistemas e Informática; Vol. 5, núm. 1 (2008) 1909-0056 1657-7663 . |
dc.rights.spa.fl_str_mv |
Derechos reservados - Universidad Nacional de Colombia |
dc.rights.coar.fl_str_mv |
http://purl.org/coar/access_right/c_abf2 |
dc.rights.license.spa.fl_str_mv |
Atribución-NoComercial 4.0 Internacional |
dc.rights.uri.spa.fl_str_mv |
http://creativecommons.org/licenses/by-nc/4.0/ |
dc.rights.accessrights.spa.fl_str_mv |
info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Atribución-NoComercial 4.0 Internacional Derechos reservados - Universidad Nacional de Colombia http://creativecommons.org/licenses/by-nc/4.0/ http://purl.org/coar/access_right/c_abf2 |
eu_rights_str_mv |
openAccess |
dc.format.mimetype.spa.fl_str_mv |
application/pdf |
dc.publisher.spa.fl_str_mv |
Universidad Nacional de Colombia -Sede Medellín |
institution |
Universidad Nacional de Colombia |
bitstream.url.fl_str_mv |
https://repositorio.unal.edu.co/bitstream/unal/24330/1/9972-18047-1-PB.pdf https://repositorio.unal.edu.co/bitstream/unal/24330/2/9972-18047-1-PB.pdf.jpg |
bitstream.checksum.fl_str_mv |
02a308e3de33481fb6b16798f0260a0e 0e8c3950bc9dac8bc7aa605c238a9cee |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 |
repository.name.fl_str_mv |
Repositorio Institucional Universidad Nacional de Colombia |
repository.mail.fl_str_mv |
repositorio_nal@unal.edu.co |
_version_ |
1814089318532120576 |