An ontology-based information extractor for data-rich documents in the information technology domain
This paper presents an information extraction method, suitable for data-rich documents, based on the knowledge represented in a domain ontology. The extractor combines a fuzzy string matcher and a word sense disambiguation (WSD) algorithm. The fuzzy string matcher finds mentions of terms combining c...
- Autores:
-
Jiménez Vargas, Sergio Gonzalo
González Osorio, Fabio Augusto
- Tipo de recurso:
- Article of journal
- Fecha de publicación:
- 2008
- Institución:
- Universidad Nacional de Colombia
- Repositorio:
- Universidad Nacional de Colombia
- Idioma:
- spa
- OAI Identifier:
- oai:repositorio.unal.edu.co:unal/24330
- Acceso en línea:
- https://repositorio.unal.edu.co/handle/unal/24330
http://bdigital.unal.edu.co/15367/
- Palabra clave:
- Knowledge Management
Information Extraction
Ontologies
Fuzzy String Searching
Word Sense Disambiguation
Semantic Relatedness
- Rights
- openAccess
- License
- Atribución-NoComercial 4.0 Internacional
Summary: | This paper presents an information extraction method, suitable for data-rich documents, based on the knowledge represented in a domain ontology. The extractor combines a fuzzy string matcher and a word sense disambiguation (WSD) algorithm. The fuzzy string matcher finds mentions of terms combining character-level and token-level similarity measures dealing with non-standardized acronyms and inconsistent abbreviation styles. We propose a new character-level edit distance sensitive to prefixes called root distance and a token-level similarity algorithm for fuzzy acronym detection. Additionally, a WSD strategy using an ontology-based semantic relatedness measure is used to solve the inherent ambiguity of some entities. The WSD module finds a sense combination over all the document length optimizing the document semantic coherence. Our approach seems to be suitable to extract information from data-rich documents describing Orly one main object (i.e. product) by document. The results showed a precision of 78.9% with 99.5% recall using documents and an ontology related to laptop computers domain. |
---|