First I will give a little context for ontologies and ontology learning Given the time constraints going through the tools one by one is not feasible so I am giving an overview of the techniques commonly used in the tools. I broke the tools into the following categories (explain categories) Then I will cover some possible future work in the conclusion
Ontologies capture concepts and relationships between them relationships are both taxonomic (class hierarchy) and non taxonomic (all other relationships) Ontologies Concepts medicine, disease Relationships Taxonomic doctor and patient a subclass of person Non taxonomic relationship between medicine and disease relationship between symptoms and disease Manual Ontology building often requires an expert and they disagree not practical for large scale ontologies also a problem for maintenance Reusing, maintaining and combining existing ontologies has the same problems
Ontology Learning aims to automate the ontology creation process uses techniques from other fields such as machine learning and inductive programming clustering, rule based still a long way from being fully automatic and useable on a large scale by novices requires validation and input from the user throughout the process
ASIUM processes the input text for syntactic structure creates initial clusters using word frequency uses clustering algorithm with user validation at each layer of the clustering HASTI one of the most robust end to end tools preprocesses the text using NLP Uses a modular architecture for each knowledge base and extraction unit uses templates to extract the knowledge uses other machine learning such as clustering to maintain and add to the ontology LTG preprocessing using NLP techniques runs series of algorithms on parsed output Mo’K workbench preprocessing using NLP techniques runs series of algorithms on parsed output OntoGen similar to ASIUM NLP ran on input documents unsupervised and supervised learning algorithms ran to generate suggestions OntoLT NLP is ran to create a XML version of the input with NLP annotations XSLT based rules are ran to extract the knowledge in the input SOAT NLP ran on input text rules used to start with a root concept and grow the taxonomy from that using the rules SVETLAN preprocessing using NLP techniques runs hierarchical clustering algorithm to create taxonomy using syntactic and semantic similarity
Cooperative learning appraoch tools checks with user at each step for validation or makes suggestion to user of actions to take ASIUM requires a lot of user interaction in the validation at each step in the clustering HASTI could allow more user adaptation like the other tools LTG earlier tools that does not actually output an ontology Mo’K workbench constrains the type of learning algorithms that can be used OntoGen requires a lot of user interaction in the validation at each step in the clustering OntoLT requires hand crafted XSLT rules and operators to build ontology SOAT Requires large amount of “high quality input” for domain coverage and concept learning SVETLAN more of a support tool to learn a basic taxonomy then an ontology learning tool
Very similar to the first group but also use structure present in existing ontologies, taxonomies and knowledge bases preprocess input text and use the same type of learning algorithms OntoEdit/KOAN/Text-To-Onto/Text2Onto modular architecture NLP processing of input text algorithm library used to run several algorithms on the input text some algorithms use information from input ontologies algorithms output is in standard format so it is combinable to create a meta-learner OntoLearn extracts terminology and then uses input to determine which terms are only used in that domain creates a concept forest for those terms using input ontology and inductive learning rules adds the concept forest back to ontology and trims the ontology to only represent that domain more robust then DODDLE tools by using semantic interpretation to associate an appropriate concept identifier with each term in the ontology ONTOTEXT rules based approach to learn ontology elements and extract the knowledge TFIDF NLP processing of input Extract single word terms learn multi-word terms and identify patterns Extract related terms by applying the learned patterns to the corpus and return back to the previous step Patterns are learned using existing patterns in the input ontology
OntoEdit/KOAN/Text-To-Onto/Text2Onto one of the most promising tools OntoLearn Depend on enough of the ontology domain being represented in the input ontology, taxonomy or knowledge base ONTOTEXT requires a large amount of hand created rules to learn even a small part of the ontology TFIDF Depend on enough of the ontology domain being represented in the input ontology, taxonomy or knowledge base
DODDLE and DODDLE II focus on building a hierarchically structured set of domain terms creates initial ontology by doing text matching to add domain terms to dictionary trims the initial ontology by determining which parts to drift to the final ontology as well as looking for inconsistent relationships and badly balanced sub-trees DODDLE II adds non taxonomy relationship discovery using word space and word cooccurrence to determine the strength of word relations uses word vector related similarity measures
DODDLE and DODDLE II highly dependent on the ability of simple text matching to map domain terms to the general dictionary to create the domain ontology could use more sophisticated matching techniques problem can be alleviated if a domain specific dictionary is available
Two subclasses for this group enhancing an existing ontology or merging two ontologies OntoBuilder and WebKB create an initial simple ontology using input and then augment it using further web page input use HTML page structure heavily to find important information WebKB is machine learning supervised learning based SyndiKate uses grammatical information available in the input text to learn new terms and add them to the input ontology assigns terms to the ontology using an iterative labeling techniques GLUE creates mappings between two ontologies use probability distributions learns classifiers to map between the two ontologies OntoLift requires the database owner to describe the structure of the underlying data and the types of queries that can be issues to the database uses mapping rules for mapping OntoMerge translates both ontologies to a semantic representation users bridging axioms to map between the two ontologies Prompt and Chimeara does the simple text based matching for the user creates list of suggested mappings and allows the user to validate them checks for inconsistencies in added mappings Useful for ontology maintance Prompt and Chimeara OntoBuilder WebKB
Make assumptions about input text being web based and coverage is based on input web pages OntoBuilder WebKB SyndiKate requires a large amount of input data GLUE only creates a one-to-one mappings OntoLift requires a lot of upfront work done by the data providers OntoMerge extra translation layer between syntactic and semantic layer Prompt and Chimeara requires a lot of user interaction
Selecting a tool most important thing for a user selecting a tool is the type of input they have available the next most important is the type of knowledge learned (concepts, taxonomic and non taxonomic relationships) could reverse these if you are willing to create the types of input needed will probably need to use a combination of several tools to get what is needed which lead to the workbench approach Suggested future work - Combine workbench best approach seen in the tools covered is the workbench approach that allows using multiple tools ideally an easily added to algorithm library that uses a common combinable representation and can take new algorithms as they are created also the workbench will probably need to validate the output from the various steps and algorithms which leads to the next point
Suggested future work - Validation for ontologies (manually created or machine learned) is still an open research area needs to be resolved if ontologies can be trusted for important teaks Machine learning validation Precision used to measure correctness of ontologies by looking for false positives Recall used to measure coverage of the ontology Validation is an important open area of research in ontology learning as noted by all of the papers can help solve other problems like the high need for user interaction in validation that could be replaces with validation techniques
fully automated solution workbench approach will help as well as validation support to replace the user in the validation of steps in the process Semantic web even though it still has a long way to go the hope for its possibilities are strong and the beginnings of it are being seen