Data Mining Ontologies
Data Mining Optimization (DMOP) Ontology and Knowledge Base
The DMOP ontology was designed to support algorithm, model, and workflow selection in view of optimizing the data mining process and the quality of the mined hypotheses (models or pattern sets). The terminological box (TBox), or the ontology proper, provides a unified conceptual framework for modeling DM tasks, datasets, algorithms, hypotheses, workflows and performance metrics, as well as their relationships. The assertional box (ABox) contains facts describing instances of these concepts (e.g., specific DM algorithms). These two boxes together form the DMOP Knowledge Base, which can be viewed as a compendium of the collective expertise of the data mining community. The DMOP Knowledge Base is available on:
In the e-LICO system, the information distilled in the DMOP ontology and knowledge base is used to drive the self-improvement of the DM process planner. Whereas the DMWF ontology supports the planner in its task of generating candidate workflows for a given mining task, the DMOP ontology supports the meta-miner, whose role is to analyse past DM experiments in order to build predictive models for ranking these workflows. The meta-miner is a semantic meta-miner: it draws its analytical power not only from metadata describing past experiments but also from background knowledge stored in the DMOP ontology and knowledge base. These meta-data are stored in RDF triple stores whose schemas are based on DMOP's conceptual framework. The figure shows the architecture of DMOP and its satellite databases that describe ingredients of DM experiments: operators (OPER-DB), datasets (DSET-DBs), workflows (WFLO-DB) and experimental parameters and results (DMEX-DBs).
A distinctive feature of DMOP is its in-depth characterization of the major ingredients of the data mining process: datasets, algorithms, and learned hypotheses. Datasets are described using statistical and information-theoretic measures, as well as geometric complexity measures that suggest the potential difficulty of analysing them. Inductive paradigms and algorithms are modeled in terms of their implicit assumptions, their optimization strategies, their capabilities (e.g. ability to handle classification costs or instance weights), and their resilience to data flaws such as noise or missing values. Mined hypotheses are characterized by their structural complexity, interpretability, average performance in a given application domain, and -- for classifiers -- the type of decision boundaries induced in the instance space.
Due to the magnitude of the task and the constant evolution of data mining technology, DMOP is and will always be expandable and revisable. For this reason, it was designed to be a community undertaking. To ensure sustained and collective development of DMOP and similar ontologies, we have created the DMO Foundry, a web portal that provides a collaborative ontology development platform for data mining ontologies. As the Foundry's pilot ontology, DMOP can be explored online via the Foundry's DMO Browser (http://www.dmo-foundry.org/DMOBrowser/); it can also be downloaded from http://www.dmo-foundry.org/download-DMOP for offline inspection on any OWL browser/editor such as Protégé 4.1.