Machine Learning and Inference Laboratory, College of Public Health
Permanent URI for this collection
The Machine Learning and Inference (MLI) Laboratory conducts fundamental and experimental research on the development of intelligent systems capable of advanced forms of learning, inference, and knowledge generation, and applies them to real-world problems.
Major research areas include:
- theory and computational models of learning and inference
- data mining and knowledge discovery
- machine learning and natural induction
- inductive databases and knowledge scouts
- behavior modeling and computer intrusion detection
- non-Darwinian evolutionary computation
- multistrategy learning and knowledge mining
- intelligent systems for education
- models of human plausible reasoning
- machine vision with learning capabilities
Browse
Browsing Machine Learning and Inference Laboratory, College of Public Health by Subject "Data mining"
Now showing 1 - 8 of 8
Results Per Page
Sort Options
Item An Adjustable Description Quality Measure for Pattern Discovery in Large Databases Using the AQ Methodology(2000-03) Kaufman, Kenneth A.; Michalski, Ryszard S.In concept learning and data mining tasks, the learner is typically faced with a choice of many possible hypotheses or patterns characterizing the input data. If one can assume that training data contain no noise, then the primary conditions a hypothesis must satisfy are consistency and completeness with regard to the data. In real-world applications, however, data are often noisy, and the insistence on the full completeness and consistency of the hypothesis is no longer valid. In such situations, the problem is to determine a hypothesis that represents the best trade-off between completeness and consistency. This paper presents an approach to this problem in which a learner seeks rules optimizing a rule quality criterion that combines the rule coverage (a measure of completeness) and training accuracy (a measure of inconsistency). These factors are combined into a single rule quality measure through a lexicographical evaluation functional (LEF). The method has been implemented in the AQ18 learning system for natural induction and pattern discovery, and compared with several other methods. Experiments have shown that the proposed method can be easily tailored to different problems and can simulate different rule learners.by modifying the parameter of the rule quality criterion.Item Attributional Calculus: A Logic and Representation Language for Natural Induction(2004-04) Michalski, Ryszard S.Attributional calculus (AC) is a typed logic system that combines elements of propositional logic, predicate calculus, and multiple-valued logic for the purpose of natural induction. By natural induction is meant a form of inductive learning that generates hypotheses in human-oriented forms, that is, forms that appear natural to people, and are easy to understand and relate to human knowledge. To serve this goal, AC includes non-conventional logic operators and forms that can make logic expressions simpler and more closely related to the equivalent natural language descriptions. AC has two forms, basic and extended, each of which can be bare or annotated. The extended form adds more operators to the basic form, and the annotated form includes parameters characterizing statistical properties of bare expressions. AC has two interpretation schemas, strict and flexible. The strict schema interprets AC expressions as true-false valued, and the flexible schema as continuously-valued. Conventional decision rules, association rules, decision trees, and n-of-m rules all can be viewed as special cases of attributional rules. Attributional rules can be directly translated to natural language, and visualized using concept association graphs and general logic diagrams. AC stems from Variable-Valued Logic 1 (VL1), and is intended to serve as a concept description language in advanced AQ inductive learning programs. To provide a motivation and background for AC the first part of the paper presents basic ideas and assumptions underlying concept learning.Item Building Knowledge Scouts Using KGL Metalanguage(2000) Michalski, Ryszard S.; Kaufman, Kenneth A.Knowledge scouts are software agents that autonomously search for and synthesize user-oriented knowledge (target knowledge) in large local or distributed databases. A knowledge generation metalanguage, KGL, is used to creating scripts defining such knowledge scouts. Knowledge scouts operate in an inductive database, by which we mean a database system in which conventional data and knowledge management operators are integrated with a wide range of data mining and inductive inference operators. Discovered knowledge is represented in two forms: (1) attributional rules, which are rules in attributional calculus -- a logic-based language between propositional and predicate calculus, and (2) association graphs, which graphically and abstractly represent relations expressed by the rules. These graphs can depict multi-argument relationships among different concepts, with a visual indication of the relative strength of each dependency. Presented ideas are illustrated by two simple knowledge scouts, one that seeks relations among lifestyles, environmental conditions, symptoms and diseases in a large medical database, and another that searches for patterns of children's behavior in the National Youth Survey database. The preliminary results indicate a high potential utility of the presented methodology as a tool for deriving knowledge from databases.Item Generating Alternative Hypotheses in AQ Learning(2004-12) Michalski, Ryszard S.In many areas of application of machine learning and data mining, it is desirable to generate alternative inductive hypotheses from the given data. The Aq-ALT or, briefly, ALT method, presented in this paper, generates alternative hypotheses in two phases. The first phase proceeds according to the standard Aq algorithm, but each star generation process produces not just one best complex, but rather a collection of complexes, called the elite. This phase ends when the union of best complexes constitutes a complete and consistent cover of the target set, called the primary hypothesis. The second phase derives alternative hypotheses by multiplying out the disjunctions of symbols representing complexes in each elite, and creating an irredundant DNF expression. Individual terms in this expression determine alternative hypotheses. These hypotheses are ranked according to a given hypothesis evaluation criterion, LEFh, and the alt best hypotheses are selected, where alt is a parameter provided to the program. The method is extended to inconsistent covering problem by introducing an event membership probability function. The selected hypotheses can be used as alternative generalizations of data, or arranged into an ensemble of classifiers to perform a form of boosting. The ALT method is general, and can thus be employed not only in concept learning, but also for generating alternative solutions to any general covering problem.Item Initial Considerations toward Knowledge Mining(2004-10) Kaufman, Kenneth A.; Michalski, Ryszard S.In view of the tremendous production of computer data worldwide, there is a strong need for new powerful tools that can automatically generate useful knowledge from a variety of data, and present it in human-oriented forms. In efforts to satisfy this need, researchers have been exploring ideas and methods developed in machine learning, statistical data analysis, data mining, text mining, data visualization, pattern recognition, etc. The first part of this paper is a compendium of ideas on the applicability of symbolic machine learning and logical data analysis methods toward this goal. The second part outlines a multistrategy methodology for an emerging research direction, called knowledge mining, by which we mean the derivation of high-level concepts and descriptions from data through symbolic reasoning involving both data and relevant background knowledge. The effective use of background as well as previously created knowledge in reasoning about new data makes it possible for the knowledge mining system to derive useful new knowledge not only from large amounts of data, but also from limited and weakly relevant data.Item Multitype Pattern Discovery Via AQ21: A Brief Description of the Method and Its Novel Features(2006-06) Wojtusiak, Janusz; Michalski, Ryszard S.; Kaufman, Kenneth A.; Pietrzykowski, JaroslawThe AQ21 program seeks different types of patterns in data and represents them in human-oriented forms resembling natural language descriptions. Because of the latter feature it is called a natural induction program. This feature is achieved by employing a highly expressive representation language, Attributional Calculus, that combines aspects of propositional, predicate and multi-valued logic for the purpose of supporting pattern discovery and inductive learning. This paper briefly describes the pattern discovery mode in AQ21, and several novel abilities seamlessly integrated in it, specifically, to discover different types of attributional patterns depending on the parameter settings, to optimize patterns according to a large number of different pattern quality criteria, to learn rules with exceptions, to determine optimized sets of alternative hypotheses generalizing the same data, and to handle data with missing, irrelevant and/or not-applicable meta-values. The discovered patterns are expressed in the form of attributional rules that are directly interpretable in natural language and are visualized using either general logic diagrams or concept association graphs. The described program features are illustrated by a sample of pattern discovery problems.Item Natural Induction and Conceptual Clustering: A Review of Applications(2006-06) Michalski, Ryszard S.; Kaufman, Kenneth A.; Pietrzykowski, Jaroslaw; Wojtusiak, Janusz; Mitchell, Scott; Seeman, DougNatural induction and conceptual clustering are two methodologies pioneered by the GMU Machine Learning and Inference Laboratory for discovering conceptual relationships in data, and presenting them in the forms easy for people to interpret and understand. The first methodology is for supervised learning (learning from examples) and the second for unsupervised learning (clustering). Examples of their application to a wide range of practical domains are presented, including bioinformatics, medicine, agriculture, volcanology, demographics, intrusion detection and computer user modeling, manufacturing, civil engineering, optimization of functions of very large number of variables (100-1000), design of complex engineering systems, tax fraud detection, and musicology. Most of the results were obtained by applying our recent natural induction program, AQ21, which is downloadable from http://www.mli.gmu.edu/msoftware.html. To give the Reader a quick insight into differences between natural induction implemented in AQ21 and some well-known learning methods, such as those implemented in C4.5, RIPPER, and CN2, as well as between conceptual clustering and conventional clustering, Sections 15 and 16 describe results from applying all these methods to very simple, designed problems.Item The AQ18 System for Machine Learning and Data Mining System: An Implementation and User's Guide(2000-03) Michalski, Ryszard S.; Kaufman, Kenneth A.This report is a comprehensive user's guide for AQ18, an environment for natural induction, machine learning and knowledge discovery. By natural induction is meant a form of inductive inference which strives to induce data descriptions that are most natural and comprehensible to people. This feature is achieved by employing a highly expressive description language (attributional calculus). Along with a learning for determining attributional rulesets from examples, or for incrementally improving the previously learned rulesets through new examples, AQ18 also incorporates a ruleset testing module (ATEST) and a module for selecting the best attributes for a given learning problem (PROMISE).