An experimental comparison of methods for handling incomplete data in learning parameters of Bayesian networks



Authors:
Agnieszka Onisko
Bialystok University of Technology
Institute of Computer Science
Bialystok, 15-351, Poland
e-mail: aonisko@ii.pb.bialystok.pl

Marek J. Druzdzel
Decision Systems Laboratory
School of Information Sciences
and Intelligent Systems Program
University of Pittsburgh
e-mail: marek@sis.pitt.edu

Hanna Wasyluk
The Medical Center of Postgraduate Education
Warsaw, Marymoncka 99, Poland
e-mail: hwasyluk@cmkp.edu.pl

Abstract:
Missing values of attributes in data sets, also referred to as incomplete data, pose difficulties in learning tasks, such as classification, data mining, or learning Bayesian network structure and its numerical parameters. Because of the predominance of incomplete data in practice, many methods have been proposed to deal with them while there are few studies that compare their performance. The HEPAR II project presents an excellent opportunity to test experimentally how these methods perform on a real data set. We briefly review several popular methods for handling incomplete data and then compare them on the task of learning conditional probability distributions of a Bayesian network model, where the comparison criterion is the resulting diagnostic accuracy. While substitution of "normal" values of missing attributes seemed to perform best, we observed only a small difference in performance among the studied methods.

The full paper is available in Compressed PostScript (93KB) and PDF (109KB) formats.
Back to list of publications
Back to Marek's home page

marek@sis.pitt.edu / Last update: 11 May 2005