I.E. 1062/2062: DATA MINING
(Fall 2011-2012)
INSTRUCTOR:
Dr. Jayant Rajgopal
1039 Benedum Hall
Telephone No. 624-9840,
e-mail
:rajgopal@pitt.edu
URL for this web page: http://www.pitt.edu/~jrclass/datamining/
LECTURES:
Room 1021 Benedum Hall
Tue./Thu.: 9:30 AM - 10:45 AM
TEXT:
Introduction to Data Mining, by Pang-Ning Tan,
Michael Steinbach and Vipin Kumar, Addison-Wesley,
Boston (2006).
NOTES:
Class Notes are required. Please download,
print and bring to class with you: Notes (zipped folder - password required)
REFERENCES:
The following references will all be on reserve in the Engineering
Library:
Data Mining: Practical Machine Learning Tools and
Techniques (2nd Ed.), by Ian H. Witten and Eibe
Frank, Morgan Kaufmann Publishers, San Francisco (2005).
Data Mining: Concepts and Techniques, by Jiawei
Han and Micheline Kamber, Morgan Kaufmann
Publishers, San Francisco (2006).
Data Mining Techniques for Marketing, Sales and Customer
Support, by Michael J. A. Berry and Gordon
Linoff , John Wiley & Sons, New York (2004)
SOFTWARE:
Overview of IBM SPSS
Modeler 14.2 (previously known as Clementine)
CONTENT:
This is an introductory course in Data Mining. The
objective is to introduce the student to the area of data mining
and overview the important techniques associated with the
subject. The tentative list of topics includes data mining
applications, an overview of data warehousing, inputs and data
preparation, knowledge representation, evaluation of learning, and
techniques for classification, association and clustering.
Specific algorithms inlcude decision trees; Bayesian learning;
covering algorithms for classification rules; instance-based
learning; backpropagation and artificial neural networks; the Apriori
algorithm; FP-growth; k-means method; agglomerative,
divisive and probability based clustering. Please refer to
the link alongside for a tentative list of
topics.
GRADING:
On the basis of two open-book examinations, a term paper,
homework and class discussions.
A very famous data miner...
HOMEWORK & ANNOUNCEMENTS
- September
20,
2011: Here is a copy of the Excel file with the cell-phone data that we discussed
in class yesterday. Also, here
is Homework 1 - due Tuesday, Sep. 27. (Sep. 28: SOLUTIONS)
- September
28,
2011: Here is a copy of the Excel file with the cell-phone count data that we used
with the C5.0 method in class yesterday. Also, posted above
are solutions to HW1.
- September
29,
2011: Here is Homework 2
- due Thursday, Oct. 06.
Please start early on this! (Oct. 09: SOLUTIONS)
- October
10, 2011: Here is a copy of the Excel file for the Classification Rules exercise I
went over in class.
- October
13, 2011: Here is Homework
3 - due Thursday, Oct. 20. (Oct. 20: SOLUTIONS)
- October 20, 2011:The mid-term exam
will be on Tuesday, October 25 and will cover everything
we have seen in class up to and including classification
methods. Basically, this means you should study:
- The
introductory material: lecture notes + Chapter 1 in
text,
- Inputs
and outputs: lecture notes + Sections 2.1-2.4 in the
text (Sections 2.1-2.3, parts of 2.4, 3.1-3.5, 3.8, 3.9
in the book by Witten & Frank)
- Classification
(1-R,
Naive
Bayesian, Decision Trees, Classification Rules, Neural
Networks, Instance Based Learning): lecture notes +
Sections 4.1-4.3, parts of 4.4, parts of 5.1, 5.2, parts
of 5.3, and parts of 5.4 (Sections 4.1-4.4, parts of
6.1, 6.2 and 6.4 in the book by Witten & Frank)
- October
27, 2011: Here
are
Solutions to the
Mid-term exam. Please download and print
out so that you can compare them with your answers when I
hand back the tests. Also, Homework
4 - due Thursday, November 3: Question 17, Section 5.10 of
the text (pp. 322-323). (Nov. 04: SOLUTIONS)
- October
31, 2011: Here is the Excel
file for the material on ROC curves covered in class
last Thursday.
- November
04, 2011: If you are looking for applications of data
mining for your term paper, possible sources include the
journals International Journal of Production Research,
and Data Mining and Knowledge Discovery, and
Proceedings of the Annual ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. There
are also many articles and surveys in the popular press and
less technical journals. Basically, I'm looking for a
brief paper that provides details on a real-world
application of data mining - what was the problem
considered, why was data mining a possible solution, what
data was required and where it was obtained, what technique
was used, what level of implementation of results took
place, what were the benefits (if any) that were obtained,
etc. I'm guiessing it will be between 5 and 10 pages
long maybe.
- November
08, 2011: Here is Homework 5 -
due Tuesday, Nov. 15.
(Nov. 16: SOLUTIONS)
- November
16, 2011: Here are links to applets that
demonstrate clustering via K-Means
- November
18, 2011: Here
is
the Excel file for clustering via the k-prototypes
method that I used in class yesterday.
- November
29, 2011: Here is Homework
6 - due Thursday, December 08 (Dec. 10: SOLUTIONS). Also, here are the links to the
java applets for hierarchical clustering:
- December
7, 2011: Here are the Excel files I used in
class for hierarchical, incremental and probabilistic
clustering: AGNES, COBWEB, Expectation-Maximization.
-
December
8, 2011: The final exam will be on
Monday, December 12 at 10 A.M. and will
have a format similar to the mid-term. It will
cover:
- Evaluation
of
Learning:
class notes + Sections 4.5 and 5.7 of the text;
(Sections 5.1-5.4 of the book by Frank & Witten)
- Association
Rules
Mining/
Market Basket Analysis: class notes +Chapter 6 (most of
Sections 6.1 to 6.3) of the text.
- Clustering:
class
notes
+ Chapter 8 (most of Sections 8.1 through 8.3), &
most of Section 9.2.2 of the text; (Sections 4.8, 6.6 to
page 266 in the book by Frank & Witten)
DATA FILES
cancer.csv
zoo.csv
zoo1.csv
zoo2.csv
datamine.csv
hwdata.csv
datamine-symbolic.csv
usage.csv