Scaling Theory

IS 2065

Fall 2001 (02-1)

Instructor: Dr. Stephen Hirtle
Office: B201 IS Building
Office Phone: 624-9434
Email: sch@sis.pitt.edu
Office Hours: Wednesday 1:30-3:00pm or by appointment
Class Meets: Thursday, 3:00-5:50pm, 501 IS Bldg
Prerequisite: IS 2060 or permission of instructor

All available information about this course may be found at: http://www.pitt.edu/~hirtle/is2065.html

  Overview. Scaling techniques have become increasingly popular with the growth of data mining and knowledge discovery.  This term, the focus will be on the use of scaling techniques for data mining.

The course will provide an important foundation for further study in diverse areas, such as information retrieval, cognitive science, and marketing. The techniques discussed are also the foundation of many modern data mining techniques. In addition, the course will count as one of the two required statistics course for the Information Science Track of the DIST PhD program.

Materials. The primary text for the term is
 

J. Han and M. Kamber.
Data Mining: Concepts and Techniques.
Morgan Kaufmann, 2000.


A complete set of powerpoint slides to accompany the text can be found at:  http://www.cs.sfu.ca/~han/DM_Book.html.  Additional readings will be assigned each week to complement the text.  Students should read both the text and the readings each week. However, given your own level of knowledge and interest, you will be free to focus on either the text or the readings.  The material, which is not the focus of your study, may be skimmed.  Both text and readings will be discussed each week during the lectures.  Generally, the text will be covered in the first half of the lecture.  The readings in the second half.

Assignments will require accounts on icarus.sis, unixs.cis, and vms.cis. See me if you have trouble getting an account on any of these machines. We will be using several packages and programs, including S-plus and SPSS during the semester. Additional links related to the class can be found at the following sites:


Evaluation. Evaluation will occur through a combination of three short papers and a term project. The papers can be one of two types: A review or an analysis.  A review will describe a current problem in the data mining field and dicuss various proposed solutions, including any solutions that you might suggest.  An analytical paper will consist of a short, written analysis of a data set using one or more techniques. The papers will be will be limited to 5 pages of text, plus supporting graphs and tables. For each paper, there will be a set of guidelines/topics that will be distributed two weeks before the paper is due. Each paper will count 50 points. Late papers will lose 2 points each day it is late. No paper will be accepted more than 7 days after the due date. All papers must be completed independently.  There must be at least one paper of each type at some point in the term.

In addition to the short papers, there will be a separate term project, which will be similar to the papers in style, but will include a general discussion and must cover at least two of the topic areas. The written part of the project will be limited to 8 pages of text. The project will also be presented orally during the last class meeting. The entire project will be worth 100 points, including 10 points for the oral presentation. Any extenuating circumstances that would result in missing the final deadline must be discussed in advanced with the instructor. The oral presentation cannot be made up.

Special circumstances.  If you have a disability for which you are or may be requesting an accomodation, you are encouraged to contact both your instructor and the Office of Disability Resources and Sevices, 216 William Pitt Union, (412-648-7890/TTY:412-383-7355) as early as possible in the term. DRS will verify your disability and determine reasonable accomodations for this course.  In addition, you should be aware that my office is up a short flight of stairs.  If this problematic, I am happy to arrange a meeting in an accessible location at any time.

Course Readings

Introduction; 8/30

Chapter 1 Data Warehousing; 9/6 Chapter 2
HFW+96a; AAD+96; HAC+99; Han98
Intro to Splus; 9/13 Links to Splus Documentation, organized by OS;
Splus Introduction;
Brief into to Splus ;
(Note: no class on 9/20 -- COSIT Meeting)
Data Preprocessing; 9/27 Chapter 3 Data Mining Primitives; 10/4 Chapter 4; Paper 1 due today
Fre99;
Concept Description; 10/11 Chapter 5 KN98; KN97; GCB+97; Mining Association Rules; 10/18 Chapter 6 RMS98; Classification ; 10/25 Chapter 7; Paper 2 due today Cluster Analysis; 11/1 & 11/8 Chapter 8 Complex data types; 11/15 Chapter 9 Latent Semantic Indexing; Varenius Report on Spatial Data-Mining; Applications; 11/29 Chapter 10; Paper 3 due today UIC Terabyte Data Mining Project Discussion; 12/6
Class Presentations; 12/13 Final Paper due

Other References | CSNA | KD Nuggets | Citeseer