Data Mining - SMU

DATA MINING Part I IIIT Allahabad Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275, USA [email protected] Some slides extracted from Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002. Support provided by Fulbright Grant and IIIT Allahabad IIIT Allahabad 1 IIIT Allahabad

2 Data Mining Outline Part I: Introduction (19/1 20/1) Part II: Classification (24/1 27/1) Part III: Clustering (31/1 3/2) Part IV: Association Rules (7/2 10/2) Part V: Applications (14/2 17/2) IIIT Allahabad 3 Class Structure Each class is two hours Tuesday/Wednesday presentation

Thursday/Friday Lab IIIT Allahabad 4 Data Mining Part I Introduction Outline Goal: Provide an overview of data mining. Lecture

Define data mining Data mining vs. databases Basic data mining tasks Data mining development Data mining issues Lab Download XLMiner and Weka Analyze simple dataset IIIT Allahabad 5

Introduction Data is growing at a phenomenal rate Users expect more sophisticated information How? UNCOVER HIDDEN INFORMATION DATA MINING IIIT Allahabad 6 Data Mining Definition Finding hidden information in a database Fit data to a model Similar terms

Exploratory data analysis Data driven discovery Deductive learning IIIT Allahabad 7 Data Mining Algorithm Objective: Fit Data to a Model Descriptive Predictive Preference Technique to choose the

best model Search Technique to search the data Query IIIT Allahabad 8 Database Processing vs. Data Mining Processing Query Poorly Well defined defined

No SQLprecise query language Data Operational data Output Precise Subset of database IIIT Allahabad

Data Not operational data Output Fuzzy Not a subset of database 9 Query Examples Database Find all credit applicants with last name of Smith. Identify customers who have purchased more than $10,000 in the last month.

Find all customers who have purchased milk Mining Data Find all credit applicants who are poor credit risks. (classification) Identify customers with similar buying habits. (Clustering) Find all items which are frequently purchased with milk. (association rules) IIIT Allahabad 10 Basic Data Mining Tasks

Classification maps data into predefined groups or classes Supervised learning Prediction Regression Clustering groups similar data together into clusters. Unsupervised learning Segmentation Partitioning IIIT Allahabad 11

Basic Data Mining Tasks (contd) Link Analysis uncovers relationships among data. Affinity Analysis Association Rules Sequential Analysis determines sequential patterns. IIIT Allahabad 12 CLASSIFICATION

Assign data into predefined groups or classes. IIIT Allahabad 13 But it isnt Magic You must know what you are looking for You must know how to look for you Suppose you knew that a specific cave had gold: What would you look for? How would you look for it? Might need an expert miner

IIIT Allahabad 14 If it looks like a duck, walks like a duck, and If it looks like a terrorist, quacks like a duck, then walks like itsaaterrorist, duck. and quacks like a terrorist, then its a terrorist. Description Behavior

Associations Classification Clustering Link Analysis (Profiling) (Similarity) IIIT Allahabad 15 Classification Ex: Grading x <90 >=90 x A

<80 >=80 x B <70 >=70 x <50 >=60 F

IIIT Allahabad C D 16 Katydids Given a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is. Grasshoppers IIIT Allahabad (c) Eamonn Keogh, [email protected]

17 The classification problem can now be expressed as: Given a training database predict the class label of a previously unseen instance Insect ID Abdomen Length Antennae

Length Insect Class 1 2.7 5.5 Grasshopper 2 8.0 9.1

Katydid 3 0.9 4.7 Grasshopper 4 1.1 3.1 Grasshopper

5 5.4 8.5 Katydid 6 2.9 1.9 Grasshopper 7

6.1 6.6 Katydid 8 0.5 1.0 Grasshopper 9 8.3

6.6 Katydid 10 8.1 4.7 Katydid previously unseen instance11= IIIT Allahabad (c) Eamonn Keogh, [email protected] 5.1

7.0 ??????? 18 Antenna Length 10 9 8 7 6 5 4 3 2

1 1 2 3 4 5 6 7 8 9 10 Abdomen Length IIIT Allahabad Grasshoppers (c) Eamonn Keogh, [email protected] 19 Katydids Facial Recognition IIIT Allahabad (c) Eamonn Keogh, [email protected] 20

Handwriting Recognition 1 0.5 0 0 50 100 150 200 250

300 350 400 (c) Eamonn Keogh, [email protected] IIIT Allahabad George Washington Manuscript 21 450 Anomaly Detection

IIIT Allahabad 22 IIIT Allahabad 23 CLUSTERING Partition data into previously undefined groups. IIIT Allahabad 24 IIIT Allahabad 25 What is Similarity? IIIT Allahabad (c) Eamonn Keogh, [email protected] 26 Two Types of Clustering Hierarchical Partitional

(c) Eamonn Keogh, [email protected] IIIT Allahabad 27 Hierarchical Clustering Example Iris Data Set Versicolor Setosa Virginica The data originally appeared in Fisher, R. A. (1936). "The Use of Multiple Measurements in Axonomic Problems," Annals of Eugenics 7, 179-188.

Hierarchical Clustering Explorer Version 3.0, Human-Computer Interaction Lab, University of Maryland, . IIIT Allahabad 28,9171,1541283,00.html IIIT Allahabad 29 Microarray Data Analysis

Each probe location associated with gene Color indicates degree of gene expression Compare different samples (normal/disease) Track same sample over time Questions Which genes are related to this disease? Which genes behave in a similar manner? What is the function of a gene? Clustering Hierarchical K-means IIIT Allahabad 30 Microarray Data - Clustering

"Gene expression profiling identifies clinically relevant subtypes of prostate cancer" Proc. Natl. Aca d. Sci. USA , Vol. 101, Issu e 3, 811-816, J anuary 20, 200 4 IIIT Allahabad

31 ASSOCIATION RULES/ LINK ANALYSIS Find relationships between data IIIT Allahabad 32 ASSOCIATION RULES EXAMPLES People who buy diapers also buy beer If gene A is highly expressed in this disease then gene A is also expressed

Relationships between people Book Stores Department Stores Advertising Product Placement IIIT Allahabad 33 Data 2003.Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, DILBERT reprinted by permission of United Feature Syndicate, Inc. IIIT Allahabad 34

Joshua Benton and Holly K. Hacker, At Charters, Cheatings off the Charts:, Dallas Morning News, June 4, 2007. IIIT Allahabad 35 No/Little Cheating Joshua Benton and Holly K. Hacker, At Charters, Cheatings off the Charts:, Dallas Morning News, June 4, 2007. IIIT Allahabad

36 Rampant Cheating Joshua Benton and Holly K. Hacker, At Charters, Cheatings off the Charts:, Dallas Morning News, June 4, 2007. IIIT Allahabad

37 IIIT Allahabad Jialun Qin, Jennifer J. Daning Hu, Marc Sage Hsinchun Chen, Anal Terrorist Networks: A C of the Global 38 Salafi Jih Network Lecture Not Computer Science, Ex: Stock Market Analysis Example: Stock Market Predict future values Determine similar patterns over time Classify behavior

IIIT Allahabad 39 Ex: Stock Market Analysis IIIT Allahabad 40 Data Mining vs. KDD Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns

derived by the KDD process. IIIT Allahabad 41 KDD Process Modified from [FPSS96C]

Selection: Obtain data from various sources. Preprocessing: Cleanse data. Transformation: Convert to common format. Transform to new format. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner. IIIT Allahabad 42 KDD Process Ex: Web Log Selection: Select log data (dates and locations) to use

Preprocessing: Remove identifying URLs; Remove error logs Transformation: Sessionize (sort and group) Data Mining: Identify and count patterns; Construct data structure Interpretation/Evaluation:

Identify and display frequently accessed sequences. Potential User Applications: Cache prediction Personalization IIIT Allahabad 43 Related Topics Databases OLTP OLAP Information Retrieval

IIIT Allahabad 44 DB & OLTP Systems Schema (ID,Name,Address,Salary,JobNo) Data Model ER Relational

Transaction Query: SELECT Name FROM T WHERE Salary > 100000 DM: Only imprecise queries IIIT Allahabad 45 Classification/Prediction is Fuzzy Loan

Reject Reject Amnt Accept Simple IIIT Allahabad Accept Fuzzy 46 Information Retrieval

Information Retrieval (IR): retrieving desired information from textual data. Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query: Find all documents about data mining. DM: Similarity measures; Mine text/Web data. IIIT Allahabad

47 Information Retrieval (contd) Similarity: measure of how close a query is to a document. Documents which are close enough are retrieved. Metrics: Precision = |Relevant and Retrieved| |Retrieved| Recall = |Relevant and Retrieved| |Relevant| IIIT Allahabad 48

IR Query Result Measures and Classification IR IIIT Allahabad Classification 49 OLAP Online Analytic Processing (OLAP): provides more complex queries than OLTP. OnLine Transaction Processing (OLTP): traditional

database/transaction processing. Dimensional data; cube view Visualization of operations: Slice: examine sub-cube. Dice: rotate cube to look at another dimension. Roll Up/Drill Down DM: May use OLAP queries. IIIT Allahabad 50 DM vs. Related Topics Area Query Data DB/OLTP Precise Database IR

OLAP DM IIIT Allahabad Results Output Precise DB Objects or Aggregation Precise Documents Vague Documents Analysis Multidimensional Precise DB Objects or Aggregation Vague Preprocessed Vague KDD Objects 51

Data Mining Development Similarity Measures Relational Data Model SQL Association Rule Algorithms Data Warehousing Scalability Techniques Hierarchical Clustering IR Systems Imprecise Queries Textual Data Web Search Engines Bayes Theorem Regression Analysis EM Algorithm

K-Means Clustering Time Series Analysis Algorithm Design Techniques Algorithm Analysis Data Structures IIIT Allahabad Neural Networks Decision Tree Algorithms 52 KDD Issues Human Interaction Overfitting Outliers

Interpretation Visualization Large Datasets High Dimensionality IIIT Allahabad 53 Overfitting Suppose we want to predict whether an individual is short, medium, or tall in height. What is wrong with this data? IIIT Allahabad

Name Gender Height Output Mary F 1.6 Short Maggie F 1.9 Medium Martha F 1.88 Medium Stephanie F 1.7 Short Bob M 1.85 Medium

Kathy F 1.6 Short George M 1.7 Short Debbie F 1.8 Medium Todd M 1.95 Medium Kim F 1.9 Medium Amy F 1.8 Medium

Wynette F 1.75 Medium 54 KDD Issues (contd) Multimedia Data Missing Data Irrelevant Data Noisy Data Changing Data Integration Application IIIT Allahabad

55 WARNING With data mining you dont always know what you are looking for. There is not one right answer. The data you are using is noisy Data Mining is a very applied discipline. A data mining course provides you tools to use to analyze data. Experience provides you knowledge of how to use these tools. IIIT Allahabad 56

IIIT Allahabad 57 er=32236 IIIT Allahabad 58 Social Implications of DM Privacy Profiling Unauthorized use Invalid results and claims IIIT Allahabad

59 Data Mining Metrics Usefulness Return on Investment (ROI) Accuracy Space/Time IIIT Allahabad 60 Visualization Techniques Graphical Geometric

Icon-based Pixel-based Hierarchical Hybrid IIIT Allahabad 61 Models Based on Summarization Visualization: Frequency distribution, mean, variance, median, mode, etc. Box Plot: IIIT Allahabad

62 Scatter Diagram IIIT Allahabad 63 DM Tools XLMiner Easy addin to Excel Weka Open Source; Visualization,

Functionality, Interface SAS (JMP) Commercial Product SPSS Commercial Product MATLAB Statistical/Math Applications R Programming IIIT Allahabad 64

Recently Viewed Presentations

  • Writing For The Web SAINT MARYS COLLEGE Mar

    Writing For The Web SAINT MARYS COLLEGE Mar

    Readability (that is, the user's ability to get what he/she wants from the site as quickly and as easily as possible) should never be sacrificed for design purposes. …and These Copy Tips Visualize the language you use Write instructions as...
  • Indiana - Augusta County Public Schools

    Indiana - Augusta County Public Schools

    Nickname - Hoosier state ... Points of interest: Wyandotte Cave, Indianapolis Motor Speedway, Indiana Dunes, Holiday World, Brown County ... Indiana borders Michigan, Illinois, Kentucky, and Ohio. Indiana also borders Lake Michigan. Indiana's Flag. Indiana's Popular foods. Sugar Cream Pie...
  • ACSM&#x27;s Guidelines For Exercise Testing &amp; Prescription

    ACSM's Guidelines For Exercise Testing & Prescription

    ACSM's Guidelines For Exercise Testing & Prescription Chapter 4 Physical Fitness Testing & Interpretation Purposes Of Fitness Testing To educate the patient about their present fitness relative to health-related standards as well as age and gender matched norms.
  • Microscopes - Mrs. Stefanisin&#x27;s page

    Microscopes - Mrs. Stefanisin's page

    The Modern Compound Microscope. A compound light microscope is a microscope with more than one lens and its own light source. is an instrument commonly used inside the laboratory to view specimens mounted on a glass . slide. The compound...
  • Littletown FC Plan -

    Littletown FC Plan -

    Littletown FC Plan . Aim. Encourage an atmosphere & feeling of inspiration to evolve every time a player plays for the club. Players are encouraged to have Empathy, ownership and empowerment .
  • While and If-Else Loops - Socorro Independent School District

    While and If-Else Loops - Socorro Independent School District

    In this example, if one of the touch sensors is pressed, the rightMotor will be turned on in one if-elsestatement and immediately turned off in the other. Multiple if-else statements This can be corrected by embedding the second if-else within...
  • The Scarlet Ibis

    The Scarlet Ibis

    About the Scarlet Ibis. Both the male and female ibis take turns guarding their nest and feeding their chicks until they are large enough to defend themselves.
  • Marketing Plan - Franklin University

    Marketing Plan - Franklin University

    Strategic Formulation Customer Decision Making Process Research Techniques Observation Interview / Intercepts Focus groups Surveys Other Implementation & Evaluation Marketing Plan Strategies Alternative strategies Redefine the business Revised Mission, Vision, Objectives Market development Product/Service development Market penetration ...