Formato Base dei Dati - UCLA

Extending DSMS for Data Stream Mining CS240B Notes by Carlo Zaniolo UCLA CSD 1 Data Streams Continuous, unbounded, rapid, time-varying streams of data elements Occur in a variety of modern applications

Network monitoring and traffic engineering Sensor networks, RFID tags Telecom call records Financial applications Web logs and click-streams Manufacturing processes DSMS = Data Stream Management System 2 Many Research Projects Amazon/Cougar (Cornell) sensors Aurora (Brown/MIT) sensor monitoring, dataflow Hancock (AT&T) Telecom streams Niagara (OGI/Wisconsin) Internet DBs & XML

OpenCQ (Georgia) triggers, view maintenance Stream (Stanford) general-purpose DSMS Tapestry (Xerox) pubish/subscribe filtering Telegraph (Berkeley) adaptive engine for sensors Gigascope: AT&T Labs Network Monitoring Stream Mill (UCLA) - power & extensibility 3 Technology Challenges Data Models Relational Streams--but XML streams important too Tuple Time-Stamping Order is important Windows Query Languages: Extensions of SQL or XQUERY To support continuous (i.e., persistent) queries on transient datareversal of roles. Blocking operators excluded

Query Plans: New execution models (main memory oriented) Optimized scheduling for response time or memory Quality of Services (QoS) & Approximation Synopses Sampling Load shedding. 4 Commercial Developments Several Startups

Streambase, Coral8, Apama, and Truviso. Oracle and DBMS companies Publish/subscribe Complex Event Processing (CEP) Limitations: only simple applicationse.g. continuous queries expressed in SQL No Support for Data Stream Mining queries. 5 Data Stream Mining Many applications: click stream analysis, intrusion detection,... Many fast & light algorithms developed for stream mining.

Ensembles, Moment, SWIM, etc. Analyst should be able to focus on high-level mining tasks. Leaving QoS and lower-level issues to the system. Integration of mining methods into Data Stream Management Systems (DSMS) is required Many research challenges. Stream Mill Miner (SMM) is the first DSMS designed for that.

6 Data Stream Management Systems (DSMS) Data stream mining applications so far ignored by DSMS although A. DSMS technology is required for data stream mining QoS, query scheduling, synopses, sampling, windows, ... B. But supporting DM applications is difficult since current DSMS only support simple query languages based on SQL. Conclusion: either a shotgun wedding ... or a research breakthrough is needed here! 7 A Difficult Problem: the Inductive DBMS

Experience Initial attempts to support mining queries in relational DBMS: Unsuccessful OR-DBMS do not fare much better [Sarawagi 98]. In 1996 the high-road approach by Imielinski & Mannila who called for a quantum leap in functionality: High-level declarative languages for DM . Extensions for query processing and optimization. The research area of Inductive DBMS was thus born

Inspired DMQL, Mine Rule, MSQL, etc. Suffer from limited generality and performance issues. 8 DBMS Vendors Vendors have taken a `low-road approach. A library of mining functions using a cache-mining approach IBM DB2 Intelligent Miner Oracle Data Miner MS OLE DB for DM: mining models

Closed systems, Lacking in coverage and user-extensibility. Not as popular as dedicated, stand-alone mining systems, such as Weka 9 Weka A comprehensive set of mining algorithms, and tools. Generic algorithms over arbitrary data sets. Independent on the number of columns in tables. Open and extensible system based on Java. These are the features that we want in our SMM

starting from SQL rather than Java! Not an easy task ...why? 10 SMM Contributions Build on Stream Mill DSMS and its SQL-based continuous query language and enabling technology. Language and System Extensions: Genericity, Extensibility, and Performance A suite of stream mining algorithms.

Existing ones and Newly developed in this projecte.g., SWIM. High level mining model for better Usability Control of mining process. 11 From SQL to Online Mining in SMM: step by step Nave Bayesian Classifier (NBC).

Important and frequently used. Schema-specific NBC. Simple to express in SQL by count, sum aggregates. But a generci NBC is still preferable. Genericity: one function independent of number columns involved. Schema independence in SQL? 12 Genericity Weka Arrays of type real. SMM

Verticalization. Similar arrays, but in tables. Built-in table function to reduce any table to this form. Thus, generic UDAs work with this schema. And further improvements are also supported in SMM 13

Extensibility? Most mining tasks cannot be implemented in SQL. Solution: Define complex functions by User Defined Aggregates (UDAs) Complex mining tasks can be viewed as aggregates UDAs Natively defined in SQL make the language computationally complete [Wang 04] Turing-complete over static data Non-blocking complete over data streams Natural extensions to support windows and delta computations for data streams [Bai 06]

UDAs can be defined in a PL, for better performance 14 Windowed UDA Example Continuous Count WINDOW AGGREGATE sum(val REAL):REAL { TABLE state (tot real); INITIALIZE: { INSERT INTO state VALUES(val); } ITERATE: { UPDATE state SET tot = tot + val; } EXPIRE: { UPDATE state SET tot = tot oldest().val; } /* No TERMINATE state */ } For efficient

differential computation 15 Online Mining in SMM UDAs Invoked with standard SQL:2003 syntax of OLAP functions. SELECT learn(ts.Column, ts.Value, t.dec) OVER (ROWS 1000 PRECEDING) FROM trainingstream AS t, TABLE (verticalize(Outlook, Temp, Humidity, Wind)) AS ts Powerful framework: Concept drifts-shifts Association rule mining

16 The Slide Construct A window can be divided into panes (called a slide) Tumbling windows when the size of the slide is equal or larger than that of the window The slide/window combination is great for data stream mining. Simple construct added to support slides in UDAs Allowed us to build a flexible and efficient library of data stream mining UDAs 17 SMM Contributions Build on Stream Mill DSMS and its SQL-based

continuous query language and enabling technology. Language and System Extensions: Genericity, Extensibility, and Performance A suite of stream mining algorithms. Existing ones and Newly developed in this projecte.g., SWIM. High level mining model for better

Usability Control of mining process. 18 Association Rule Mining SWIM [Mozafari 08] Maintaining frequent patterns over large windows with slides. Differentially computes frequent patterns as slides enter (expire out of) the window. Uses efficient Verifiers based on conditional counting. Trade-off between Delay and Performance Performance gain over existing algorithms. 19

SWIM (Sliding Window Incremental Miner) If pattern p is freq in a window, it must be freq in at least one of its slides -- keep a union of freq patterns of all slides (PT) Expired S4 New S5 S6 W4

W5 Count/Update frequencies PT PT = F5 F4 U U F6 F5 U F7 F6 . Mine Count/Update frequencies Add F7 to PT

PT S7 Mining Alg. Prune PT 20 Concept Drifts/ShiftsComplex Processes Ensemble based methods. Weighted bagging [Wang 03], adaptive boosting [Chu 04], inductive transfer [Forman 06]. Generic support, e.g. adaptive boosting (below).

21 Built-in Online Mining Algorithms In SMM Online classifiers Nave Bayesian Decision Tree K-nearest Neighbor Online clustering DBScan [Ester 96] IncDBScan Windowed K-means* DenStream* [Cao 06] CluStream Already supported Association rule mining

Approximate frequent items SWIM [Mozafari 08] Moment [Chi 04] AFPIM Time series/sequence queries SQL-TS [Sadri 01] Many more To be supported 22 SMM Contributions Build on Stream Mill DSMS and its SQL-based continuous query language and enabling technology.

Language and System Extensions: Genericity, Extensibility, and Performance A suite of stream mining algorithms. Existing ones and Newly developed in this projecte.g., SWIM. High level mining model for better

Usability Control of mining process. 23 Usability? Complex SQL queries to invoke built-in and user-defined mining algorithms. An open and extensible system Most analysts would prefer using highlevel mining language that supports uniform invocation of built-in and userdefined mining algorithms (no SQL required) describes the workflow of the mining process Is also open and extensible to incorporate newly defined mining algorithms.

24 Example: Defining a Mining Model CREATE MODEL TYPE NaiveBayesianClassifier { SHAREDTABLES (DescriptorTbl), Learn (UDA LearnNaiveBayesian, WINDOW TRUE, PARTABLES(), % names of param tables required by the method PARAMETERS() % additional parameters to be specified for input ), Classify (UDA ClassifyNaiveBayesian, WINDOW TRUE, PARTABLES(), PARAMETERS() ) }; 25 Example: Using a Mining Model

Creating an instance: CREATE MODEL INSTANCE NaiveBayesianInstance AS NaiveBayesianClassifier; Uniform invocation of mining tasks: RUN NaiveBayesianInstance.Learn WITH TrainingSet; 26 Performance SMM Vs. Weka NBC and decision tree classifier Datasets [UCI] Iris: 5 attributes Heart disease: 13 attributes Overhead of integrating algorithms into SMM The SWIM algorithm standalone vs.

integrated Dataset [IBM Quest] Trans len 20, Pattern len 5, Tuples 50K 27 Comparison with Weka: NBC-Iris 28 Comparison with Weka: NBC-HD 29 Comparison with Weka: Decision Tree - Iris 30

Integration Overhead: Integrated SWIM vs. Standalone SWIM 31 The Stream Mill System One server, multiple clients Server (on Linux): hosts the ESL language and manages storage and continuous queries Client (Java based GUI): allows the user to specify streams, queries, etc. 32 Conclusion SMM integrates new solutions for several

difficult problems: Usability by high-level mining models Extensibility by user-defined mining models that call on UDAs with windows Suite of built-in data stream mining UDAs Generic mining UDAs by Verticalization & other techniques Performance SMM is the first of its kind: more and better systems will follow in its footsteps. 33

Future Work Faster & lighter mining algorithms E.g. online algorithms for clustering Integration of other mining algorithms Data flow in mining models Similar solution for databases 34 Thank you! 35 References [Arasu 04] Arvind Arasu and Jennifer Widom. Resource sharing in continuous sliding-window aggregates. In VLDB, pages 336347, 2004.

[Babcock 02] B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom. Models and issues in data stream systems. In PODS, 2002. [Bai 06] Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, and Carlo Zaniolo. A data stream language and system designed for power and extensibility. In CIKM, pages 337346, 2006. [Cao 06] F Cao, M Ester, W Qian, and A Zhou, Density-based Clustering over an Evolving Data Stream with Noise, To appear in Proceedings of SIAM 2006. [Chi 04] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Moment: Maintaining closed frequent itemsets over a stream sliding window. In Proceedings of the 2004 IEEE International Conference on Data Mining (ICDM04), November 2004. [Chu 04] F. Chu and C. Zaniolo. Fast and light boosting for adaptive mining of data streams. In PAKDD, volume 3056, 2004. [Ester 96] Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Second International Conference on Knowledge Discovery and Data Mining, pages 226231, 1996. [Forman 06] George Forman. Tackling concept drift by temporal inductive

transfer. In SIGIR, pages 252259, 2006. 36 References [Imielinski 96] Tomasz Imielinski and Heikki Mannila. A database perspective on knowledge discovery. Commun. ACM, 39(11):58 64, 1996. [Law 04] Yan-Nei Law, Haixun Wang, and Carlo Zaniolo. Data models and query language for data streams. In VLDB, pages 492 503, 2004. [Mozafari 08] Barzan Mozafari, Hetal Thakkar, and Carlo Zaniolo. Verifying and mining frequent patterns from large windows over data streams. In International Conference on Data Engineering (ICDE), 2008. [Sadri 01] Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and Jafar Adibi. Optimization of sequence queries in database systems. In PODS, Santa Barbara, CA, May 2001. [Sarawagi 98] S. Sarawagi, S. Thomas, and R. Agrawal.

Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMOD, 1998. [UCI-MLR] [Wang 03] H. Wang, W. Fan, P. S. Yu, and J. Han. Mining conceptdrifting data streams using ensemble classifiers. In SIGKDD, 2003. 37

Recently Viewed Presentations

  • Public Opinion and Political Socialization 1 0 Zhang

    Public Opinion and Political Socialization 1 0 Zhang

    Bettmann/Corbis. Is polling always accurate? 10.1. Not only did advance polls in 1948 predict the Republican nominee Thomas E. Dewey would defeat Democratic incumbent President Harry S Truman, but on the basis of early and incomplete vote tallies, some newspapers'...
  • Welcome to Starry Monday at Otterbein Astronomy Lecture

    Welcome to Starry Monday at Otterbein Astronomy Lecture

    Incomplete understanding of gravitation Modified Newtonian Dynamics (MOND) Nonsymmetric gravity General relativity What General Relativity tells us The more mass there is in the universe, the more the expansion of the cosmos slows down So the game is: Mass vs....
  • Satan: The Accuser (Revelation 12:7-17) 0 1 7

    Satan: The Accuser (Revelation 12:7-17) 0 1 7

    Eagles' Wings. Exodus 19:4 You yourselves have seen what I did to Egypt, and how I carried you on eagles' wings and brought you to myself. Isaiah 40:31 But those who hope in the Lord will renew their strength.
  • The Legal and Ethical Environment of Business 8-1

    The Legal and Ethical Environment of Business 8-1

    IIPM Other titles: Arial ヒラギノ角ゴ Pro W3 Trebuchet MS Calibri FWK presenatation template 2_Office Theme 4_Office Theme 3_Office Theme The Legal and Ethical Environment of Business Chapter 8 The Property System Learning Objectives Learning Objectives Introduction Personal Property ...
  • Warmup 10/11  How long did the Roman Empire

    Warmup 10/11 How long did the Roman Empire

    Rome's republican beginnings were often seen as proof that republics can work and be great. Rome's model for republican government was the inspiration for many of the governments of today. Civil Law. The Roman legal system is called "Civil Law"...
  • NPS Program Implementation

    NPS Program Implementation

    Central Coast Regional Water Board Alfred Wanger ... CCC Evaluation Results Oakland Santa Cruz Ventura Newport Beach Attendee Affiliation N = 370 * * What the partnership has done: Working to develop a strategy to get the right kind of...
  • A Red version! - University of Ottawa

    A Red version! - University of Ottawa

    Kadriye Ercikan, UBC, Marielle Simon, uOttawa, Maria Elena Oliveri, UBC & Normand Dufour, MÉLS Canadian Society for the Study of Education Canadian Education Researchers Association May-June 2010, Concordia University, Montreal
  • IR AND RE VERBS - Sault Schools

    IR AND RE VERBS - Sault Schools

    OTHER IR VERBS Here is a list of other regular IR verbs Choisir = to choose Grossir = to gain weight Maigrir = to lose weight Grandir = to grow Obéir = to obey Réussir à = to succeed, to...