Machine Learning Day 1 2010. 10. 6-8 Byoung-Tak Zhang (CBIT)CBIT) & , , http://bi.snu.ac.kr/ 1 (10/6): Neural Nets Decision Trees Support Vector Machines
2 (CBIT)10/7): Self-Organizing Maps Clustering Algorithms Reinforcement Learning Evolutionary Learning 3 (CBIT)10/8): Bayesian Networks Markov Random Fields Particle Filters 2 (c) 2009-2010 SNU Biointelligence Laboratory, http://bi.snu.ac.kr/ (1 )
, , ? Neural Net (CBIT)NN) . Decision Tree (CBIT)DT) . k-Nearest Neighbor (CBIT)kNN) ? Support Vector Machine (CBIT)SVM) ? NN, DT, kNN, SVM . ? 3 (c) 2009-2010 SNU Biointelligence Laboratory, http://bi.snu.ac.kr/ Approaches to Artificial Intelligence Symbolic AI Rule-Based Systems Connectionist AI Neural Networks
Evolutionary AI Genetic Algorithms Molecular AI: DNA Computing 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 4 Research Areas and Approaches Research Artificial Intelligence Learning Algorithms Inference Mechanisms Knowledge Representation Intelligent System Architecture Application Intelligent Agents Information Retrieval
Electronic Commerce Data Mining Bioinformatics Natural Language Proc. Expert Systems Paradigm Rationalism (Logical) Empiricism (Statistical) Connectionism (Neural) Evolutionary (Genetic) Biological (Molecular) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 5 Day 1 1. Concept of Machine Learning Learning: Definition
Learning is the improvement of performance in some environment through the acquisition of knowledge resulting from experience in that environment. the improvement of behavior on some performance task through acquisition of knowledge based on partial task experience 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 7 Neural Network (MLP) Error Backpropagation wi wi wi , wi
Information Propagation Weights Input x1 x Input x2 E wi Output Comparison 1 Ed ( w) (t k ok ) 2 2 koutputs Output o f (x)
Input x3 Input Layer Scaling Function Hidden Layer Activation Function Output Layer Activation Function 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 8 Application Example: Autonomous Land Vehicle (CBIT)ALV) NN learns to steer an autonomous vehicle. 960 input units, 4 hidden units, 30 output units
Driving at speeds up to 70 miles per hour ALVINN System Image of a forward mounted camera Weight values for one of the hidden units 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 9 DARPA Grand Challenge [Sebastian Thrun, Stanley & Junior, Stanford Univ Stanford Stanford
2005 2005 Grand GrandChallenge Challenge (CBIT)(CBIT) 22 ),),2007 2007 Urban UrbanChallenge Challenge . . 2005
2005 : : 175 175 10 10 2007 2007 : : 96km 96km 66
Video Video DARPA Grand Challenge: Final Part 1 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 10 Stanford Autonomous Helicopter - Airshow #2: http://www.youtube.com/watch?v=VCdxqn0fcnE ( : Andrew Ng, Stanford Univ.) RC RC
. . ,, RC RC 2008, SNU Biointelligence Lab, http://bi.snu.ac.kr/11
(RL) (RL) R RL) L) ( : Willow Garage) / PR2 Robot Cleans Up http://www.youtube.com/watch?v=g Yqfa-YtvW4&feature=related PR2 Robot Plays Pool http://www.youtube.com/watch?v= mgHUNfqIhAc&feature=related PR2 Robot of Willow Garage
12 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ ( : WWW2006, Baluja) zoom-in zoom-in , , Decision DecisionTree Tree . .
1 2 3 4 5 6 7 8 9 Decision Tree 5
13 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 13 ( : Thore Graepel, MS Research Cambridge) Whole-audience Control of a Racing Game http://www.youtube.com/watch?v=NS_L3Yyv2RI Drivatar Microsoft MS MSXBOX XBOX360 360 Forza Forza22
The Future of Racing Games Research in Cambridge, UK http://www.youtube.com/watch?v=TaU yzlKKu-E
//
(Imitation Approach) (Imitation Approach) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 14 ( : Nature Reviews Neuroscience, 2006)
( , ) ( , ) . . : (CBIT)Lie Detector) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 15
Machine Learning: Three Tasks Supervised Learning Estimate an unknown mapping from known input and target output pairs f w ( x) y f ( x) Learn fw from training set D={(CBIT)x,y)} s.t. Classification: y is discrete Regression: y is continuous Unsupervised Learning Only input values are providedf w ( x) x Learn fw from D={(CBIT)x)} s.t.
Compression Clustering Reinforcement Learning hw (x, a, c ) Not target, but rewards (CBIT)critiques) are provided Learn a heuristic function hw from D={(CBIT)x, a, c)} s.t. Action selection Policy learning 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 16 Robotics
DARPA Urban challenge / Computer Vision Drivatar AI Neuroscience
HIV Web
Spam filtering / Mobile Web browsing Bioinformatics
Connectonomics / 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 17 : , , , , , , ,
, , , , Prolog Version Space, (CBIT)ILP) If-Then , AQ
Sigmoid, , , RBF , SVM, , Lisp , , / , , , HMM 18 :
Symbolic Learning Bayesian Networks Helmholtz Machines Version Space Learning Case-Based Learning Markov Random Fields Hypernetworks Neural Learning
Multilayer Perceptrons Self-Organizing Maps Support Vector Machines Kernel Machines Latent Variable Models Generative Topographic Mapping Evolutionary Learning Evolution Strategies Evolutionary Programming Genetic Algorithms Genetic Programming Molecular Programming Probabilistic Learning
Other Methods Decision Trees Reinforcement Learning Boosting Algorithms Mixture of Experts Independent Component Analysis 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 19
, , , , , , , , , , , , / , , , , HCI , , , , , , DNA , , , , ,
, , , , , , , , , , , (CBIT)CRM), , , , , [ , , 2007 3 ] 20 Day 1 2. Neural Network Learning
From Biological Neuron to Artificial Neuron Dendrite Cell Body Axon From Biology to Artificial Neural Nets Properties of Artificial Neural Networks A network of artificial neurons Characteristics
Nonlinear I/O mapping Adaptivity Generalization ability Fault-tolerance (graceful degradation) Biological analogy Problems Appropriate for Neural Networks Many training examples available Outputs can be discrete or continuous-valued or their vectors. May contain noise in training examples
Tolerant to long training time Fast execution time Not necessary to explain the prediction results Example Applications NETtalk [Sejnowski] Inputs: English text Output: Spoken phonemes Phoneme recognition [Waibel] Inputs: wave form features Outputs: b, c, d,
Robot control [Pomerleau] Inputs: perceived features Outputs: steering control Application: Autonomous Land Vehicle (CBIT)ALV) NN learns to steer an autonomous vehicle. 960 input units, 4 hidden units, 30 output units Driving at speeds up to 70 miles per hour ALVINN System Image of a forward mounted camera Weight values
for one of the hidden units Application: Data Recorrection by a Hopfield Network original target data Recorrected data after 10 iterations corrupted input data Recorrected data after 20 iterations Fully recorrected data after
35 iterations Perceptron and Gradient Descent Algorithm (CBIT)Simple) Perceptron: A neural net with a single neuron Perceptron = a linear threshold unit (LTU) Note: Linear perceptron = linear unit (see below) Input: a vector of real values Output: 1 or -1 (binary) Activation function: threshold function Linearly Separable vs. Linearly
Nonseparable (CBIT)a) Decision surface for a linearly separable set of examples (CBIT)correctly classified by a straight line) (CBIT)b) A set of training examples that is not linearly separable. Perceptron Training Rule Note: output value o is +1 or -1 (CBIT)not a real) Perceptron rule: a learning rule for a threshold unit. Conditions for convergence Training examples are linearly separable. Learning rate is sufficiently small. Delta Rule: Least Mean Square (CBIT)LMS) Error
Linear unit (CBIT)linear perceptron) Note: output value o is a real value (CBIT)not binary) Delta rule: learning rule for an unthresholded perceptron (CBIT)i.e. linear unit). Delta rule is a gradient-descent rule. Gradient Descent Method Delta Rule for Error Minimization wi wi wi , E wi wi wi (t d od ) xid d D Perceptron Learning Algorithm Multilayer Perceptron Multilayer Perceptrons
x1 y1 x2 ym xn Input layer Hidden layer Output layer
1 Ed ( w) (t k ok ) 2 2 koutputs x: input vector y: output vector Supervised learning Gradient search Noise immunity Learning algorithm: Backpropagation E d w ji w ji 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 38 Multilayer Networks and its Decision Boundaries
Decision regions of a multilayer feedforward network. The network was trained to recognize 1 of 10 vowel sounds occurring in the context h_d The network input consists of two parameter, F1 and F2, obtained from a spectral analysis of the sound. The 10 network outputs correspond to the 10 possible vowel sounds. Differentiable Threshold Unit Sigmoid function: nonlinear, differentiable Error Function for BP 1 E ( w) (t kd o kd ) 2
2 dD koutputs E defined as a sum of the squared errors over all the output units k for all the training examples d. Error surface can have multiple local minima Guarantee toward some local minimum No guarantee to the global minimum Backpropagation Learning Algorithm for MLP Adding Momentum Original weight update rule for BP:
w ji (n) j x ji Adding momentum w ji (n) j x ji w ji (n 1), 0 1 Help to escape a small local minima in the error surface. Speed up the convergence. Derivation of the BP Rule Notations xij : the ith input to unit j wij : the weight associated with the ith input to unit j netj : the weighted sum of inputs for unit j oj : the output computed by unit j tj : the target output for unit j : the sigmoid function outputs : the set of units in the final layer of the network
Downstream(CBIT)j) : the set of units whose immediate inputs include the output of unit j Derivation of the BP Rule 1 2 E ( w Error measure: d ) (t k ok ) 2 koutputs E d Gradient descent: w ji w ji Ed Ed net j
Ed x ji Chain rule: w ji net j w ji net j Case 1: Rule for Output Unit Weights Step 1: Step 2: Step 3: Ed Ed o j
net j o j net j net j w ji x ji i Ed 1 (t k ok ) 2 (t j o j ) o j o j 2 koutputs o j net j (net j ) net j o j (1 o j ) E d
All together: w ji (t j o j )o j (1 o j ) x ji w ji Case 2: Rule for Hidden Unit Weights Step 1: Ed Ed net k o j net j kDownstream( j ) net k o j net j net k o j k kDownstream ( j ) o j net j
k wkj wkj o j (1 o j ) kDownstream ( j ) k kDownstream ( j ) Thus: o j net j w ji j x ji , where j o j (1 o j )
w k kj kDownstream ( j ) BP for MLP: revisited Hidden Layer Representations BP has an ability to discover useful intermediate representations at the hidden unit layers inside the networks which capture properties of the input spaces that are most relevant to learning the target function. When more layers of units are used in the network, more complex features can be invented. But the representations of the hidden layers
are very hard to understand for humans. Hidden Layer Representation for Identity Function Hidden Layer Representation for Identity Function The evolving sum of squared errors for each of the eight output units as the number of training iterations (epochs) increase Hidden Layer Representation for Identity Function The evolving hidden layer representation for the input string 01000000 Hidden Layer Representation for Identity Function The evolving weights for one of the three hidden units Generalization and Overfitting
Continuing training until the training error falls below some predetermined threshold is a poor strategy since BP is susceptible to overfitting. Need to measure the generalization accuracy over a validation set (CBIT)distinct from the training set). Two different types of overffiting Generalization error first decreases, then increases, even the training error continues to decrease. Generalization error decreases, then increases, then decreases again, while the training error continues to decreases. Two Kinds of Overfitting Phenomena Techniques for Overcoming the Overfitting Problem
Weight decay Decrease each weight by some small factor during each iteration. This is equivalent to modifying the definition of E to include a penalty term corresponding to the total magnitude of the network weights. The motivation for the approach is to keep weight values small, to bias learning against complex decision surfaces k-fold cross-validation Cross validation is performed k different times, each time using a different partitioning of the data into training and validation sets The result are averaged after k times cross validation. Designing an Artificial Neural Network for Face Recognition Application ANN for Face Recognition
960 x 3 x 4 network is trained on gray-level images of faces to predict whether a person is looking to their left, right, ahead, or up. Problem Definition Possible learning tasks Classifying camera images of faces of people in various poses. Direction, Identity, Gender, ... Data: 624 grayscale images for 20 different people 32 images per person, varying persons expression (CBIT)happy, sad, angry, neutral) direction (CBIT)left, right, straight ahead, up) with and without sunglasses
resolution of images: 120 x128, each pixel with a grayscale intensity between 0 (CBIT)black) and 255 (CBIT)white) Task: Learning the direction in which the person is facing. Factors for ANN Design in the Face Recognition Task Input encoding Output encoding Network graph structure Other learning algorithm parameters
Input Coding for Face Recognition Possible Solutions Extract key features using preprocessing Coarse-resolution Features extraction edges, regions of uniform intensity, other local image features Defect: High preprocessing cost, variable number of features Coarse-resolution Encode the image as a fixed set of 30 x 32 pixel intensity values, with one network input per pixel. The 30x32 pixel image is a coarse resolution summary of the original 120x128 pixel image Coarse-resolution reduces the number of inputs and weights to a much more manageable size, thereby reducing
computational demands. Output Coding for Face Recognition Possible coding schemes Using one output unit with multiple threshold values Using multiple output units with single threshold value. One unit scheme Assign 0.2, 0.4, 0.6, 0.8 to encode four-way classification. Multiple units scheme (CBIT)1-of-n output encoding) Use four distinct output units Each unit represents one of the four possible face directions, with highest-valued output taken as the network prediction
Output Coding for Face Recognition Advantages of 1-of-n output encoding scheme It provides more degrees of freedom to the network for representing the target function. The difference between the highest-valued output and the second-highest can be used as a measure of the confidence in the network prediction. Target value for the output units in 1-of-n encoding scheme < 1, 0, 0, 0 > v.s. < 0.9, 0.1, 0.1, 0.1 > < 1, 0, 0, 0 >: will force the weights to grow without bound. < 0.9, 0.1, 0.1, 0.1 >: the network will have finite weights. Network Structure for Face Recognition
One hidden layer v.s. more hidden layers How many hidden nodes is used? Using 3 hidden units: test accuracy for the face data = 90% Training time = 5 min on Sun Sprac 5 Using 30 hidden units: test accuracy for the face data = 91.5% Training time = 1 hour on Sun Sparc 5 Other Parameters for Face Recognition Learning rate = 0.3 Momentum = 0.3 Weight initialization: small random values near 0 Number of iterations: Cross validation After every 50 iterations, the performance of the
network was evaluated over the validation set. The final selected network is the one with the highest accuracy over the validation set Day 1 3. Decision Tree Learning Decision Trees quicktime no yes yes no computer `NO
no clinton yes NO space Learning Algorithm: C4.5 YES unix no Nodes: attributes Edges: values Terminal nodes: class labels
no NO yes YES yes YES Gain( S , A) Entropy( S ) vValues ( A ) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Sv S Entropy( Sv )
67 Main Idea Classification by Partitioning Example Space Goal : Approximating discrete-valued target functions Appropriate Problems Examples are represented by attribute-value pairs. The target function has discrete output value. Disjunctive description may be required. The training data may contain missing attribute values. SNU Center for Bioinformation Technology (CBIT) 68
Example Problem (CBIT)Play Tennis) Day Outlook Temperature Humidity Wind PlayTennis D1 D2 D3 D4 D5 D6 D7 D8 D9
D10 D11 D12 D13 D14 Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Hot Hot Hot
Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High
Normal High Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Strong Weak Strong No No Yes Yes Yes No
Yes No Yes Yes Yes Yes Yes No Example Space No Yes (Outlook = Sunny & Humidity = High) (Outlook = Sunny & Humidity = Normal) Yes (Outlook = Overcast) Yes
No (Outlook = Rain & Wind = Weak) (Outlook = Rain & Wind = Strong) SNU Center for Bioinformation Technology (CBIT) 70 Decision Tree Representation Outlook Sunny Humidity High NO Overcast
Rain Wind YES Normal YES NO SNU Center for Bioinformation Technology (CBIT) YES 71 Basic Decision Tree Learning Which Attribute is Best? Select the attribute that is most useful for
classifying examples. Quantitative Measure Information Gain For Attribute A, relative to a collection of data D | Dv | Gain( D, A) Entropy( D) Entropy( Dv) vValues ( A ) | D | Expected Reduction of Entropy SNU Center for Bioinformation Technology (CBIT) 72 Entropy Impurity of an Arbitrary Collection of Examples
1.0 c i 1 Minimum number of bits of information needed to encode the classification of an arbitrary member of D Entropy(S) Entropy ( D) pi log pi 0.0 SNU Center for Bioinformation Technology (CBIT) P 1.0
73 Constructing Decision Tree SNU Center for Bioinformation Technology (CBIT) 74 Example : Play Tennis (CBIT)1) Entropy of D Entropy ( D ) Entropy ([9,5 ]) 9 9 5 5 log log 14 14 14 14 0.940
SNU Center for Bioinformation Technology (CBIT) 75 Example : Play Tennis (CBIT)2) Attribute Wind D = [9+,5-] Dweak = [6+,2-] Dstrong=[3+,3-] [9+,5-] : E = 0.940 Wind Weak [6+,2-] : E = 0.811 Gain( D, Wind ) Entropy ( D) Strong [3+,3-] : E = 1.0
| Dv | Entropy ( Dv) | D | v{ weak , strong } 8 6 Entropy ( Dweak ) Entropy ( Dstrong ) 14 14 8 6 0.940 0.811 1.00 14 14 0.048 Entropy ( D) SNU Center for Bioinformation Technology (CBIT)
76 Example : Play Tennis (CBIT)3) Attribute Humidity Dhigh = [3+,4-] Dnormal=[6+,1-] [9+,5-] : E = 0.940 Humidity High [3+,4-] : E = 0.985 Gain( D, Wind ) Entropy ( D) Normal [6+,1-] : E = 0.592 | Dv | Entropy ( Dv)
v{high, normal} | D | 7 7 Entropy ( Dhigh ) Entropy ( Dnormal ) 14 14 7 7 0.940 0.985 0.592 14 14 0.151 Entropy ( D) SNU Center for Bioinformation Technology (CBIT) 77 Example : Play Tennis (CBIT)4)
Best Attribute? Gain(D, Outlook) = 0.246 Gain(CBIT)D, Humidity) = 0.151 Gain(CBIT)D, Wind) = 0.048 Gain(CBIT)D, Temperature) = 0.029 [9+,5-] : E = 0.940 Outlook Sunny Overcast [2+,3-] : (D1, D2, D8, D9, D11) [4+,0-] : (D3, D7, D12, D13) YES Rain [3+,2-] : (D4, D5, D6, D10, D14) SNU Center for Bioinformation Technology (CBIT)
78 Example : Play Tennis (CBIT)5) Entropy Dsunny Entropy ( Dsunny ) Entropy ([2,3 ]) 2 2 3 3 log log 5 5 5 5 0.971 Day Outlook Temperature Humidity
Wind PlayTennis D1 D2 D8 D9 D11 Sunny Sunny Sunny Sunny Sunny Hot Hot Mild Cool Mild High
High High Normal Normal Weak Strong Weak Weak Strong No No No Yes Yes SNU Center for Bioinformation Technology (CBIT) 79 Example : Play Tennis (CBIT)6)
Attribute Wind Dweak = [1+,2-] Dstrong=[1+,1-] [2+,3-] : E = 0.971 Wind Weak [1+,2-] : E = 0.918 Gain( D,Wind ) Entropy ( Dsunny ) Strong [1+,1-] : E = 1.0 | Dv | Entropy ( Dv ) | D | v{ weak , strong } 3
2 Entropy ( Dweak ) Entropy ( Dstrong ) 5 5 3 2 0.971 0.918 1.00 5 5 0.020 Entropy ( Dsunny ) SNU Center for Bioinformation Technology (CBIT) 80 Example : Play Tennis (CBIT)7) Attribute Humidity [2+,3-] : E = 0.971 Humidity
Dhigh = [0+,3-] Dnormal=[2+,0-] High [0+,3-] : E = 0.00 Gain( D, Wind ) Entropy ( Dsunny ) Entropy ( Dsunny ) 0.971 Normal [2+,0-] : E = 0.00 | Dv | Entropy ( Dv) | D | v{ high , normal } 3 2 Entropy ( Dhigh ) Entropy ( Dnormal )
5 5 3 2 0.00 0.00 5 5 0.971 SNU Center for Bioinformation Technology (CBIT) 81 Example : Play Tennis (CBIT)8) Best Attribute? Gain(Dsunny, Humidity) = 0.971 Gain(CBIT)Dsunny, Wind) = 0.020 Gain(CBIT)Dsunny, Temperature) = 0.571 [9+,5-] : E = 0.940 Outlook
Sunny Humidity High NO Overcast YES Normal Rain [3+,2-] : (D4, D5, D6, D10, D14) YES SNU Center for Bioinformation Technology (CBIT) 82 Example : Play Tennis (CBIT)9) Entropy Drain
Entropy ( Dsunny ) Entropy ([3,2 ]) 3 3 2 2 log log 5 5 5 5 0.971 Day Outlook Temperature Humidity Wind PlayTennis D4
D5 D6 D10 D14 Rain Rain Rain Rain Rain Mild Cool Cool Mild Mild High Normal Normal Normal High Weak
Weak Strong Weak Strong Yes Yes No Yes No SNU Center for Bioinformation Technology (CBIT) 83 Example : Play Tennis (CBIT)10) Attribute Wind Dweak = [3+,0-] Dstrong=[0+,2-] [3+,2-] : E = 0.971
Wind Weak [3+,0-] : E = 0.00 Gain( D, Wind ) Entropy ( Drain ) Strong [0+,2-] : E = 0.00 | Dv | Entropy ( Dv ) | D | v{ weak , strong } 3 2 Entropy ( Dweak ) Entropy ( Dstrong ) 5 5 3 2
0.971 0.00 1.00 5 5 0.971 Entropy ( Drain ) SNU Center for Bioinformation Technology (CBIT) 84 Example : Play Tennis (CBIT)11) Attribute Humidity Dhigh = [1+,1-] Dnormal=[2+,1-] [2+,3-] : E = 0.971 Humidity High [1+,1-] : E = 1.00 Gain( D,Wind ) Entropy ( Drain )
Normal [2+,1-] : E = 0.918 | Dv | Entropy ( Dv) | D | v{high, normal} 3 2 Entropy ( Dhigh ) Entropy ( Dnormal ) 5 5 2 3 0.971 1.00 0.918 5 5 0.020 Entropy ( Drain )
SNU Center for Bioinformation Technology (CBIT) 85 Example : Play Tennis (CBIT)12) Best Attribute? Gain(CBIT)Drain, Humidity) = 0.020 Gain(Drain, Wind) = 0.971 Gain(CBIT)Drain, Temperature) = 0.020 Outlook Sunn y Humidity High NO
Overcas t YES Rain Wind Normal YES SNU Center for Bioinformation Technology (CBIT) NO YES 86 Avoiding Overfitting Data Definition
Given a hypothesis space H, a hypothesis h H is said to overfit the data if there exists some alternative hypothesis h H, such that h has smaller error than h over the training examples, but h has a smaller error than h over entire distribution of instances. Occams Razor Prefer the simplest hypothesis that fits the data. SNU Center for Bioinformation Technology (CBIT) 87 Avoiding Overfitting Data SNU Center for Bioinformation Technology (CBIT) 88 Solutions to Overfitting 1. Partition examples into training, test, and validation set. 2. Use all data for training, but apply a statistical test to
estimate whether expanding (or pruning) a particular node is likely to produce an improvement beyond the training set. 3. Use an explicit measure of the complexity for encoding the training examples and the decision tree, halting growth of the tree when this encoding is minimized. SNU Center for Bioinformation Technology (CBIT) 89 Summary Decision trees provide a practical method for concept learning and discrete-valued functions. ID3 searches a complete hypothesis space. Overfitting is an important issue in decision tree learning. SNU Center for Bioinformation Technology (CBIT) 90 Day 1
4-5. Plan for ML Exercise Day 1: Classification Program: Weka Agenda: classification by Neural Network(NN) and Decision Tree (DT) Day 2: Clustering Program: Genesis Agenda: k-means /hierarchical clustering, SOM Day 3: Bayesian Network Program: GeNIe Agenda: designing / learning / inference in Bayesian networks (CBIT)C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Day 1 Classification The Tool for Practice Weka 3: Data Mining Software in Java Collection of machine learning algorithms for data mining tasks What you can do with Weka are data pre-processing, feature selection, classification, regression, clustering, association rules, and visualization Weka is an open source software issued under the GNU General Public License How to get? http://www.cs.waikato.ac.nz/ml/weka/ or just type Weka in google. (CBIT)C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Classification Using Weka - Problems Pima Indians Diabetes (2-class problem)
Pima Indians have the highest prevalence of diabetes in the world We will build classification models that diagnoses if the patient shows signs of diabetes Handwritten Digit Recognition (10-class problem) The MNIST database of handwritten digits contains digits written by office workers and students We will build a recognition model based on classifiers with the reduced set of MNIST (CBIT)C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Classification Using Weka - Algorithms Neural Network (RL) Multilayer Perceptron) Decision Tree (CBIT)C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Classification Using Weka click Click MultilayerPerceptron Set parameters for MLP Set parameters for Test Click Start for learning load a file that contains the training data by clicking Open file button ARFF or CSV formats are readible Click Classify tab Click Choose button Select weka function MultilayerPerceptron (CBIT)C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Classification Using Weka Options for
testing the trained model Result output: various measures appear (CBIT)C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Day 1 6. k-Nearest Neighbor Different Learning Methods Eager learning Explicit description of target function on the whole training set Instance-based learning Learning = storing all training instances Classification = assigning a target function to a new instance
Referred to as lazy learning Kinds of instance-based learning K-nearest neighbor algorithm Locally weighted regression Case-based reasoning Local vs. Distributed Representation Lazy Learning? Eager Learning 100 K-Nearest Neighbor Features All instances correspond to points in an n-dimensional Euclidean space Classification is delayed till a new instance arrives
Classification done by comparing feature vectors of the different points Target function may be discrete or real-valued k-Nearest Neighbor Classifier (RL) kNN) Memory-Based Learning (x, y=?) Query Pattern (x1, y1=1) (x2, y2=1) (x3, y3=0) (x4, y4=0) (x5, y5=1) (x6, y6=0) . . . (xN, yN=1) Learning Set D
k=5 1 1 0 0 Three 1s vs. Two 0s y=1 1 Match Set 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 102 kNN as a Neural Network (xi, yi) xq
Input (vector) Hidden (vector) yq Output (vector) 103 k-Nearest Neighbor (RL) kNN) Training For each training example
instances from D that are nearest to xq Return xq + + + where (a,b)=1 if a=b, and (a,b)=0 otherwise. Memory-based or case-based learning 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 104 Generalizing kNN
Divide the input space into local regions and learn simple (CBIT)constant/linear) models in each patch Unsupervised: Competitive, online clustering Supervised: Radial-basis func, mixture of experts 105 Radial-Basis Function Network Locally-tuned units H Output (scalar) y whpht w 0 Hidden
(scalar) xt m h pht exp 2sh2 Input (scalar) t h 1 2 106
Training RBF Hybrid learning: First layer centers and spreads: Unsupervised k-means Second layer weights: Supervised gradient-descent Fully supervised (CBIT)Broomhead and Lowe, 1988; Moody and Darken, 1989) 107 Regression E mh ,sh ,w ih i ,h 1 | X r it y it 2 t i
2 H y it w ih pht w i 0 h 1 w ih r it y it pht t mhj t t
t r i y i w ih ph t i x t j mhj sh2 t t x mh t t sh r i y i w ih ph
3 s t i h 108 2 Classification E mh ,sh ,wih i ,h | X rit log yit t t i y
exp h wihpht wi 0 i t exp w p k h kh h wk 0 109 Rules and Exceptions H
y t whpht vT xt v 0 h 1 Exceptions 110 Default rule Rule-Based Knowledge IF x1 a AND x2 b OR x3 c THEN y 0.1 x1 a 2 x2 b 2 p1 exp exp with w1 0.1 2 2 2s1 2s2 x3 c 2
p2 exp with w 2 0.1 2 2s3 Incorporation of prior knowledge (CBIT)before training) Rule extraction (CBIT)after training) (CBIT)Tresp et al., 1997) Fuzzy membership functions and fuzzy rules 111 Day 1 7. Support Vector Machines Support Vector Machines Bias b K(x,x1)
x1 K(x,x2) x2 Input layer Output neuron xn y K(x,xm) Type of SVM
Inner product kernel K(x,xi) Polynomial learning machine (1 x T y ) p Radial-basis function network 1 exp x xi 2 2 Two-layer perceptron tanh 0 x T x i 1
2 Hidden layer of m inner-product kernels 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 113 Support Vector Machines The line that maximizes the minimum margin is a good bet. The model class of hyper-planes
with a margin of m has a low VC dimension if m is big. This maximum-margin separator is determined by a subset of the datapoints. Datapoints in this subset are called support vectors. It will be useful computationally if only a small fraction of the datapoints are support vectors, because we use the support vectors to decide which side of the separator a test case is on. The support vectors are indicated by the circles around them. Training a linear SVM To find the maximum margin separator, we have to solve the following optimization problem: w.x c b 1 for positive cases w.x c b 1 for negative cases
and || w ||2 is as small as possible This is tricky but its a convex problem. There is only one optimum and we can find it without fiddling with learning rates or weight decay or early stopping. Dont worry about the optimization problem. It has been solved. Its called quadratic programming. It takes time proportional to N^2 which is really bad for very big datasets so for big datasets we end up doing approximate optimization! Testing a linear SVM The separator is defined as the set of points for which: w x b 0 so if w x c b 0 say its a positive case and if w x c b 0 say its a negative case Introducing slack variables
Slack variables are constrained to be non-negative. When they are greater than zero they allow us to cheat by putting the plane closer to the datapoint than the margin. So we need to minimize the amount of cheating. This means we have to pick a value for lamba (this sounds familiar!) w x c b 1 c for positive cases w x c b 1 c for negative cases with c 0 for all c || w ||2 and c 2 c as small as possible A picture of the best plane with a slack variable How to make a plane curved
Fitting hyperplanes as separators is mathematically easy. The mathematics is linear. By replacing the raw input variables with a much larger set of features we get a nice property: A planar separator in the highdimensional space of feature vectors is a curved separator in the low dimensional space of the raw input variables. A planar separator in a 20-D feature space projected back to the original 2-D space A potential problem and a magic solution
If we map the input vectors into a very high-dimensional feature space, surely the task of finding the maximum-margin separator becomes computationally intractable? The mathematics is all linear, which is good, but the vectors have a huge number of components. So taking the scalar product of two vectors is very expensive. The way to keep things tractable is to use trick the kernel The kernel trick makes your brain hurt when you first learn about it, but its actually very simple. The kernel trick For many mappings from a
low-D space to a high-D space, there is a simple operation on two vectors in the low-D space that can be used to compute the scalar product of their two images in the high-D space. Low-D xb High-D K ( x a , x b ) ( x a ) . ( x b ) Letting the kernel do the work doing the scalar product in the obvious way
xa ( xb ) (xa ) Kernel Examples 1. Gaussian kernel 2. Polynomial kernel The classification rule The final classification rule is quite simple: bias ws K ( x test s
,x ) 0 s SV The set of support vectors All the cleverness goes into selecting the support vectors that maximize the margin and computing the weight to use on each support vector. We also need to choose a good kernel function and we may need to choose a lambda for dealing with non-separable cases. Some commonly used kernels Polynomial: K (x, y ) (x.y 1) p Gaussian radial basis function
K (x, y ) e Neural net: K (x, y ) tanh ( k x.y ) ||x y||2 / 2 2 Parameters that the user must choose For the neural network kernel, there is one hidden unit per support vector, so the process of fitting the maximum margin hyperplane decides how many hidden units to use. Also, it may violate Mercers condition. Support Vector Machines are Perceptrons!
SVMs use each training case, x, to define a feature K(x, .) where K is chosen by the user. So the user designs the features. Then they do feature selection by picking the support vectors, and they learn how to weight the features by solving a big optimization problem. So an SVM is just a very clever way to train a standard perceptron. All of the things that a perceptron cannot do cannot be done by SVMs (but its a long time since 1969 so people have forgotten this). Supplement: SVM as a Kernel Machine Kernel Methods Approach The kernel methods approach is to stick with linear functions but work in a high dimensional feature space:
The expectation is that the feature space has a much higher dimension than the input space. Form of the Functions So kernel methods use linear functions in a feature space: For regression this could be the function For classification require thresholding Controlling generalisation The critical method of controlling generalisation is to force a large margin on the training data: Support Vector Machines
SVM optimization Addresses generalization issue but not the computational cost of dealing with large vectors Complexity problem Lets apply the quadratic example to a 20x30 image of 600 pixels gives approximately 180000 dimensions! Would be computationally infeasible to work in this space Dual Representation Suppose weight vector is a linear combination of the training examples:
can evaluate inner product with new example Learning the dual variables Since any component orthogonal to the space spanned by the training data has no effect, general result that weight vectors have dual representation: the representer theorem. Hence, can reformulate algorithms to learn dual variables rather than weight vector directly Dual Form of SVM The dual form of the SVM can also be derived by taking the dual optimisation problem! This gives: Note that threshold must be determined from border examples
Using Kernels Critical observation is that again only inner products are used Suppose that we now have a shortcut method of computing: Then we do not need to explicitly compute the feature vectors either in training or testing Kernel example As an example consider the mapping Here we have a shortcut:
Efficiency Hence, in the pixel example rather than work with 180000 dimensional vectors, we compute a 600 dimensional inner product and then square the result! Can even work in infinite dimensional spaces, eg using the Gaussian kernel: Constraints on the kernel There is a restriction on the function: This restriction for any training set is enough to guarantee function is a kernel What Have We Achieved?
Replaced problem of neural network architecture by kernel definition Arguably more natural to define but restriction is a bit unnatural Not a silver bullet as fit with data is key Can be applied to non- vectorial (or high dim) data Gained more flexible regularization/ generalization control Gained convex optimization problem i.e. NO local minima! However, choosing the right kernel remains as a design issue. (1 )
, , ? Neural Net (CBIT)NN) . Decision Tree (CBIT)DT) . k-Nearest Neighbor (CBIT)kNN) ? Support Vector Machine (CBIT)SVM) ? NN, DT, kNN, SVM . ? 140 (c) 2009-2010 SNU Biointelligence Laboratory, http://bi.snu.ac.kr/
The Past, Present and Future of Coal Ash Management in Virginia
EPA ash survey of bottom sediment along the 70 mile length of the river has found no stations exceeding 20% ash content due to ash being mixed and buried by native sediment. Ash recovery at Danville-Schoolfield Dam site continues with...