LEARNING CHAPTER 18.4-18.4.1 VALIDATION, and 18.6 REGRESSION ONWARDS 18.4 EVALUATION: Stationarity assumption: Real world has the same distribution as that of training data Independent and Identically distributed (iid): Each datum, training or in real world, has equal probability of appearing Non-stationarity or non-iid: data is changing over time, what you learned before is no longer useful Cross-validation: divide data set into two groups - training and validation, for testing what is the rate of successfull classification of test data. Measurement of validation: error rate on validation set An ML algorithm has many parameters: e.g., model (order of polynomial), learning rate, etc. Fine-tune parameters using a WRAPPER algorithm, by repeated validation: need to repeat cross validation by randomly splitting the available data set. k-fold Cross validation: 1/k part of data set is validation set, repeat x-number of times by randomly splitting 1/k k=n, for data set size n, is leave-one-out cross-validation Peeking: After k-fold cross-validation, ML algorithm may overfit known training data (validation is part of it) but not as good for real life use (note: that may mean iid is not true) ML competitions hold out real test data, but still groups "cheat" by repeatedly submitting fine-tuned code. 18.4.1 MODEL SELECTION Decision tree model: best feature selection Linear regression or classifier: power of ploynomial Error rates may be optimized (minimized) for model + learning parameters. 18.4.3. Regularization: Additional term in optimization other than Error rate ============================================ 18.6 LINEAR REGRESSION: Least Square Error or L2-norm minimization Linear model y = mx +c: Has closed form solution Eq 18.3 Gradient Descent is iteratively getting closer to the solution: determine direction in each iteration and update (w, c) above step-size may be updated in each iteration, a constant, or a fixed schedule Note: "direction" gets determined on if error is +ve or -ve (useful in understanding classifier later) Multivariate linear regression: Model is: y = c +w1x1 +w2x2 +..., for x1, x2, ...xn variables Closed form solution is a matrix formulation with partial differential equations Gradient descent is extended to Eq 18.6 Note: some dimensions may be irrelevant, wi ~0 for some xi Attempt to eliminate irrelevant dimensions: use a penalty term in error function for "complexity" (=Sumof-wi-over-i) times lambda or alpha (a hyperparameter to be tuned) L1 norm (absolute sum) is better for this complexity term: "sparse model" : minimizes #of dimensions (Fig 18.14 p722) ================================= 18.6.3 LINEAR CLASSIFIER Predicting y is no longer the objective The objective is to learn a boolean function: h_w(x) = 1 or 0: data point x is in the class or not Training problem: set of (x, y) given, x are data points and now, y =1 or 0, find h_w(x) that models y Test: a data point x is given, predict if it is in the class or not (compute h_w(x)) No longer h_w(x) is the line expected to pass through the data samples as in regression, but to separate them into two sides of the line Model: w0 + w1x1 +w2x2 + ... >=0, and h_w(x)=1 if so, =0 otherwise Finding the line is very similar as in regression: optimize for (m, c) Rewrite the model: (w0, w1, w2, ...)-transpose * (x1, x1, x2, ...) >=0, a vector product Consider two vectors, w= (w0, w1, w2, ...)-transpose and x= (x1, x1, x2, ...)-transpose h_w(x) = 1 when (w.x)>=0, otherwise h_w(x)=0 Gradient descent (for linear separator), Eq 18.7 p724: Perceptron Learning rule: wi <- wi +alpha*(y - h_w(x)) * xi, 0<=i<=n, updates from iteration to iteration Note: y here is also boolean: 1 or 0 from the "training" set Training: (1) w stays same for correct prediction y=h_w(x), (2) False negative: y=1, but h_w(x)=0, increase wi for each positive?(xi), decrease otherwise (3) False positive: y=0, but h_w(x)=1, decrease wi for each positive?(xi), increase otherwise Logistic regression: use sigmoid h_w(x) rather than boolean function or step as above Formulation: Eq 18.8 p 727 Continuous valued h_w(x) may be interpreted as probability of being in the class ====================== ARTIFICIAL NEURAL NETWORK Ch 18.7 Perceptron output fed to multiple other perceptrons, Fig 18.19 p728 Different types of h_w() may be used Layers of perceptrons: Input -> Hidden -> Hidden -> ... -> output classifying layer Two essential components: Architecture, and Weights updating algorithm Types of architectures: Feed-forward Network: simple Directted Acyclic Graph Recurrent Network: feedback loop A single layer perceptron network cannot "learn" xor function or boolean sum, Fig 18.21 p 730 Multi-layer Feed Forward Network: Multiple layers can coordinate to create complex multi-linear classification space, Fig 18.23 p732 Back-propagate, draw how error propagates backward get total error del_E, weighted distribution over each backward nodes, each node now knows its "errors", propagate that error recursively backward all the way through input layer, then update the weights Weight update rule, eq 18.11 p733, with delta_k as above Weighted error propagation, eq 18.12 p 733 ===================== NON-PARAMETRIC CLASSIFIERS Data driven, no parametrized model assumed Trivial agorithm: table lookup for each example, return closest match from past "training" data = 1-nearest neighborhood lookup Needs a distance measure between samples. e.g. L2-norm Higher level: k-nearest neighbor algorithm, Fig 18.26 p738 Note: Query point runs over the full continuous space to create the gray mapped classifier regions Sample space becomes sparse in a large dimensional space, one needs large radius to get any hit - means otliers are included in k Alternative: do not return any match if no sample is found within a fixed distance Curse of dimensionality = dimensions are too many compared to number of samples Search for k-neighbors is O(N), for N examples Faster than O(N): k-d tree: http://pointclouds.org/documentation/tutorials/kdtree_search.php https://en.wikipedia.org/wiki/K-d_tree binary search tree for k-dimensions [k is #dimensions, and d stands for "dimensions"] Search is O(log_N) LSH or Locality sensitive hashing: if two points are close in space, then so is there projection on any dimension Note: reverse is not necessarily true LSH concept: hash data to only random projections on some dimensions, not all, no need, probabilty of hit increases by number of dimensions used Search is faster than k-d tree, possibly O(1) on each dimension, O(p) for p projected dimensions NON-PARAMETRIC REGRESSION: Same as above, return average (or some aggregate measure) of only k-nearest neigbors Fig 18.28a-c, p 742: linear, 1-3 neighbors Locality "weighted" regression: Fig 18.28.d p742: weighted over k-points, using a "Kernel" function in Fig 18.29 p743 How to weight w*? Minimize kernel times "distance", referr to Eq at bottom p743 To find resulting prediction, h(xq) = dot-product w* with query xq Note: not necessarily a fixed k for neighborhood, rather latter depends on the kernel width ==================== UNSUPERVISED LEARNING Not labelled data Problem is no longer to classify a single output based on training data set Problem is to divide given data into groups based on their attributes: CLUSTERING Clustering is also data compression, e.g., 100 data points into 3 groups Needs a measure of distance between each pair of datum points Clustering Algorithm works with such distance measures k-means clustering: https://en.wikipedia.org/wiki/K-means_clustering (WIKI - Donate to Wiki!) 1. Throw darts (k number) to data space as starting cluster-centers ->2. "Classify" each data point according to its closest cluster-center 3. New cluster center is mean position for each group of data 4. loop over steps 2-3 until cluster center stops updating 5. Return cluster centers and final classification of each data point (partitions) k-median clustering: use median instead of mean in above algorithm Many other clustering algorithms (e.g., Bayesian clustering, Fuzzy c-means clustering, Hierarchical clustering) and other Unsupervised learning algorithms exist The problem is NP-hard ==================== SYLLABUS FOR FALL 2017: UP TO MARGIN BELOW ======= SUPPORT VECTOR MACHINE A strong theory (CLT=Computational Learning Theory) based classifier A direction towards clustering: maximum margin separators between classes (typically two) "Support Vector" that separates the classes Uses artificially created (possibly non-linear) dimensions to find linear seprator: "Kernel trick" Non-prametric: no model, potentially all examples (training set) needs to be stored: practically, only a few examples are needed (around the margins), these "close to separating line" examples are called "support vectors" Key concept in SVM: some examples are more important than others (for clasification task) Fig 18.30 p745: any of 3 classification lines in Fig a. will do, but three circled points in Fig b. are more important, they can create "separators" for better learning However, black and white dots may be on both sides of a separator Solution: Find the separator furthest from "support" example points closest to the hypothesized separator: "Maximum margin separator" Margin: width area between two lines Error margin: w.x +b =0, or minimize left side to find x (b is scalar) Classification (x0 = 1 or -1, not 1 or 0 as before, for a data point) However, rather a dual representation of above, Eq 18.13 p746 is maximized for alphas where, y are "dual" (artificial) variables in dual representation and, w = Sum-of(alphas * x) that may be recovered afterwards Classification hypothesis (after training) given a data point: Eq 18.14 p 746, only "support vectors" x are needed from training data (closest to separator) When data are not linearly separable: Create artificial x-axes (dimensions): Eq 18.15 p746 Fig 18.31 p747: how "kernel trick" creates linear separation For N data points: N-1 dimensions are sufficient; also, Circle->4D, Ellipse->5D, including y No one cares what does the high dimensional space mean geometrically, and how to create Kernels is not a topic for this class, for now ==================== NOT IN SYLLABUS FOR SPRING 2018 ====== ENSEMBLE LEARNING Sort of voting from multiple classifiers: Fig 18.32 p749 An easy example: Multiple decision trees at each of cross-validation stage Output classification on given x: vote Ideal: each generated hypothesis is independent: voting will work Second pass idea: ok, hypthese are not independent, but you have a hyp space then, learn on hyp space for final classification: Fig 18.32 p749, more expressive or discriminatory than any single hyp Boosting: weighted training set, initialized with w=1 for all Correctly predicted example -> decrease weight, otherwise increase (so that in future training does better job): this is "boosting" Generated K hypotheses, for a given K Each hypothesis is weighted according to its success rate: Fig 18.33 p750 Using all K hypotheses instead of final hypothesis, improves accuracy of prediction Needs each hypothesis to predict 50%+ It is better to have not very good hyptheses in the ensembe (overlearned) Ada boost: Fig 18.34 p751 UNSUPERVISED LEARNING / CLUSTERING No target output value to predict, only a set of data Group them by their "proximity": proximity Needs a distance function Type of clustering (wiki): Connectivity models: for example, hierarchical clustering builds models based on distance connectivity. Centroid models: for example, the k-means algorithm represents each cluster by a single mean vector. Distribution models: clusters are modeled using statistical distributions, such as multivariate normal distributions used by the expectation-maximization algorithm. Density models: for example, DBSCAN and OPTICS defines clusters as connected dense regions in the data space. Subspace models: in biclustering (also known as co-clustering or two-mode-clustering), clusters are modeled with both cluster members and relevant attributes. Group models: some algorithms do not provide a refined model for their results and just provide the grouping information. Graph-based models: a clique, that is, a subset of nodes in a graph such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Relaxations of the complete connectivity requirement (a fraction of the edges can be missing) are known as quasi-cliques, as in the HCS clustering algorithm. Neural models: the most well known unsupervised neural network is the self-organizing map and these models can usually be characterized as similar to one or more of the above models, and including subspace models when neural networks implement a form of Principal Component Analysis or Independent Component Analysis. Hard / Soft (Fuzzy) clustering K-MEANS CLUSTERING Start with arbitrary k cluster-points in data space Do while not converged: Group data by their proximity to each of those cluster points Find each group's mean and reassign them as cluster points https://en.wikipedia.org/wiki/K-means_clustering