LEARNING CHAPTER 18.4-18.4.1 VALIDATION, and 18.6 REGRESSION ONWARDS

18.4 EVALUATION:

Stationarity assumption: Real world has the same distribution as that of training data
Independent and Identically distributed (iid): Each datum, training or in real world, has equal probability of appearing

Non-stationarity or non-iid: data is changing over time, what you learned before is no longer useful

Cross-validation: divide data set into two groups - training and validation, for testing what is the rate of successfull classification of test data.
Measurement of validation: error rate on validation set

An ML algorithm has many parameters: e.g., model (order of polynomial), learning rate, etc.

Fine-tune parameters using a WRAPPER algorithm, by repeated validation:
need to repeat cross validation by randomly splitting the available data set.

k-fold Cross validation: 1/k part of data set is validation set, repeat x-number of times by randomly splitting 1/k

k=n, for data set size n, is leave-one-out cross-validation

Peeking: After k-fold cross-validation, ML algorithm may overfit known training data (validation is part of it) but not as good for real life use (note: that may mean iid is not true)

ML competitions hold out real test data, but still groups "cheat" by repeatedly submitting fine-tuned code.

18.4.1 MODEL SELECTION

Decision tree model: best feature selection

Linear regression or classifier: power of ploynomial

Error rates may be optimized (minimized) for model + learning parameters.

18.4.3. Regularization: Additional term in optimization other than Error rate


============================================

18.6 LINEAR REGRESSION:

Least Square Error or L2-norm minimization

Linear model y = mx +c: Has closed form solution Eq 18.3

Gradient Descent is iteratively getting closer to the solution: 
	determine direction in each iteration and update (w, c) above
	step-size may be updated in each iteration, a constant, or a fixed schedule 

Note: "direction" gets determined on if error is +ve or -ve (useful in understanding classifier later)

Multivariate linear regression:
Model is: y = c +w1x1 +w2x2 +..., for x1, x2, ...xn variables

Closed form solution is a matrix formulation with partial differential equations

Gradient descent is extended to Eq 18.6

Note: some dimensions may be irrelevant, wi ~0 for some xi

Attempt to eliminate irrelevant dimensions:
use a penalty term in error function for "complexity" (=Sumof-wi-over-i) times lambda or alpha (a hyperparameter to be tuned)

L1 norm (absolute sum) is better for this complexity term: 
"sparse model" : minimizes #of dimensions (Fig 18.14 p722)

=================================
18.6.3 LINEAR CLASSIFIER

Predicting y is no longer the objective

The objective is to learn a boolean function: h_w(x) = 1 or 0: 
data point x is in the class or not

Training problem: set of (x, y) given, x are data points and now, y =1 or 0, 
find h_w(x) that models y
Test: a data point x is given, predict if it is in the class or not (compute h_w(x))

No longer h_w(x) is the line expected to pass through the data samples as in regression, 
but to separate them into two sides of the line
Model: w0 + w1x1 +w2x2 + ... >=0, and h_w(x)=1 if so, =0 otherwise

Finding the line is very similar as in regression: optimize for (m, c)

Rewrite the model: (w0, w1, w2, ...)-transpose * (x1, x1, x2, ...) >=0, a vector product
Consider two vectors, w= (w0, w1, w2, ...)-transpose and x= (x1, x1, x2, ...)-transpose
h_w(x) = 1 when (w.x)>=0, otherwise h_w(x)=0

Gradient descent (for linear separator), Eq 18.7 p724: 
Perceptron Learning rule:
wi <- wi +alpha*(y - h_w(x)) * xi, 0<=i<=n, updates from iteration to iteration
			
Note: y here is also boolean: 1 or 0 from the "training" set

Training: 
(1) w stays same for correct prediction y=h_w(x), 
(2) False negative: y=1, but h_w(x)=0, increase wi for each positive?(xi), decrease otherwise
(3) False positive: y=0, but h_w(x)=1, decrease wi for each positive?(xi), increase otherwise 

Logistic regression: use sigmoid h_w(x) rather than boolean function or step as above
Formulation: Eq 18.8 p 727

Continuous valued h_w(x) may be interpreted as probability of being in the class


======================
ARTIFICIAL NEURAL NETWORK Ch 18.7

Perceptron output fed to multiple other perceptrons, Fig 18.19 p728

Different types of h_w() may be used

Layers of perceptrons: Input -> Hidden -> Hidden -> ... -> output classifying layer

Two essential components: Architecture, and Weights updating algorithm

Types of architectures:
Feed-forward Network: simple Directted Acyclic Graph
Recurrent Network: feedback loop

A single layer perceptron network cannot "learn" xor function or boolean sum, Fig 18.21 p 730


Multi-layer Feed Forward Network:

Multiple layers can coordinate to create complex multi-linear classification space, Fig 18.23 p732

Back-propagate, draw how error propagates backward
	get total error del_E, weighted distribution over each backward nodes,
	each node now knows its "errors", propagate that error recursively backward 
	all the way through input layer,
	then update the weights

Weight update rule, eq 18.11 p733, with delta_k as above

Weighted error propagation, eq 18.12 p 733

=====================
NON-PARAMETRIC CLASSIFIERS

Data driven, no parametrized model assumed

Trivial agorithm: table lookup for each example, return closest match from past "training" data
= 1-nearest neighborhood lookup

Needs a distance measure between samples. e.g. L2-norm

Higher level: k-nearest neighbor algorithm, Fig 18.26 p738
Note: Query point runs over the full continuous space to create the gray mapped classifier regions

Sample space becomes sparse in a large dimensional space, 
one needs large radius to get any hit - means otliers are included in k
Alternative: do not return any match if no sample is found within a fixed distance
Curse of dimensionality = dimensions are too many compared to number of samples

Search for k-neighbors is O(N), for N examples 

Faster than O(N): k-d tree: http://pointclouds.org/documentation/tutorials/kdtree_search.php
https://en.wikipedia.org/wiki/K-d_tree
binary search tree for k-dimensions [k is #dimensions, and d stands for "dimensions"]
Search is O(log_N)

LSH or Locality sensitive hashing:
if two points are close in space, then so is there projection on any dimension
Note: reverse is not necessarily true

LSH concept: hash data to only random projections on some dimensions, 
not all, no need, probabilty of hit increases by number of dimensions used

Search is faster than k-d tree, possibly O(1) on each dimension, O(p) for p projected dimensions

NON-PARAMETRIC REGRESSION:

Same as above, return average (or some aggregate measure) of only k-nearest neigbors

Fig 18.28a-c, p 742: linear, 1-3 neighbors

Locality "weighted" regression: Fig 18.28.d p742: weighted over k-points, 
using a "Kernel" function in Fig 18.29 p743

How to weight w*? Minimize kernel times "distance", referr to Eq at bottom p743
To find resulting prediction, h(xq) = dot-product w* with query xq 

Note: not necessarily a fixed k for neighborhood, rather latter depends on the kernel width

====================
UNSUPERVISED LEARNING

Not labelled data

Problem is no longer to classify a single output based on training data set

Problem is to divide given data into groups based on their attributes: CLUSTERING

Clustering is also data compression, e.g., 100 data points into 3 groups

Needs a measure of distance between each pair of datum points
Clustering Algorithm works with such distance measures

k-means clustering: https://en.wikipedia.org/wiki/K-means_clustering (WIKI - Donate to Wiki!)
	1. Throw darts (k number) to data space as starting cluster-centers
      ->2. "Classify" each data point according to its closest cluster-center
	3. New cluster center is mean position for each group of data
	4. loop over steps 2-3 until cluster center stops updating
	5. Return cluster centers and final classification of each data point (partitions)

k-median clustering: use median instead of mean in above algorithm

Many other clustering algorithms (e.g., Bayesian clustering, Fuzzy c-means clustering, Hierarchical clustering) and other Unsupervised learning algorithms exist

The problem is NP-hard

==================== SYLLABUS FOR FALL 2017: UP TO MARGIN BELOW =======
SUPPORT VECTOR MACHINE

A strong theory (CLT=Computational Learning Theory) based classifier

A direction towards clustering: maximum margin separators between classes (typically two)
	"Support Vector" that separates the classes

Uses artificially created (possibly non-linear) dimensions to find linear seprator: 
	"Kernel trick"

Non-prametric: no model, potentially all examples (training set) needs to be stored:
	practically, only a few examples are needed (around the margins),
	these "close to separating line" examples are called "support vectors"

Key concept in SVM: some examples are more important than others (for clasification task)
	Fig 18.30 p745: any of 3 classification lines in Fig a. will do, 
	but three circled points in Fig b. are more important, 
	they can create "separators" for better learning

However, black and white dots may be on both sides of a separator 
Solution: Find the separator furthest from "support" example points 
closest to the hypothesized separator: "Maximum margin separator"

Margin: width area between two lines

Error margin: w.x +b =0, or minimize left side to find x (b is scalar)
	Classification (x0 = 1 or -1, not 1 or 0 as before, for a data point)

However, rather a dual representation of above, Eq 18.13 p746 is maximized for alphas
	where, y are "dual" (artificial) variables in dual representation
	and, w = Sum-of(alphas * x) that may be recovered afterwards

Classification hypothesis (after training) given a data point: 
	Eq 18.14 p 746,  only "support vectors" x are needed from training data 
	(closest to separator)


When data are not linearly separable:

Create artificial x-axes (dimensions): Eq 18.15 p746

Fig 18.31 p747: how "kernel trick" creates linear separation

For N data points: N-1 dimensions are sufficient; 
	also, Circle->4D, Ellipse->5D, including y 

No one cares what does the high dimensional space mean geometrically, and
how to create Kernels is not a topic for this class, for now

==================== NOT IN SYLLABUS FOR SPRING 2018 ======
ENSEMBLE LEARNING	

Sort of voting from multiple classifiers: Fig 18.32 p749

An easy example: Multiple decision trees at each of cross-validation stage
	Output classification on given x: vote

Ideal: each generated hypothesis is independent: voting will work

Second pass idea: ok, hypthese are not independent, but you have a hyp space
then, learn on hyp space for final classification: 
Fig 18.32 p749,  more expressive or discriminatory than any single hyp


Boosting: 
weighted training set, initialized with w=1 for all

Correctly predicted example -> decrease weight, otherwise increase 
(so that in future training does better job): this is "boosting"

Generated K hypotheses, for a given K
Each hypothesis is weighted according to its success rate: Fig 18.33 p750

Using all K hypotheses instead of final hypothesis, improves accuracy of prediction
	Needs each hypothesis to predict 50%+
	It is better to have not very good hyptheses in the ensembe (overlearned)

Ada boost: Fig 18.34 p751


UNSUPERVISED LEARNING / CLUSTERING

No target output value to predict, only a set of data

Group them by their "proximity": proximity

Needs a distance function

Type of clustering (wiki):
   Connectivity models: for example, hierarchical clustering builds models based on distance connectivity.
    Centroid models: for example, the k-means algorithm represents each cluster by a single mean vector.
    Distribution models: clusters are modeled using statistical distributions, such as multivariate normal distributions used by the expectation-maximization algorithm.
    Density models: for example, DBSCAN and OPTICS defines clusters as connected dense regions in the data space.
    Subspace models: in biclustering (also known as co-clustering or two-mode-clustering), clusters are modeled with both cluster members and relevant attributes.
    Group models: some algorithms do not provide a refined model for their results and just provide the grouping information.
    Graph-based models: a clique, that is, a subset of nodes in a graph such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Relaxations of the complete connectivity requirement (a fraction of the edges can be missing) are known as quasi-cliques, as in the HCS clustering algorithm.
    Neural models: the most well known unsupervised neural network is the self-organizing map and these models can usually be characterized as similar to one or more of the above models, and including subspace models when neural networks implement a form of Principal Component Analysis or Independent Component Analysis.

Hard / Soft (Fuzzy) clustering

K-MEANS CLUSTERING
Start with arbitrary k cluster-points in data space

Do while not converged:
	Group data by their proximity to each of those cluster points
	Find each group's mean and reassign them as cluster points

https://en.wikipedia.org/wiki/K-means_clustering