# Feature Generation Feature Generation I have a box of apples Decision Tree Classifier Decision Tree Classification

## Feature Generation Feature Generation I have a box of apples Decision Tree Classifier Decision Tree Classification

Reed, Jessi, Morning Drive On-Air Personality has reference to this Academic Journal, PHwiki organized this Journal Overfitting Overfitting occurs when a statistical model describes r in addition to om error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model which has been overfit will generally have poor predictive per as long as mance, as it can exaggerate minor fluctuations in the data. Suppose we need to solve a classification problem We are not sure if we should us the Simple linear classifier or the Simple quadratic classifier How do we decide which to use We do cross validation in addition to choose the best one. Simple linear classifier gets 81% accuracy Simple quadratic classifier 99% accuracy

This Particular University is Related to this Particular Journal

Simple linear classifier gets 96% accuracy Simple quadratic classifier 97% accuracy This problem is greatly exacerbated by having too little data Simple linear classifier gets 90% accuracy Simple quadratic classifier 95% accuracy What happens as we have more in addition to more training examples The accuracy as long as all models goes up! The chance of making a mistake goes down The cost of the mistake (if made) goes down Simple linear 70% accuracy Simple quadratic 90% accuracy Simple linear 90% accuracy Simple quadratic 95% accuracy Simple linear 99% accuracy Simple quadratic 99% accuracy

One Solution: Charge Penalty as long as complex models For example, as long as the simple {polynomial} classifier, we could charge 1% as long as every increase in the degree of the polynomial 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 Accuracy = 90.5% Accuracy = 97.0% Accuracy = 97.05% Simple linear classifier gets 90.5% accuracy, minus 0, equals 90.5% Simple quadratic classifier 97.0% accuracy, minus 1, equals 96.0% Simple cubic classifier 97.05% accuracy, minus 2, equals 95.05% One Solution: Charge Penalty as long as complex models For example, as long as the simple {polynomial} classifier, we could charge 1% as long as every increase in the degree of the polynomial. There are more principled ways to charge penalties In particular, there is a technique called Minimum Description Length (MDL) Suppose you have a four feature problem, in addition to you want to search over feature subsets. It happens to be the case that features 2 in addition to 3, shown here Are all you need, in addition to the other features are r in addition to om

Suppose you have a four feature problem, in addition to you want to search over feature subsets. It happens to be the case that features 2 in addition to 3, shown here are all you need, in addition to the other features are r in addition to om 0 1 2 3 4 My-Collection We have seen that we are given features Suppose using these features we cannot get satisfactory accuracy results. So far, we have two tricks Ask as long as more features Remove irrelevant or redundant features There is another possibility Feature Generation Feature generation refers to any technique to make new features from existing features Recall pigeon problem 2, in addition to assume we are using the linear classifier Examples of class A 4 4 5 5 6 6 3 3 Examples of class B Pigeon Problem 2 Using both features works poorly, using just X works poorly, using just Y works poorly

Feature Generation Solution: Create a new feature Z Z = absolute-value(X-Y) 1 2 3 4 5 6 7 8 9 10 0 Z-axis Recall this example It was a teaching example to show that NN could use any distance measure It would not really work very well, unless we had LOTS more data AIKO AIMI AINA AIRI AKANE AKEMI AKI AKIKO AKIO AKIRA AMI AOI ARATA ASUKA ABERCROMBIE ABERNETHY ACKART ACKERMAN ACKERS ACKLAND ACTON ADAIR ADLAM ADOLPH AFFLECK ALVIN AMMADON Japanese Names Irish Names

AIKO 0.75 AIMI 0.75 AINA 0.75 AIRI 0.75 AKANE 0.6 AKEMI 0.6 ABERCROMBIE 0.45 ABERNETHY 0.33 ACKART 0.33 ACKERMAN 0.375 ACKERS 0.33 ACKLAND 0.28 ACTON 0.33 Japanese Names Irish Names Z = number of vowels / word length Vowels = I O U A E I have a box of apples All bad All good 0 0.5 1 H(X) Pr(X = good) = p then Pr(X = bad) = 1 p the entropy of X is given by 0 1 binary entropy function attains its maximum value when p = 0.5 Decision Tree Classifier Ross Quinlan Antenna Length Abdomen Length Abdomen Length > 7.1 no yes Katydid Antenna Length > 6.0 no yes Katydid Grasshopper

Grasshopper Antennae shorter than body Cricket Foretiba has ears Katydids Camel Cricket Yes Yes Yes No No 3 Tarsi No Decision trees predate computers Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify in addition to remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree Decision Tree Classification Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide- in addition to -conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they can be discretized in advance) Examples are partitioned recursively based on selected attributes. Test attributes are selected on the basis of a heuristic or statistical measure (e.g., in as long as mation gain) Conditions as long as stopping partitioning All samples as long as a given node belong to the same class There are no remaining attributes as long as further partitioning  majority voting is employed as long as classifying the leaf There are no samples left How do we construct the decision tree

In as long as mation Gain as A Splitting Criteria Select the attribute with the highest in as long as mation gain (in as long as mation gain is the expected reduction in entropy). Assume there are two classes, P in addition to N Let the set of examples S contain p elements of class P in addition to n elements of class N The amount of in as long as mation, needed to decide if an arbitrary example in S belongs to P or N is defined as 0 log(0) is defined as 0 In as long as mation Gain in Decision Tree Induction Assume that using attribute A, a current set will be partitioned into some number of child sets The encoding in as long as mation that would be gained by branching on A Note: entropy is at its minimum if the collection of objects is completely uni as long as m

Hair Length <= 5 yes no Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 Entropy(1F,3M) = -(1/4)log2(1/4) - (3/4)log2(3/4) = 0.8113 Entropy(3F,2M) = -(3/5)log2(3/5) - (2/5)log2(2/5) = 0.9710 Gain(Hair Length <= 5) = 0.9911  (4/9 0.8113 + 5/9 0.9710 ) = 0.0911 Let us try splitting on Hair length Weight <= 160 yes no Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 Entropy(4F,1M) = -(4/5)log2(4/5) - (1/5)log2(1/5) = 0.7219 Entropy(0F,4M) = -(0/4)log2(0/4) - (4/4)log2(4/4) = 0 Gain(Weight <= 160) = 0.9911  (5/9 0.7219 + 4/9 0 ) = 0.5900 Let us try splitting on Weight age <= 40 yes no Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 Entropy(3F,3M) = -(3/6)log2(3/6) - (3/6)log2(3/6) = 1 Entropy(1F,2M) = -(1/3)log2(1/3) - (2/3)log2(2/3) = 0.9183 Gain(Age <= 40) = 0.9911  (6/9 1 + 3/9 0.9183 ) = 0.0183 Let us try splitting on Age Weight <= 160 yes no Hair Length <= 2 yes no Of the 3 features we had, Weight was best. But while people who weigh over 160 are perfectly classified (as males), the under 160 people are not perfectly classified So we simply recurse! This time we find that we can split on Hair length, in addition to we are done! Weight <= 160 yes no Hair Length <= 2 yes no We need dont need to keep the data around, just the test conditions. Male Male Female How would these people be classified It is trivial to convert Decision Trees to rules Weight <= 160 yes no Hair Length <= 2 yes no Male Male Female Rules to Classify Males/Females If Weight greater than 160, classify as Male Elseif Hair Length less than or equal to 2, classify as Male Else classify as Female Summary of Classification We have seen 4 major classification techniques: Simple linear classifier, Nearest neighbor, Decision tree. There are other techniques: Neural Networks, Support Vector Machines, Genetic algorithms In general, there is no one best classifier as long as all problems. You have to consider what you hope to achieve, in addition to the data itself Let us now move on to the other classic problem of data mining in addition to machine learning, Clustering

## Reed, Jessi Morning Drive On-Air Personality

Reed, Jessi is from United States and they belong to Mornings With Kirt – KFLG-FM and they are from  Bullhead City, United States got related to this Particular Journal. and Reed, Jessi deal with the subjects like Entertainment

## Journal Ratings by Fairmont State College

This Particular Journal got reviewed and rated by Fairmont State College and short form of this particular Institution is US and gave this Journal an Excellent Rating.