DOC PREVIEW
Princeton COS 424 - Lecture # 4

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

COS 424: Interacting with DataLecturer: Robert Schapire Lecture # 4Scribe: Anirudh Badam February 15, 2007The previous lecture defined the Nearest Neighbor Algorithm and discussed how it suf-fers from the curse of dimensionality. This means that as the number of dimensions increasethe Nearest Neighbor algorithm performs poorer and poorer. To better understand the curseof dimensionality with regard to the Nearest Neighbor algorithm one must understand whathigher dimensions look like. The following discussion demonstrates how higher dimensions( n >> 3) are qualitatively different from lower dimensions (2 or 3).1 Higher Dimensions1.1 Hypersphere VolumeImagine a hypersphere in n dimensions of unit radius. Let’s call it Sn. Now imagine thesmallest hypercube that contains the sphere with in its volume. Call this cube Cn(the sideof this cube would be 2 (why?)). Now imagine a smaller cub e cnat the center of Snwhoseside length is 1. The smaller cube is placed such that the centers of Sn, Cnand cnareco-incident and the edges of the cubes are parallel. Figure 1 shows the setting for n = 2and Figure 2 shows it for n = 3. Now let us look at the volume of this hyper sphere.For n = 2, 3 we see that the smaller cube is contained entirely in the sphere. Intuitivelywe would expect this to be true with higher dimensions. Table 1 outlines the volumes ofSn, Cnand cnfor some values of n. The results suggest that as n increases the volume ofSnmoves closer to zero and in the limit reaches zero. (Observation: distance of a cornerpoint of cnfrom the center isrΣn12=rn4, which is greater than one (outside the hypersphere) for n > 4).1 11/2Figure 1: In 2 Dimensions Figure 2: In 3 Dimensions1.2 Shell VolumeImagine a hypershell (Hn) of unit outer radius with a shell of thickness ². Figure 3 depictssuch a shell in 2 dimensions. Volume of a hypersphere in n dimensions is kn· rn(why?),where knis some constant for n dimensions and r is the radius of the hypersphere. Hence,n V olume of cnV olume of SnV olume of Cn2 1 π ≈ 3.14 43 1 4π/3 ≈ 4.12 84 1 4.93 165 1 5.26 326 1 5.16 647 1 4.72 1288 1 4.06 256.. .. .. ..∞ 1 0 ∞Table 1: Hypersphere VolumeV olume of Hn= kn· (1n− (1 − ²)n)V olume of HnV olume of Sn=kn· (1n− (1 − ²)nkn= 1 − (1 − ²)n→ 1 as n → ∞1Figure 3: Hypershell in 2 Dimensions1.3 ConclusionThe results in the previous sections mean that in higher dimensions, the points are “as faras they can be” even in the average case (why?)! This means that the estimate that NearestNeighbor makes in higher dimensions can be wrong, in the sense that our intuition of a testpoint being similar to a nearest training point breaks down due to the distances being huge.As we know the three requirements of a good classifier are Enough Data, Low TrainingError and Simplicity, we can say that Nearest Neighbor has no control over the first one.The second one can be true in cases of smaller dimensions (infact training error = 0 forn = 2, 3).Nearest Neighbor can be quite complex! As was demonstrated in the class, it can have2a very complex separation between what it would label as positive and negative (the areadistribution of a certain Nearest Neighbor classifier was shown in class to demonstrate this).2 Decision Tree2.1 An interesting exampleThis section introduces a new kind of classifier. A decision tree is a tree where each no de isused to make a decision and each edge is used to pursue further path based on the previouslymade decision. The process stops when a no de is reached which is a label. Let us considerthe following example. Table 2 shows the training data needed for a classifier and Table 3shows the test data for the classifier. The apparel of a person such as tie, cape, mask, etc.are called the features/attributes of the data (people in this case). Figure 4 shows a decisiontree based on this training data. The node marked tie makes a decision whether the personwears a tie or not. Based on this decision, the p erson is further sent on to the left or rightedge of the tree.Thus, it is clear to see that Batgirl is labeled as good while Riddler is labeled as badsex smokes tie mask cape ears classBatman male no no yes yes yes goodRobin male no no yes yes yes goodAlfred male no yes no no no goodPenguin male yes yes no no no badCatwoman female no no yes no yes badJoker male no no no no no badTable 2: Training Datasex smokes tie mask cape ears classBatgirl female no no yes yes yes ???Riddler male no no yes no no ???Table 3: Test DataTieCape SmokesB G G BN YN Y YNFigure 4: Decision Tree3TieBatman +Robin +Joker −Catwoman −Alfred +Penguin −I(1/2)CapeBatman +Robin +Alfred +Penguin −Catwoman −Joker −I(1/2) I(1/4)I(1)NYNYFigure 5: Choice of Feature for Ro otby this decision tree. (Not a bad classifier, but everybody in the class would be labeled badby this tree (no tie, no cape)). Nevertheless, how did we build this tree?2.2 Building a Decision TreeIt is clear to see that it is always possible to build a decision tree which will fit the entiretraining data (why?). Hence, the objective must be to build the smallest possible decisiontree which classifies the entire training data correctly. The following discussion describesa recursive procedure to find a decision tree. Although this technique does not explicitlyconcentrate on finding the smallest decision tree which satisfies training data, it does find adecision tree which would satisfy the entire training data. A very high level definition of thealgorithm is that, we need to figure out what to place at a particular node and recursivelycall the similar procedure to find the children of the node.Let us first figure out what to put at the root. There are six possibilities in all. Figure 5shows how choosing tie as the root node would split the data and also shows how choosingcape as the root would split the data. Now we need to decide which is a better choice. By“better” choice we mean the one which splits the data to a “better” extent. It is easy tosee that choosing tie would split the data so that the number of bad labels on each side isequal to the number of good lab els on that side. On the contrary, choosing cape would splitit better in the sense that the right hand side has all good people and the left side has onlyone good person. This intuition can be formalized in the following manner.Define an impurity function to gauge a split of the data. The best feature to choose isthe one that leads to lesser impurity. In this


View Full Document

Princeton COS 424 - Lecture # 4

Download Lecture # 4
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture # 4 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture # 4 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?