12In general, it is hard to define learning, as we don't really know what learning is.3In this example, the learning agent is something that learns by observing and interacting with the environment.4Magic == lots of hacksNeed a hypothesis representation that is general enough to express what you want to learn, but specific enough that the search for the correct hypothesis isn't too large.5678910Given any number of points, it is possible to find a hypothesis that is consistent for every training point. However we want the hypothesis to also predict points that were not training points. This slide is an example of overfitting. Yes, it is consistent with ever training example, but it is a poor predictor of points not in the training data set.11There is an implicit bias in the hypothesis’s form. For example, if your hypothesis is tha 4thdegree polynomial, where you are learning the coefficient of each term, you can only represent curves of degree 4 and less.Ockham's razor: when two hypothesis perform similarly, prefer the simpler one (ie if a line and a curve both fit points, then prefer a line).12How should data be represented? Here, each row is one example, each column is an attribute.Each attribute takes a value. Boolean attributes take true/false, etc. Some attributes are discrete, some are continuous. As you'll see later on, continuous valued attributes can complicate things. 13A learner should learn that the type, director and your mood are important to whether or not you liked the movie, while the company is not so important1415This is a decision tree to decide whether or not you are going to wait for a table at a restaurant.Here, hypotheses are in the form of decision trees. There are many possible decision trees (hypotheses). You are searching for the simplest decision tree (hypothesis) that is most consistent with the training dataIs this example the simplest?1617Trying to classify examples as P or G based on their attributes18If tree height is what determines which hypothesis is simpler, there is no simpler, consistent hypothesis1920212223Any attribute splits the data (though some groups might be empty). Here, the type attribute splits the data into three groups.The optimal attribute would split the data into pure nodes. Here, it would be great if just knowing the type of movie would tell you whether or not the movie was liked.24High entropy because all the information is mixed up, it is chaotic.25Low entropy because the information is organized and orderly.Know the formula for entropy, it is very, very useful!26Log base conversion from base a to base b: loga(x) = logb(x)/logb(a)2728Imagine having a basket full of red and black balls. If p is probability of being black, then when everything is red, p is 0 and entropy is 0. When everything is black, and p is 1, then entropy is 0. If half are red and half are black, p is 0.5 and entropy is 1.294 positive examples and 8 negative examples. Entropy is still very high.30Split on the attribute that gives the most information gain (reduces the entropy the most). This is not necessarily optimal, but it is a good heuristic.31323334You need some default answer for missing data.35Learn which values of each attribute is the best to split
View Full Document