**Unformatted text preview:**

More Learning Theory Last TimeLearning TheoryHaussler, 1988Sample ComplexityPAC BoundsPAC BoundsPAC BoundsPAC BoundsPAC BoundsPAC LearningPAC LearningVC DimensionVC DimensionVC DimensionVC DimensionVC DimensionVC DimensionVC DimensionVC DimensionVC DimensionVC DimensionVC DimensionVC DimensionVC DimensionVC DimensionVC DimensionAxis Parallel RectanglesAxis Parallel RectanglesAxis Parallel RectanglesExamplesExamplesExamplesExamplesPAC Bounds with VC DimensionMore Learning TheoryNicholas RuozziUniversity of Texas at DallasBased on the slides of Vibhav Gogate and David SontagLast Time• Probably approximately correct (PAC)• The only reasonable expectation of a learner is that with high probability it learns a close approximation to the target concept• Specify two small parameters, 0 < 𝜖𝜖, 0 < 𝛿𝛿 < 1• 𝜖𝜖 is the error of the approximation• (1 −𝛿𝛿) is the probability that, given 𝑀𝑀 i.i.d. samples, our learning algorithm produces a classifier with error at most 𝜖𝜖2Learning Theory• We use the observed data in order to learn a classifier• Want to know how far the learned classifier deviates from the (unknown) underlying distribution• With too few samples, we will with high probability learn a classifier whose true error is quite high even though it may be a perfect classifier for the observed data• As we see more samples, we pick a classifier from the hypothesis space with low training error & hope that it also has low true error • Want this to be true with high probability – can we bound how many samples that we need?3Haussler, 1988• What we proved last time:Theorem: For a finite hypothesis space, 𝐻𝐻, with 𝑀𝑀 i.i.d. samples, and 0 < 𝜖𝜖 < 1, the probability that any consistent classifier has true error larger than 𝜖𝜖 is at most 𝐻𝐻 𝑒𝑒−𝜖𝜖𝜖𝜖• We can turn this into a sample complexity bound4Sample Complexity• Let 𝛿𝛿 be an upper bound on the desired probability of not 𝜖𝜖-exhausting the sample space• The probability that the version space is not 𝜖𝜖-exhausted is at most 𝐻𝐻 𝑒𝑒−𝜖𝜖𝜖𝜖≤ 𝛿𝛿• Solving for 𝑀𝑀 yields𝑀𝑀 ≥ −1𝜖𝜖ln𝛿𝛿𝐻𝐻= ln |𝐻𝐻| + ln1𝛿𝛿/𝜖𝜖5PAC BoundsTheorem: For a finite hypothesis space H, 𝑀𝑀 i.i.d. samples, and 0 < 𝜖𝜖 < 1, the probability that true error of any of the best classifiers (i.e., lowest training error) is larger than its training error plus 𝜖𝜖 is at most |𝐻𝐻|𝑒𝑒−2𝜖𝜖𝜖𝜖2• Sample complexity (for desired 𝛿𝛿 ≥ |𝐻𝐻|𝑒𝑒−2𝜖𝜖𝜖𝜖2)𝑀𝑀 ≥ ln 𝐻𝐻 + ln1𝛿𝛿/2𝜖𝜖26PAC Bounds• If we require that the previous error is bounded above by a fixed 𝛿𝛿, then with probability (1 −𝛿𝛿), for all ℎ ∈ 𝐻𝐻𝜖𝜖ℎ≤ 𝜖𝜖ℎ𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡+12𝑀𝑀ln |𝐻𝐻| + ln1𝛿𝛿• Follows from Chernoff bound|𝐻𝐻|𝑒𝑒−2𝜖𝜖𝜖𝜖2≤ 𝛿𝛿�ℎ∈𝐻𝐻𝑝𝑝 𝜖𝜖ℎ−𝜖𝜖ℎ𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡≥ 𝜖𝜖 ≤ |𝐻𝐻|𝑒𝑒−2𝜖𝜖𝜖𝜖2≤ 𝛿𝛿7“bias” “variance”PAC Bounds• If we require that the previous error is bounded above by 𝛿𝛿, then with probability (1 −𝛿𝛿), for all ℎ ∈ 𝐻𝐻𝜖𝜖ℎ≤ 𝜖𝜖ℎ𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡+12𝑀𝑀ln |𝐻𝐻| + ln1𝛿𝛿• Follows from Chernoff bound𝜖𝜖 ≥12𝑀𝑀ln |𝐻𝐻| + ln1𝛿𝛿�ℎ∈𝐻𝐻𝑝𝑝 𝜖𝜖ℎ−𝜖𝜖ℎ𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡≥ 𝜖𝜖 ≤ |𝐻𝐻|𝑒𝑒−2𝜖𝜖𝜖𝜖2≤ 𝛿𝛿8“bias” “variance”PAC Bounds• If we require that the previous error is bounded above by 𝛿𝛿, then with probability (1 −𝛿𝛿), for all ℎ ∈ 𝐻𝐻𝜖𝜖ℎ≤ 𝜖𝜖ℎ𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡+12𝑀𝑀ln |𝐻𝐻| + ln1𝛿𝛿• For small |𝐻𝐻|• High bias (may not be enough hypotheses to choose from)• Low variance9“bias” “variance”PAC Bounds• If we require that the previous error is bounded above by 𝛿𝛿, then with probability (1 −𝛿𝛿), for all ℎ ∈ 𝐻𝐻𝜖𝜖ℎ≤ 𝜖𝜖ℎ𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡+12𝑀𝑀ln |𝐻𝐻| + ln1𝛿𝛿• For large |𝐻𝐻|• Low bias (lots of good hypotheses)• High variance10“bias” “variance”PAC Learning• Given:• Set of data 𝑋𝑋• Hypothesis space 𝐻𝐻• Set of target concepts 𝐶𝐶• Training instances from unknown probability distribution over 𝑋𝑋 of the form (𝑥𝑥, 𝑐𝑐𝑥𝑥 )• Goal:• Learn the target concept 𝑐𝑐 ∈ C11PAC Learning• Given:• A concept class 𝐶𝐶 over 𝑛𝑛 instances from the set 𝑋𝑋• A learner 𝐿𝐿 with hypothesis space 𝐻𝐻• Two constants, 𝜖𝜖, 𝛿𝛿 ∈ (0,12)• 𝐶𝐶 is said to be PAC learnable by 𝐿𝐿 using 𝐻𝐻 iff for all distributions over 𝑋𝑋, learner 𝐿𝐿 by sampling 𝑛𝑛 instances, will with probability at least 1 −𝛿𝛿 outputs a hypothesis ℎ ∈ H such that• 𝜖𝜖ℎ≤ 𝜖𝜖• Running time is polynomial in 1𝜖𝜖,1𝛿𝛿, 𝑛𝑛, 𝑠𝑠𝑠𝑠𝑠𝑠𝑒𝑒(𝑐𝑐)12VC Dimension• Our analysis for the finite case was based on |𝐻𝐻|• If 𝐻𝐻 isn’t finite, this translates into infinite sample complexity• We can derive a different notion of complexity for infinite hypothesis spaces by considering only the number of points that can be correctly classified by some member of 𝐻𝐻• We will only consider the binary classification case for now13VC Dimension• What is the largest number of data points in 1-D that can be correctly classified by a linear separator (regardless of their assigned labels)?• 2 points:14Yes!VC Dimension• What is the largest number of data points in 1-D that can be correctly classified by a linear separator (regardless of their assigned labels)?• 2 points:15Yes!VC Dimension• What is the largest number of data points in 1-D that can be correctly classified by a linear separator (regardless of their assigned labels)?• 2 points:16Yes!VC Dimension• What is the largest number of data points in 1-D that can be correctly classified by a linear separator (regardless of their assigned labels)?• 3 points:17Yes!VC Dimension• What is the largest number of data

View Full Document