Active Learning OutlineActive learning motivationSupervised learningActive learningActive learning variantsMembership queriesProblemSelective sampling [Cohn, Atlas & Ladner, 1992]Can adaptive querying really help?More general hypothesis classes[1] Uncertainty sampling[2] Region of uncertainty[2] Region of uncertainty[2] Region of uncertainty[3] Query-by-committee[3] Query-by-committee[3] Query-by-committee[3] Query-by-committee[3] Query-by-committeeOnline active learningOnline learning: related workFast online active learning [Dasgupta, Kalai & M, COLT ‘05]Selective sampling, online constraintsAC Milan vs. Inter MilanProblem frameworkOPTPerceptronLower bound on labels for PerceptronA modified Perceptron updateA modified Perceptron updateMistake boundMistake boundActive learning ruleActive learning ruleLabel boundProof technique[DKM05] in contextLower bounds on label complexityA fuller pictureA view of the hypothesis spaceGeometry of hypothesis spaceLabel upper bounding technique [Dasgupta NIPS‘05]Searchability index [D05]Searchability index [D05]Example: the 1-d lineOpen problem: efficient, general ALOpen problem: efficient, general ALOther open problemsThank you!Active Learning9.520 Class 22, 03 May 2006Claire MonteleoniMIT CSAILOutlineMotivationHistorical framework: query learningCurrent framework: selective samplingSome recent results Open problemsActive learning motivationMachine learning applications, e.g.Medical diagnosisDocument/webpage classificationSpeech recognitionUnlabeled data is abundant, but labels are expensive.Active learning is a useful model here.Allows for intelligent choices of which examples to label.Label-complexity: the number of labeled examples required to learn via active learning Æ can be much lower than the PAC sample complexity!Supervised learningGiven access to labeled data (drawn iid from an unknown underlying distribution P), want to learn a classifier chosen from hypothesis class H, with misclassification rate <ε. Sample complexity characterized by d = VC dimension of H.If data is separable,need roughly d/ε labeled samples.Slide credit: Sanjoy DasguptaActive learningIn many situations unlabeled data is easy to come by, but there is a charge for each label.What is the minimum number of labels needed to achieve the target error rate?Slide credit: S. DasguptaActive learning variantsThere are several models of active learning: Query learning (a.k.a. Membership queries)Selective samplingActive model selection Experiment designVarious evaluation frameworks:Regret minimizationMinimize label-complexity to reach fixed error rateLabel-efficiency (fixed label budget)We focus on classification, though regression AL exists too.Membership queriesEarliest model of active learning in theory work [Angluin 1992]X = space of possible inputs, like {0,1}nH = class of hypothesesTarget concept h*∈ H to be identified exactly.You can ask for the label of any point in X: no unlabeled data.H0= HFor t = 1,2,…pick a point x ∈ X and query its label h*(x)let Ht= all hypotheses in Ht-1consistent with (x, h*(x))What is the minimum number of “membership queries” needed to reduce H to just {h*}?Slide credit: S. DasguptaX = {0,1}nH = AND-of-positive-literals, like x1∧ x3∧ x10S = { } (set of AND positions)For i = 1 to n:ask for the label of (1,…,1,0,1,…,1) [0 at position i]if negative: S = S ∪ {i}Total: n queries General idea: synthesize highly informative points.Each query cuts the version space -- the set of consistent hypotheses --in half.Slide credit: S. DasguptaMembership queries: exampleProblemMany results in this framework, even for complicated hypothesis classes.[Baum and Lang, 1991] tried fitting a neural net to handwritten characters.Synthetic instances created were incomprehensible to humans![Lewis and Gale, 1992] tried training text classifiers.“an artificial text created by a learning algorithm is unlikely to be a legitimate natural language expression, and probably would be uninterpretable by a human teacher.”Slide credit: S. DasguptaSelective sampling[Cohn, Atlas & Ladner, 1992]Selective sampling:Given: pool (or stream) of unlabeled examples, x, drawn i.i.d. from input distribution.Learner may request labels on examples in the pool/stream.(Noiseless) oracle access to correct labels, y.Constant cost per labelThe error of any classifier h is measured on distribution P:err(h) = P(h(x) ≠ y)Goal: minimize label-complexity to learn the concept to a fixed accuracy.Can adaptive querying really help?[CAL92, D04]: Threshold functions on the real line hw(x) = 1(x ≥ w), H = {hw: w ∈ R}Start with 1/εunlabeledpointsBinary search – need just log 1/ε labels, from which the rest can be inferred! Exponential improvement in sample complexity.w+-Slide credit: S. DasguptaMore general hypothesis classesFor a general hypothesis class with VC dimension d, is a “generalized binary search” possible?Random choice of queries d/ε labelsPerfect binary search d log 1/ε labelsWhere in this large range does the label complexity of active learning lie?We’ve already handled linear separators in 1-d…Slide credit: S. Dasgupta[1] Uncertainty samplingMaintain a single hypothesis, based on labels seen so far.Query the point about which this hypothesis is most “uncertain”.Problem: confidence of a single hypothesis may not accurately represent the true diversity of opinion in the hypothesis class.X-------+++++--Slide credit: S. Dasgupta[2] Region of uncertaintycurrent version spaceSuppose data lies on circle in R2; hypotheses are linear separators.(spaces X, H superimposed)region of uncertainty in data spaceCurrent version space: portion of H consistent with labels so far.“Region of uncertainty” = part of data space about which there is still some uncertainty (ie. disagreement within version space)++Slide credit: S. Dasguptacurrent version spaceData and hypothesis spaces, superimposed:(both are the surface of the unit sphere in Rd)region of uncertainty in data spaceAlgorithm [CAL92]:of the unlabeled points which lie in the region of uncertainty, pick one at random to query.Slide credit: S. Dasgupta[2] Region of uncertainty[2] Region of uncertaintyNumber of labels needed depends on H and also on P.Special case: H = {linear separators in Rd}, P = uniform distribution over unit sphere. Theorem [Balcan, Beygelzimer & Langford ICML ‘06]: Õ(d2log 1/ε) labels are needed to reach a hypothesis with error rate < ε.Supervised learning:
View Full Document