DOC PREVIEW
MIT 9 520 - Active Learning

This preview shows page 1-2-3-24-25-26-27-49-50-51 out of 51 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Active Learning OutlineActive learning motivationSupervised learningActive learningActive learning variantsMembership queriesProblemSelective sampling [Cohn, Atlas & Ladner, 1992]Can adaptive querying really help?More general hypothesis classes[1] Uncertainty sampling[2] Region of uncertainty[2] Region of uncertainty[2] Region of uncertainty[3] Query-by-committee[3] Query-by-committee[3] Query-by-committee[3] Query-by-committee[3] Query-by-committeeOnline active learningOnline learning: related workFast online active learning [Dasgupta, Kalai & M, COLT ‘05]Selective sampling, online constraintsAC Milan vs. Inter MilanProblem frameworkOPTPerceptronLower bound on labels for PerceptronA modified Perceptron updateA modified Perceptron updateMistake boundMistake boundActive learning ruleActive learning ruleLabel boundProof technique[DKM05] in contextLower bounds on label complexityA fuller pictureA view of the hypothesis spaceGeometry of hypothesis spaceLabel upper bounding technique [Dasgupta NIPS‘05]Searchability index [D05]Searchability index [D05]Example: the 1-d lineOpen problem: efficient, general ALOpen problem: efficient, general ALOther open problemsThank you!Active Learning9.520 Class 22, 03 May 2006Claire MonteleoniMIT CSAILOutlineMotivationHistorical framework: query learningCurrent framework: selective samplingSome recent results Open problemsActive learning motivationMachine learning applications, e.g.Medical diagnosisDocument/webpage classificationSpeech recognitionUnlabeled data is abundant, but labels are expensive.Active learning is a useful model here.Allows for intelligent choices of which examples to label.Label-complexity: the number of labeled examples required to learn via active learning Æ can be much lower than the PAC sample complexity!Supervised learningGiven access to labeled data (drawn iid from an unknown underlying distribution P), want to learn a classifier chosen from hypothesis class H, with misclassification rate <ε. Sample complexity characterized by d = VC dimension of H.If data is separable,need roughly d/ε labeled samples.Slide credit: Sanjoy DasguptaActive learningIn many situations unlabeled data is easy to come by, but there is a charge for each label.What is the minimum number of labels needed to achieve the target error rate?Slide credit: S. DasguptaActive learning variantsThere are several models of active learning: Query learning (a.k.a. Membership queries)Selective samplingActive model selection Experiment designVarious evaluation frameworks:Regret minimizationMinimize label-complexity to reach fixed error rateLabel-efficiency (fixed label budget)We focus on classification, though regression AL exists too.Membership queriesEarliest model of active learning in theory work [Angluin 1992]X = space of possible inputs, like {0,1}nH = class of hypothesesTarget concept h*∈ H to be identified exactly.You can ask for the label of any point in X: no unlabeled data.H0= HFor t = 1,2,…pick a point x ∈ X and query its label h*(x)let Ht= all hypotheses in Ht-1consistent with (x, h*(x))What is the minimum number of “membership queries” needed to reduce H to just {h*}?Slide credit: S. DasguptaX = {0,1}nH = AND-of-positive-literals, like x1∧ x3∧ x10S = { } (set of AND positions)For i = 1 to n:ask for the label of (1,…,1,0,1,…,1) [0 at position i]if negative: S = S ∪ {i}Total: n queries General idea: synthesize highly informative points.Each query cuts the version space -- the set of consistent hypotheses --in half.Slide credit: S. DasguptaMembership queries: exampleProblemMany results in this framework, even for complicated hypothesis classes.[Baum and Lang, 1991] tried fitting a neural net to handwritten characters.Synthetic instances created were incomprehensible to humans![Lewis and Gale, 1992] tried training text classifiers.“an artificial text created by a learning algorithm is unlikely to be a legitimate natural language expression, and probably would be uninterpretable by a human teacher.”Slide credit: S. DasguptaSelective sampling[Cohn, Atlas & Ladner, 1992]Selective sampling:Given: pool (or stream) of unlabeled examples, x, drawn i.i.d. from input distribution.Learner may request labels on examples in the pool/stream.(Noiseless) oracle access to correct labels, y.Constant cost per labelThe error of any classifier h is measured on distribution P:err(h) = P(h(x) ≠ y)Goal: minimize label-complexity to learn the concept to a fixed accuracy.Can adaptive querying really help?[CAL92, D04]: Threshold functions on the real line hw(x) = 1(x ≥ w), H = {hw: w ∈ R}Start with 1/εunlabeledpointsBinary search – need just log 1/ε labels, from which the rest can be inferred! Exponential improvement in sample complexity.w+-Slide credit: S. DasguptaMore general hypothesis classesFor a general hypothesis class with VC dimension d, is a “generalized binary search” possible?Random choice of queries d/ε labelsPerfect binary search d log 1/ε labelsWhere in this large range does the label complexity of active learning lie?We’ve already handled linear separators in 1-d…Slide credit: S. Dasgupta[1] Uncertainty samplingMaintain a single hypothesis, based on labels seen so far.Query the point about which this hypothesis is most “uncertain”.Problem: confidence of a single hypothesis may not accurately represent the true diversity of opinion in the hypothesis class.X-------+++++--Slide credit: S. Dasgupta[2] Region of uncertaintycurrent version spaceSuppose data lies on circle in R2; hypotheses are linear separators.(spaces X, H superimposed)region of uncertainty in data spaceCurrent version space: portion of H consistent with labels so far.“Region of uncertainty” = part of data space about which there is still some uncertainty (ie. disagreement within version space)++Slide credit: S. Dasguptacurrent version spaceData and hypothesis spaces, superimposed:(both are the surface of the unit sphere in Rd)region of uncertainty in data spaceAlgorithm [CAL92]:of the unlabeled points which lie in the region of uncertainty, pick one at random to query.Slide credit: S. Dasgupta[2] Region of uncertainty[2] Region of uncertaintyNumber of labels needed depends on H and also on P.Special case: H = {linear separators in Rd}, P = uniform distribution over unit sphere. Theorem [Balcan, Beygelzimer & Langford ICML ‘06]: Õ(d2log 1/ε) labels are needed to reach a hypothesis with error rate < ε.Supervised learning:


View Full Document

MIT 9 520 - Active Learning

Documents in this Course
Load more
Download Active Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Active Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Active Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?