Towards Best Effort Information Extraction iFlex Warren Shen Pedro DeRose Robert McCann AnHai Doan Raghu Ramakrishnan By Akanksha Dayal Topics 1 2 3 4 5 6 7 8 Introduction Approximate IE programs Representing Approximate Data Approximate Query Processing The Next Effort Assistant Empirical Evaluation Conclusion Future Work The Idea User U uses a declarative language and writes and initial approximate IE program P using an approximate Query Processor that may fetch an approximate result U then examines that result and figures out if further refinement is required iFlex Approach Traditional Approaches Precise IE Information Extraction programs Precise IE results Traditional Approaches Drawbacks Difficult to execute partially specified IE programs and obtain meaningful results which produces a long debug loop Long time before first meaningful results are obtained impractical for time sensitive operation May be waste of effort as in most cases approximate results suffice and can be produced quickly How iFlex Addresses the drawbacks Relaxes precise IE requirement in an effort to have a best effort Information extraction User writes a quick initial approximate IE program P with possible real world semantics and evaluates it using approximate query processor to quickly arrive at approximate results Example Problem For a given 500 web pages with 1 house per page user U wants to find out all house with price exceeding 500 000 and whose high school is Lincoln Using iFlex The user U writes P initial approximate IE program specifying constraints like price should be numeric and maybe just one number greater than 500 000 with high school as Lincoln 9 search results U stops 200 result pages U redefines P for further refinement The Flexible Approach User knows when the program is under defined or over defined by looking at the result set U asks Next effort Assistant to suggest what to focus on For example look for the prices in bold More time spent writing program iterating high precision results Challenge Language for writing approximate IE programs For approximate IE programs we can extend a precise IE language like UIMA GATE Lixto or Xlog 1 Xlog Extension of Datalog Datalog variant for declarative IE programs 2 Alog To write approximate IE rules specify types of approximation in values Xlog Xlog Program P consists of multiple rules Rules are of form p q1 q2 qn where p qi are predicates head body resp Each predicate atom in a rule is associated with a relational table Extentional Predicate when table is provided to P Intensional Predicate when table needs to be computed using the rules in P Example Xlog program that extracts houses with price over 500000 and area above 4500 f2 top high school R1 houses x p a h housePages x extractHouses x p a h R2 schools s schoolPages y extractSchools y s R3 q x p a h houses a p a h schools s p 500000 a 4500 approxMatch h s where x documents p price a area h top high school 2 predicates extractHouses x p a h extractSchools y s 1 p function approxMatch h s true iff 2 text spans are similar tf idf Example cont d Assuming housePages consists of 2 documents x1 x2 and schoolPages of 2 documents y1 y2 R1 tuple extracted from x1 x1 351000 2750 vanhise High R1 tuple extracted from x2 x2 619000 4700 Bask High R2 3 tuples y1 Bask High y1 Franklin y1 Vanhise R3 x2 619000 4700 Bask High Xlog Usage Limitations Decompose task into subtasks making up ppredicates and p functions Then implement these made up predicates functions using a procedural language Significant amount of time spent in writing procedures Refinement aggravates the whole scene Best effort IE with Alog Start off by initial Xlog program called Skeleton list of IE predicates no procedures associated with them U implements them only partially So for a predicate q U partially writes one or more description rules U executes the so called partial IE programs Example S1 houses x p a h housePages x extractHouses x p a h S2 schools s schoolPages y extractSchools y s S3 Q x p a h houses x p a h schools s p 500000 a 4500 approxMatch h s S4 extractHouses x p a h from x p from x a from x h numeric p yes numeric a yes S5 extractSchools y s from y s bold font s yes Example cont d Vocabulary Description rules r Declare a set of domain constraints about the textual features of the attributes extracted q q1 qn similiarity to Xlog rule q is an IE predicate and defines a relation Domain Constraints A domain constraint f a v feature f of any text span that is a value for attribute a must take value v Writing Safe Description Rules A description rule is safe if each non input variable in the head also appears in the body either in intentional or extensional predicate or as in output variable in an IE predicate For example S1 extractHouses x p a h numeric p yes numeric a yes p a no indication where they are extracted from iFlex provides a built in from x y Safe Description Rule from x y extract all subspans of y from doc x e g S2 extractHouses x p a h from x p from x h from x a numeric p yes numeric a yes Providing Ways to Declare the Approximation Types Two common types of approximation supported 1 Existence of a tuple Existence annotation this indicates that each tuple in the relation R produced by a rule r may or may not exist 2 Value of an attribute of a tuple Attribute annotation this indicates that an attribute takes a value from a given set but we do not know which value U annotates an Alog rule to indicate these approx types Example S1 houses x p a h housePages x extractHouses x p a h S2 schools s schoolPages y extractSchools y s S3 Q x p a h houses x p a h schools s p 500000 a 4500 approxMatch h s S4 extractHouses x p a h from x p from x a from x h numeric p yes numeric a yes S5 extractSchools y s from y s bold font s yes S 1 houses x p a h housePages x extractHouses x p a h S 2 schools s schoolPages y extractSchools y s S 3 Q x p a h houses x p a h schools s p 500000 a 4500 approxMatch h s Example cont d Formal Definitions Definition 1 Existence Annotation Let p be the head of a rule r that produces relation R under the normal Xlog semantics Then adding an existence annotation to r means replacing p with p This produces a rule r that defines a set of possible relations R which is the powerset of the tuples in R Definition 2 Attribute Annotation Suppose the head of rule r is p a1 ai an Then annotating attribute ai means replacing the head with p a1 ai an to produce rule r Suppose that r defines relation R under the
View Full Document