Guest Lecture Large Scale Deep Web Integration Exploring and Querying Structured Data on the Deep Web Zhen Zhang What you will learn in this lecture What is deep Web Why information integration on deep Web What are integration paradigms What are technical challenges What are proposed solutions 2 The Web becomes increasing dynamic Static Static surface surface Web Web Dynamic Dynamic deep deep Web Web 3 Data are hidden behind query forms 4 Bring data up front An evidence Google Base 5 6 The Deep Web Databases on the Web How to enable effective access to the deep Web 7 Survey the frontier BrightPlanet com March 2000 Bergman00 Overlap analysis of search engines n0 nb na N Estimated 43 000 96 000 deep Web sites Content size 500 times that of surface Web 8 Survey the frontier UIUC MetaQuerier April 2004 ChangHL 04 Macro Deep Web at large Data Automatically sampled 1 million IPs Micro per source specific characteristics Data Manually collected sources 8 representative domains 494 sources Airfare 53 Autos 102 Books 69 CarRentals 24 Hotels 38 Jobs 55 Movies 78 MusicRecords 75 Available at http metaquerier cs uiuc edu repository 9 We want to observe How many deep Web sources are out there How many structured databases 348 000 structured 102 000 text 3 1 How do search engines cover them 307 000 sites 450 000 DBs 1 258 000 interfaces Google covered 5 fresh and 21 state objects InvisibleWeb com covered 7 8 sources How hidden are they CarRental 0 Airfares 4 MusicRec Books Movies 80 10 Google s Recent Survey courtesy Jayant Madhavan 11 System Example Applications 12 Vertical Search Engines Warehousing approach e g Libra Academic Search NieZW 05 courtesy Integrating information from multiple types of sources Ranking papers conferences and authors for a given query Handling structured queries MSRA Web Database Web Database Web Database Web Database Web Database Journal Homepage PDF PS DOC Conf Homepage Auhtor Homepage 13 On the fly Meta querying Systems e g WISE HeMYW03 MetaQuerier ChangHZ05 MetaQuerier UIUC FIND sources Cars com Amazon com db of dbs Apartments com 411localte com QUERY sources unified query interface 14 What needs to be done Technical Challenges Source Modeling Selection Schema Matching Source Querying Crawling and Obj Ranking Data Extraction 15 Technical Challenges 1 Source Modeling Selection How to describe a source and find right sources for query answering 16 Source Modeling Selection for Large Scale Integration Focus Discovery of sources Focus Extraction of source models Focused crawling to collect query interfaces BarbosaF05 ChangHZ05 Hidden grammar based parsing ZhangHC04 Proximity based extraction HeMY 04 Classification to align with given taxonomy HessK03 Kushmerick03 Focus Organization of sources and query routing Offline clustering HeTC04 PengMH 04 Online search for query routing KabraLC05 17 Form Extraction the Problem Output all the conditions for each Grouping elements into query conditions Tagging elements with their semantic roles attributeoperator value 18 Form Extraction Parsing Approach ZhangHC04 A hidden syntactic model exist Observation Interfaces share patterns of presentation Hypothesis Interface Creation Now the problem Given query capabilities Gramma r how to find 19 Best Effort Visual Language Parsing Framework Input HTML query form 2P Grammar Productions Preferences Tokenizer Layout Engine BE Parser X Ambiguit y Resolutio n Error Handling Output semantic structure 20 Form Extraction Clustering Approach HessK03 Kushmerick03 Concept A form as a Bayesian network Training Estimate the Bayesian probabilities Classification Max likelihood predictions given terms 21 Technical Challenges 2 Schema Matching How to match the schematic structures between sources 22 Schema Matching for Large Scale Integration Focus Matching large number of interface schemas often in a holistic way Statistical model discovery HeC03 correlation mining HeCH04 HeC05 Subject Category Business Query probing WangWL 04 Fiction Clustering HeMY 03 WuYD 04 Type Computer Categor Corpus assisted MadhavanBD 05 yBusiness Web assisted WuDY06 Fiction Computer Focus Constructing unified interfaces As a global generative model HeC03 Cluster merge select HeMY 03 23 WISE Integrator Cluster Merge Represent HeMY 03 24 WISE Integrator Cluster Merge Represent HeMY 03 Matching attributes Synonymous label WordNet string similarity Compatible value domains enum values or type Constructing integrated interface form initial empty until all attributes covered take one attribute find clusters it belongs to select a representative and merge values put representative to the interface if not there 25 Statistical Schema Matching MGS A hidden statistical model exist HeC03 HeCH04 HeC05 Observation Schemas share tendencies of attribute usage M 1 Hypothesis C1 C2 C3 2 1 1 2 1 3 1 3 Concepts Attributes 4 5 1 4 Statistical Model Schema Generation Now the problem Given M how to find attribute matchings M 26 Technical Challenges 3 Source Querying Crawling Search How to query a source How to crawl all objects and to search them 27 Source Querying for Large Scale Integration Metaquerying model Focus On the fly Querying 1 Vertical search engine model Focus Source crawling to collect objects 2 MetaQuerier Query Assistant ZhangHC05 Form submission by query generation selection e g RaghavanG01 WuWLM06 Focus Object search and ranking NieZW 05 28 On the fly Querying ZhangHC05 Type locality based Predicate Target template P Source predicate s Translation X Type Recognizer Domain Specific Handler Predicat e Mapper Text Numeric Datetime Handler Handler Handler Target Predicate t Correspondences occur within localities Translation by type handler 29 Source Crawling by Query Selection WuWL 06 Author Title Category Ullman Complier System Ullman Data Mining Application Ullman Han Automata Data Mining Theory Application System Application Han Ullman Theory Automata Data Mining Conceptually the DB as a graph Compiler Node Attributes Edge Occurrence relationship Crawling is transformed into graph traversal problem Find a set of nodes N in the graph G such that for every node i in G there exists a node j in N j i And the summation of the cost of nodes in N should be minimum 30 Object Ranking Object Relationship Graph NieZW 05 Popularity Propagation Factor for each type of relationship link Popularity of an object is also affected by the popularity of the Web pages containing the object 31 Technical Challenges 4 Data Extraction How to extract
View Full Document