The Cimple Project on Community Information Management AnHai Doan University of Wisconsin Madison The CIM Problem Numerous online communities database researchers movie fans legal professionals bioinformatics enterprise intranets tech support groups Each community many data sources many members Database community home pages project pages DBworld DBLP conference pages Movie fan community review sites movie home pages theatre listings Legal profession community law firm home pages 2 The CIM Problem Members often want to discovery query monitor information in the community Database community what is new in the past week in the database community any interesting connection between researchers X and Y find all citations of this paper in the past one week on the Web what are current hot topics who has moved where Legal profession community which lawyers have moved where which law firms have taken on which cases 3 The CIM Problem To address such needs build data portals Starting out topic based now structured data portals DBLP Citeseer IMDB GlobalSpec etc Limitations of current solutions mostly by hand labor intensive error prone hard to port solutions few services other than browsing and keyword search 4 Cimple Project Wisconsin Yahoo Research Develop generic solutions to create structured data portals via extraction integration mass collaboration Researcher Homepages Jim Gray Web pages Conference Pages Group Pages DBworld mailing list Jim Gray SIGMOD 04 give talk SIGMOD 04 Text documents Keyword search SQL querying Question answering Browse Mining Alert Monitor News summary DBLP Personalize system provide feedback 5 The Research Team Faculty Vice President AnHai Doan Raghu Ramakrishnan Current students Pedro DeRose Warren Shen Fei Chen Yoonkyong Lee Doug Burdick Mayssam Sayyadian Xiaoyong Chai Ting Chen 6 Prototype System DBLife Integrate data of the DB research community 1164 data sources Crawled daily 11000 pages 160 MB day 7 Data Extraction 8 Data Integration Raghu Ramakrishnan co authors A Doan Divesh Srivastava 9 Resulting ER Graph Proactive Re optimization write write Shivnath Babu advise coauthor write Pedro Bizarro coauthor coauthor Jennifer Widom David DeWitt advise PC member PC Chair SIGMOD 2005 10 Querying The ER Graph Query David DeWitt Jennifer Widom coauthor 1 David DeWitt Jennifer Widom coauthor 2 Jennifer Widom David DeWitt PC member PC Chair SIGMOD 2005 Shivnath Babu 3 advise Jennifer Widom coauthor coauthor David DeWitt 11 Provide Services DBLife system 12 Mass Collaboration Example 1 Picture is removed if enough users vote no 13 Mass Collaboration Meets Jeff Naughton Jeffrey F Naughton swears that this is David J DeWitt 14 Mass Collaboration Example 2 Community Wikipedia backed up by a structured underlying database 15 What We Have Done Define the CIM problem understand it a little bit start to talk about it in the DB community SIGMOD 06 tutorial IEEE DEB 06 CIDR 07 Build DBLife helps clarify research issues live at dblife cs wisc edu latest stuff at dblife labs cs wisc edu Start some preliminary research ICDE 07a ICDE 07b ICDE 07b 16 What We Would Like to Do Next Release DBLife as a research education tool possible service to the DB community demo of CIM systems benchmark challenge for data integration extraction Develop and release a generic Cimple platform anyone can use it to build structured data portals Build CimBase a hosting service anyone can specify a structured portal on CimBase we will build and host it Continue research expand team build alliance 17 Research Challenges 1 Researcher Homepages Jim Gray Web pages Conference Pages Group Pages DBworld mailing list Jim Gray SIGMOD 04 give talk SIGMOD 04 Text documents Keyword search SQL querying Question answering Browse Mining Alert Monitor News summary DBLP Personalize system provide feedback Information extraction Data integration Mass collaboration 18 Research Challenges 2 Researcher Homepages Jim Gray Web pages Conference Pages Group Pages DBworld mailing list Jim Gray SIGMOD 04 give talk SIGMOD 04 Text documents Keyword search SQL querying Question answering Browse Mining Alert Monitor News summary DBLP Personalize system provide feedback Exploiting extracted data Handling uncertainty provenance explanation Dealing with evolving data versioning temporal data 19 Research Challenges 3 Researcher Homepages Jim Gray Web pages Conference Pages Group Pages DBworld mailing list Jim Gray SIGMOD 04 give talk SIGMOD 04 Text documents Keyword search SQL querying Question answering Browse Mining Alert Monitor News summary DBLP Personalize system provide feedback What is the right architecture What is the right data model storage How to build continuously running systems How to build massively scalable hosting services How to build a generic CIM platform 20 Rest of the Talk The CIM problem The Cimple solution approach What we have done plan to do Research challenges information extraction data integration focus on entity matching mass collaboration Broader perspectives 21 Declarative IE Current IE research develops learning rule based solutions SIGMOD 06 tutorial focuses largely on improving accuracy DECLARATIVE IE Dr R Ramakrishnan This is a fun topic Real world IE applications glue multiple such solutions together using Perl Serious problems hard to develop understand debug and optimize 22 Example in DBLife Find conference name in raw text Regular expressions to construct the pattern to extract conference names These are subordinate patterns my wordOrdinals first second third fourth fifth sixth seventh eighth ninth tenth eleventh twelfth thirteenth fourteenth fifteenth my numberOrdinals d 1st 2nd 3rd 1th 2th 3th 4th 5th 6th 7th 8th 9th 0th my ordinals wordOrdinals numberOrdinals my confTypes Conference Workshop Symposium my words A Z w s A word starting with a capital letter and ending with 0 or more spaces my confDescriptors international s A Z s e g International Conference or the conference name for workshops e g VLDB Workshop my connectors on of my abbreviations A Z w w W s d d Conference abbreviations like SIGMOD 06 The actual pattern we search for A typical conference name this pattern will find is 3rd International Conference on Blah Blah Blah ICBBB 05 my fullNamePattern ordinals s words confDescriptors confTypes s connectors s s abbreviations n r Given a dbworldMessage look for the conference pattern lookForPattern dbworldMessage fullNamePattern In a given file look for occurrences of pattern pattern is
View Full Document