Unformatted text preview:

MASSACHUSETTS INSTITUTE OF TECHNOLOGY SLOAN SCHOOL OF MANAGEMENT 15.565 Integrating Information Systems: Technology, Strategy, and Organizational Factors 15.578 Global Information Systems: Communications & Connectivity Among Information Systems 1 Spring 2002 Lecture 17 WEB AS A DATABASE: EVOLUTION OF XMLBrowser A B C “raw information” Web sites Traditional Use of the Web: For direct human usage Focus: Entertainment Examples: Extract just mortgage rate information Compare mortgage rates offered by multiple sources Build cumulative database of mortgage rates over time Compare cumulative rates with previously stored, alert to new highs and lows Programs for: Enhancing Analyzing Consolidating Processing A B C Web sites Web Wrapper and Context Mediator Technology Internal data bases Repository Exception reports 2 New Use of the Web: Program intermediaries Focus: ProductivityMIT Sloan COntext INterchange (COIN) Project Appli-cations OUTPUT PROCESSING ODBC Driver Web -Publishing Receivers CONTEXT MEDIATION * Automatic conflict detection and conversion - Derived data - Source selection - Source attribution TRUSTED AGENTS Browsers Data bases INPUT PROCESSING * Automatic * Automatic web web wrappingwrapping --SemiSemi--structured textstructured text --MultiMulti--source source query plan and query plan and executionexecution Sources Web Pages APPLICATIONS: Financial services, electronic commerce, asset visibility, in-transit visibility. 3Example Semi-structured Web data: Intel SEC filing 4Cameleon Architecture Data SQL Authentication, S Core Simple SQL Query & Output Format Output in Desired Format Retrieval HTTP Client Query Handling Web or Database Spec File Parsing Extraction Spec Files Regular Expression Engine Relational Front End Planner Optimizer Executioner Registry Application 5CIA Fact Book Example http://www.odci.gov/cia/publications/factbook/geos/sn.htmlCAMELEON QUERY: Select capital, location, coordinates, totalarea, climate, population, GDP from cia where Country="Singapore" CAMELEON RESULTS: Record 1 CAPITAL Singapore LOCATION Southeastern Asia, islands between Malaysia and Indonesia COORDINATES 1 22 N, 103 48 E TOTALAREA 647.5 sq km CLIMATE tropical; hot, humid, rainy; no pronounced rainy or dry seasons; thunderstorms occur on 40% of all days (67% of days in April) POPULATION 4,151,264 (July 2000 est.) GDP 7 $98 billion (1999 est.)Spec file for CIA Fact Book (partial) #Relation=cia #Source=http://www.odci.gov/cia/publications/factbook/country.html #Attribute=Link#String #Begin=Top\s*of\s*Page #Pattern=<LI><FONT SIZE=-1><a href="([^"]*)">#Country#</a></font> #End=</[Bb][oO][dD][yY]> #Source=http://www.odci.gov/cia/publications/factbook/#Link# #Attribute=Telephone#String #Begin=Telephones: #Pattern=</b>\s*([\0-\377]*?)\s*<p> #End=Telephone system: #Attribute=Background#String #Begin=Background: #Pattern=</b>\s*([\0-\377]*?)\s*< #End=Location: #Attribute=Location#String #Begin=Location: #Pattern=</b>\s*([\0-\377]*?)\s*<p> #End=Geographic\s*coordinates: . . . 8Regular Expressions Used in Spec Files * Match 0 or more times (greedy). • *? Match 0 or more times (non-greedy). • + Match 1 or more times (greedy). • ? Match 0 or 1 time (greedy). Greedy quantifiers such as * matches as much as possible, whereas non-greedy quantifiers stop at the minimum match. Example: <b> hello </b> <i>lovely </i> <b> world </b> <b>(.*) </b> would match ‘hello </b> <i>lovely </i> <b> world’ whereas <b>(.*?) </b> would match ‘hello’ and ‘world’ • . matches everything except \n • [\0-\377] matches everything • ^ matches the beginning of a string or line • [^ a character] matches everything except the specified character. For instance [^<] matches anything but < • $ matches the end of a string or line • \s matches a whitespace character • \S matches a non-whitespace character • \d matches a digit • Expressions within parentheses are saved. 9Sample Application Research Analyst or Trader Spreadsheet Legacy Application WWW Text Application Manual Data Movement 10Providing Integrated Data and Real-time feed AnalysisEquity prices - TIBCO SEC Filings - EDGAR Web based - Internet News - Reuters, Newswire and Business wire Web based - Internet Merrill Research Reports Text based - Merrill Intranet Market updates - Merrill Lynch home page Web based - Internet 11Spreadsheet Interface 12XML – The Silver Bullet ? • XML is (according to press reports …) ¾ “HTML on steroids” ? ¾ “a Rosetta Stone” ? ¾ “a universal way to translate data” ? ¾ “a miraculous way to” … information integration ? ¾ “a silver bullet” ? 13XML What is it? •XML -EXtensible Markup Language • Meta language for defining a markup language • Based on SGML -Standard Generalized Markup Language • Data model for syntax for structuring data • Can define tags at will • Can nest document structures to arbitrary levels ofcomplexity • Can use Document Type Definition (DTD) • Many other members of “family”: – XSL, XSLT, XLL, XML-Query, etc. 14XML Does help create structured Web pages Keywork plus Field-sensitive queries KeywordSearch Documents + semi-structured data DocumentsOrientation Multiple (XSL)SingleViews ContentPresentationTag purpose Extensible set of tagsFixed set of tagsExtensibility XMLHTMLFeature 15Example: HTML Compared with XML HTML * <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> <html> <head> . . . <BODY topmargin=18 leftmargin=6 bgcolor="#ffffff" link="#0000ee" VLINK="#551A8B" ALINK="#ff0000"> <pre><font size=2> Regular Our Price Price Palm Pilot V 329.00 236.00 In stock </font></pre> <table cellpadding=0 cellspacing=0 border=0> <tr><td align=left valign=middle width=455 nowrap height=20> <tr><td align=left valign=top nowrap width=455> <font size=1 face="helvetica,arial"> . . . XML </BODY> </HTML> <XML> <Product info> * The HTML is normally much messier, <Product> Palm Pilot V <\Product> with lots more formatting details and <tr> <Regular price> 329.00 <\Regular price>and <td> tags for defining tables and tab <Our price> 236.00 <\Our price> positions. <InStock> yes <\InStock> <\Product info> <\XML> 16XML Why do we need it? • W3C wanted to get out of the tag creation business • Separate data from presentation – Use of a style sheet instead of “hardcoded” HTML formatting – Flexibility / scalability / extensibility XML Page Netscape browser (default) Custom browser /


View Full Document

MIT 15 565J - WEB AS A DATABASE: EVOLUTION OF XML

Download WEB AS A DATABASE: EVOLUTION OF XML
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view WEB AS A DATABASE: EVOLUTION OF XML and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view WEB AS A DATABASE: EVOLUTION OF XML 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?