CSC309: Web ProgrammingGreg Wilson 11Web Programming:Client-Side HTTPGreg [email protected] 20052Small Pieces, Loosely JoinedUnix command line was the world's first component object modelAllowed programmers to build small pieces, then connect them in arbitrary waysKey features:Low barriers to entryCommon data formatCommunication protocol3…Loosely JoinedThe web succeeded (in part) because it followed the same modelData format: HTML (now XML)Communication protocol: HTTPThis lecture looks at how to use HTTP to get data over the webNext one looks at how to provide informationNext week, we'll look at what happens in between4HTTP ReviewMost common protocol on the web is HTTPRuns on top of TCP/IP, which provides reliable stream connection between two endpointsHTTP cycle:Client makes connectionSends request (request line, headers, body)Server sends response (similar format)Connection is closedCycle may repeat many times to display one logical pageCSC309: Web ProgrammingGreg Wilson 25Fetching PagesOpening sockets, constructing HTTP requests, parsing responses, etc. is tediousSo most languages provide a library for doing itPython: urllib.urlopen(URL) does what your browser would do:Parse the URL to figure out who to talk to, and what to ask forConstruct requestGive calling code something that looks like a file handle so that it can read the response6…Fetching Pagesimport urllibinput = urllib.urlopen("http://www.third-bit.com/greeting.html")lines = input.readlines()input.close()for line in lines[:5]:print line,7…Fetching PagesNote: readlines() wouldn't do the right thing if the URL referred to an imageUse read() to grab bytes in that caseUp to the client to do the right thing!8Baby SpidersExample: make a list of the links in a web pageThe first step in building a web spider that can explore the internet on its ownThat, and a search engine, and you're GoogleFetch the page, then parse it to extract the linksShould use DOM, but many web pages are badly formattedUse regular expressions insteadCSC309: Web ProgrammingGreg Wilson 39…Baby Spidersimport urllib, refrom sets import Setinput = urllib.urlopen("http://www.third-bit.com/index.html")page = input.read()input.close()links = re.findall(r'href=\"[^\"]+\"', page)temp = Set()for x in links:temp.add(x[6:-1])links = list(temp)links.sort()for x in links:print x10Passing ParametersSometimes want to provide extra information as part of a URLE.g., to specify search terms to GoogleAdd parameters to the URLhttp://www.google.com?q=Python searches for pages related to Python"?" separates parameters from the rest of the URLEach parameter is name=valueMultiple parameters separated by "&"Space replaced by "+"11URL EncodingBut what if you want to include "?" or "&" as part of a URL?Encode special charactersYes, it's another escaping mechanism…Use %XX, where XX is the hex character code%3D=%3B;%2C,%2B+%3A:%2F/%40@%3F?%26&%25%12…Passing ParametersExample: to search Google for "grade = A+":http://www.google.ca?grade+%3D+A%2BHelper functions:urllib.quote(str) replaces special charactersurllib.unquote(str) converts backurllib.urlencode(params) takes a list of pairs, or a dictionary, and constructs the entire query parameter stringCSC309: Web ProgrammingGreg Wilson 413Web ServicesSuppose you want to write a script that actually does search GoogleConstruct a URL: easySend it and read response: no problemParse the response: hm… there's a lot of junk on the page…Many first-generation web applications relied on screen scrapingProblem: whenever the web site changes its layout, the application has to be rewritten14…Web ServicesA proto-solution is to give clients information twiceOnce in the page body for humans to readOnce in the "meta" headers for machines to readNext step in evolution:Client says, "I want machine-readable XML, not human-readable HTML"Much easier to parseMuch less likely to change over timeA form of remote procedure call15Let the Shouting BeginTwo camps:Use existing HTTP for request/response, orUse a new protocol specifically for web servicesMost popular new protocol today is SOAPSimple Object Access ProtocolDespite its name, it's anything but simpleCredentials, foreign objects, blah blah blahThere are libraries to hide the details……but debugging can be a nightmare16Let the Shouting BeginTwo camps:Use existing HTTP for request/responseRepresentation State Transfer (REST)Use a new protocol specifically for web servicesMost popular new protocol today is SOAPSimple Object Access ProtocolDespite its name, it's anything but simpleLocal proxies for remote objectsLike database abstraction layersDebugging can be a nightmareCSC309: Web ProgrammingGreg Wilson 517AmazonAmazon was one of the first big players to define a web APIYou need a license key in order to use itFree keys restrict you to one request per secondUse functions in amazon.py module to search by various criteriaResult is a list of objects that match the criteriaCan now maintain a wishlist programmatically18…Amazonimport sys, amazon# Format multiple authors' names nicely.def prettyName(arg):if type(arg) in (list, tuple):arg = ', '.join(arg[:-1]) + ' and ' + arg[-1]return argif __name__ == '__main__':# Get information.key, asin = sys.argv[1], sys.argv[2]amazon.setLicense(key)items = amazon.searchByASIN(asin)19…Amazon# Handle errors.if not items:print 'Nothing found for', asinif len(items) > 1:print len(items), 'items found for', asin# Display information.item = items[0]productName = item.ProductNameourPrice = item.OurPriceauthors = prettyName(item.Authors.Author)print '%s: %s (%s)' % (authors, productName, ourPrice)20Everybody Can PlayYou can write similar code to talk to:GoogleFedExeBay/PayPalAnd on, and on…Question 1 of Exercise 2 will ask you to do thisOdds are good your next employer will as wellCSC309: Web ProgrammingGreg Wilson 621SummaryHuman activities have natural timescalesSip of coffee, fresh pot, tomorrow, sometime…Real revolutions occur when we move something from one category to anotherSpreadsheetsDesktop publishingWeb services make it possible for ordinary programmers to create distributed applications without heroic effortSo,
View Full Document