UIUC Event Search A Crawling Calendar Samyong Jeong JaeHun Jung Chaeeun Lee Edward Lee Andrzej Palczewski University of Illinois at Urbana Champaign 201 N Goodwin Ave Urbana IL 61801 2302 1 217 333 3426 jeong6 jung28 clee82 eslee3 palczews uiuc edu 1 INTRODUTION Imagine being able to effortlessly search through all your favorite calendars on the internet and find cool events for the upcoming weekend Imagine a query interface so powerful that with a few clicks you could find out both when that fascinating seminar with delicious pizza is going to occur as well as what plays are being shown tonight at the local theatre for your date with that cute coworker This was the motivation behind UIUC Event Search to take all the data from the University of Illinois at UrbanaChampaign s calendars and stick it into one accessible calendar 2 MOTIVATION 2 1 Navigation When the UIUC calendars are split up they are difficult to navigate Most people will lose interest in searching many separate calendars and give up in the end missing out on events they otherwise would have been interested in Since campus events are made to create a strong community meet new friends and have fun more attendance is always a positive thing By combining the events more people will be able to attend the events they are interested in thus creating a positive outcome for everyone 2 2 Interface In addition to being difficult to navigate through the distributed system the user interface for the UIUC calendars is rather poor It is difficult to see at a glance which events occur on what date due to the lack of a calendar view with events listed on each date Instead the user needs to read through all the events that occur for the entire month or create a separate query for each day they are interested in By creating an intuitive and easy to use interface we address this problem in UIUC Event Search and make the calendar viewing a much more pleasurable experience 2 3 Subscription Finally we wanted to allow users to subscribe to their favorite events and receive email updates This gives them the opportunity to be notified of new events that might interest them without even having to visit the web site This automated system allows users to subscribe by event type for instance a user might choose to be notified of all seminars dances and sports games This especially helps people who are too busy to search for things to do but if something interesting drops their way they would love to come 3 METHODS 3 1 Database 3 1 1 Design The database was first created in SQL Server 2000 using SQL scripts It consists of five tables the site table event table user table user event table and log table All tables are designed with an auto numbered ID as the first field and primary key This guarantees no duplicates and allows all records to be referenced individually All names are prefixed with the table name For example the event URL is E URL Finally all rules of data normalization are applied to make sure that the data is stored in as compact and malleable fashion as possible 3 1 2 Site The site table contains all the information pertaining to each site to be crawled It has a URL that we will crawl as well as an additional field containing any data the program wants to serialize into it This makes the site table extensible for any programming necessary for crawling the data Unfortunately this field never came into play since the parsing engine globally parses all calendars using the same configuration Still if in the future we needed to add new calendars with different configuration files this would be the place to do it 3 1 3 Event The event table has a record for every event in the database It contains a separate field for start date end date location description email etc The start and end dates are required and events that only occur at one point in time should have the start date equal to the end date This is necessary for the web front end to function well as users can query by start date and end date on the web site Also events are shown based on their occurrence date 3 1 4 User Finally the user table contains the user information and the user event table contains one entry for every event category the user is subscribed to We designed the database this way so the user could subscribe to an arbitrarily large number of events 3 1 5 Log Finally the log table contains a record for each error that occurs in one of our programs When the parser is unable to parse a page it writes an entry to the log table From this we can deduce which site is causing the problem and why Since all of our programs run unsupervised if we do not take note of errors we will be unable to debug them 3 2 3 2 1 Crawler Design Crawler should search and get the site information related to the category what user wants to know Because sometimes it takes much time to search all the web sites it is usually executed at background on server We decided to make a Java program for the search and get function implementation For the prerequisites the server should be at the DMZ area which is not screened for the security problem And of course the sites that crawler searches should be at the internet zone DMZ area Usually C or Java can be used to make the crawler so we decided to use Java and JSP program So we used Apache Tomcat environment for Web Application Server The crawler program searches the websites related to specific category at first and then it searches the hyperlinked sites recursively until there s no hyperlinks on the page Our project is focused on the event information of the university websites to make event database So the crawler just needs to search the university websites especially all the calendar pages The result lists the pages that crawler found can be very big at the size Sometimes it s over 1000 events in the university calendar For the Road Runner which uses the result of the crawler and extracts event information at the page the website information is downloaded at the server as file format It was because the roadrunner can understand the file type as its input Road runner searches specific directory e g c event on which the crawler program creates file per one page for the result of the searching But the crawler program does not writes page information and creates files for all the sites that it searched it only makes files for the lowest hyperlinked pages which contains no more hyperlinks for the calendar and event And also before making the files
View Full Document