SL Language Manual CS W4115: Programming Languages and Translators Professor Stephen A. Edwards Computer Science Department Submitted by Majid Khan {UID = mk2759, Email = [email protected] } 1. Introduction Search engines are still emerging markets and lack tools which enable building search applications e.g. Search engines, price matching engines, Meta crawlers etc. faster and easier. This search language would provide user capability to create text search applications with lesser efforts. The scope of this project would be limited to text search in HTML and plain text files. A typical search application should provide 1) Ability to define repositories and maintain repositories 2) Ability to understand document formats and manipulate data inside repositories 3) Ability to gather statistical data based on repository data and then store it for later use. 4) Ability to provide operations on statistical data to perform and produce results 1.1 Glossary Repository – A repository would be a hierarchy of directories containing documents. Document – HTML and plain texts would be the file types used as documents. Only files with extensions html, htm, txt and without extensions (would be considered as plain text) would be processed by this language.Data manipulation – It is a multi step process which is as following Parse data and get words. Perform stemming on these words (http://en.wikipedia.org/wiki/Stemming) Get frequencies of stemmed words and create an inverted vector table using this data. Calculate term frequency–inverse document frequency (tf-idf) for every stemmed word. Store this data in for future use. Search – Search would need two items. Query string and a repository on which we have already performed data manipulation. Query is a series of alphanumeric words separated by white space. This process would return a list of documents and also a number that represents the closeness of query with document. 2. Lexical conventions There are six kinds of tokens: identifiers, keywords, constants, strings, expression operators, and separators. White space would be ignored (would not be converted to any tag) but it may be serving as tag separators. A token is the longest consuctive string not separated by white space and other declared seperators e.g. Comma, semiColon, braces, new line, space, tab etc. 2.1 Comments C style multiline comments are used i.e. Anything between /* and */ is comment. There is no hierarchy of comments i.e. /* /* */ is one comment while /* /* */ */ would be an error because outermost */ is extra. 2.2 Identifiers An identifier is a sequence of letters and digits; the first character must be alphabetic. Identifiers are case insensitive. Although it could be controlled but identifiers could be of any length. All part of identifier is significant.2.3 Keywords int null float if else for each string name value document documentlist operation 2.4 Types and Literals There are several kinds of constants, as follows: 2.4.1 Integer An integer is a sequence of digits and could precede an optional (‘+’ or ‘-‘) character. Integer must be in the range -2147483648 to 2147483647. 2.4.2 Floats A float consists of an integer part, a decimal point, and a fraction part. The integer and fraction parts both consist of a sequence of digits. Either the integer part or the fraction part (not both) may be missing. 2.4.3 String A string is a sequence of characters surrounded by double quotes `“`. A string has the type arrayofcharacters (see below) and refers to an area of storage initialized with the given characters. In a string, the character `"` (double quote) must be preceded by a `”` (double quote). Character constant is missing and this could be fulfilled by a string of size 1. 2.4.4 Name Value A name value is an assignment of string constant to an identified. A name value can not exist at its own (cannot be declared as a variable). It can only be defined when an operation is invoked. The name part is predefined for built in function and can be defined for custom functions by user. Its value have specific patterns (If time permits I would try to introduce regular expression to match this value). The only difference between namevalue and oridnary variable that is passed through a function is that a regular expression acn be checked against this inside function (this feature may not be implemented) which would check or reject the value. Each operation or data declaration would define its own required name value types. A name value cannot be declared at its own in declaration section of program. Once name value is in scope i.e. in a function that contains name value parameters it behaves like identifier and can become lvalue if needed. 2.4.5 Repository Repository is the base datatype that holds information about data source. It could be defined in two ways, creating from scratch or from some operation on existing repositories. Repository data type needs three name value fields "location", "inlcude" and "type". Name value "location" contains a value that points to a directory structure (compiler would not validate value string contains a valid directory). Name value "include" contains a string that contains comma separated wildcard file patterns (default value if not provided is "*.html,*.txt,*" which means all files that has extension .html, .txt and without extensions in all subdirectories would be considered as doucments). Name value "type" contains the method which would be use to produce statistical data for search. Example initialization: repository arep is location=”c:\adirectory\anotherDirectory” include=”*.html,*.txt,*.htm,*” type=”cosine”; 2.4.6 Query Query is the base data type that holds search string, repository reference and where the search result would be stored. It has two name values "query" and "store". Name value "query" is a string that we are searching in repository and "store" is the file name where query results would be stored (if it is already present then file would be overwritten). Example initialization: performquery q on arep with query=”search this data please” store=”search.query” reference=docList;2.4.7 Document and DocumentList Document type is not in the scope of current project (I mentioned earlier that if time would permit I will add this). Query result can be returned as a collection of Document which is a builtin type called Document list. A for each loop
View Full Document