DOC PREVIEW
Columbia COMS W4115 - SL Language Manual

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

SL Language Manual CS W4115: Programming Languages and Translators Professor Stephen A. Edwards Computer Science Department Submitted by Majid Khan {UID = mk2759, Email = [email protected] } 1. Introduction Search engines are still emerging markets and lack tools which enable building search applications e.g. Search engines, price matching engines, Meta crawlers etc. faster and easier. This search language would provide user capability to create text search applications with lesser efforts. The scope of this project would be limited to text search in HTML and plain text files. A typical search application should provide 1) Ability to define repositories and maintain repositories 2) Ability to understand document formats and manipulate data inside repositories 3) Ability to gather statistical data based on repository data and then store it for later use. 4) Ability to provide operations on statistical data to perform and produce results 1.1 Glossary Repository – A repository would be a hierarchy of directories containing documents. Document – HTML and plain texts would be the file types used as documents. Only files with extensions html, htm, txt and without extensions (would be considered as plain text) would be processed by this language.Data manipulation – It is a multi step process which is as following Parse data and get words. Perform stemming on these words (http://en.wikipedia.org/wiki/Stemming) Get frequencies of stemmed words and create an inverted vector table using this data. Calculate term frequency–inverse document frequency (tf-idf) for every stemmed word. Store this data in for future use. Search – Search would need two items. Query string and a repository on which we have already performed data manipulation. Query is a series of alphanumeric words separated by white space. This process would return a list of documents and also a number that represents the closeness of query with document. 2. Lexical conventions There are six kinds of tokens: identifiers, keywords, constants, strings, expression operators, and separators. White space would be ignored (would not be converted to any tag) but it may be serving as tag separators. A token is the longest consuctive string not separated by white space and other declared seperators e.g. Comma, semiColon, braces, new line, space, tab etc. 2.1 Comments C style multiline comments are used i.e. Anything between /* and */ is comment. There is no hierarchy of comments i.e. /* /* */ is one comment while /* /* */ */ would be an error because outermost */ is extra. 2.2 Identifiers An identifier is a sequence of letters and digits; the first character must be alphabetic. Identifiers are case insensitive. Although it could be controlled but identifiers could be of any length. All part of identifier is significant.2.3 Keywords int null float if else for each string name value document documentlist operation 2.4 Types and Literals There are several kinds of constants, as follows: 2.4.1 Integer An integer is a sequence of digits and could precede an optional (‘+’ or ‘-‘) character. Integer must be in the range -2147483648 to 2147483647. 2.4.2 Floats A float consists of an integer part, a decimal point, and a fraction part. The integer and fraction parts both consist of a sequence of digits. Either the integer part or the fraction part (not both) may be missing. 2.4.3 String A string is a sequence of characters surrounded by double quotes `“`. A string has the type arrayofcharacters (see below) and refers to an area of storage initialized with the given characters. In a string, the character `"` (double quote) must be preceded by a `”` (double quote). Character constant is missing and this could be fulfilled by a string of size 1. 2.4.4 Name Value A name value is an assignment of string constant to an identified. A name value can not exist at its own (cannot be declared as a variable). It can only be defined when an operation is invoked. The name part is predefined for built in function and can be defined for custom functions by user. Its value have specific patterns (If time permits I would try to introduce regular expression to match this value). The only difference between namevalue and oridnary variable that is passed through a function is that a regular expression acn be checked against this inside function (this feature may not be implemented) which would check or reject the value. Each operation or data declaration would define its own required name value types. A name value cannot be declared at its own in declaration section of program. Once name value is in scope i.e. in a function that contains name value parameters it behaves like identifier and can become lvalue if needed. 2.4.5 Repository Repository is the base datatype that holds information about data source. It could be defined in two ways, creating from scratch or from some operation on existing repositories. Repository data type needs three name value fields "location", "inlcude" and "type". Name value "location" contains a value that points to a directory structure (compiler would not validate value string contains a valid directory). Name value "include" contains a string that contains comma separated wildcard file patterns (default value if not provided is "*.html,*.txt,*" which means all files that has extension .html, .txt and without extensions in all subdirectories would be considered as doucments). Name value "type" contains the method which would be use to produce statistical data for search. Example initialization: repository arep is location=”c:\adirectory\anotherDirectory” include=”*.html,*.txt,*.htm,*” type=”cosine”; 2.4.6 Query Query is the base data type that holds search string, repository reference and where the search result would be stored. It has two name values "query" and "store". Name value "query" is a string that we are searching in repository and "store" is the file name where query results would be stored (if it is already present then file would be overwritten). Example initialization: performquery q on arep with query=”search this data please” store=”search.query” reference=docList;2.4.7 Document and DocumentList Document type is not in the scope of current project (I mentioned earlier that if time would permit I will add this). Query result can be returned as a collection of Document which is a builtin type called Document list. A for each loop


View Full Document

Columbia COMS W4115 - SL Language Manual

Documents in this Course
YOLT

YOLT

13 pages

Lattakia

Lattakia

15 pages

EasyQL

EasyQL

14 pages

Photogram

Photogram

163 pages

Espresso

Espresso

27 pages

NumLang

NumLang

6 pages

EMPATH

EMPATH

14 pages

La Mesa

La Mesa

9 pages

JTemplate

JTemplate

238 pages

MATVEC

MATVEC

4 pages

TONEDEF

TONEDEF

14 pages

SASSi

SASSi

16 pages

JTemplate

JTemplate

39 pages

BATS

BATS

10 pages

Synapse

Synapse

11 pages

c.def

c.def

116 pages

TweaXML

TweaXML

108 pages

Load more
Download SL Language Manual
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view SL Language Manual and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view SL Language Manual 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?