DOC PREVIEW
Columbia COMS W4115 - Search language (SL)

This preview shows page 1 out of 3 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Search language (SL) White paper CS W4115: Programming Languages and Translators Professor Stephen A. Edwards Computer Science Department Fall 2007 Columbia University Submitted by Majid Khan {UID = mk2759, Email = [email protected] }Objective Search engines are still an emerging markets and lack tools which enable building search applications e.g. Search engines faster and easier. This search language would give capability to users to create search engines or search engine type applications with lesser efforts. The scope of this project would be limited to text search in HTML and plain text files. Brief overview Any search language would need to deal with at least 3 items. 1) Able to fetch data from repositories and understand document formats. 2) Able to manipulate data and able to store statistically manipulated data for later use. 3) Able to provide a search capability and provide ranked search results. Repository – A repository would be a hierarchy of directories containing documents. Document – HTML and plain texts would be the file types used as documents. Only files with extensions htm, htm, txt and without extensions (would be considered as plain text) would be processed by this language. Data manipulation – It is a multi step process which is as following Parse data and get words. Perform stemming on these words (http://en.wikipedia.org/wiki/Stemming) Get frequencies of stemmed words and create an inverted vector table using this data. Calculate term frequency–inverse document frequency (tf-idf) for every stemmed word. Store this data in for future use. Search – Search would need two items. Query string and a repository on which we have already performed data manipulation. Query is a series of alphanumeric words separated by white space. This process would return a list of documents and also a number that represents the closeness of query with document.A sample program repository arep is location=”c:\adirectory\anotherDirectory” include=”*.html,*.txt,*.htm,*” type=”cosine”; process arep with location=”c:\adirectory\anotherDirectory” stemmer =”paice” overrwite=”true” filename=”arep.rep”; performquery q on arep with query=”search this data please” store=”search.query” Output The output for returned by perform query is stored in a file. This file would contain list of files returned by search with the result weight. Higher the weight closer the match is. pathFromRepositoryRoot\filename1 weight1 pathFromRepositoryRoot\filename2 weight2 pathFromRepositoryRoot\filename3 weight3 Summary This language is a primitive step towards a language which would enable search engine creation, design and development easier. Above mention language is the initial scope of this project, if time permits (which I really doubt) I would try to add a search for familiar pages and also try to introduce two types i.e. document and document


View Full Document

Columbia COMS W4115 - Search language (SL)

Documents in this Course
YOLT

YOLT

13 pages

Lattakia

Lattakia

15 pages

EasyQL

EasyQL

14 pages

Photogram

Photogram

163 pages

Espresso

Espresso

27 pages

NumLang

NumLang

6 pages

EMPATH

EMPATH

14 pages

La Mesa

La Mesa

9 pages

JTemplate

JTemplate

238 pages

MATVEC

MATVEC

4 pages

TONEDEF

TONEDEF

14 pages

SASSi

SASSi

16 pages

JTemplate

JTemplate

39 pages

BATS

BATS

10 pages

Synapse

Synapse

11 pages

c.def

c.def

116 pages

TweaXML

TweaXML

108 pages

Load more
Download Search language (SL)
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Search language (SL) and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Search language (SL) 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?