Stella:An Environment for Experimental Machine Learning Antonio KantekCVN Student 1. Introduction1. Machine LearningMachine Learning covers a variety of topics. In this work, Machine Learning (or simply ML) stands for data analysis using computational statistics. Data analysis can be understood as the process of extracting a set of patterns (knowledge) from raw data (information). One can understand extracting information from raw data as running a SQL query against a relational database (task related to low level operational support purpose – e.g. select all customer where birthday is 10/31/75). Extracting a set of patterns from a dataset means to create ML models that discover rules and associations hidden in datasets (e.g. If customer is male and age < 15 then buy videogame xyz with confidence = 80% and support = 75%). Implementing ML models is a both iterative and interactive (semi-automatic) process. There are three main types of ML models: i) classifiers (both nominal and numeric attribute predictors), ii) clustering and iii) association rules learner. They all share the same common input: a dataset composed by instances. Each instance (or row) is composed by an array of numbers, strings or dates. Classifier (also know as Classification Learning) is a ML model that predicts what will happen in new (unseen before) data. As an example, consider a medical diagnosis, where a classifier will predict whether or not a patient has a given disease. The outcome (class to predict) can be a nominal one (like buy or not buy) or a numeric quantity. Bayesian Networks [1] and C4.5 [2] are both famous types of classifiers. Clustering is the second type of ML. It is similar to Classification Learning, but the attribute to predict will be defined by the model itself (rather than the user). By doing that, the model will be able to group similar classes (the class represents the relationship between predictor attributes and the goal attribute values). Some algorithms for clustering (like K-Means [3]) use the concept of geometric distance between instances in order to group the closest ones. The last type of ML model is Association Rule, which is related to structural data description instead of class prediction. Rules are commonly represented as if then rules. A Decision Tree is a data structure composed by the join of several if then rules. In a Decision Tree, each internal node contains a rule for a predictor attribute (e.g. Attribute: customer age < 15 yes/no) and each leaf node represents the instance classification (e.g. Decision: customer buy yes/no). Apriori [4] is a popular algorithm for association rule extraction. 2. MotivationThe process of discovering patterns in data is a semiautomatic (empirical) process. There is no universally best algorithm across all datasets (datasets are different according to their attribute types, some have more numerical attributes while others have more nominal attributes). Some algorithms have better performance in one type of dataset and a disastrous one in another type. They are biased accordingto the type of data to process. Stella is a dynamic language for ML model implementation. A dynamic language is the best tool for experimental computing. They provide you with a simple way to load and unload data structures (e.g. Easily dynamic class loading). Models are easily implemented and tested. You should be able to run a piece of code as easily as running a SQL script.The two main common approaches for data analysis using ML are: specialized query languages and frameworks written in general purposes languages like C/C++ and Java. Commercial databases products like Microsoft SQL Server provide some sort of data mining query language [5]. This is a very limited solution, since the user can not build his own models. OR-Objects [6], Oracle Java Data Mining (OJDM) [7] and Weka (Waikato Environment for Knowledge Analysis) [8] are examples of the second approach (frameworks for ML). Weka is a superb, well documented generic framework for ML and it is written in Java. Java is not static as C++ but still is not dynamic enough. It is not possible to load and unload classes in Java without dealing with ClassLoaders issues.Stella is a Domain Specific Language for ML model implementation (and testing). It is an Object Oriented Language (but not Object Oriented Obsessed like Smalltalk). Stella's API is composed by two parts: small generic API (e.g. Generic types like Integer, Double, String, Date and Object) and a extensive ML API (e.g. Classes like DataSet, Classifier, ClassifierEvaluation, Instance, and so on). Besides that, the language offers declarative constructions for some tedious tasks, and of course, a good array manipulation support. 2. Language Reference1. Constants, Enumerations and FunctionsConstants contain immutable values. Functions in Stella is defined in the same way as in (non-OOP) procedural languages like Pascal or C. Constants and functions are the easiest way to declare and implement mathematical functions. Some common functions will be natively implemented in Java. I/O is also done by functions. An enumeration, like in C, defines a sequence of elements.Examples of constants enumerations and functions: enum AttributeType { NOMINAL, NUMERIC, DATE };constant double SMALL := 1.e-6constant double NORMAL_DISTRIBUTION := sqrt(2 * PI);function void out(Object obj); //Console output functionfunction Object fout(Object obj, String file); //File output functionfunction double min(double[] doubles) { check notNull(doubles, "doubles is null"); check notEmpty(doubles, "doubles is empty"); double min; //Default value for numbers is NaN foreach(doubles[i]) { if (min == NaN || min > doubles[i]) { min := doubles[i]; } } ^ min;}2. Classes and ObjectsThe main API is composed by a few classes and some special constructions in order to deal with numbers and arrays. All classes inherent from class Object (you do not need to specify that). Object is a virtual class (the only one) and no one can directly create an instance of Object. The main API does not have a direct support for meta class, introspection and reflection. All methods starting by # are class methods (static method in Java). Instances of string, date and number are immutable objects. Overview of the main classes (some methods are missing):class Object {Object deepCopy();Object shalowCopy();boolean equals(Object obj);String
View Full Document