Stella An Environment for Experimental Machine Learning Antonio Kantek CVN Student 1 Introduction 1 Machine Learning Machine Learning covers a variety of topics In this work Machine Learning or simply ML stands for data analysis using computational statistics Data analysis can be understood as the process of extracting a set of patterns knowledge from raw data information One can understand extracting information from raw data as running a SQL query against a relational database task related to low level operational support purpose e g select all customer where birthday is 10 31 75 Extracting a set of patterns from a dataset means to create ML models that discover rules and associations hidden in datasets e g If customer is male and age 15 then buy videogame xyz with confidence 80 and support 75 Implementing ML models is a both iterative and interactive semi automatic process There are three main types of ML models i classifiers both nominal and numeric attribute predictors ii clustering and iii association rules learner They all share the same common input a dataset composed by instances Each instance or row is composed by an array of numbers strings or dates Classifier also know as Classification Learning is a ML model that predicts what will happen in new unseen before data As an example consider a medical diagnosis where a classifier will predict whether or not a patient has a given disease The outcome class to predict can be a nominal one like buy or not buy or a numeric quantity Bayesian Networks 1 and C4 5 2 are both famous types of classifiers Clustering is the second type of ML It is similar to Classification Learning but the attribute to predict will be defined by the model itself rather than the user By doing that the model will be able to group similar classes the class represents the relationship between predictor attributes and the goal attribute values Some algorithms for clustering like K Means 3 use the concept of geometric distance between instances in order to group the closest ones The last type of ML model is Association Rule which is related to structural data description instead of class prediction Rules are commonly represented as if then rules A Decision Tree is a data structure composed by the join of several if then rules In a Decision Tree each internal node contains a rule for a predictor attribute e g Attribute customer age 15 yes no and each leaf node represents the instance classification e g Decision customer buy yes no Apriori 4 is a popular algorithm for association rule extraction 2 Motivation The process of discovering patterns in data is a semiautomatic empirical process There is no universally best algorithm across all datasets datasets are different according to their attribute types some have more numerical attributes while others have more nominal attributes Some algorithms have better performance in one type of dataset and a disastrous one in another type They are biased according to the type of data to process Stella is a dynamic language for ML model implementation A dynamic language is the best tool for experimental computing They provide you with a simple way to load and unload data structures e g Easily dynamic class loading Models are easily implemented and tested You should be able to run a piece of code as easily as running a SQL script The two main common approaches for data analysis using ML are specialized query languages and frameworks written in general purposes languages like C C and Java Commercial databases products like Microsoft SQL Server provide some sort of data mining query language 5 This is a very limited solution since the user can not build his own models OR Objects 6 Oracle Java Data Mining OJDM 7 and Weka Waikato Environment for Knowledge Analysis 8 are examples of the second approach frameworks for ML Weka is a superb well documented generic framework for ML and it is written in Java Java is not static as C but still is not dynamic enough It is not possible to load and unload classes in Java without dealing with ClassLoaders issues Stella is a Domain Specific Language for ML model implementation and testing It is an Object Oriented Language but not Object Oriented Obsessed like Smalltalk Stella s API is composed by two parts small generic API e g Generic types like Integer Double String Date and Object and a extensive ML API e g Classes like DataSet Classifier ClassifierEvaluation Instance and so on Besides that the language offers declarative constructions for some tedious tasks and of course a good array manipulation support 2 Language Reference 1 Constants Enumerations and Functions Constants contain immutable values Functions in Stella is defined in the same way as in nonOOP procedural languages like Pascal or C Constants and functions are the easiest way to declare and implement mathematical functions Some common functions will be natively implemented in Java I O is also done by functions An enumeration like in C defines a sequence of elements Examples of constants enumerations and functions enum AttributeType NOMINAL NUMERIC DATE constant double SMALL 1 e 6 constant double NORMAL DISTRIBUTION sqrt 2 PI function void out Object obj Console output function function Object fout Object obj String file File output function function double min double doubles check notNull doubles doubles is null check notEmpty doubles doubles is empty double min Default value for numbers is NaN foreach doubles i if min NaN min doubles i min doubles i min 2 Classes and Objects The main API is composed by a few classes and some special constructions in order to deal with numbers and arrays All classes inherent from class Object you do not need to specify that Object is a virtual class the only one and no one can directly create an instance of Object The main API does not have a direct support for meta class introspection and reflection All methods starting by are class methods static method in Java Instances of string date and number are immutable objects Overview of the main classes some methods are missing class Object Object deepCopy Object shalowCopy boolean equals Object obj String toString class String int length String concat String aString class Boolean boolean parse String aString class Number class Double subclass Number double parseAsDouble String aString class Integer subclass Number int parseAsInt String aString class Long subclass Number long parseAsLong String aString class Date booelan before Date aDate
View Full Document
Unlocking...