Classification

edit

This functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.

Classification is a machine learning process for predicting the class or category of a given data point in a dataset. Typical examples of classification problems are predicting loan risk, classifying music, or detecting cancer in a DNA sequence. In the first case, for example, our dataset consists of data on loan applicants that covers investment history, employment status, debit status, and so on. Based on historical data, the classification analysis predicts whether it is safe or risky to lend money to a given loan applicant. In the second case, the data we have represents songs and the analysis – based on the features of the data points – classifies the songs as hip-hop, country, classical, or any other genres available in the set of categories we have. Therefore, classification is for predicting discrete, categorical values, unlike regression analysis which predicts continuous, numerical values.

From the perspective of the possible output, there are two types of classification: binary and multi-class classification. In binary classification the variable you want to predict has only two potential values. The loan example above is a binary classification problem where the two potential outputs are safe or risky. The music classification problem is an example of multi-class classification where there are many different potential outputs; one for every possible music genre. In the 7.5.2 version of the Elastic Stack, you can perform only binary classification analysis.

Feature variables

edit

When you perform classification, you must identify a subset of fields that you want to use to create a model for predicting another field value. We refer to these fields as feature variables and dependent variable, respectively. Feature variables are the values that the dependent variable value depends on. There are three different types of feature variables that you can use with our classification algorithm: numerical, categorical, and boolean. Arrays are not supported in the feature variable fields.

Training the classification model

edit

Classification – just like regression – is a supervised machine learning process. It means that you need to supply a labeled training dataset that has some feature variables and a dependent variable. The classification algorithm learns the relationships between the features and the dependent variable. Once you’ve trained the model on your training dataset, you can reuse the knowledge that the model has learned about the relationships between the data points to classify new data. Your training dataset should be approximately balanced which means the number of data points belonging to the various classes should not be widely different, otherwise the classification analysis may not provide the best predictions. Read Imbalanced class sizes affect classification performance to learn more.

Classification algorithms
edit

The ensemble algorithm that we use in the Elastic Stack is a type of boosting called boosted tree regression model which combines multiple weak models into a composite one. We use decision trees to learn to predict the probability that a data point belongs to a certain class.

Measuring model performance

edit

You can measure how well the model has performed on your dataset by using the classification evaluation type of the evaluate data frame analytics API. The metric that the evaluation provides you is the multi-class confusion matrix which tells you how many times a given data point that belongs to a given class was classified correctly and incorrectly. In other words, how many times your data point that belongs to the X class was mistakenly classified as Y.

Another crucial measurement is how well your model performs on unseen data points. To assess how well the trained model will perform on data it has never seen before, you must set aside a proportion of the training dataset for testing. This split of the dataset is the testing dataset. Once the model has been trained, you can let the model predict the value of the data points it has never seen before and compare the prediction to the actual value by using the evaluate data frame analytics API.