New

The executive guide to generative AI

Read more

Create data frame analytics jobs API

edit

This functionality is in beta and is subject to change. The design and code is less mature than official GA features and is being provided as-is with no warranties. Beta features are not subject to the support SLA of official GA features.

Creates a new data frame analytics job. The API accepts a PutDataFrameAnalyticsRequest object as a request and returns a PutDataFrameAnalyticsResponse.

Request

edit

A PutDataFrameAnalyticsRequest requires the following argument:

PutDataFrameAnalyticsRequest request = new PutDataFrameAnalyticsRequest(config); 

The configuration of the data frame analytics job to create

Data frame analytics configuration

edit

The DataFrameAnalyticsConfig object contains all the details about the data frame analytics job configuration and contains the following arguments:

DataFrameAnalyticsConfig config = DataFrameAnalyticsConfig.builder()
    .setId("my-analytics-config") 
    .setSource(sourceConfig) 
    .setDest(destConfig) 
    .setAnalysis(outlierDetection) 
    .setAnalyzedFields(analyzedFields) 
    .setModelMemoryLimit(new ByteSizeValue(5, ByteSizeUnit.MB)) 
    .setDescription("this is an example description") 
    .setMaxNumThreads(1) 
    .build();

The data frame analytics job ID

The source index and query from which to gather data

The destination index

The analysis to be performed

The fields to be included in / excluded from the analysis

The memory limit for the model created as part of the analysis process

Optionally, a human-readable description

The maximum number of threads to be used by the analysis. Defaults to 1.

SourceConfig

edit

The index and the query from which to collect data.

DataFrameAnalyticsSource sourceConfig = DataFrameAnalyticsSource.builder() 
    .setIndex("put-test-source-index") 
    .setQueryConfig(queryConfig) 
    .setSourceFiltering(new FetchSourceContext(true,
        new String[] { "included_field_1", "included_field_2" },
        new String[] { "excluded_field" })) 
    .build();

Constructing a new DataFrameAnalyticsSource

The source index

The query from which to gather the data. If query is not set, a match_all query is used by default.

Source filtering to select which fields will exist in the destination index.

QueryConfig

edit

The query with which to select data from the source.

QueryConfig queryConfig = new QueryConfig(new MatchAllQueryBuilder());

DestinationConfig

edit

The index to which data should be written by the data frame analytics job.

DataFrameAnalyticsDest destConfig = DataFrameAnalyticsDest.builder() 
    .setIndex("put-test-dest-index") 
    .build();

Constructing a new DataFrameAnalyticsDest

The destination index

Analysis

edit

The analysis to be performed. Currently, the supported analyses include: OutlierDetection, Classification, Regression.

Outlier detection

edit

OutlierDetection analysis can be created in one of two ways:

DataFrameAnalysis outlierDetection = org.elasticsearch.client.ml.dataframe.OutlierDetection.createDefault(); 

Constructing a new OutlierDetection object with default strategy to determine outliers

or

DataFrameAnalysis outlierDetectionCustomized = org.elasticsearch.client.ml.dataframe.OutlierDetection.builder() 
    .setMethod(org.elasticsearch.client.ml.dataframe.OutlierDetection.Method.DISTANCE_KNN) 
    .setNNeighbors(5) 
    .setFeatureInfluenceThreshold(0.1) 
    .setComputeFeatureInfluence(true) 
    .setOutlierFraction(0.05) 
    .setStandardizationEnabled(true) 
    .build();

Constructing a new OutlierDetection object

The method used to perform the analysis

Number of neighbors taken into account during analysis

The min outlier_score required to compute feature influence

Whether to compute feature influence

The proportion of the data set that is assumed to be outlying prior to outlier detection

Whether to apply standardization to feature values

Classification

edit

Classification analysis requires to set which is the dependent_variable and has a number of other optional parameters:

DataFrameAnalysis classification = Classification.builder("my_dependent_variable") 
    .setLambda(1.0) 
    .setGamma(5.5) 
    .setEta(5.5) 
    .setMaxTrees(50) 
    .setFeatureBagFraction(0.4) 
    .setNumTopFeatureImportanceValues(3) 
    .setPredictionFieldName("my_prediction_field_name") 
    .setTrainingPercent(50.0) 
    .setRandomizeSeed(1234L) 
    .setClassAssignmentObjective(Classification.ClassAssignmentObjective.MAXIMIZE_ACCURACY) 
    .setNumTopClasses(1) 
    .setFeatureProcessors(Arrays.asList(OneHotEncoding.builder("categorical_feature") 
        .addOneHot("cat", "cat_column")
        .build()))
    .setAlpha(1.0) 
    .setEtaGrowthRatePerTree(1.0) 
    .setSoftTreeDepthLimit(1.0) 
    .setSoftTreeDepthTolerance(1.0) 
    .setDownsampleFactor(0.5) 
    .setMaxOptimizationRoundsPerHyperparameter(3) 
    .setEarlyStoppingEnabled(true) 
    .build();

Constructing a new Classification builder object with the required dependent variable

The lambda regularization parameter. A non-negative double.

The gamma regularization parameter. A non-negative double.

The applied shrinkage. A double in [0.001, 1].

The maximum number of trees the forest is allowed to contain. An integer in [1, 2000].

The fraction of features which will be used when selecting a random bag for each candidate split. A double in (0, 1].

If set, feature importance for the top most important features will be computed.

The name of the prediction field in the results object.

The percentage of training-eligible rows to be used in training. Defaults to 100%.

The seed to be used by the random generator that picks which rows are used in training.

The optimization objective to target when assigning class labels. Defaults to maximize_minimum_recall.

The number of top classes (or -1 which denotes all classes) to be reported in the results. Defaults to 2.

Custom feature processors that will create new features for analysis from the included document fields. Note, automatic categorical feature encoding still occurs for all features.

The alpha regularization parameter. A non-negative double.

The growth rate of the shrinkage parameter. A double in [0.5, 2.0].

The soft tree depth limit. A non-negative double.

The soft tree depth tolerance. Controls how much the soft tree depth limit is respected. A double greater than or equal to 0.01.

The amount by which to downsample the data for stochastic gradient estimates. A double in (0, 1.0].

The maximum number of optimisation rounds we use for hyperparameter optimisation per parameter. An integer in [0, 20].

Whether to enable early stopping to finish training process if it is not finding better models.

Regression

edit

Regression analysis requires to set which is the dependent_variable and has a number of other optional parameters:

DataFrameAnalysis regression = org.elasticsearch.client.ml.dataframe.Regression.builder("my_dependent_variable") 
    .setLambda(1.0) 
    .setGamma(5.5) 
    .setEta(5.5) 
    .setMaxTrees(50) 
    .setFeatureBagFraction(0.4) 
    .setNumTopFeatureImportanceValues(3) 
    .setPredictionFieldName("my_prediction_field_name") 
    .setTrainingPercent(50.0) 
    .setRandomizeSeed(1234L) 
    .setLossFunction(Regression.LossFunction.MSE) 
    .setLossFunctionParameter(1.0) 
    .setFeatureProcessors(Arrays.asList(OneHotEncoding.builder("categorical_feature") 
        .addOneHot("cat", "cat_column")
        .build()))
    .setAlpha(1.0) 
    .setEtaGrowthRatePerTree(1.0) 
    .setSoftTreeDepthLimit(1.0) 
    .setSoftTreeDepthTolerance(1.0) 
    .setDownsampleFactor(0.5) 
    .setMaxOptimizationRoundsPerHyperparameter(3) 
    .setEarlyStoppingEnabled(true) 
    .build();

Constructing a new Regression builder object with the required dependent variable

The lambda regularization parameter. A non-negative double.

The gamma regularization parameter. A non-negative double.

The applied shrinkage. A double in [0.001, 1].

The maximum number of trees the forest is allowed to contain. An integer in [1, 2000].

The fraction of features which will be used when selecting a random bag for each candidate split. A double in (0, 1].

If set, feature importance for the top most important features will be computed.

The name of the prediction field in the results object.

The percentage of training-eligible rows to be used in training. Defaults to 100%.

The seed to be used by the random generator that picks which rows are used in training.

The loss function used for regression. Defaults to mse.

An optional parameter to the loss function.

Custom feature processors that will create new features for analysis from the included document fields. Note, automatic categorical feature encoding still occurs for all features.

The alpha regularization parameter. A non-negative double.

The growth rate of the shrinkage parameter. A double in [0.5, 2.0].

The soft tree depth limit. A non-negative double.

The soft tree depth tolerance. Controls how much the soft tree depth limit is respected. A double greater than or equal to 0.01.

The amount by which to downsample the data for stochastic gradient estimates. A double in (0, 1.0].

The maximum number of optimisation rounds we use for hyperparameter optimisation per parameter. An integer in [0, 20].

Whether to enable early stopping to finish training process if it is not finding better models.

Analyzed fields

edit

FetchContext object containing fields to be included in / excluded from the analysis

FetchSourceContext analyzedFields =
    new FetchSourceContext(
        true,
        new String[] { "included_field_1", "included_field_2" },
        new String[] { "excluded_field" });

Synchronous execution

edit

When executing a PutDataFrameAnalyticsRequest in the following manner, the client waits for the PutDataFrameAnalyticsResponse to be returned before continuing with code execution:

PutDataFrameAnalyticsResponse response = client.machineLearning().putDataFrameAnalytics(request, RequestOptions.DEFAULT);

Synchronous calls may throw an IOException in case of either failing to parse the REST response in the high-level REST client, the request times out or similar cases where there is no response coming back from the server.

In cases where the server returns a 4xx or 5xx error code, the high-level client tries to parse the response body error details instead and then throws a generic ElasticsearchException and adds the original ResponseException as a suppressed exception to it.

Asynchronous execution

edit

Executing a PutDataFrameAnalyticsRequest can also be done in an asynchronous fashion so that the client can return directly. Users need to specify how the response or potential failures will be handled by passing the request and a listener to the asynchronous put-data-frame-analytics method:

client.machineLearning().putDataFrameAnalyticsAsync(request, RequestOptions.DEFAULT, listener); 

The PutDataFrameAnalyticsRequest to execute and the ActionListener to use when the execution completes

The asynchronous method does not block and returns immediately. Once it is completed the ActionListener is called back using the onResponse method if the execution successfully completed or using the onFailure method if it failed. Failure scenarios and expected exceptions are the same as in the synchronous execution case.

A typical listener for put-data-frame-analytics looks like:

ActionListener<PutDataFrameAnalyticsResponse> listener = new ActionListener<PutDataFrameAnalyticsResponse>() {
    @Override
    public void onResponse(PutDataFrameAnalyticsResponse response) {
        
    }

    @Override
    public void onFailure(Exception e) {
        
    }
};

Called when the execution is successfully completed.

Called when the whole PutDataFrameAnalyticsRequest fails.

Response

edit

The returned PutDataFrameAnalyticsResponse contains the newly created data frame analytics job.

DataFrameAnalyticsConfig createdConfig = response.getConfig();
Was this helpful?
Feedback