Data Source

Feature List

Now, you can configure your model even further starting with Data Columns by clicking "Feature List" icon:

In Data Columns , you can do everything related to the features in your dataset. You can add a new feature using your existing features, search a feature and edit your features and much more. Now, let's give a more detailed explanation of these functionalities:

Search Feature

You can search among your features by providing a keyword and TAZI will return features matching the search criteria:

Actions

Samples

You can see a random sample drawn from your feature column when you click Samples:

Above, the popped out list shows some random values the feature Payment Amount can take.

Edit

Clicking this will open Edit Feature window. In TAZI, there are two different editing windows depending on the feature type. When the feature is numerical or categorical, you will see the left or right window below, respectively:

Name: You can change the feature name by editing this field.
Feature Type: TAZI automatically detects the data type of the feature and sets the type for you. However, you can change the type to other preset types such as Int, Double, Float, String etc.
Imputing Strategy: This is only visible when the feature is numerical such as Integer, Double, Float etc. When there is missing data in your feature column, you can choose the imputing strategy to impute the missing values.
Default Value: If imputing strategy is set to None, you can use a default value for a feature column when there is missing value.
Description: You can edit the description of your feature.
Feature Kind: TAZI detects the feature kind automatically. But you can also change it if you want. Accepted values are Discrete and Continuous.
TARGET: By enabling this, feature will become the target label.
Business KPI: Setting this to TRUE will make this feature Business KPI.
Encoded: This is only visible when the feature is categorical. It does label encoding on the feature. Default is 10 categories.
Ignore: When Ignore is set to TRUE, TAZI won't use the feature in machine learning while processing the data.
Keep: When enabled, even if this feature does not contribute significantly to the model, it will be used throughout the modeling process, nonetheless.
Null OK: If this is set to FALSE, when a value of this feature column is missing, model will skip this instance entirely. If it is set to TRUE, model will impute the value if imputing strategy is provided. Otherwise, default value will be used.
Passthru: If enabled, the feature itself won't be used in the model. However, it may be used to create other features.
Metric Feature: Model can be represented by using this metric group feature with group by.
Metric Group Feature: A numeric feature whose accumulated value is calculated for each segment. Two metric features can be defined in the data columns and sum of these metric features and their ratios are displayed near the explanation of the related segment in the model. Two different metric features can be defined and then a ratio is also calculated by dividing their sum.

Edit Engineered Specifications

You can create additional features based on your dataset by clicking Edit Engineered Specifications. These engineered features can be used when your model is run.

Statistics: If the feature is numerical, you can choose to aggregate data points using an aggregation function such as mean, standard deviation and etc. over a specified window:
Window: Integer, 0 or greater. Some statistics of the features can be calculated automatically and added as new features. Multiple statistics over multiple windows can be added for a feature. For example: If you set the window as 50, selected statistics are calculated for the last 50 records from the current record and it is saved to the new column(s) as engineered features. Note: If you give 0 as the window length, running statistics will be calculated from the beginning, i.e. there is no window.

By default, mean will be used as a rolling average of your dataset with the selected window size. You can change the aggregate function or add multiple ones from the drop-down list. However, depending on the feature type, this drop-down list may change. If the feature is categorical, you'll see a list like the one below:

Here, TAZI will create a new feature for you that calculates the number of same values for the last 10 instances for each row in the dataset. You can also choose to calculate the unique values over the window size and use it as a new feature.

History: Multiple numeric values can be added here to save the previous values of that feature as new features. For example: If you set the values as 10 and 20, two columns will be added to the input dataset and the values that are before 10 and 20 records will be saved in these columns.

Time Cycles: This can only be used for date-time features. Let's say you click the drop-down list and choose DayOfMonth from the list. This means that for every instance of the feature, TAZI will extract the day of the month and will used it as a new feature. Same is true for every item on that list. You can choose multiple items and TAZI will create new features for you based on those you selected.

Note: You won't see these engineered features in the data column as a new feature. But, TAZI will use them just like another feature while your model is being run. When an engineered specification is used on the feature you selected, you'll see label(s) on the feature indicating which kind of specification you selected to be computed by TAZI.

Remove

You can choose to remove any feature from your Data Columns by clicking Remove:

Note that this is an irreversible action, you have to load the dataset again if you decide to use the removed feature. You can always Ignore a feature if you don't want it to be used during the training phase of your model.

Add Feature

Clicking the circled button above will direct you to a window where you can create your own feature from scratch. You can choose a Name, a Feature Type (Integer, Double, String etc.) and a Default Value (to be used if there is a run-time exception). But most importantly, you can write an expression for a feature by clicking Edit button circled below:

You will be directed to the Edit Expression page where you can transform your features into new features. TAZI lets you engineer new features using a custom expression just like you could using a programming language. These expressions are a combination of your features, operators and functions and are no different than arithmetic or programming expressions. You can write an expression manually or by dragging and dropping elements from Features and Operators tabs. It is also possible to search any feature in the search bar:

Operators tab consists of unary and binary operators, functions and values (such as pi number).
Features tab holds every feature in your dataset.
Expression bar lets you write custom expressions manually. You can also see the expressions you have created by drag-and-drop here and edit them manually.
Validation results verify whether you have written a valid expression. If not, you can check the error(s) by clicking the information icon.

Overview of Expressions

Syntax for accessing other feature values: $(some feature name)
Syntax for accessing a restricted set of functions and values: #(symbol name)
Custom expressions may return any type, as long as feature type and kind match.

Example 1: Let's start with a simple expression by just using the unary log function:

#(log)($(Payment Amount))

First you select log function from the Operators tab and drop it to the main pane. After that, you select the feature you want to transform (in this case Payment Amount) and drop it in the block of log function. Log function will compute the logarithm 2 of every value of the feature and this feature transformation will be created and used dynamically during the training of your model.

You can see that the expression is a valid one. But if you used a feature of String type, the same expression would yield an error since you cannot compute the logarithm of a categorical variable.

Example 2: Now let's create a more complex expression. Suppose that we want a categorical feature that produces two categories, namely Category 1 and 2, when a condition is met:

if ($(Payment Amount) >= 50) "Category 1" else "Category 2"

For every value of Payment Amount, above expression will create a String value that is either Category 1 or Category 2 depending on the condition. Those String values will be used as a new feature by TAZI when our model is run.

Same expression can be generated via drag-and-drop method. First we select if_else condition block from the Operators tab and drop it to the main pane:

Then, we fill the blocks accordingly like below. Note that, text and number blocks are used as containers (or variables) for String and Integer values, respectively:

After our expression is validated, we can hit the Submit button. Now that we can see the expression we constructed, we can continue filling out the remaining empty fields:

By clicking Add button, finally, our newly engineered feature can be seen in the Data Columns:

Running Mode

You can access the running mode by clicking running mode icon.

You can run the model either batch or continuous mode.

You can run the model in the continuous mode.

Additional Parameters

You can see Additional Parameters icon in the Data Source part of Configuration Map

You can edit the additional parameters.

Turbo Train

Implementation of a configurable random selection mechanism to select which instances to train with.
Turbo train in characterization

When turbo training is enabled, a subset of the training data is used for training.

It works for all input types and use cases (classification, regression etc…)
Sampling is done randomly, but it’s repeatable (has seed).
It does not depend on any of the training data values.
Sampling rate is calculated dynamically, it does not change after it’s set.

There are three thresholds for sampling rate calculation:
Threshold 1 (200K)
Threshold 2 (1M)
Threshold 3 (5M)
Train data count (C)

condition	training
C < 200K	TAZI uses ALL instances (turbo training disabled)
200K <= C < 1M	TAZI uses only 200K instances
1M <= C < 5M	TAZI uses X instances where X is the mapping from [1M, 5M) to [200K, 1M)
5M <= C	TAZI uses 5M instances
train data count unknown	TAZI uses first 200K instances, and then uses 10% of the remaining data