AI Algorithms

In AI Algorithms part of Configuration Map, you can choose machine learning algorithm(s) and combine different types of models based on your criteria.

When classification is set as configuration type, combination of supervised model and deep neural network is selected by default. In the regression case, only deep neural network is selected automatically.

You can choose any of the models available in TAZI and adjust all of their existing parameters if you think the defaults do not suit your problem. Now, let’s take a closer look to models TAZI provide for you:

Machine Learning Models:

Online Decision Tree model, Semi-supervised Online Decision Tree model, Deep Neural Network, XGBoost, RandomForest and LightGBM represent machine learning models that are used to train your dataset by TAZI.

∞: Continuous ML models are indicated by the icon "∞" on the upper left corner of model boxes.
Edit: It helps to configure status and weight of the model.
Active: If the model is active, it shows Yes, otherwise No.
Weight: It is set to determine the relative effect of the model to the combiner. Combiner (as will be explained thoroughly later) will use this weight while combining the models you choose. Weight also can be seen at the top right of the model.
Remove: Removes the model.
Additional Parameters: It helps to configure the model parameters. Online Decision Tree Model, Semi-supervised Online Decision Tree Model, Explanation Model parameter names are the same as it can be seen below:

Boosting: When enabled, random forest algorithm is used, the number of estimators and fraction of features used can be configured additionally.
Max Levels: Maximum depth of the tree. Higher number of levels increases model complexity.
Impurity Threshold: Impurity threshold for a node to be splitted. Default Value: 0. Increase for simpler model: 0.001, ..., 0.045.
Instance Threshold: Instance threshold for a node to be splitted. Default Value: 300. Increase for simpler model: 30, ..., 1000.
Instance Threshold Classify: Instance threshold for a leaf to use its label as opposed to its parent's label.
Leaf Split Threshold: Minimum percentage of minority class on a node to be splitted. Default: 0.05
Training Bucket Size: Number of training instances to wait before updating the model. Increase for faster model training. Increasing may reduce accuracy.
Lambda Decay: Aging factor - related with parameter forgetHorizonInstanceCount.
Alpha Confidence: It should be less than 0.5-impurityThreshold and impurityThreshold.
Gamma Multiplier: Used to scale how much entropy difference between features is big enough.
Signature Precision: Signature precision to be shown in output.
Max Unique Values: Max unique values to be kept in histograms. Default value: 10000. Reduce for simpler model.
Classification Method: Classfication method applied in nodes which are majority, naive bayes, confusion cost. Choose majority for simpler model.
Min Node Count: Minimum node size of the tree. It implicity sets the depth of the tree.

XGBoost, RandomForest and LightGBM

https://xgboost.readthedocs.io
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
https://lightgbm.readthedocs.io/en/latest/

Deep Neural Network parameters are listed below:

Learning Rate: Learning rate used for optimization (e.g. [0.0001, 0.01] ). Too low values cause slow learning, too high values may prevent convergence.
Momentum: Momentum helps the neural network weights converge faster if the weight change is consistently in the same direction for consecutive instances (e.g. 0.95. 0.98. 0.99).
Number of Hidden Layers: Number of hidden layers in the neural network.
Layer Size: Number of units in hidden layer.
Drop Out: Use for better generalization or if data is noisy. If true, some weights are dropped (set to zero) so that the neural network can still perform even if those weights do not exist.
Shuffle: (Batch Training Only) Shuffles training data.
Batch Size: The model will be updated for a batchsize many instances at a time.

KMeans Model parameters are listed below:

N_microCluster: Number of clusters
threshold: When to declare (based on distance from mean/median) an instance as an outlier
uniqueLabelCount:
instanceSummarySize: Considering infinite number of instances in the stream, how many instances to keep as reference points.

Custom Python/R Model Coding

TAZI empowers users with the ability to code their own models using Python/R. Follow the steps below to craft your own model:

How to use:

1- Navigate to the configuration step where you create a new model or modify an existing one to include algorithms.

2- Click the '+' button on the configuration screen to add a new algorithm.

3- Choose either Python or R model.

4- Enable the "Provide Train/Save/Load Codes" option in the configuration to open sections for users to input their code.

5- Sample codes for each section are provided below. After filling in these sections according to your preferences, users can save and proceed to execute their model.

Initialization:

import numpy as np
import pandas as pd
import dill
import joblib


from lightgbm import LGBMClassifier, LGBMRegressor, Booster

from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, FeatureUnion

trained = False
classes = []
model = None
target = 'label'
is_regression = False

def predict_proba(model, X, n_classes):
  result = model.predict(X)
  if n_classes > 2:
    return result
  else:
    return np.vstack((1. - result, result)).transpose()

def return_prediction(df, pre_pipe, is_regression, model, classes):
  X = pre_pipe.transform(df)
  if is_regression:
    return [{'regression': num} for num in model.predict(X)]
  else:
    return [dict(zip(classes, probas)) for probas in predict_proba(model, X, len(classes))]

Prediction:

if not trained:
  raise Exception("model not trained!")
else:
  tazi_pred = return_prediction(tazi_df, pre_pipe, is_regression, model, classes)

Train:

defaults = {
  "boosting_type" : "gbdt",
  "num_leaves" : 31,
  "max_depth" : -1,
  "learning_rate" : 0.1,
  "n_estimators" : 100,
  "subsample_for_bin" : 200000,
  "min_split_gain" : 0.0,
  "min_child_weight" : 0.001,
  "min_child_samples" : 20,
  "subsample" : 1.0,
  "subsample_freq" : 0,
  "colsample_bytree" : 1.0,
  "reg_alpha" : 0.0,
  "reg_lambda" : 0.0,
  "random_state" : 42,
  "n_jobs" : -1,
  "importance_type" : "split"
}
model_params = defaults
if TAZI__CLASS_WEIGHT is not None:
  model_params['class_weight'] = TAZI__CLASS_WEIGHT

df = tazi_df # tazi_df is present here

num_cols = list(df[df.columns.difference([target])].select_dtypes(include='number').columns)
cat_cols = list(df[df.columns.difference([target])].select_dtypes(include='object').columns)
num_selector = ('selector', FunctionTransformer(lambda X: X[num_cols]))
cat_selector = ('selector', FunctionTransformer(lambda X: X[cat_cols]))

num_pipe = Pipeline([
  num_selector,
  ('imputer', SimpleImputer(strategy='mean')),
  ('scaler', StandardScaler())
]) if len(num_cols) > 0 else Pipeline([num_selector])

cat_pipe = Pipeline([
  cat_selector,
  ('imputer', SimpleImputer(strategy='most_frequent')),
  ('encoder', OneHotEncoder(sparse=False, handle_unknown='ignore'))
]) if len(cat_cols) > 0 else Pipeline([cat_selector])

pre_pipe = FeatureUnion([
  ('num', num_pipe),
  ('cat', cat_pipe)
])

pre_pipe.fit(df)

X = pre_pipe.transform(df)
y = df[target].map(str)

if 'class_weight' in model_params:
  # remove classes not present in train labels
  model_params['class_weight'] = {k: model_params['class_weight'][k] for k in list(y.unique())}

model = LGBMRegressor(**model_params) if is_regression else LGBMClassifier(**model_params)
model.fit(X, y)

if not is_regression:
  classes = list(map(str, model.classes_))
model = model.booster_ # from now on, use booster only

df = None
trained = True

TAZI__TRAIN_RESULT = {'shape': tazi_df.shape, 'model_params': model_params}

Save:

TAZI__FILES = {}
if model:
  temp1 = tempfile.mktemp()
  temp2 = tempfile.mktemp()

  model.save_model(temp1)

  others = {
    'pre_pipe': pre_pipe,
    'classes': classes,
    'num_cols': num_cols,
    'cat_cols': cat_cols
  }
  joblib.dump(others, temp2)

  TAZI__FILES = {
    'model': temp1,
    'others': temp2
  }

Load:

model = Booster(model_file=TAZI__FILES['model'])
others = joblib.load(TAZI__FILES['others'])
classes = others['classes']
num_cols = others['num_cols']
cat_cols = others['cat_cols']
pre_pipe = others['pre_pipe']
trained = True

Add New Model

When you click + button you can choose which model to add to the Combiner:

Here, besides the models we have explained before, you can see that there is a Python Model that you can also add. When you click that, you’ll be directed to new window where you can add your own machine learning model that you have implemented using Python programming language:

Model Name: Type a model name that you want to give it to your own Python model.
Template: TAZI will provide templates for you to easily embed your own model in our platform. These templates are listed below:
Categorical
DateTime
Numerical/Categorical (Detailed Explanation)
Numerical
Regression
You can see all of the libraries that you can use in your own model.
Files: You can upload your model that you have created in another environment.
Input Info: All of the features in your dataset can be seen here.
Output Info: Gives you the necessary information on how to embed your model in TAZI
Initialization: Here, you’ll initialize your setup by restoring the uploaded model file you provided. You also let TAZI know the types of your variables here. Moreover, you can do any extra preprocessing here before you make predictions.
Prediction: Now, you can use data source you have provided before and fit your own Python model to it by using the necessary name conventions TAZI requires.

Combiner

Combiner provides the model combination.

When you click Combiner Parameters another window will open for you adjust settings:

In Model Combination list, you can see ways to choose how Combiner can combine results of the individual model results:

Bayesian: Computes the ensemble result according to the bayesian model combination method.
Accuracy Based: Computes the ensemble result according to the accuracies (weighted accuracy by the actual label class weights) of the classifiers.
Fixed: You can set fixed weights manually.

Explanation Model

TAZI looks at the results of Combiner and fits a new model to the predictions made by itself. This Explanation Model may help us understand how our model (combined) makes predictions depending on the feature set and feature values. Hence, Explanation model is to used to understand your model better and may provide insight to business side of your company into taking more suitable actions. It does so, as stated, by interpreting the results of model(s).

The success of the explanation model is really dependent on the data you are using and its complexity. For complex data sources with thousands of rows and hundreds of features, a simpler explanation model gives the best results in terms of both performance and micro-segmentation. For simple data sources such as breast cancer, however, the more complex the explanation model is the better.

Edit: You can eactivate or deactivate the explanation model.

Additional Parameters: It helps to configure the model parameters.

Max Levels: Maximum depth of the tree. Higher number of levels increases model complexity.
Impurity Threshold: Impurity threshold for a node to be splitted. Default Value: 0. Increase for simpler model: 0.001, ..., 0.045.
Instance Threshold: Instance threshold for a node to be splitted. Default Value: 300. Increase for simpler model: 30, ..., 1000.
Instance Threshold Classify: Instance threshold for a leaf to use its label as opposed to its parent's label.
Leaf Split Threshold: Minimum percentage of minority class on a node to be splitted. Default: 0.05
Training Bucket Size: Number of training instances to wait before updating the model. Increase for faster model training. Increasing may reduce accuracy.
Lambda Decay: Aging factor - related with parameter forgetHorizonInstanceCount.
Alpha Confidence: It should be less than 0.5-impurityThreshold and impurityThreshold.
Gamma Multiplier: Used to scale how much entropy difference between features is big enough.
Signature Precision: Signature precision to be shown in output.
Max Unique Values: Max unique values to be kept in histograms. Default value: 10000. Reduce for simpler model.
Classification Method: Classfication method applied in nodes which are majority, naive bayes, confusion cost. Choose majority for simpler model.
Min Node Count: Minimum node size of the tree. It implicity sets the depth of the tree.