Skip to content

Data Profiler

Data profiling is the first step in machine learning pipeline and it is crucial to gain better insight into your dataset. TAZI Profiler is a preprocessing data profiling tool that examines the data source you provide and gives key recommendations in order to ensure that the dataset is formatted properly and data quality issues are solved.

Profiler helps you discover anomalies in your dataset through explanatory data analysis. It plots histograms and gives some important statistics of your features while also enabling you to understand the underlying relationship between your features.

After giving you the full picture of your dataset, it gives you recommendations that you can accept and apply to your dataset before feeding it to the machine learning algorithms.

You can start TAZI Profiler by clicking the Start Profiler button. A window will pop up to let you set profiler parameters:

  • Instance Count is the number of rows you want the Profiler to process. Default value is the size of your dataset.
  • If you disable Work in Synch Mode, profiling will happen as a background process. If it is enabled, you'll have to wait for the profiling to be completed before you proceed to the next step.

  • When you set Enable Composite Recommendations to TRUE, Profiler will create composite features from the existing feature set. However, profiling may take longer depending on the size of your dataset.

After you click the Submit button, data profiling will start and when it is done, you'll be directed to the main page of the Profiler:

1.Profiler shows some information and statistics for each feature:

  • Relevance: Strong relevance of a feature indicates that the feature is always necessary for an optimal subset; it cannot be removed without affecting the original conditional class distribution. Weak relevance suggests that the feature is not always necessary but may become necessary for an optimal subset at certain conditions. Relevance is between 0 and 0.5.

  • Entropy: Entropy is the measure of impurity, disorder or uncertainty in a bunch of data points and it controls how a Decision Tree decides to split the data. It actually effects how a Decision Tree draws its boundaries.

  • Processed Count: Number of instances that are processed.

  • Empty Count: Shows how many instances are empty for that feature.

  • Malformed Count: Instance count of malformed data that cannot be read or correctly processed.

  • Number of Unique Values: The count of unique values of the feature.

  • Minimum: Minimum value of that feature.

  • Maximum: Maximum value of that feature.

  • Mean: The "average" number; found by adding all data points and dividing by the number of data points.

  • Median: The middle number; found by ordering all data points and picking out the one in the middle (or if there are two middle numbers, taking the mean of those two numbers).

  • Mode: The most frequent number—that is, the number that occurs the highest number of times. If there is more than one frequent number, there will be a tie for the value that occurs the most often.

  • Standard Deviation: Standard deviation is a measure of the spread of the distribution. It shows how the feature values deviate from its mean. If the standard deviation is large, this means the spread is big.

  • Skewness: Skewness is asymmetry in a statistical distribution, in which the curve appears distorted or skewed either to the left or to the right. Skewness can be quantified to define the extent to which a distribution differs from a normal distribution. In a normal distribution, the graph appears as a classical, symmetrical "bell-shaped curve". The mean, or average, and the mode, or maximum point on the curve, are equal.

  • Kurtosis: Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack thereof. A uniform distribution would be the extreme case.

2.You can sort the profiling results according to a criteria. Default is the relevance of the feature.

3.

  • By clicking Show all recommendations, you'll be directed to the recommendations page:

You can select any number of recommendations and click APPLY to apply the recommended transformations to the feature(s). Information icon next to the APPLY button shows the reason why that particular recommendation is given for the feature.

  • Show occurrence matrix shows unique value counts and percentage of each feature values:

4.

5.You can select and see individual recommendations.

After you have seen the profiler results and accepted the recommendations that you deem necessary, you can close the main page of the Profiler and continue configuring your Business Model.

Profiler Comparison

Users can utilize the profiler comparison feature, which enables them to conduct comparisons based on the data used in the training data within the dataset designated for testing. This feature allows them to observe the changes within the dataset. How to use: Assuming that the user has access to two datasets, labeled as retention_7k and retention_3k, the following explains how to employ this feature:

1- Access the trained model starting with the base data we use (retention_7k).

Note: Upon the completion of a run, a new profiler result is automatically created based on the training data used to train the model.

2- A test run is initiated from within our main model as described below.

3- A short while after clicking the 'Submit' button, a test model automatically generates and begins as shown in the screen below.

4- Once the test model is completed, the 'Profiler' button specified below is clicked to transition to the profiler screen.

5-Click the 'Start Profiler Model' button.

6- With the base profiler selected, a new profiler is run for the test model.

  • Click on ‘SHOW ADVANCED OPTIONS’
  • In this screen, we can see the profiler results that run after the 'Main' model. While using this Profiler result for comparison, it will be selected as the 'Base Profiler'.
  • Click start to run the profiler for the test model data.
  • 'Using 'Start from' and 'Count', a certain portion of the data can be profiled and compared.

7- After the profiler run for the test model is completed, click the Compare button as indicated below.

8- The profiler run for the test model is selected as the target and the Compare button is pressed.

9- The results can be observed on this screen. As shown below, we can observe drift for the 'Credit Score' feature in the profiler comparison of the used retention_7k and retention_3k.

On the comparison screen, there are two charts available. The chart on the left is a scatter plot modeling the relationship between the drift score and the feature importance for each feature. The chart on the right displays an histogram for the selected feature where each bin contains the corresponding values from the base and target datasets. Users can explore the combined histogram of any feature by either clicking on the Feature dropdown and selecting the desired feature or clicking on the corresponding on point on the scatter plot.

By clicking on the Advanced Details user can explore more in-depth analysis of each feature comparison.

First section displays the missing and additional features in the two datasets. Drift metrics and statistical comparisons will not be calculated for these features.

In the Feature Comparison section, there is a table of all the common features and fundamental statistical differences in between observed in the two datasets (profiler results)

You can find the description of each feature below: * name: Name of the feature that is being compared * type: Whether the feature is a discrete or continuous feature TAZI AI Proprietary ©2024 8 * critical_count_change: If the critical threshold is specified (i.e. %20) when comparing two profiler results, this column will display the number of metrics (i.e. entropy change, minimum value change, number of unique values) that exhibit a greater change than the specified threshold for that feature * kl_divergence: Drift metric that measures the relative entropy. For further details you can visit the following links: https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guideto- understanding-kl-divergence-2b382ca2b2a8 https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explain ed https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence * malformed_change: Change (%) in the percentage of malformed instances (unparseable) of a feature * entropy_change: Change (%) in the entropy of the feature * processed_count_change: Change (%) in the number of successfully parsed instances * empty_count_change: Change (%) in the number of instances with empty values * unique_value_change: Change (%) in the number of unique values

After clicking on a feature, users can see a more in depth explanation of each metric comparison.

  • referenced_value refers to the calculated metric in the base profiler result
  • target_value refers to the calculated metric in the target profiler result

If target is specified for both datasets, there will be relevance comparison will also be available For the Suspension feature, while the unique value count in the base dataset was 52, in the target dataset it was 67. Since the percent change (% 32) surpasses the specified threshold, the metric is highlighted as red. The following comparisons based on the type of the feature being compared (continuous or discrete). For discrete features, most common values are compared. The number of most common values to compare can be determined by the Number of most frequent values parameter in the advanced options when instantiating the comparison

In this case 20 most common values are being compared. Missing Values specify individual values that are one of the top 20 most common values in the target dataset, but not in the top 20 for the base dataset, and vice versa for the New Values.

After clicking on a continuous feature, alongside the comparison of the base metrics, users can also examine the comparison of numeric metrics.

Full list of detailed comparison for continuous features: