Implementation

scikit-learn compatible

We rolled our own scikit-learn compatible estimator following the Rolling your own estimator instructions and using the provided project template from scikit-learn.

We are trying to be consistent with scikit-learn’s decision tree and ensemble modules.

Exceptions

Used class_balance as hyperparameter name instead of class_weight

The class_weight hyperparameter name is recognized by check_estimator() and the test check_class_weight_classifiers() is performed that uses the dict parameter and requires for a decision tree the “min_weight_fraction_leaf” hyperparameter to be implemented to pass the test.

We provide and use the same Random Number Generator from our C++ implementation in Python.

Basic Concepts

The basic concepts, including stack, samples LUT with in-place partitioning, incremental histogram updates, for the implementation of the classifiers are based on:

  1. Louppe, Understanding Random Forests, PhD Thesis, 2014

Not Missing At Random (NMAR)

The probability of an instance having a missing value for a feature may depend on the value of that feature.

Training The split criterion considers missing values as another category and samples with missing values are passed to either the left or the right child depending on which option provides the best split.

Testing If the split criterion includes missing values, a missing value is dealt with accordingly (passed to left or right child). If the split criterion does not include missing values, a missing value at a split criterion is dealt with by combining the results from both children proportionally to the number of samples that are passed to the children during training (same as MCAR Missing Completely At Random).

Note that the number of samples that are passed to the children represents the feature’s estimated probability distribution for the particular missing value based on the training data.