Implementation

scikit-learn compatible

We rolled our own scikit-learn compatible estimator following the Rolling your own estimator instructions and using the provided project template from scikit-learn.

We are trying to be consistent with scikit-learn’s decision tree and ensemble modules.

Exceptions

Used class_balance as hyperparameter name instead of class_weight.

The class_weight hyperparameter name is recognized by check_estimator() and the test check_class_weight_classifiers() is performed that uses the dict parameter and requires for a decision tree the “min_weight_fraction_leaf” hyperparameter to be implemented to pass the test.

Only class_weights (not sample_weights) are used and for multi-output (single model), class_weights are calculated and treated separately for each output.

Scikit-learn’s compute_class_weight() and compute_sample_weight() functions multiply the sample_weights for multi-output together to a single sample_weight.

Overwritten score() function instead of using inherited score() function from metrics module.

Scikit-learn’s metrics module does not support “multiclass-multioutput” format.

We provide and use the same Random Number Generator from our C++ implementation in Python.

Cython (Python bindings for C++ library)

To allow all data to be passed by reference (not copied) between Python and C++, multi-dimensional arrays, like X, y and class_weights, in Python are handled as one-dimensional arrays in C++. The down-side is that this is not a very elegant programming style for our C++ library.

Basic Concepts

The basic concepts, including stack, samples LUT with in-place partitioning, incremental histogram updates, for the implementation of the classifiers are based on:

  1. Louppe, Understanding Random Forests, PhD Thesis, 2014

Not Missing At Random (NMAR)

The probability of an instance having a missing value for a feature may depend on the value of that feature.

Training The split criterion considers missing values as another category and samples with missing values are passed to either the left or the right child depending on which option provides the best split.

Testing If the split criterion includes missing values, a missing value is dealt with accordingly (passed to left or right child). If the split criterion does not include missing values, a missing value at a split criterion is dealt with by combining the results from both children proportionally to the number of samples that are passed to the children during training (same as MCAR Missing Completely At Random).

Note that the number of samples that are passed to the children represents the feature’s estimated probability distribution for the particular missing value based on the training data.