koho.sklearn.DecisionTreeClassifier

class koho.sklearn.DecisionTreeClassifier(class_balance='balanced', max_depth=None, max_features=None, max_thresholds=None, missing_values=None, random_state=None)[source]

A decision tree classifier.,

Parameters
  • class_balance (str 'balanced' or None, optional (default='balanced')) –

    Weighting of the classes.

    • If ‘balanced’, then the values of y are used to automatically adjust class weights inversely proportional to class frequencies in the input data.

    • If None, all classes are supposed to have weight one.

  • max_depth (int or None, optional (default=None)) –

    The maximum depth of the tree.

    The depth of the tree is expanded until the specified maximum depth of the tree is reached or all leaves are pure or no further impurity improvement can be achieved. - If None, the maximum depth of the tree is set to max long (2^31-1).

  • max_features (int, float, str or None, optional (default=None)) –

    Note: only to be used by Decision Forest

    The number of random features to consider when looking for the best split at each node.

    • If int, then consider max_features features.

    • If float, then max_features is a percentage and int(max_features * n_features) features are considered.

    • If ‘auto’, then max_features = sqrt(n_features).

    • If ‘sqrt’, then max_features = sqrt(n_features).

    • If ‘log2’, then max_features = log2(n_features).

    • If None, then max_features = n_features considering all features in random order.

    Note: the search for a split does not stop until at least one valid partition of the node samples is found up to the point that all features have been considered, even if it requires to effectively inspect more than max_features features.

    Decision Tree: max_features = None and max_thresholds = None

    Random Tree: max_features < n_features and max_thresholds = None

  • max_thresholds (int 1 or None, optional (default=None)) –

    Note: only to be used by Decision Forest

    The number of random thresholds to consider when looking for the best split at each node.

    • If 1, then consider 1 random threshold, based on the Extreme Randomized Tree formulation.

    • If None, then all thresholds, based on the mid-point of the node samples, are considered.

    Extreme Randomized Trees (ET): max_thresholds = 1

    Totally Randomized Trees: max_features = 1 and max_thresholds = 1, very similar to Perfect Random Trees (PERT).

  • missing_values (str 'NMAR' or None, optional (default=None)) –

    Handling of missing values.

    • If ‘NMAR’ (Not Missing At Random), then during training: the split criterion considers missing values as another category and samples with missing values are passed to either the left or the right child depending on which option provides the best split, and then during testing: if the split criterion includes missing values, a missing value is dealt with accordingly (passed to left or right child), or if the split criterion does not include missing values, a missing value at a split criterion is dealt with by combining the results from both children proportionally to the number of samples that are passed to the children during training.

    • If None, an error is raised if one of the features has a missing value. An option is to use imputation (fill-in) of missing values prior to using the decision tree classifier.

  • random_state (int or None, optional (default=None)) –

    A random state to control the pseudo number generation and repetitiveness of fit().

    • If int, random_state is the seed used by the random number generator;

    • If None, the random number generator is seeded with the current system time.

n_outputs_

The number of outputs (multi-output).

Type

int

classes_

The classes labels for each output.

Type

list of variable size arrays, shape = [n_classes for each output]

n_classes_

The number of classes for each output.

Type

list of int

n_features_

The number of features.

Type

int

max_features_

The inferred value of max_features.

Type

int

tree_

The underlying estimator.

Type

tree object

feature_importances_

The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature.

Type

array, shape = [n_features]

__init__(class_balance='balanced', max_depth=None, max_features=None, max_thresholds=None, missing_values=None, random_state=None)[source]

Create a new decision tree classifier and initialize it with hyperparameters.

export_graphviz(feature_names=None, class_names=None, rotate=False)[source]

Export of a decision tree in GraphViz dot format.

Parameters
  • feature_names (list of str, optional (default=None)) – Names of each of the features.

  • class_names (list of str, optional (default=None)) – Names of each of the classes in ascending numerical order. Classes are represented as integers: 0, 1, … (n_classes-1). If y consists of class labels, those class labels need to be provided as class_names again.

  • rotate (bool, optional (default=False)) – When set to True, orient tree left to right rather than top-down.

Returns

dot_data – String representation of the decision tree classifier in GraphViz dot format.

Return type

str

export_text()[source]

Export of a decision tree in a simple text format.

Returns

data – String representation of the decision tree classifier in a simple text format.

Return type

str

property feature_importances_

Get feature importances from the decision tree.

fit(X, y)[source]

Build a decision tree classifier from the training data.

Parameters
  • X (array, shape = [n_samples, n_features]) – The training input samples.

  • y (array, shape = [n_samples] or [n_samples, n_outputs]) – The target class labels corresponding to the training input samples.

Returns

self – Returns self.

Return type

object

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

predict(X)[source]

Predict classes for the test data.

Parameters

X (array, shape = [n_samples, n_features]) – The test input samples.

Returns

y – The predicted classes for the test input samples.

Return type

array, shape = [n_samples] or [n_samples, n_outputs]

predict_proba(X)[source]

Predict classes probabilities for the test data.

Parameters

X (array, shape = [n_samples, n_features]) – The test input samples.

Returns

p – The predicted classes probabilities for the test input samples.

Return type

array, shape = [n_samples x n_classes] or [n_samples x n_outputs x n_classes_max]

score(X, y)[source]

Returns the mean accuracy on the given test data and labels.

sklearn has no metrics support for “multiclass-multioutput” format, therefore we implement our own score() here

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters
  • X (array-like, shape = (n_samples, n_features)) – Test samples.

  • y (array-like, shape = (n_samples) or (n_samples, n_outputs)) – True labels for X.

Returns

score – Mean accuracy of self.predict(X) wrt. y.

Return type

float

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns

Return type

self