User Guide¶

koho (Hawaiian word for ‘to estimate’) is a Decision Forest C++ library with a scikit-learn compatible Python interface.

Classification
Numerical (dense) data
Missing values (Not Missing At Random (NMAR))
Class balancing
Multi-Class
Multi-Output (single model)
Build order: depth first
Impurity criteria: gini
n Decision Trees with soft voting
Split a. features: best over k (incl. all) random features
Split b. thresholds: 1 random or all thresholds
Stop criteria: max depth, (pure, no improvement)
Bagging (Bootstrap AGGregatING) with out-of-bag estimates
Important Features
Export Graph

Python¶

We provide a scikit-learn compatible Python interface.

Classification¶

The koho library provides the following classifiers:

DecisionTreeClassifier DecisionForestClassifier

We use the iris dataset provided by scikit-learn for illustration purposes.

>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> X, y = iris.data, iris.target

>>> from koho.sklearn import DecisionTreeClassifier, DecisionForestClassifier
>>> clf = DecisionForestClassifier(random_state=0)

Decision Tree: max_features=None and max_thresholds=None
Random Tree: max_features<n_features and max_thresholds=None
Extreme Randomized Trees (ET): max_thresholds=1
Totally Randomized Trees: max_features=1 and max_thresholds=1 very similar to Perfect Random Trees (PERT).

Training

>>> clf.fit(X, y)
DecisionForestClassifier(bootstrap=False, class_balance='balanced',
         max_depth=3, max_features='auto', max_thresholds=None,
         missing_values=None, n_estimators=100, n_jobs=None,
         oob_score=False, random_state=0)

Feature Importances

>>> feature_importances = clf.feature_importances_
>>> print(feature_importances)
[0.09045256 0.00816573 0.38807981 0.5133019]

Visualize Trees

Export a tree in graphviz format and visualize it using graphviz:

$: conda install python-graphviz

>>> import graphviz
>>> tree_idx = 0
>>> dot_data = clf.estimators_[tree_idx].export_graphviz(
...         feature_names=iris.feature_names,
...         class_names=iris.target_names,
...         rotate=True)
>>> graph = graphviz.Source(dot_data)
>>> graph

Convert the tree to different file formats (e.g. pdf, png):

>>> graph.render("iris", format='pdf')
iris.pdf

Export a tree in a compact textual format:

>>> t = clf.estimators_[tree_idx].export_text()
>>> print(t)
0 X[3]<=0.8 [50, 50, 50]; 0->1; 0->2; 1 [50, 0, 0]; 2 X[3]<=1.75 [0, 50, 50]; 2->3; 2->6; 3 X[2]<=4.95 [0, 49, 5]; 3->4; 3->5; 4 [0, 47, 1]; 5 [0, 2, 4]; 6 X[3]<=1.85 [0, 1, 45]; 6->7; 6->8; 7 [0, 1, 11]; 8 [0, 0, 34];

Persistence

>>> import pickle
>>> with open("clf.pkl", "wb") as f:
...     pickle.dump(clf, f)
>>> with open("clf.pkl", "rb") as f:
...     clf2 = pickle.load(f)

Classification

>>> c = clf2.predict(X)
>>> print(c)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

>>> cp = clf2.predict_proba(X)
>>> print(cp)
[[1.         0.         0.        ]
 [1.         0.         0.        ]
 [1.         0.         0.        ]
 ...
 [0.         0.01935722 0.98064278]
 [0.         0.01935722 0.98064278]
 [0.         0.09155897 0.90844103]]

Testing

>>> score = clf2.score(X, y)
>>> print("Score: %f" % score)
Score: 0.966667

scikit-learn’s ecosystem

Pipeline

>>> from sklearn.pipeline import make_pipeline
>>> pipe = make_pipeline(DecisionForestClassifier(random_state=0))
>>> pipe.fit(X, y)
>>> pipe.predict(X)
>>> pipe.predict_proba(X)
>>> score = pipe.score(X, y)
>>> print("Score: %f" % score)
Score: 0.966667

Grid Search

>>> from sklearn.model_selection import GridSearchCV
>>> parameters = [{'n_estimators': [10, 20],
...                'bootstrap': [False, True],
...                'max_features': [None, 1],
...                'max_thresholds': [None, 1]}]
>>> grid_search = GridSearchCV(DecisionForestClassifier(random_state=0), parameters, iid=False)
>>> grid_search.fit(X, y)
>>> print(grid_search.best_params_)
{'bootstrap': False, 'max_features': None, 'max_thresholds': 1, 'n_estimators': 10}
>>> clf = DecisionForestClassifier(random_state=0)
>>> clf.set_params(**grid_search.best_params_)
>>> clf.fit(X, y)
>>> score = clf.score(X, y)
>>> print("Score: %f" % score)
Score: 0.966667

Parallel Processing (joblib + dask)

Install and setup dask:

$: conda install dask distributed

>>> from dask.distributed import Client
>>> client = Client()

>>> clf = DecisionForestClassifier(random_state=0)
>>> from sklearn.externals.joblib import parallel_backend
>>> with parallel_backend('dask', n_jobs=-1):  # 'loky' when not using dask
...     clf.fit(X, y)
...     score = clf.score(X, y)
>>> print("Score: %f" % score)
Score: 0.966667

View progress with dask:

Firefox: http://localhost:8787/status

C++¶

We provide a C++ library.