Clustering Tree Learner (tree)

ClusteringTreeLearner is an implementation of classification and regression trees, based on the SimpleTreeLearner. It is implemented in C++ for speed and low memory usage. Clustering trees work by splitting the data into clusters based on attributes. The attribute provides the optimal split based on a measure, the default used in this implementation is the Euclidean distance between the centroids of clusters, which we try to maximize. Additional measures are implemented, more information on them can be found in the parameter description.

The implementation is based on the article by Blockeel et al. [1]

ClusteringTreeLearner was developed for speeding up the construction of random forests, but can also be used as a standalone tree learner.

class Orange.multitarget.tree.ClusteringTreeLearner
min_majority

Minimal proportion of the majority class value each of the class variables has to reach to stop induction (only used for classification).

min_MSE

Minimal mean squared error each of the class variables has to reach to stop induction (only used for regression).

min_instances

Minimal number of instances in leaves. Instance count is weighed.

max_depth

Maximal depth of tree.

method

The method used when chosing attributes while building the learner. The parameters should be supplied as either an integer (from 0 to 3) or with Orange.multitarget.tree. followed by the name of the measure (as shown in the examples below). Possible choices are:

  • inter_distance (default) - Euclidean distance between centroids of clusters
  • intra_distance - average Euclidean distance of each member of a cluster to the centroid of that cluster
  • silhouette - silhouette (http://en.wikipedia.org/wiki/Silhouette_(clustering)) measure calculated with euclidean distances between clusters instead of each element of a cluster.
  • gini_index - calculates the Gini-gain index, should be used with class variables with nominal values
skip_prob

At every split an attribute will be skipped with probability skip_prob. Useful for building random forests.

random_generator

Provide your own Orange.misc.Random.

Examples

ClusteringTreeLearner can be used on its own or in a random forest, below are examples of usage.

import Orange
data = Orange.data.Table('multitarget:bridges.tab')

majority = Orange.multitarget.binary.BinaryRelevanceLearner(
	learner = Orange.classification.majority.MajorityLearner, name = "Majority")

clust_tree = Orange.multitarget.tree.ClusteringTreeLearner(
	max_depth = 50, min_majority = 0.6, min_instances = 5, 
	method = Orange.multitarget.tree.inter_distance, name = "CT inter dist")

# we can use different distance measuring methods
ct2 = Orange.multitarget.tree.ClusteringTreeLearner(
	max_depth = 50, min_majority = 0.6, min_instances = 5, 
	method = Orange.multitarget.tree.intra_distance, name = "CT intra dist")

ct3 = Orange.multitarget.tree.ClusteringTreeLearner(
	max_depth = 50, min_majority = 0.6, min_instances = 5, 
	method = Orange.multitarget.tree.silhouette, name = "CT silhouette")

# Gini index should be used when working with nominal class variables
ct4 = Orange.multitarget.tree.ClusteringTreeLearner(
	max_depth = 50, min_majority = 0.6, min_instances = 5, 
	method = Orange.multitarget.tree.gini_index, name = "CT gini index")


# forests work better if trees are pruned less
forest_tree = Orange.multitarget.tree.ClusteringTreeLearner(
	max_depth = 50, min_majority = 1.0, min_instances = 3)
clust_forest = Orange.ensemble.forest.RandomForestLearner(
	base_learner = forest_tree, trees = 50, name = "Clustering Forest")

learners = [ majority, clust_tree, ct2, ct3, ct4, clust_forest ]

results = Orange.evaluation.testing.cross_validation(learners, data, folds=5)

print "Classification - bridges.tab"
print "%17s  %6s  %8s  %8s" % ("Learner", "LogLoss", "Mean Acc", "Glob Acc")
for i in range(len(learners)):
    print "%17s  %1.4f    %1.4f    %1.4f" % (learners[i].name,
    Orange.multitarget.scoring.mt_average_score(results, Orange.evaluation.scoring.logloss)[i],
    Orange.multitarget.scoring.mt_mean_accuracy(results)[i],
    Orange.multitarget.scoring.mt_global_accuracy(results)[i])

# regression uses a different parameter for pruning - min_MSE instead of min_majority
clust_tree = Orange.multitarget.tree.ClusteringTreeLearner(
	max_depth = 50, min_MSE = 0.05, min_instances = 5, name = "Clustering Tree")

forest_tree = Orange.multitarget.tree.ClusteringTreeLearner(
	max_depth = 50, min_MSE = 0.06, min_instances = 3)
clust_forest = Orange.ensemble.forest.RandomForestLearner(
	base_learner = forest_tree, trees = 50, name = "Clustering Forest")

learners = [ majority, clust_tree, clust_forest ]

data = Orange.data.Table('multitarget-synthetic.tab')
results = Orange.evaluation.testing.cross_validation(learners, data, folds=5)

print "Regression - multitarget-synthetic.tab"
print "%17s  %6s " % ("Learner", "RMSE")
for i in range(len(learners)):
    print "%17s  %1.4f  " % (learners[i].name,
    Orange.multitarget.scoring.mt_average_score(results, Orange.evaluation.scoring.RMSE)[i])

References

[1]H. Blockeel, L. De Raedt, and J. Ramon, “Top-Down Induction of Clustering Trees”, In Proceedings of the Fifteenth International Conference on Machine Learning (ICML ‘98), 55-63, 1998.

Table Of Contents

Previous topic

Multi-target prediction (multitarget)

Next topic

Binary Relevance Learner (binary)

This Page