Clustering Tree Learner (`tree`)¶

ClusteringTreeLearner is an implementation of classification and regression trees, based on the SimpleTreeLearner. It is implemented in C++ for speed and low memory usage. Clustering trees work by splitting the data into clusters based on attributes. The attribute provides the optimal split based on a measure, the default used in this implementation is the Euclidean distance between the centroids of clusters, which we try to maximize. Additional measures are implemented, more information on them can be found in the parameter description.

The implementation is based on the article by Blockeel et al. [1]

ClusteringTreeLearner was developed for speeding up the construction of random forests, but can also be used as a standalone tree learner.

class Orange.multitarget.tree.ClusteringTreeLearner¶

min_majority¶: Minimal proportion of the majority class value each of the class variables has to reach to stop induction (only used for classification).

min_MSE¶: Minimal mean squared error each of the class variables has to reach to stop induction (only used for regression).

min_instances¶: Minimal number of instances in leaves. Instance count is weighed.

max_depth¶: Maximal depth of tree.

method¶

The method used when chosing attributes while building the learner. The parameters should be supplied as either an integer (from 0 to 3) or with Orange.multitarget.tree. followed by the name of the measure (as shown in the examples below). Possible choices are:

inter_distance (default) - Euclidean distance between centroids of clusters

intra_distance - average Euclidean distance of each member of a cluster to the centroid of that cluster

silhouette - silhouette (http://en.wikipedia.org/wiki/Silhouette_(clustering)) measure calculated with euclidean distances between clusters instead of each element of a cluster.

gini_index - calculates the Gini-gain index, should be used with class variables with nominal values

skip_prob¶: At every split an attribute will be skipped with probability skip_prob. Useful for building random forests.

random_generator¶: Provide your own Orange.misc.Random.

Examples¶

ClusteringTreeLearner can be used on its own or in a random forest, below are examples of usage.

import Orange
data = Orange.data.Table('multitarget:bridges.tab')

majority = Orange.multitarget.binary.BinaryRelevanceLearner(
	learner = Orange.classification.majority.MajorityLearner, name = "Majority")

clust_tree = Orange.multitarget.tree.ClusteringTreeLearner(
	max_depth = 50, min_majority = 0.6, min_instances = 5, 
	method = Orange.multitarget.tree.inter_distance, name = "CT inter dist")

# we can use different distance measuring methods
ct2 = Orange.multitarget.tree.ClusteringTreeLearner(
	max_depth = 50, min_majority = 0.6, min_instances = 5, 
	method = Orange.multitarget.tree.intra_distance, name = "CT intra dist")

ct3 = Orange.multitarget.tree.ClusteringTreeLearner(
	max_depth = 50, min_majority = 0.6, min_instances = 5, 
	method = Orange.multitarget.tree.silhouette, name = "CT silhouette")

# Gini index should be used when working with nominal class variables
ct4 = Orange.multitarget.tree.ClusteringTreeLearner(
	max_depth = 50, min_majority = 0.6, min_instances = 5, 
	method = Orange.multitarget.tree.gini_index, name = "CT gini index")


# forests work better if trees are pruned less
forest_tree = Orange.multitarget.tree.ClusteringTreeLearner(
	max_depth = 50, min_majority = 1.0, min_instances = 3)
clust_forest = Orange.ensemble.forest.RandomForestLearner(
	base_learner = forest_tree, trees = 50, name = "Clustering Forest")

learners = [ majority, clust_tree, ct2, ct3, ct4, clust_forest ]

results = Orange.evaluation.testing.cross_validation(learners, data, folds=5)

print "Classification - bridges.tab"
print "%17s  %6s  %8s  %8s" % ("Learner", "LogLoss", "Mean Acc", "Glob Acc")
for i in range(len(learners)):
    print "%17s  %1.4f    %1.4f    %1.4f" % (learners[i].name,
    Orange.multitarget.scoring.mt_average_score(results, Orange.evaluation.scoring.logloss)[i],
    Orange.multitarget.scoring.mt_mean_accuracy(results)[i],
    Orange.multitarget.scoring.mt_global_accuracy(results)[i])

# regression uses a different parameter for pruning - min_MSE instead of min_majority
clust_tree = Orange.multitarget.tree.ClusteringTreeLearner(
	max_depth = 50, min_MSE = 0.05, min_instances = 5, name = "Clustering Tree")

forest_tree = Orange.multitarget.tree.ClusteringTreeLearner(
	max_depth = 50, min_MSE = 0.06, min_instances = 3)
clust_forest = Orange.ensemble.forest.RandomForestLearner(
	base_learner = forest_tree, trees = 50, name = "Clustering Forest")

learners = [ majority, clust_tree, clust_forest ]

data = Orange.data.Table('multitarget-synthetic.tab')
results = Orange.evaluation.testing.cross_validation(learners, data, folds=5)

print "Regression - multitarget-synthetic.tab"
print "%17s  %6s " % ("Learner", "RMSE")
for i in range(len(learners)):
    print "%17s  %1.4f  " % (learners[i].name,
    Orange.multitarget.scoring.mt_average_score(results, Orange.evaluation.scoring.RMSE)[i])

References¶

[1]	H. Blockeel, L. De Raedt, and J. Ramon, “Top-Down Induction of Clustering Trees”, In Proceedings of the Fifteenth International Conference on Machine Learning (ICML ‘98), 55-63, 1998.

Clustering Tree Learner (`tree`)¶

Examples¶

References¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

Clustering Tree Learner (tree)¶

Examples¶

References¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation

Clustering Tree Learner (`tree`)¶