Discussion:
[Scikit-learn-general] BIRCH: merge subclusters
Dženan Softić
2016-02-07 20:12:21 UTC
Permalink
Hi,

I am doing some experiments with BIRCH. When BIRCH finish, I would
like to merge subclusters based on some criteria. I am doing this this
by calling "merge_subcluster" method on subcluster that I want to
merge with, passing it subcluster object of the second cluster:

cluster1.merge_subcluster(cluster2, self.threshold)

It seems to work, since it updates correctly N, LS, SS (n_samples,
linear_sum, squared_sum). What is left is to remove a merged
subcluster (cluster2) from the subclusters list and to update
centroids:

ind = leaf.subclusters_.index(cluster1) #getting the index to update
the centroid
ind_remove = leaf.subclusters_.index(cluster2) #getting the index of a
cluster that needs to be removed because it is merged
leaf.init_centroids_[ind] = cluster1.centroid_ #update centroid
leaf.init_sq_norm_[ind] = cluster1.sq_norm_
leaf.centroids_ = np.delete(leaf.centroids_, ind_remove, 0) #removing
the centroid of a cluster2
self.root_.init_centroids_ = np.delete(self.root_.init_centroids_,
ind_remove, 0) #removing the centroid from the root
leaf.subclusters_.remove(cluster) #removing the cluster itself

I am not sure I am doing it the right way. Any suggestion/comment
would be very much appreciated.

Thanks,
Dzeno
Joel Nothman
2016-02-07 20:58:03 UTC
Permalink
It's not clear *why* you're doing this. The model will automatically
recluster the subclusters after identifying them, as long as you specify
either a number of clusters or a clustering model to the n_clusters
parameter. Can you fit this post-processing into that "final clustering"
framework?
Post by Dženan Softić
Hi,
cluster1.merge_subcluster(cluster2, self.threshold)
ind = leaf.subclusters_.index(cluster1) #getting the index to update the centroid
ind_remove = leaf.subclusters_.index(cluster2) #getting the index of a cluster that needs to be removed because it is merged
leaf.init_centroids_[ind] = cluster1.centroid_ #update centroid
leaf.init_sq_norm_[ind] = cluster1.sq_norm_
leaf.centroids_ = np.delete(leaf.centroids_, ind_remove, 0) #removing the centroid of a cluster2
self.root_.init_centroids_ = np.delete(self.root_.init_centroids_, ind_remove, 0) #removing the centroid from the root
leaf.subclusters_.remove(cluster) #removing the cluster itself
I am not sure I am doing it the right way. Any suggestion/comment would be very much appreciated.
Thanks,
Dzeno
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Dženan Softić
2016-02-07 23:02:42 UTC
Permalink
Hi,

Thank you for your reply. My aim is not to use global clustering step, but
rather to use BIRCH for online an clustering (possible infinite stream). I
was also trying to set BIRCH threshold automatically. In order to do so, I
use Gap Statistics (developed it on top of Apache Spark) for certain
'window' of data stream, and I am able to produce BIRCH threshold with high
accuracy (based on tests I did so far). Since BIRCH is highly dependent on
order of data, and because of the way I am setting the threshold, there is
a certain possibility that some clusters have to be merged (while
possibility for splitting the clusters is small). In order to keep it
"online", I want to merge those clusters in a "intermediate" step if there
is a need. So basically I want to do merging if needed in the
"partial_fit", before I proceed with the next batch and maybe modified
threshold.

That is why I can't use global clustering with predefined number of
clusters or other clustering model. Hope this makes sense now.

Thanks again.
Dzeno
Post by Joel Nothman
It's not clear *why* you're doing this. The model will automatically
recluster the subclusters after identifying them, as long as you specify
either a number of clusters or a clustering model to the n_clusters
parameter. Can you fit this post-processing into that "final clustering"
framework?
Post by Dženan Softić
Hi,
cluster1.merge_subcluster(cluster2, self.threshold)
ind = leaf.subclusters_.index(cluster1) #getting the index to update the centroid
ind_remove = leaf.subclusters_.index(cluster2) #getting the index of a cluster that needs to be removed because it is merged
leaf.init_centroids_[ind] = cluster1.centroid_ #update centroid
leaf.init_sq_norm_[ind] = cluster1.sq_norm_
leaf.centroids_ = np.delete(leaf.centroids_, ind_remove, 0) #removing the centroid of a cluster2
self.root_.init_centroids_ = np.delete(self.root_.init_centroids_, ind_remove, 0) #removing the centroid from the root
leaf.subclusters_.remove(cluster) #removing the cluster itself
I am not sure I am doing it the right way. Any suggestion/comment would be very much appreciated.
Thanks,
Dzeno
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Loading...