Introduction

The *k*-means clustering method that measures data similarity by the Pearson correlation distance has been widely used for processing large-scale biological data and have proven useful in gaining insights into various aspects of life science. Despite the popularity of the Pearson correlation distance, optimization methods tailored to this distance are largely unexplored. Although effective pruning methods using triangle inequality have been developed for Euclidean distance, they are not applicable because the Pearson correlation distance does not satisfy triangle inequality.

To boost *k*-means clustering for the Pearson correlation distance, we propose a novel method of significantly pruning unnecessary computation while retaining the final solution. Our algorithm also can be applied to K-median clustering.

Source code

Manual and source code gzip

Pubication

"Boosting *k*-means clustering for the Pearson correlation distance"

submitted

Data used in the publicated paper

Nucleosome positioning signal data (K562 cells)

K562_nuc_signal_d5.gz K562_nuc_signal_d10.gz K562_nuc_signal_d20.gz K562_nuc_signal_d50.gz K562_nuc_signal_d101.gz K562_nuc_signal_d201.gz K562_nuc_signal_d501.gz K562_nuc_signal_d1001.gz K562_nuc_signal_d2001.gz

Morishita Laboratory, Graduate School of Frontier Sciences, the University of Tokyo

Kazuki Ichikawa e-mail: ichikawa@cb.k.u-tokyo.ac.jp Last updated 2013/05/13