The k-means clustering method that measures data similarity by the Pearson correlation distance has been widely used for processing large-scale biological data and have proven useful in gaining insights into various aspects of life science. Despite the popularity of the Pearson correlation distance, optimization methods tailored to this distance are largely unexplored. Although effective pruning methods using triangle inequality have been developed for Euclidean distance, they are not applicable because the Pearson correlation distance does not satisfy triangle inequality.
To boost k-means clustering for the Pearson correlation distance, we propose a novel method of significantly pruning unnecessary computation while retaining the final solution. Our algorithm also can be applied to K-median clustering.
"Boosting k-means clustering for the Pearson correlation distance"
Data used in the publicated paper
Nucleosome positioning signal data (K562 cells)
K562_nuc_signal_d5.txt.gz K562_nuc_signal_d10.txt.gz K562_nuc_signal_d20.txt.gz K562_nuc_signal_d50.txt.gz K562_nuc_signal_d101.txt.gz K562_nuc_signal_d201.txt.gz K562_nuc_signal_d501.txt.gz K562_nuc_signal_d1001.txt.gz K562_nuc_signal_d2001.txt.gz