Cluster Points


This website describes the setup and the functionality of a subplugin written for the QGIS Processing Toolbox.

The plugin can be downloaded here and the source code is available here.

Introduction

Cluster Points offers a set of cluster tools for an automatical grouping of a point layer in Quantum GIS minimizing the intra-group distance and maximizing the between-group distance: There are two inherently different algorithms the user may choose from. First, there is K-Means clustering which randomly initializes the cluster centers and reassigns cluster members until the centers stop moving. Second, there is agglomerative hierarchical clustering which starts with as many clusters as there are points and gradually merges individual clusters according to a certain link function.

Cluster Points works with input GIS files (Shapefile or GeoPackage) with a new field Cluster_ID appended to the attribute table as output.

The plugin also offers an incorporation of prescribed attribute fields into the clustering process. To this end, the percentage contribution can be defined for both the geographical coordinates and the whole of these attribute fields.

Cluster Points is a free software and offered without guarantee or warranty. You can redistribute it and/or modify it under the terms of version 3 of the GNU General Public License as published by the Free Software Foundation. Bug reports or suggestions are welcome under the issue tracker.

This plugin was started during the project phase of a GIS-Analyst training course in Berlin (GIS-Trainer). I acknowledge the assistance of the GIS-Trainer tutors and my classmates Juliane, Bennet and Sebastian.



User Manual

Clustering

The Clustering Tool offers spatial clustering of a point layer based on the mutual distances between points. Basically, the inter-class distances are maximized whereas the intra-class distances are minimized. The user always needs to define the number of clusters which is sought (minimum is 2). Also, the user needs to decide between the Euclidean distance and the Manhattan distance within the cluster computation. Since the K-Means algorithm is based on a random initialization, a random seed can be specified for this type of clustering to guarantee stable results. For hierachical clustering, a linkage, i.e. link function, must be specified which determines how the distances are measured between individual clusters.

Two inherently different clustering types are available:

All versions offer the output of the new field/attribute labeled Cluster_ID appended to the input layer to indicate cluster membership of individual points. Als an additional output, the Fuzzy C-Means algorithms also offers the new field/attribute labeled Cluster_% appended to the input layer to show cluster membership probabilities following the IDs of the Cluster_ID.

Incorporation of attribute fields

If any information other than the geographical coordinates needs to be considered, the user may select available numerical fields (Attribute fields) from the input layer to be incorporated into the clustering process. The user can then specify the percentage contribution of these fields as a whole in addition to or as an alternative to the geographical coordinates.

If the percentage contribution is set to 0, then the clustering only considers the geographical coordinates and is purely spatial. If the percentage contribution is set to 100, then the clustering disregards the geographical coordinates and only considers the selected fields. Finally, if the percentage is in between 0 and 100, the clustering is based on a mix of geographical coordinates and the selected fields. In essence, the percentage determines, whether the clustering is rather location-based (low percentage) or attribute-based (high percentage). The numerical fields are standardized to have the same standard deviation as the (horizontal) geographical distances, before they are considered by the given percentage.



Example usage

To illustrate the functionality of the plugin, we have a look at some kind of customer addresses of a certain company in Berlin (customer locations shown on the map, polygons delineate Berlin districts). Let's assume the company plans to install 8 new logistics centers across the city and needs to know about the optimum locations to minimize distances between individual logistics centers and customers nearby (875 altogether).

To find the optimum locations, the Clustering is run to find groups of customers who are close to each other. The user sets 8 as the target number of clusters and runs the algorithm (in this case the K-Means). In the original version, the output is a point layer with the same number of points as the input, but with the new field Cluster_ID appended to the attribute table and the cluster members highlighted in color according to their Cluster ID. Later versions just append the field Cluster_ID to the input layer (color highlighting to be done manually, if required). In the end, cluster centers may be found by a standard centroid estimation (not shown here).

Cluster feature illustration

To speed up the computation of the hierarchical clustering with the Lance-Williams distance updates, the customers are first assigned to 81 cluster features (radius derived from user-defined percentile of sample distances). Then the hierarchical clustering is only done on the 81 cluster features. Finally, individual customers are assigned the Cluster_ID of their cluster feature. Below, the centroids of the cluster features are displayed with bigger markers whereas initial points, i.e. individual customers, are displayed with smaller markers and connected to their cluster features via straight lines.