Author: Darryl Chia


Cluster analysis or "clustering" is a task that involves grouping sets of objects or data into groups known as clusters by the level of similarity that each individual object has with those in relation to it.

How it Works

Cluster analysis is a form of exploratory data mining, and a common technique in statical data analysis that can also be found in many other fields.
Clustering is based off the idea of classes, or conceptually meaningful groups of objects that share common characteristics. It basically boils down to dividing the information into groups (clustering) and identifying meaning to each group (classification). Ultimately, the definition is imprecise and the meaning can vary from time to time depending on the subject.

Clustering isn't tied to a specific algorithm, but many different algorithms can work in conjunction to help form them. These clusters usually appear as data points that happen to be in close proximity to each other or that take dense areas of data space, intervals or particular statistical distributions. Clustering requires the optimisation of many different points in the space, so that means it needs to be supplied with constant information of each iterative process so that it can optimise many objects at once through trial and error. Computing or pre-processing the data is an essential part to reducing the workload for cluster analysis.

Clusters are organised by connectivity, or by the distance and shape between objects determining their organisation and relation to each other. In this model measuring the significance of the relationship between points is reliant on the space between the objects. They can also be organised based on the density of each individual cluster or whether the data fits within the same mathematical distribution.

An example of a cluster diagram, one organised by linkage (connectivity) and another by density

Applications of use specific to the Case Study

Clustering has applications in business analytics, particularly when one wants to organise Big Data into meaningful structures that can be understood easily. Also clustering can break large heterogeneous (diverse in character and concept) populations into smaller homogenous (same) groups, making it easier to understand and interpret when making business decisions. Cluster analysis can also be used to investigate or maximise the degree of association that objects have with one another to inform the business about possible customer behaviours. This will be important for businesses to understand their target market.

For the Case Study, grocers can use clustering to segment their loyalty card customers into groups based on buying behaviours to maximise their appeal to each particular group by analysing their similarities in shopping behaviour and other factors. Cluster analysis simply discovers patterns, but it can't be used to provide explanations or interpretations, so companies will still need to interpret this information correctly.
In particular, information like brand selectivity based on different segments of buyers due to attributes like location, size, brand, flavour, price. It can also be used to detect anomalies in behaviours.

Links to Social and Ethical Issues

Please note that not all issues need to be addressed. Please add the URL or source of any examples to support your suggestion. It may be helpful to RANK the issues in the THIRD column.

Social & Ethical Issue
Examples that specifically link to the concept and/or definition in the Case Study
1.1 Reliability and integrity
The primary issue of using Clustering is that of the reliability and integrity of the data
recorded and organised. There is a liability in the data if it isn't entered or, more likely
in this case, measured properly the data can't be categorised accurately nor can the
predictions made with clustering be accurate.

1.2 Security

1.3 Privacy and anonymity

1.4 Intellectual property

1.5 Authenticity

1.6 The digital divide and equality of access
Another issue is probably given the geography of the case study, access to clustering
software and the means to gather the amount of data necessary can be a problem
for those living in these situations.

1.7 Surveillance

1.8 Globalization and cultural diversity

1.9 Policies

1.10 Standards and protocols

1.11 People and machines

1.12 Digital citizenship

References and resources