Cluster analysis finds groups of similar respondents, where respondents are considered to be similar if there are relatively small differences between their average ratings. As an example, look at the plot below. It is a scatterplot showing data for 18 respondents on two numeric variables. You can hopefully see that the dots fall into two groups. If you used cluster analysis to analyze this data, provided you didn't do something wrong, it would also identify the same two groups that you can see. These groups are then called clusters.
Typically we have many more than two variables and this makes it difficult for us to visualize the data and determine the clusters. It is in such situations that we need to use cluster analysis. Consider the data below, which shows the responses of 20 respondents to 7 variables (this is from SPSS). Can you see a pattern? Perhaps you can find one if you work hard. However, these 20 respondents are from a total sample of 498 respondents and few people could read a table showing all of their data and identify any clusters, which is why cluster analysis (or, better yet, Latent Class Analysis) is used to find clusters in the data. SurveyAnalysis.org has a detailed discussion of how cluster analysis works.
This example analyzes the data that was previewed at the end of the previous section. You can download the SPSS data file and questionnaire if you wish to replicate the example. The data contains seven variables measuring attitudes towards mobile phones, where respondents were asked to give their degree of agreement/disagreement with the following statements:
- Technology is fascinating
- I am often surprised by the size of bills
- I find it difficult to determine best deal
- I spent a lot of time shopping for best deal
- I closely monitor the time I spend on the phone
- Cost is a factor when deciding where to SMS or phone
- I try to keep calls short and to the point
The scale used was:
Strongly agree Agree a little Neither Disagree a little Strongly disagree DON’T KNOW
Consequently, prior to analyzing the data the data needed to be prepared in two ways:
- People that said DON'T KNOW were assigned missing value codes in the corresponding variables.
- The data was recoded so that a value of 1 was assigned to Strongly disagree, 2 to Disagree a little and so on up to 5. Note that these precise values are arbitrary as there is no good reason why these categories should be assigned values from 1 to 5 with a space of 1 between each scale point. While arbitrary, this is, nevertheless, standard practice.
SPSS's Two Step Cluster analysis routine, which is the best of the cluster analysis techniques that is available in SPSS,[note 1] recommends the following five cluster solution.
The top row of the table shows the sizes of the clusters. We can see that approximately 25% of the sample is in the first cluster, 22% in the second and so on.
The variables are then listed underneath each cluster in the order of their importance in determining cluster membership. Looking at cluster 1 we can see that the variable shown at the top is Difficult to determine best deal. As its average value is 1.92 and, as a 2 represents Disagree a little we can conclude that people in cluster 1 find it relatively easy to identify the best deal. Looking elsewhere in the table we can see that all the other clusters found it hard to find the best deal (i.e., as their averages are much higher than 1.92). Similarly, if we look at cluster 2 we can see that the most important determinant of membership of this cluster was level of agreement with the statement Closely monitors time on phone.
In a real-world study the basic process from this point would be to:
- Create a summary of the unique aspects of each of the clusters.
- Run crosstabs of the clusters against other interesting tables to see if there are any further interesting relationships.
- Examine alternative cluster analysis solutions. Generally it is useful to explore solutions with fewer clusters than those that are automatically suggested. In this case, a five cluster solution was automatically identified so it would be advisable to also review solutions with four, three and two clusters. (Note that there is no truly scientific method for determining the number of clusters and that all automated methods for selecting the number of clusters really do is identify the maximum number that is likely to be sensible.)
- Come up with evocative names for each of the clusters (e.g., "The big talkers").
- As it can also handle categorical variables and automatically recommends a specified number of clusters.