The easy part of data analysis is creating lots of crosstabs. The hard part is trying to figure out what it all means. Our brains are greatly limited in terms of the quantity of information that they can process. The more information that our brains have to process, the worse the job that they do. Consequently, a secret to interpreting data is to reduce the quantity of data. The academic term for this is data reduction. This page illustrates some of the more elementary aspects of of data reduction by working through an example.
The crosstab below shows how preferences for different brands of cola differ by age, gender and income. It is quite typical of the type used by commercial consultants when analyzing surveys. It is a very large crosstab. Indeed, it is so large that it is impractical for anybody to read everything on this crosstab and thus key findings hidden in the crosstab may easily be missed.
We can greatly improve this crosstab by deleting things. The reason that this works is that the more things on the page the more our brain gets distracted and thus the more we delete, the clearer the story in the data will become. The key things that we can delete from this table are:
- All the statistics except for the column percentages. That is, we should delete counts (labelled N), row percentages (% within row). Of course, in some situations these additional statistics may prove to be useful, but more often-than-not they are not useful and thus the default should be to remove them.
- The decimal places. Due to sampling error it is very rare that differences in decimal places are relevant and thus they should not be shown.
- All of the income and gender data. The column comparisons show no significant differences for gender and thus the differences between male and female preferences for different brands of cola are likely due to sampling error and are thus not interesting. The case for deleting income is a little more complex. The only significant difference relates to the preference for Pepsi Max being relatively high among the people with incomes of $120,001 to $150,000/. This pattern seems counter intuitive: why should this small brand appeal to such a narrow income-based demographic. Consequently, it likely reflects one of those flukes that comes up from time-to-time in surveys and thus, in the absence of any corroborating evidence, should be ignored.
The much simpler crosstab that results is shown below.
The same basic idea can be applied to charts as well. The example below is from a PowerPoint presentation showing the popularity over time of some of Australia's Prime Ministers. The plot beneath makes the pattern in the data much more clear by:
- Deleting the background colors.
- Smoothing the lines (i.e., deleting the small random deviations).
- Deleting the chart junk (i.e., the callouts with the Prime Minters' names.
The two tables below show the same data. The one on the left is ordered alphabetically. The one on the right has been sorted by values. For most purposes, the table on the right is more useful as it allows people to quickly work out the relative performance of the options (which is usually the main interest when interpreting any table).
The idea of ordering information is more general than that of sorting. Consider the following two charts. The one on the left is substantially better in terms of communicating the pattern.
When dealing with large tables the most useful way of ordering the tables is to use diagonalization. The basic idea of diagonalization is that the rows and columns of tables are re-ordered such that a near-diagonal pattern appears in the table. This allows our brains to recognize a pattern, which in turn makes it much easier to process the data. Note how in the second of the plots below it is really easy to see which brands are similar to which other brands and why.
Summarizing data involves replacing a larger set of data with a smaller set of data. Ordering: data is ordered in a way to make the story more obvious. Grouping: combing bits of data together. Summarizing
The second stage in trying to work out what a survey means is data reduction. As the name implies, this involves reducing the amount of data so that conclusions become easier to find.
- Collins, M. (1992). "The data reduction approach to survey analysis." Journal of the Market Research Society 34(2): 149-162. Ehrenberg, A. S. C. (1975). Data reduction: analyzing and interpreting statistical data. New York, John Wiley.
- Ehrenberg, A. S. C. (1975). Data reduction: analyzing and interpreting statistical data. New York, John Wiley.