Counted Values and Missing Values in Multiple Response Questions
- 1 Counted values
- 2 Multiple response data where the variables contain multiple values
- 3 Missing values
- 4 Notes
In a single response question it is usually obvious that the correct way to compute the proportions is to compute the number of people that selected a category and divide this by the total number of people that selected at least one category. At an intuitive level it makes sense that percentages of multiple response data would be computed in the same way. However, the way that data is stored prevents it from being quite so simple. Usually, multiple response data is stored so that there is one variable for each brand. However, it is not always clear how to analyze this particular variable.
In some data files the code frame will be set up as:
0 Not selected 1 Selected
In such a situation it is usually pretty obvious that the correct way to compute the proportions is to work out the proportion of people with a 1 in their data (it gets more complicated if there is missing data; this is discussed in the next section). Or, phrasing it in a different way, the correct way of computing the proportions is to count the higher value (i.e., the 1).
Similarly, if there is no metadata and the variable only contains 0s and 1s it is still obvious that the 1s should be counted.
However, where it gets complicated is when there are values other than 0 and 1 in the data. For example, sometimes the code frame will be:
1 Yes 2 No
To a human being it is obvious at the Yes responses should be counted. However, in one sense, this is the opposite to the previous examples, as now we are counting the lowest of the observed values rather than the highest.
Due to the potential ambiguity, the way that most programs works is that they either force the user to specify a specific value (e.g., SPSS requires the user to specify the Counted value), or, they give the user the ability to inspect and modify the setting. For example, in Q the user specifies whether the analysis should or should not Count this value and in DataCracker the user has the option to Select Categories.
Multiple response data where the variables contain multiple values
Sometimes the variables contain more than two values, so it is not at all obvious which of the values should be counted. There are two very difference instances of this.
Case A: Max-Multi data
In SPSS, for example, the data needs to be selected as Categories when defining the multiple response sets, whereas in Q there is a special question type of Pick Any - Compact designed for this type of data.
Case B: Recoding grid questions
Often it is useful to treat some types of grid questions as if they are multiple response questions. Most commonly, with a question that gets people to rate agreement using five points (e.g., Strongly disagree; Somewhat disagree; Neither agree nor disagree; Somewhat Agree; Strongly Agree it is common to turn this into a top 2 box scores (i.e., the NET of Somewhat Agree and Strongly Agree. There are numerous ways of doing this. However, the simplest is to treat the data as being multiple response and count multiple values. For example, count the 4 and 5 values (assuming they correspond to Somewhat Agree and Strongly Agree). This can be done in most programs by recoding the existing variables so that they have only two values (E.g., 0 and 1) and then treating the data as if from a multiple response question. In the case of Q and DataCracker they both permit the specification of multiple counted values (i.e., there is no need to recode the variables in these programs).
The following table shows the data for the first 10 of 498 respondents from the Mobiles Example. Note that, for example, the first respondent has only provided data for brands 1, 2 and 7 (i.e., has missing values for all of the others). This data is from a question where people were presented with a list of brands and asked which they had shown before (i.e., it is an Aided Awareness question). Where respondents have missing values (shown as a .) this is because they had indicated in an earlier Unaided Awareness question that they were aware of the brands.
The only way to compute a valid summary table of this multiple response question is if the data has been set up so that the analysis program knows that the correct interpretation is that a person is aware of a given brand if either they have said Yes, or, they have missing data. By default no analysis program will work this out. For example, the resulting summary tables in SPSS and Q are shown below. Both are incorrect in this instance.
|SPSS multiple response summary table with missing data||Q multiple response summary table with missing values|
base n = 0 to 489
To better understand the data and how to compute valid percentages it is helpful to look at the following table, which indicates the number of respondents to have data of Yes, No or Missing data for each option. Looking at the SPSS table above, the 21.5% shown for Responses AAPT has been computed by dividing the 78 people that said Yes for AAPT by the total number of Yess records for all the brands. Less obviously, the 71.6% shown for Percent of cases has been computed by dividing 78 by 106, where 106 is the number of people to have a Yes response response in the data for at least one of the brands (this number is not shown on the table and cannot be deduced from the table). In the presence of missing data neither of these statistics has an useful meaning (i.e., they are not estimates that relate to the population of phone users).
The table computed by Q (above and to the right) shows 17.6% for AAPT. This has been computed by dividing the 78 by the total number of people to have said either Yes or No for AAPT (i.e., 17.6% = 78/(376 + 78)). This percentage does have a real meaning which is useful in some contexts. The interpreation is that 17.6% of people asked whether they were aware of AAPT said they were aware. However, in this specific example, where we know that everybody with missing data was aware of AAPT, Q's default calculation is also unhelpful.
With this type of data the correct calculation is to compute the aided awareness as the proportion of people to have said Yes or having missing data and divide this by the total number of people in the study. In the case of AAPT, for example, the correct proportion is 24.5% (i.e., (44 + 78) / 498). The following table shows the correct proportions for all of the brands:
Using software to compute the proportions correctly
The standard way to fix the data is to:
- Recode the variables so that the missing values are recoded as having a value of 1.
- If it is not already in the data, create a none of these alternative (this is necessary because SPSS and some of the older analysis packages require that each respondent has at least one Yes response in order for the percentages to be correctly calculated).
The standard method can be done in Q and DataCracker as well, but both of these programs have an easier way of fixing this problem.
Computing the correct proportions in DataCracker
- Select a table containing the multiple response question.
- Data Manipulation > Data Values > Missing Data, select Include in Analyses for each of the categories and press OK.
- Data Manipulation > Data Values > Select Categories and ensure that Yes and Missing data are selected and press OK.
Computing the correct proportions in Q
- In the Tables tab, right-click on one of the categories of the question and select Values.
- Fill in the dialog box as shown below.