One of the trickiest aspects of cleaning and tidying a data file relates to missing values, which are also known as missing data. A value is said to be 'missing' where there is no valid data for a particular respondents for a variable. How missing values are treated when cleaning a data file can have a large impact on any conclusions from the analysis.
How programs treat missing values (re-basing)
The tables below have been created by exporting to PowerPoint from Q. Looking first at the table on the left, note that in the bottom row it reports that the NET (which in this case is a total) is 100% and that this corresponds to 718 respondents (n). Each of the percentages on this table has been computed by dividing the number of people to select each option (n) by the total sample size. For example, the 5% for Manager/administrator is computed as 33 / 718. The first row on this table shows that 54% (389) of the respondents have missing data, which in this case was because they were students, retired, unemployed or home-makers.
The table on the right, by contrast, has excludes the respondents with missing data and, consequently, its total sample size, as shown in the bottom row of the table and also the base n in the footer of the table is 329 (i.e., 718 - 389). Note that in this table all the percentages are different. For example, the 33 Manager/administrators in the sample now correspondents to 10% of the sample.
The table on the right has been re-based; that is, the percentages have been computed with the missing data excluded from the calculations. This is done automatically by most survey analysis programs. However, to do this automatically the programs need to know which categories to treat as being missing. In the case of the table on the left, although the first category has a label which says it is missing data, there is not metadata which explicitly tells the program it is missing data and, consequently, it appears on the table and is analyzed in the same way as all the other categories. By contrast, with the example on the right the program has been told to treat the people with missing data as being actually missing and the percentages are updated accordingly.
It is important to appreciate that neither of the tables is necessarily wrong. Rather, they just have different interpretations. The table on the left shows that 5% of people are Managers/administrators, whereas the table on the right is interpreted as saying that 10% of employed people are Managers/administrators.
See How To Remove a Category and Re-Base a Table for information on how to re-base tables in different programs.
Common missing data situations
As the previous example illustrated, whether or not data should be treated as being missing is a decision that has to be made by the person analyzing the data, as the way that it needs to be treated depends upon context (e.g., in the case above, whether we wanted to compute proportions in the total sample or just among people that were employed).
Often the data will be automatically marked as being missing. In particular, with good quality data files any respondent who has not been asked a question will automatically be classified as having missing data on that question and any analyses will be automatically re-based with that respondent's data removed. Consequently, the main situations where the missing data metadata needs to be modified are:
- Where respondents have data that is marked as missing but who should not be treated as having missing data. Consider variables showing the number of children that people have in different age bands (e.g., the number of children aged under 2, from 2 to 5, etc.) Typically, where a person indicates they have no children they end up not being asked how many children they have in each age group and thus end up having missing values in the data file for the number of children in age group. However, in this instance we actually know that they have 0 children in each age category so the data needs to be changed from missing values to instead show that the data is not missing (see Recoding Variables).
- Where respondents have data that is not marked as missing but should be marked as missing. For example, in a study of voting intentions if 10% of the sample have said "Don't know" it often makes sense to treat these responses as missing data as otherwise the polling results indicate that that preference shares for the different political parties add up to less tan 100%.