Correcting Metadata

From Market Research
Jump to: navigation, search

The metadata of a data file is often incorrect and the first step in analyzing data is often to correct these errors. Where there are a substantial amount of errors in metadata it is often the case that the best solution is to try and get a better data file (see Getting a Data File). This page lists the most common problems and their remedies.

Combining variables into multiple response questions

A very common problem with data files is that a question that was asked as a multiple response question appears as multiple single response questions. This is generally caused by the data collection software exporting the data incorrectly (unfortunately, most data collection programs make this mistake). Some programs, such as Q and DataCracker, attempt to automatically correct this problem by looking for patterns in the data (e.g., if there are 10 variables, each of which has the same prefix of Awareness with and have the same code frame, then Q and DataCracker automatically group the variables together). However, in most programs there is a need to manually group together variables as multiple response questions.

There are two common variants of this problem.

Multiple binary variables

Each category of the question appears as a separate summary table, with two categories (hence the term 'binary'). For example, the following shows a part of a MarketSight Summary Report, which reveals that a question titled Unaided Awareness has been shown as multiple separate tables, rather than being grouped together. This problem has a number of different causes. One of them is that often the data file does not contain the relevant metadata. The other is that sometimes the information is in the file but the data analysis program does not interpret it properly (in this example, both Q and DataCracker do group these variables together as a multiple response question).

All the major analysis programs have in-built tools designed to combine multiple variables together and thereby solve this problem.


See How to Combine Variables into Multiple Response Questions and Grids.

Multiple categorical variables

There are multiple tables (or variables) showing all of the different categories. Typically, the first table shows the first response that a respondent selected, the second shows the second response and so on.

All the major analysis programs have in-built tools designed to combine multiple variables together and thereby solve this problem.

See How to Combine Variables into Multiple Response Questions and Grids.

Combining variables into grids

In just the way that multiple response questions are sometimes not correctly represented in a data file, requiring the variables to be combined (see the previous problems), often grid questions will also initially appear split apart. Furthermore, many data analysis programs (e.g., SPSS, MarketSight and R) do not provide support for many types of grids.

See How to Combine Variables into Multiple Response Questions and Grids.

Splitting questions

In some situations variables, are represented as grids or multiple response questions when it is more appropriate to instead represent them as separate variables. In particular, Q and DataCracker automatically group together variables when they import data and from time-to-time they are over zealous and group together variables that should not be grouped together requiring that they need to be split apart.

See How to Split Questions Into Separate Variables.

Changing variable type

Most of the more modern survey analysis programs (e.g., R, Q, DataCracker and MarketSight) use metadata to automate how they produce summary tables. For example, in the following output from MarketSight, the average is shown for IID - Interviewer Identification whereas percentages are shown for the other variables. This is because the metadata indicates that IID - Interviewer Identification is a numeric variable while Does respondent have a mobile phone? is a categorical variable. Thus, when the the metadata show the wrong Variable Type this causes the output to be inappropriate and the remedy is to change the variable type.


See How to Change Variable Type and Question Type.

Changing question type

In addition to allowing the user to change from numeric to categorical and back, Q and DataCracker also allow the user to change between different question types. Consider the table from DataCracker below on the left. The metadata in this example has treated this table as a multiple response question, but the table on the right is automatically generated instead when the metadata is changed to indicate that the question is, in fact, a grid.


See How to Change Variable Type and Question Type.

Question wording

Commonly data files will either refer to questions by their question number (for example, a question showing age data may be referred to as #q43, or some of the wording is shown but it is either messy or truncated (e.g., Please rate your satisf). In most programs it is a straightforward process for modifying these names.

See How to Change the Name of a Question or Variable.

Category wording

The following table, from SPSS, shows a quite common problem, where numbers appear instead of descriptions for some of the categories in the table. The key challenge that this usually presents is working out what the numbers really mean; there is usually no way of working this out directly from the data (i.e., you need to ask whoever created the data file). Once this is known it is usually quite straightforward to fix the problem.


See How to Change the Label of a Category of a Question.

Previous page

Interpreting a Summary Report

Next page

Missing Values