Getting a Data File
A data file contains the individual responses to a survey in a format that permits them to be analyzed by a program specifically designed for the analysis of survey data (e.g., SPSS, Q, DataCracker, Stata). Almost all programs that are used to conduct surveys are able to export data files.
In much the same way that some meals are great while others are inedible, some data files are great and some are unusable. The quality of a data file is perhaps the biggest cause of problems experienced by people when learning how to analyze surveys, as a common mistake is to put insufficient effort into obtaining 'good' data which results in the analysis being much harder than it needs to be.
- 1 What survey data looks like
- 2 Data files formats
- 3 Previous page
- 4 Next page
- 5 Notes
What survey data looks like
When we conduct market research we usually collect data about individual people, households or businesses.[note 1] The data provided by respondents is called the raw data. The table below shows raw data for ten households (from a larger data file). Each row in a table of raw data represents the data from an individual respondent, and there are no blank rows. In this case, each respondent was a household. Each column is referred to as a variable. Each variable is a measure of some characteristic of the respondents.
Data such as that above is not, on its own, readily interpretable. To interpret such data it is necessary to also have metadata, explaining what it all means. The metadata, which is sometimes referred to as a data dictionary, is shown below. So, returning to the table above, the database indicates that the fourth household was not a customer of AT&T, there is no data indicating the household’s income, the household moved twice in the last 10 years, the respondent who completed the survey was 65 or older, and so on.
Where a variable is categorical this means that the values stored in the raw data can only be interpreted by looking at the metadata. In particular, with the MOVES variable, a 1 indicates that a household has not moved, a 2 indicates it has moved once, etc. By contrast, with the USAGE variable, which is numeric, a 1 indicates it was used once, a 2 indicates it was used twice, etc.
|Variable||Variable Label||Value Labels||Variable Type|
|ID||A unique identification number assigned to each respondent||1 = first respondent, 2 = second respondent,...||Categorical|
|CARRIER||Phone carrier of household||1 = AT&T, 2 = Other||Categorical|
|INCOME||Household income bracket (in thousands)||1 = <7.5, 2 = 7.5-15, 3 = 15-25, 4 = 25-35, 5 = 35-45, 6 = 45-75, 7 = >75||Categorical|
|MOVES||Number of times the household has moved in the preceding 10 years||1 = 0, 2 = 1,3 = 2, ..., 8 = 7,11 = >10||Categorical|
|AGE||Age of the respondent||1 = 18-24,2 = 25-34, 3 = 35-44, 4 = 45-54, 5 = 55-64, 6 = 65+||Categorical|
|EDUCATION||The highest level of education achieved||1 = Did not finish school, 2 = High School; 3 = College, 4 = Postgraduate||Categorical|
|EMPLOYMENT||Employment status of the respondent||1 = Full-time, 2 = Part-time; 3 = Student; 4 = At home; 5 = Retired; 6 = Unemployed||Categorical|
|USAGE||The typical monthly number of longdistance telephone calls by the household||Numeric|
|Q1a||Aware: AT&T||0 = No, 1 = Yes||Categorical|
|Q1b||Aware: Verizon||0 = No, 1 = Yes||Categorical|
|Q1c||Aware: CenturyLink||0 = No, 1 = Yes||Categorical|
This table shows the minimal metadata necessary to analyze a survey. However, better data files will contain more information. In particular:
- Question Type. For example, note that the last three variables, Q1a, Q1b and Q1c are related and form a part of a single question (which asked people which of the companies they had heard of); a good data file will contain metadata showing that these are linked together.
- Versioning. For example, changes to question wording that occurred during the data collection process and different translations.
A good data file will contain both the raw data and the metadata together in a single file. If you have two files, one which contains the raw data and another which contains the metadata, then you do not actually have a 'data file', you instead have the material you need to create a data file, but still have to create it. Many data analysis programs will provide tools that allow you to import the raw data and then enter the metadata but it will generally need to be done manually (i.e., by retyping it or cutting and pasting each field of information); this is a time consuming and error-prone process which should be avoided where possible.
Data files formats
Data collection programs export data files in a specific format. Most programs provide multiple formats for exporting, but these formats can differ markedly in terms of their usefulness.
The simplest data files are called 'text data files'. It is generally a very bad idea to obtain the data from a survey as a text file. This is because when data is obtained as a text file there will be one of two problems:
- It will either contain no metadata, which makes it at best difficult to analyze and at worst impossible (e.g., if you do not know that a value of 2 represents an age of 25-34, then there is no way to interpret the data).
- It will contain text instead of numbers for all the data. Initially this may appear to be useful, but in practice is a massive problem, as:
- Most programs for the analysis of survey data do not permit you to do analysis with data in this format, and so you will read the data into the program and then discover that you either cannot do even the most basic analysis, or, need to spend a lot of time re-formatting the data to make it useful.
- Many of the important features of the survey will not be evident in the data file. For example, if you have asked a question getting people to give ratings from 0 to 10, when you create a table in a text file they will be ordered as: 0, 1, 10, 2, 3, .... Similarly, Grid and Multiple Response questions will generally need to be treated as if they were multiple Single Response questions.
|4||Other||NA||2||65+||Did not finish school||Retired||7||Yes||Yes||Yes|
|5||Other||NA||0||65+||High School||At Home||0||Yes||Yes||Yes|
|7||Other||25-35||0||45-54||Did not finish school||Full-time||3||Yes||Yes||Yes|
|9||Other||NA||0||55-64||Did not finish school||Full-time||0||Yes||Yes||Yes|
|10||Other||25-35||0||45-54||Did not finish school||At Home||2||Yes||Yes||No|
CSV Files and Excel files
This is generally the best of the text file formats (although this is very much a case of being the tallest dwarf). It uses a comma to separate each variable.
Tab delimited files
This is similar to a CSV file, except that a tab character is used instead of a comma. Generally, if data is in this format it is appropriate to open it in Excel and then save it as a CSV file.
Fixed width files (ASCII) files
A fixed width file is one where each column of numbers has a specific meaning. For example, in the data below the first column may represent the first variable, the second and third variable together may represent the second variable, and so on. This format was invented because it took up little hard-disk space, which was an important consideration in the 1960s and 1970s. It is rarely used today and is the worst of all of the file formats as it cannot readily be used with Open-Ended questions and most modern programs will not read this file format. Generally, if data is in this format it is appropriate to open it in Excel and then save it as a CSV file.
00001 01200 01203
Good data files
The good formats
The gold standard data file is an IBM SPSS Data Collection Model data file (also known as a Dimensions, MDT or MDD data file). This file format contains all the different types of metadata. This data file is only created by the top-of-the-range IBM data collection programs and can only be read by IBM data products and a small number of other products (Q and DataCracker).
The next-best format is the Triple S format. It is a little more widely used than the IBM SPSS Data Collection Model format, but it is generally only available in the more expensive data collection programs.
The industry standard 'good' file format for data is an SPSS .SAV data file (usually called a 'dot sav' file). This is not quite as good as the other two formats, as it does not contain the versioning information and it only contains very limited Question Type information (it does not support the various Grid type of questions). However, all good data collection programs can export in this format. Refer to SPSS Data File Specifications for details on how these files are best set up.
Occasionally data collection programs will export both a text file and an SPSS .sps file (also known as a syntax file). The syntax file is actually a program which contains instructions for turning the text file into an SPSS .SAV file. SPSS is the the only program that can always read these files, but Q can read these in some circumstances.
Appropriate set up in the good formats
Obtaining a good data file is not just a case of specifying the desired format. In particular, in the case of the SPSS .SAV data files, it is quite common to have them created with either incorrect values and incorrect metadata. The most common problems are:
- Incorrect values for options not selected in multiple response questions. That is, the files use the same value (commonly a 0 or a special indicates missing value category usually called SYSMS, NA, or NaN) to indicate that somebody was not asked a question as they use to indicate that somebody did not select an option in a Multiple Response question.
- Labels that have been truncated (e.g., saying Please rate your satisfaction with the following ba), making it impossible to determine what the data means (except by reviewing the questionnaire).
Most survey analysis programs will have some facilities in them to clean such poor data, but it is generally advisable to try and instead obtain a data file that does not contain such problems.
- There are many other units of analysis that are, from time-to-time, required, such as the: occasions, products and transactions.