In the last step, we downloaded all of our data and deposited into directories that store this source data, backed it up, and write-protected the files. Now that we have done all of that, it is time to start working with the data! There is only one problem: almost inevitably, the data do not come neat, tidy, and ready to use. Often, the data contain major problems and need to be constructed in order to be usable. In this installment, I will write about managing files for cleaning, constructing and storing datasets.

There are several varieties of how the data are not completely usable. For many items in the data, this means that the variables need to be cleaned. For instance, a woman's weight might be reported as 999 pounds. I sincerely doubt, even in the age of increasing obesity, a woman weighing 999 pounds would be included in the dataset (not impossible, but highly improbable). If the woman appears to be relatively short and healthy on other measures, it is likely that there was a keying error at some point in the data and 99 turned into 999. This is certainly not the only way that data must be cleaned, a data cleaning deserves its own complete treatment, but is a good example of the kinds of things that should be done to the data to make it useable.

The other way that data must be turned into a useable form is through variable construction. Although the existing dataset will (hopefully) include the tools you need to conduct a desired analysis, often the data are not in a format readily available for use. For example, in the 2004 Detroit Area Study, the sister study to the Chicago Area Study I mentioned in my previous post, question D_4 asked those respondents who indicated that they might be moving why they would be doing so. They were presented with a list of 11 options (e.g., want a safer neighborhood, lease expired, etc.) and interviewers were asked to record the number corresponding to the item on the list for each reason why the respondent might be moving. From a data entry point of view, the best way to limit interviewer burden and insure that the correct answers were recorded, interviewers were told to enter each number as the respondent gave them a reason. The items end up in the data entered as reason on first mention (variable d_4_01), reason on the second mention (d_4_02), etc. Unfortunately for the data analyst, someone could indicate "lease expiration" as the first mention while someone else could indicate "lease expiration" as their third mention. In order to look at respondents moving because of "lease expiration," we want to look across all mentions and create a variable that indicates if a respondent indicated "lease expiration" on any mention. This would be, in my mind, dataset construction. 1 Again, this example, though detailed, is only one of many ways that variables could be constructed in a dataset -- a topic that deserves its own treatment elsewhere. Another common way that data can be constructed is through the creation of indexes, scales, or new variables out of multiple variables.

As I described before, it is absolutely imperative that you be able to go back and check your work, potentially modify it, and -- most importantly -- be able to reproduce it. For example, if a reviewer asks how you created a variable it is very important that you be able to describe how it was created and to be able to create it again if need be (or, modify how you constructed it to construct it differently according to the reviewer's fancy). That means that you should create some kind of script that allows you to easily and quickly reproduce your work in the future (i.e., a .log file in SAS or .do file in Stata; I will simply use the term "script"). I store these scripts inside of a directory called DatasetConstruction at the same level as SourceData. So, my directory containing the CAS data looks like:

/ROOT
    /Data
        /CAS
            /SourceData
            /DatasetConstruction

Inside of the dataset construction folder, I have a file called makeCASRespondentData.do that contains all of the code to create the variables that I will use across analyses with the same dataset. The convention that I use to name scripts that create datasets, or portions of datasets, is the following: the word make followed by a reasonably clear description of the data that is being made by that file. In this case, the script is named makeRespondentCASData.do because the data is from CAS, the level of analysis is the respondent (I also have datasets at the community level and individual item level -- each of those files has its own script and data file).

I spent a lot of time getting to know the CAS, cleaning it and constructing variables, just as you will with your datasets. Therefore, I create all of the key variables to be used in analysis across different projects (e.g., variables measuring race, education, age, etc.) in this dataset. I generally avoid including variables used for the express purpose of a specific analysis in the dataset construction (and will discuss more when I write about creating data for projects). The line can sometimes be blurry and occur iteratively, but generally this script should create all variables necessary for conducting analyses with the data.

At the end of the script, I will save a dataset that contains all of my cleaned and constructed variables. When I name the file for these data, I always use the same name as the .do file minus the initial "make." So the CAS dataset would be called RespondentCASData.dta if I make it in Stata. The really convenient feature of this system is that it is always very easy to find the file that created the data you are using in case you have questions. I save this dataset in a different directory, at the same level as SourceData and DatasetConstruction called "Dataset" so that I can find it easily. This leads us at this point to have a directory structure that looks like:

/ROOT
    /Data
        /CAS
            /SourceData
            /DatasetConstruction
                /makeCASRespondentData.do
            /Dataset
                /CASRespondentData.dta

Now, we have the original data, our scripts for reproducing the data in the future, and a dataset that we can load that will allow us to start using the dataset right away. We're getting closer to the fun stuff of actually doing analysis!


  1. In fact, the DAS already did this (variable d_4_11YN). If you were to do this yourself in Stata, you could write something like:

      generate leaseexp = 0 if d_1>1 & d_1!=. //Only respondents who gave an answer greater than 1 on question d_1 were asked quesiton d_4  
      forvalues i=1/12 { //There are 12 possible responses  
            replace leaseexp=1 if d_4_01==11
            }