After establishing where my root directory resides resides, it is time to actually get to work. As with any endeavor, success begins by laying a solid foundation and with academic work that begins foundation is our data.
The most fundamental skill to academic success is asking good questions and acquiring data to answer those questions. Yet, in quantitative research, that skill is useless without the ability to manipulate data into useful formats that are capable of answering the good questions. Data cleaning, construction, and manipulation constitute well over half of my work on major quantitative projects.
It should go without saying, but there are many different types of data. I generally deal with three types: survey data, administrative data (e.g., Census data, business listings), and geographic data (e.g., data created through using geographic information systems). While my descriptions here will accordingly reflect that type of data, the principles that I describe are transferable to other types of data.1
The same principles apply because I think that pretty much all data can be described at three different stages, each of which is required for different steps of analysis: 1) source data, the data in its rawest form that is delivered to you (or you collect); 2) cleaned "master" data, the basis for all of your analysis based on a particular dataset after it is cleaned and processed; and 3) project data, the data that you use for a specific project. My data directory contains the first two (source data and cleaned master data) and I will discuss them in more detail below. The final step, project data, sits in a project directory and I will describe that folder when I discuss the project directory.
To get started, let's create a directory called
Data immediately below our root:
Now that we have the
Data directory created, we can move on to putting data in the right places.
Source Data: The Raw Materials
It sounds silly to say, but performing analysis requires data. Without data, you have no analysis, no project, no paper, presentation, dissertation, or book (or one of any quality at least). Because it is so important, there are two things that I do in order to not inadvertently lose this data. First, I back it up in multiple places -- virtually and physically (having your backup copy sitting next to your computer in the will help you if your computer melts down; however, it will do you no good if a pipe bursts and floods your office)2. Second, never, ever, ever touch the original data; if you cannot analyze the data in the format it is delivered then save a copy. Or, better yet, write a script (which we will cover when we talk about "Dataset Construction") that allows you to retrace all of your steps.
When you collect, acquire, or download your data you need to put it somewhere. Within the
Data directory, you should create a new sub-directory based on the data collection effort. Each data collection effort (e.g., a survey, qualitative interviews, field notes, or web scrapes) should live inside of its own subdirectory; however, sometimes figuring out what constitutes a single data collection effort requires some judgement and a little bit of foresight. When naming sub-directories, I try to follow two distinct, but often conflicting, goals: 1) keep all data for a given study together so that I can find it easily, 2) keep reusable data (i.e., data not necessarily tied to a single study) in its own sub-directory.
An example perhaps provides the best explanation. For my dissertation, I studied Chicago and used a different dataset for each chapter: I used the the 2004 Chicago Area Study (CAS), a national dataset of Census tracts, the Chicago Community Adult Health Study (CCAHS), and Neighborhood Change Database (NCDB). It was easy to choose what to do with the NCDB: it got its own folder since it was an extract from a national-level dataset. Choosing a separate subdirectory for the survey data for each of the latter two also proved to be an easy decision. However, I also extracted a number of Census variables for the Chicago Metropolitan Area (not already included in the NCDB) and had to decide what to do with that data. Although I would use Census data on my chapters using both the CCAHS and the CAS, it didn't make sense to save some variables in one subdirectory and other variables in another. Instead, it made sense to create a single Census folder that both chapters, the one using the CCAHS data and the one using CAS data, could reference.
Using these datasets as an example, we can now delve into a piece of my directory structure. Above, we created a
Data directory on our root. Let's start with the CAS data. Immediately below the
Data directory, we will create a directory called
CAS (be careful using acronyms because they can become confusing, so it might ultimately be better to call this directory
ChicagoAreaStudy, but we'll leave it as it is for now):
/ROOT /Data /CAS
Now, all data pertaining to the CAS data will go into that folder. This study has a single source of data, so we can create:
/ROOT /Data /CAS /SourceData
I will store the survey data that I downloaded in the
Data are frequently split among several files for different reasons and each implies a different structure. Some are split such that dataset files contain different variables for the same units of observation (i.e., same number of rows with different columns). Others contain the same variables, but different units of observation in each file (i.e., same columns, different rows). I believe that a good rule of thumb is that the former (equal rows, different columns) should be saved in the same level of the directory, while each folder in the latter (equal columns, different rows) should be saved in their own folder.3 Although this is a loose rather than steadfast rule, I think that it makes sense since the files with the same variables contain data on different units of observation that might be helpful or useful to split off in the future.
The Census data that I use provides a useful, though not perfectly straightforward, example. Because the example is a little bit messy, it enables us to think through how we decide whether to create new directories for data. In this case, there are multiple products released by the Census at multiple geographies (e.g., Census tract, block group) in multiple years. The Census is released in different files, the data that I saved here come from the "Summary File 3" data in 2000. I show the directory structure and then explain my choices below.
/ROOT /Data /CAS /SourceData /Census /2000 /SF3 /SourceData /ChicagoCMSA /PhiladelphiaCMSA
You will see that it isn't until we get to the
/Data/Census/2000/SF3 directory until we get to the
SourceData directory that actually contains the data. The 2000 Census is a unique data collection effort compared to the 1990 Census and Summary File 3 is a unique data collection effort compared to the other Summary Files. Within the
SourceData directory, the processing that I will do on the data in the Chicago area (CMSA stands for Consolidated/Combined Metropolitan Statistical Area) to use in analysis will be contained in this directory.
I created a second directory containing the data from the 2000 SF3 file for Philadelphia because it constitutes an incompatible geography (i.e., it would make sense to merge Philadelphia data with Chicago data, although I could later append the two sources if I wanted to do a cross-city comparison, but I'll hold off on that tangent). Therefore, I keep them as separate directories within the same
SourceData directory. One advantage of this system is that I can also use the same code to process the SF3 data from the Philadelphia CMSA that I use on the SF3 data from the Chicago CMSA.
Finally, I will add a directory from last dataset used in my dissertation, the CCAHS. In this case, there is a logical divide between the two hierarchical levels of the data, individuals and neighborhoods. The individual-level data is a survey of Chicago residents and the neighborhood-level data is from a systematic social observation of residential blocks in the city, each represents a different data collection effort. I now have:
/ROOT /Data /CAS /SourceData /Census /2000 /SF3 /SourceData /ChicagoCMSA /PhiladelphiaCMSA /CCAHS /Survey /SourceData /SSO /SourceData
With this, we laid out the basic data structure for importing data for later use as if I were rewriting my dissertation (the horror!). Contained within each of the
SourceData directories above should be at least one data file and before we do anything we do not pass "GO", we do not collect $200, we write-protect your data. Now that we've write-protected our data, we make sure to save it somewhere else. Now we save it again and take that last copy to move it to another physical location. Now we're ready to start digging in...
Most universities use some sort of network file storage like (e.g., AFS) that usually employs off-site daily back-ups. This is a great option to use; however, if you do, be sure to move all of your files before graduation or job relation when your computer privileges at the university because you will lose your data. ↩
Hierarchically-arranged datasets constitute a special case that is a hybrid of these two types of organization. In this case, there might be different rows; for example, one file has the number of people while another might contain the number of neighborhoods. If there is a single identifier which links the two datasets (e.g., neighborhood identifier), then I would typically store them in the same
SourceDatadirectory. If there are many neighborhood-level datasets, I would create a separate
SourceDatadirectory that contains all of these files. Again, this is only a rule of thumb and it is possible that I violate it as often as I use it; as you use hierarchical datasets, you will determine the best way for you to think about your project. ↩