I have come across a problem several times that has been relatively frustrating to deal with. I have data that is downloaded from a site (specifically the Census (which is why this comes up consistently) in which the first two lines of the data contain the variable name and variable description respectively. This is incredibly useful for documenting data. Rather than attempting to figure out what variable pct001001 means, the description of the variable is right there.

The problem with data in this format is that Stata imports variables as string variables with the first observation being the variable description. I could pull the first two lines of the data out of the original dataset, transpose the rows and columns, save them in a separate text file, and then import the variable names and descriptions. However, managing two files means that it is more likely that one gets lost or I forget to send one of the files to a colleague working on the paper, or any number of other problems that could be experienced by separating these two files. Having one single file would be far superior and that is what the code below is designed to accommodate.

Data available from the U.S. Census comes in the following format (data is clipped):

1 geo_id sumlevel geo_name p003001
2 "Geography Identifier" "Geographic Summary Level" "Geography" "Total Population"
3 "14000US17031010100" 140 "Census Tract 101, Cook County, Illinois" 5391
4 "14000US17031010200" 140 "Census Tract 102, Cook County, Illinois" 10706
5 "14000US17031010300" 140 "Census Tract 103, Cook County, Illinois" 6649

If this file is a tab-delimited file and it is imported, Stata will automatically recognize Row 1 as variable names. But, it will then think that Row 2 is data and, as such, it will make all of the variables string variables when only two should be (geo_id and geo_name). If the above data is saved in a file dt_dec_2000_sf3_u_data001_1.txt (the default for a download from the Census SF-2 files), then we can use the following code to extract the data

insheet using dt_dec_2000_sf2_u_data001_1.txt, names clear 
foreach var of varlist * { 
    local varlab = `var'[1] 
    label var `var' "`varlab'" 
drop in 1

destring , replace

The following code uses the fact that Stata does this to its advantage. Instead of having to import variable labels from a separate file, Stata has done us a favor by importing them already. Essentially, "data" contained for each variable in the first observation is just it's variable label. Therefore, we can just pull that data out, use it to label the variable. After we do that, we can drop the first observation since all it contains is variable labels that are unhelpful. The only problem left is that our numeric data will still be saved in string fields, for which we use the destring, replace command. There is one note of caution here: any variable that contains all numeric characters will be converted; however, this is sometimes undesirable if you have a long numeric code to identify observations (e.g. a FIPS code in the Census). If that is the case, you can place a variable list in the destring command to only destring truly numeric variables.

This can help keep my files organized and legible while providing Stata-accessible data.


Pingbacks are open.


Comments are closed.