I have come across a problem several times that has been relatively frustrating
to deal with. I have data that is downloaded from a site (specifically the
Census (which is why this comes up consistently) in which the first two
lines of the data contain the variable name and variable description
respectively. This is incredibly useful for documenting data. Rather than
attempting to figure out what variable pct001001
means, the description of the
variable is right there.
The problem with data in this format is that Stata imports variables as string variables with the first observation being the variable description. I could pull the first two lines of the data out of the original dataset, transpose the rows and columns, save them in a separate text file, and then import the variable names and descriptions. However, managing two files means that it is more likely that one gets lost or I forget to send one of the files to a colleague working on the paper, or any number of other problems that could be experienced by separating these two files. Having one single file would be far superior and that is what the code below is designed to accommodate.
Data available from the U.S. Census comes in the following format (data is clipped):
1 | geo_id | sumlevel | geo_name | p003001 |
2 | "Geography Identifier" | "Geographic Summary Level" | "Geography" | "Total Population" |
3 | "14000US17031010100" | 140 | "Census Tract 101, Cook County, Illinois" | 5391 |
4 | "14000US17031010200" | 140 | "Census Tract 102, Cook County, Illinois" | 10706 |
5 | "14000US17031010300" | 140 | "Census Tract 103, Cook County, Illinois" | 6649 |
If this file is a tab-delimited file and it is imported, Stata will
automatically recognize Row 1 as variable names. But, it will then think that
Row 2 is data and, as such, it will make all of the variables string variables
when only two should be (geo_id
and geo_name
). If the above data is saved in
a file dt_dec_2000_sf3_u_data001_1.txt
(the default for a download from the
Census SF-2 files), then we can use the following code to extract the data
insheet using dt_dec_2000_sf2_u_data001_1.txt, names clear
foreach var of varlist * {
local varlab = `var'[1]
label var `var' "`varlab'"
}
drop in 1
destring , replace
The following code uses the fact that Stata does this to its advantage. Instead
of having to import variable labels from a separate file, Stata has done us a
favor by importing them already. Essentially, "data" contained for each variable
in the first observation is just it's variable label. Therefore, we can just
pull that data out, use it to label the variable. After we do that, we can drop
the first observation since all it contains is variable labels that are
unhelpful. The only problem left is that our numeric data will still be saved in
string fields, for which we use the destring, replace
command. There is one
note of caution here: any variable that contains all numeric characters will be
converted; however, this is sometimes undesirable if you have a long numeric
code to identify observations (e.g. a FIPS code in the Census). If that is the
case, you can place a variable list in the destring command to only destring
truly numeric variables.
This can help keep my files organized and legible while providing Stata-accessible data.
Comments
Comments are closed.