posts tagged “data-management”

Because I am a masochistperfectionist, I spent the better part my day making my R code more elegant. I figured out what to do with a simple loop, but wanted to write the code the right way. I always tell myself that the time I spend torturing myselfwriting the right code will help me down the line so I know how to do it next time. I will inevitably forget and spend the same four hours doing the same thing again. As a gift to my future self, I decided that I would write down what I learned because it will likely come up again (you're welcome, future Mike!).

My basic problem comes from the desire to match two lists item-by-item. Python contains a function, zip(), that does this. I want to figure out how to zip in R.

R presents more of a challenge to Stata on many fronts, one of which is basic data management.

I often find myself calculating the value of one observation given the value of an adjacent value. For example, to assess a lagged effect, I would take the value of the preceding interval. Stata makes this really easy, R not so much.

Here's what we would do in Stata:

set obs 1000
gen i = _n
gen val = round(runiform()*10)
gen lag = val[_n-1]

The last command throws the warning, (1 missing value generated) because the first observation has no lagged observation. The first 10 observations ...

I am attempting to learn R. This is either a great thing or a terrible, terrible mistake while on the tenure clock. But, all the cool kids are doing -- so even though they might also jump off a bridge, I'm going to jump into R.

The hardest part so far is doing things that now come as second nature to me in Stata. Although R's tools are much better in the long-run, learning what types of objects different functions return and such ends up being a very high learning curve.

A while back (all posts are a while back now), I wrote a post describing how to import ...

Programming in Stata is relatively straightforward and this is partly because the programming syntax is both powerful and relatively straightforward. There are, however, a few minor annoyances in Stata's language including using the backtick and apostrophe to indicate local macros (i.e.,`localname'). Among these shortcomings, I would argue that the lack of anything like a list in Stata's language is one of the largest.

In most langauges, you can store a list of items and refer to the item in the list by some sort of index. This is particularly helpful for iterating over the same step multiple times. Lists generally come in two flavors: lists to which you ...

In the last step, we downloaded all of our data and deposited into directories that store this source data, backed it up, and write-protected the files. Now that we have done all of that, it is time to start working with the data! There is only one problem: almost inevitably, the data do not come neat, tidy, and ready to use. Often, the data contain major problems and need to be constructed in order to be usable. In this installment, I will write about managing files for cleaning, constructing and storing datasets.

There are several varieties of how the data are not completely usable. For many items in the ...

After establishing where my root directory resides resides, it is time to actually get to work. As with any endeavor, success begins by laying a solid foundation and with academic work that begins foundation is our data.

The most fundamental skill to academic success is asking good questions and acquiring data to answer those questions. Yet, in quantitative research, that skill is useless without the ability to manipulate data into useful formats that are capable of answering the good questions. Data cleaning, construction, and manipulation constitute well over half of my work on major quantitative projects.

It should go without saying, but there are many different types of data. I ...

In my last post, I explained the value of a directory structure: consistent file management structures a disciplined workflow that increases productivity. The magnitude of its importance was a revelation that occurred largely after graduate school as the result of starting a new job.

When I moved to start my new job, I needed to move my files to my new computer. In transferring my files, I realized that my work that followed my well-defined workflow transfered easily, while the work that didn't follow the workflow did not.

The contrast between the ease with which I started the well-structured work and difficulty getting up to speed on disorganized pieces ...

When I say that one of the most important things that I did in graduate school was set up a directory structure and workflow for my files, I am not kidding. Reading theory, learning statistical methods, and writing literature reviews were all important. However, just as important -- though not nearly as sexy -- is setting up a file structure and working directory.

Despite how trivial it sounds, maintaining a well-designed directory structure not only provides a framework for files, it structures productive work.

Given how important it was for me, I will attempt to explain the directory structure that I developed. Let me begin by saying that I am ...

I have come across a problem several times that has been relatively frustrating to deal with. I have data that is downloaded from a site (specifically the Census (which is why this comes up consistently) in which the first two lines of the data contain the variable name and variable description respectively. This is incredibly useful for documenting data. Rather than attempting to figure out what variable pct001001 means, the description of the variable is right there.

The problem with data in this format is that Stata imports variables as string variables with the first observation being the variable description. I could pull the first two lines of the data ...