tl;dr: paste outcome and dependent variables into R's as.formula() function to avoid typing the same models out repetitively.

The "don't repeat yourself" principle in programming tries to eliminate errors by eliminating repetitious code. When code repeats changing the code in one place requires making the same change everywhere the code repeats. Relying on memory to find all of the places where code should be replicated increases the chances that you introduce bugs by forgetting to change at least one place where the code repeats.

This principle applies to statistical coding. I am working on a paper that uses three sets of variables from the 2016 DCAS: race, other demographic variables, socioeconomic variables, and variables measuring neighborhood experiences of respondents. I want to run three models for each outcome in the paper.

  1. A model with just race

  2. A model with race, demographic, and socioeconomic variables

  3. A model with race, demographic, socioeconomic, and neighborhood experience variables

I will be using these same covariates in a series of four regression models. I could copy-and-paste the string of variable names in the formula each time. But that would violate the DRY principle. That wouldn't be a problem if I made the perfect set of models, triple checked that all of the models corresponded with one another, that I will never need to change anything.

What, however, happens when a reviewer points out a variable that should be in the models? If I was fortunate that the lack of the variable didn't sink the paper in the editor's eyes, I will need to add that variable into each model. I will return to my code after three months (or more) and try to remember to include the new variable in every model. That's no good and prone to error.

I will instead define the covariates in each of the three models early in the code. For my example case of three sets of variables, I would define the independent variables (not the outcome) in three character objects, m1, m2, and m3:

m1 <- 'dem.race'
m2 <- paste0(m1, '+ age + forborn + man + kids + married + ',
             'educ1 + educ2 + educ3 + educ5 + inc1 + inc3 + inc4')
m3 <- paste0(m2, '+ nhdyrs + nhdsize1 + nhdsize3')

The three sets of variables correspond to the variables in the list of models above. Notice that I even used the DRY principle as I constructed the objects. Rather than repeat dem.race in m2, I pasted the value of m1--which equals dem.race--to the front of m2. I repeated the same thing for the third model by pasting the contents of m2 into the front of m3 (this works because the variables in my models are nested within one another).

To use these objects in the models of outcomes, I will need to include them in a formula. R provides a helper function, as.formula() to tell R that the string I enter should be interpreted as a formula in a model function. Notice that I did not define the dependent variable in the objects above. That's because I want to substitute in the dependent variable for each set of analyses. In the first set of models, my dependent variable is called satisfied. I would estimate my models using the following 116 characters of code (it could fit in a tweet!):

m1_sat <- lm(paste('satisfied ~', m1), data=dcas)
m2_sat <- lm(paste('satisfied ~', m2), data=dcas)
m3_sat <- lm(paste('satisfied ~', m3), data=dcas)

I pasted the dependent variable and tilde (~) to the front of the string containing the independent variables. I included all of that in the as.formula() function, and my model ran.

Doing this once isn't that exciting or probably even worth the effort. But I have a second outcome that I want to model using the same covariates. Normally, I would have to re-type all of those covariates into a formula. With our new trick, I can just paste the new dependent variable into the formula and estimate the same models with a new outcome (and still only 116 characters!):

m1_imp <- lm(paste('improved ~', m1), data=dcas)
m2_imp <- lm(paste('improved ~', m2), data=dcas)
m3_imp <- lm(paste('improved ~', m3), data=dcas)

Now I should have output for two sets of three models with the same covariates to analyze in my paper! I hope this helps. Speaking of Twitter, feel free to contact me there if you have questions (@mike_bader).


Pingbacks are open.


Comments are closed.