Zipping Up R

Because I am a ~~masochist~~perfectionist, I spent the better part my day making my R code more elegant. I figured out what to do with a simple loop, but wanted to write the code the right way. I always tell myself that the time I spend ~~torturing myself~~writing the right code will help me down the line so I know how to do it next time. I will inevitably forget and spend the same four hours doing the same thing again. As a gift to my future self, I decided that I would write down what I learned because it will likely come up again (you're welcome, future Mike!).

My basic problem comes from the desire to match two lists item-by-item. Python contains a function, zip(), that does this. I want to figure out how to zip in R.

It turns out that you can do this in R, but it took quite a while to find out how. I had a list of data frames that each contained the same variables, each data frame containing the same names (the variable names are all recorded in a vector called racevars¹). I wanted to loop through to append a suffix to each variable name that contained the year.²

> dtas <- list(trt10_re, trt11_re, trt12_re, trt13_re, trt14_re, trt15_re)
> racedtas <- lapply(dtas, select.racevars) %>%
      lapply(id.quads)

I want, for example, the variable named pnhw to be named pnhw10 in the first of the data frames, to be renamed pnhw11 in the second and so on. I first needed to create a list of character vectors containing all of the names for each data frame. I do that by using the lapply() function to paste the year to the appropriate variables.

> namelist <- lapply(10:15,
      function(x) c(nonracevars, paste0(c(racevars, 'quad'), x)))

Now comes the magic. It turns out the function I was looking for was mapply(). The key that I continued to miss, however, was the argument SIMPLIFY. R sets the default for the SIMPLIFY argument to be TRUE. This means that R will, if it can, reduce the number of dimensions of the final object. That means that when I attempted the following, R returned a 16 × 6 matrix where each cell was a list of the 975 observations for each dataset and the rows were the variables names with the 10 suffix (I included lots of output to show what each command produced for clarity of the code, even though it makes the post a little harder to read):

> racedtas.named <- mapply(setNames, racedtas, namelist)

> class(racedtas.named)
[1] "matrix"

> dim(racedtas.named)
[1] 16  6

> racedtas.named
        [,1]          [,2]          [,3]          [,4]          [,5]         
GISJOIN Character,975 Character,975 Character,975 Character,975 Character,975
STATE   factor,975    factor,975    factor,975    factor,975    factor,975   
COUNTY  factor,975    factor,975    factor,975    factor,975    factor,975   
nhw10   Integer,975   Integer,975   Integer,975   Integer,975   Integer,975  
nhb10   Integer,975   Integer,975   Integer,975   Integer,975   Integer,975  
api10   Integer,975   Integer,975   Integer,975   Integer,975   Integer,975  
hsp10   Integer,975   Integer,975   Integer,975   Integer,975   Integer,975  
oth10   Integer,975   Integer,975   Integer,975   Integer,975   Integer,975  
two10   Integer,975   Integer,975   Integer,975   Integer,975   Integer,975  
pnhw10  Numeric,975   Numeric,975   Numeric,975   Numeric,975   Numeric,975  
pnhb10  Numeric,975   Numeric,975   Numeric,975   Numeric,975   Numeric,975  
papi10  Numeric,975   Numeric,975   Numeric,975   Numeric,975   Numeric,975  
phsp10  Numeric,975   Numeric,975   Numeric,975   Numeric,975   Numeric,975  
poth10  Numeric,975   Numeric,975   Numeric,975   Numeric,975   Numeric,975  
ptwo10  Numeric,975   Numeric,975   Numeric,975   Numeric,975   Numeric,975  
quad10  Logical,975   Logical,975   Logical,975   Logical,975   Logical,975  
        [,6]         
GISJOIN Character,975
STATE   factor,975   
COUNTY  factor,975   
nhw10   Integer,975  
nhb10   Integer,975  
api10   Integer,975  
hsp10   Integer,975  
oth10   Integer,975  
two10   Integer,975  
pnhw10  Numeric,975  
pnhb10  Numeric,975  
papi10  Numeric,975  
phsp10  Numeric,975  
poth10  Numeric,975  
ptwo10  Numeric,975  
quad10  Logical,975

Not what I was looking for. Since each dataframe contained 16 rows, R recognized that it could "simplify" the structure into a matrix with 16 rows. Following our command to setNames it took the first 16 to set the names of the rows.

I found out that when I did the same thing, but set the SIMPLIFY argument to FALSE, I got what I wanted:

> racedtas.named <- mapply(setNames, racedtas, namelist, SIMPLIFY = FALSE)

## Check the class to see if it returns a list
> class(racedtas.named)
[1] "list"

## Check to see if each item in the list is a dataframe
> lapply(racedtas.named, class)
[[1]]
[1] "data.frame"

[[2]]
[1] "data.frame"

[[3]]
[1] "data.frame"

[[4]]
[1] "data.frame"

[[5]]
[1] "data.frame"

[[6]]
[1] "data.frame"

## Check the names of each dataframe to see if they contain the year suffix
> lapply(racedtas.named, names)
[[1]]
 [1] "GISJOIN" "STATE"   "COUNTY"  "nhw10"   "nhb10"   "api10"   "hsp10"   "oth10"  
 [9] "two10"   "pnhw10"  "pnhb10"  "papi10"  "phsp10"  "poth10"  "ptwo10"  "quad10" 

<output omitted>

[[6]]
 [1] "GISJOIN" "STATE"   "COUNTY"  "nhw15"   "nhb15"   "api15"   "hsp15"   "oth15"  
 [9] "two15"   "pnhw15"  "pnhb15"  "papi15"  "phsp15"  "poth15"  "ptwo15"  "quad15"

I had a list of six dataframes, each having the year appended as a suffix on the variable names of the racial composition variables.

With this list, I could easily merge all six dataframes into a single wide dataset containing all of the variables for each year using the reduce() command from the purrr library (which comes packaged in the tidyverse library).

> final.dta <- reduce(racedtas.named, left_join, by='GISJOIN')

I am all set to go to map and to analyze the data on racial composition from 2010 to 2015!

In addition, I created a variable called quad that contains whether a tract was a "quadrivial neighborhood."↩
select.racevars and id.quads are custom functions that I wrote, respectively, to select only the race variables and to list quadrivial neighborhoods.↩

R data-management research

mike bader

zipping up r

Pingbacks

Comments