Because I am a masochistperfectionist, I spent the better part my day making my R code more elegant. I figured out what to do with a simple loop, but wanted to write the code the right way. I always tell myself that the time I spend torturing myselfwriting the right code will help me down the line so I know how to do it next time. I will inevitably forget and spend the same four hours doing the same thing again. As a gift to my future self, I decided that I would write down what I learned because it will likely come up again (you're welcome, future Mike!).
My basic problem comes from the desire to match two lists item-by-item. Python contains a function, zip()
, that does this. I want to figure out how to zip in R.
It turns out that you can do this in R, but it took quite a while to find out how. I had a list of data frames that each contained the same variables, each data frame containing the same names (the variable names are all recorded in a vector called racevars
). I wanted to loop through to append a suffix to each variable name that contained the year.
> dtas <- list(trt10_re, trt11_re, trt12_re, trt13_re, trt14_re, trt15_re)
> racedtas <- lapply(dtas, select.racevars) %>%
lapply(id.quads)
I want, for example, the variable named pnhw
to be named pnhw10
in the first of the data frames, to be renamed pnhw11
in the second and so on. I first needed to create a list of character vectors containing all of the names for each data frame. I do that by using the lapply()
function to paste the year to the appropriate variables.
> namelist <- lapply(10:15,
function(x) c(nonracevars, paste0(c(racevars, 'quad'), x)))
Now comes the magic. It turns out the function I was looking for was mapply()
. The key that I continued to miss, however, was the argument SIMPLIFY
. R sets the default for the SIMPLIFY
argument to be TRUE
. This means that R will, if it can, reduce the number of dimensions of the final object. That means that when I attempted the following, R returned a 16 × 6 matrix where each cell was a list of the 975 observations for each dataset and the rows were the variables names with the 10
suffix (I included lots of output to show what each command produced for clarity of the code, even though it makes the post a little harder to read):
> racedtas.named <- mapply(setNames, racedtas, namelist)
> class(racedtas.named)
[1] "matrix"
> dim(racedtas.named)
[1] 16 6
> racedtas.named
[,1] [,2] [,3] [,4] [,5]
GISJOIN Character,975 Character,975 Character,975 Character,975 Character,975
STATE factor,975 factor,975 factor,975 factor,975 factor,975
COUNTY factor,975 factor,975 factor,975 factor,975 factor,975
nhw10 Integer,975 Integer,975 Integer,975 Integer,975 Integer,975
nhb10 Integer,975 Integer,975 Integer,975 Integer,975 Integer,975
api10 Integer,975 Integer,975 Integer,975 Integer,975 Integer,975
hsp10 Integer,975 Integer,975 Integer,975 Integer,975 Integer,975
oth10 Integer,975 Integer,975 Integer,975 Integer,975 Integer,975
two10 Integer,975 Integer,975 Integer,975 Integer,975 Integer,975
pnhw10 Numeric,975 Numeric,975 Numeric,975 Numeric,975 Numeric,975
pnhb10 Numeric,975 Numeric,975 Numeric,975 Numeric,975 Numeric,975
papi10 Numeric,975 Numeric,975 Numeric,975 Numeric,975 Numeric,975
phsp10 Numeric,975 Numeric,975 Numeric,975 Numeric,975 Numeric,975
poth10 Numeric,975 Numeric,975 Numeric,975 Numeric,975 Numeric,975
ptwo10 Numeric,975 Numeric,975 Numeric,975 Numeric,975 Numeric,975
quad10 Logical,975 Logical,975 Logical,975 Logical,975 Logical,975
[,6]
GISJOIN Character,975
STATE factor,975
COUNTY factor,975
nhw10 Integer,975
nhb10 Integer,975
api10 Integer,975
hsp10 Integer,975
oth10 Integer,975
two10 Integer,975
pnhw10 Numeric,975
pnhb10 Numeric,975
papi10 Numeric,975
phsp10 Numeric,975
poth10 Numeric,975
ptwo10 Numeric,975
quad10 Logical,975
Not what I was looking for. Since each dataframe contained 16 rows, R recognized that it could "simplify" the structure into a matrix with 16 rows. Following our command to setNames
it took the first 16 to set the names of the rows.
I found out that when I did the same thing, but set the SIMPLIFY
argument to FALSE
, I got what I wanted:
> racedtas.named <- mapply(setNames, racedtas, namelist, SIMPLIFY = FALSE)
## Check the class to see if it returns a list
> class(racedtas.named)
[1] "list"
## Check to see if each item in the list is a dataframe
> lapply(racedtas.named, class)
[[1]]
[1] "data.frame"
[[2]]
[1] "data.frame"
[[3]]
[1] "data.frame"
[[4]]
[1] "data.frame"
[[5]]
[1] "data.frame"
[[6]]
[1] "data.frame"
## Check the names of each dataframe to see if they contain the year suffix
> lapply(racedtas.named, names)
[[1]]
[1] "GISJOIN" "STATE" "COUNTY" "nhw10" "nhb10" "api10" "hsp10" "oth10"
[9] "two10" "pnhw10" "pnhb10" "papi10" "phsp10" "poth10" "ptwo10" "quad10"
<output omitted>
[[6]]
[1] "GISJOIN" "STATE" "COUNTY" "nhw15" "nhb15" "api15" "hsp15" "oth15"
[9] "two15" "pnhw15" "pnhb15" "papi15" "phsp15" "poth15" "ptwo15" "quad15"
I had a list of six dataframes, each having the year appended as a suffix on the variable names of the racial composition variables.
With this list, I could easily merge all six dataframes into a single wide dataset containing all of the variables for each year using the reduce()
command from the purrr
library (which comes packaged in the tidyverse
library).
> final.dta <- reduce(racedtas.named, left_join, by='GISJOIN')
I am all set to go to map and to analyze the data on racial composition from 2010 to 2015!
Comments
Comments are closed.