mikebader.net - Entries for the tag Rhttps://mikebader.net/blog/tags/R/The last entries tagged with Ren-usZinniaSat, 01 Jun 2019 02:10:36 +0000Zipping Up R
https://mikebader.net/blog/entries/2019/05/31/zipping-r/<p>Because I am a <del>masochist</del>perfectionist, I spent the better part my day making my R code more elegant. I figured out what to do with a simple loop, but wanted to write the code the <em>right</em> way. I always tell myself that the time I spend <del>torturing myself</del>writing the right code will help me down the line so I know how to do it next time. I will inevitably forget and spend the same four hours doing the same thing again. As a gift to my future self, I decided that I would write down what I learned because it will likely come up again (you're welcome, future Mike!).</p>
<p>My basic problem comes from the desire to match two lists item-by-item. Python contains a function, <code>zip()</code>, that does this. I want to figure out how to zip in R.</p>
<p>Because I am a <del>masochist</del>perfectionist, I spent the better part my day making my R code more elegant. I figured out what to do with a simple loop, but wanted to write the code the <em>right</em> way. I always tell myself that the time I spend <del>torturing myself</del>writing the right code will help me down the line so I know how to do it next time. I will inevitably forget and spend the same four hours doing the same thing again. As a gift to my future self, I decided that I would write down what I learned because it will likely come up again (you're welcome, future Mike!).</p>
<p>My basic problem comes from the desire to match two lists item-by-item. Python contains a function, <code>zip()</code>, that does this. I want to figure out how to zip in R.</p>
<p>It turns out that you can do this in R, but it took quite a while to find out how. I had a list of data frames that each contained the same variables, each data frame containing the same names (the variable names are all recorded in a vector called <code>racevars</code><a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>). I wanted to loop through to append a suffix to each variable name that contained the year.<a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a></p>
<pre><code>> dtas <- list(trt10_re, trt11_re, trt12_re, trt13_re, trt14_re, trt15_re)
> racedtas <- lapply(dtas, select.racevars) %>%
lapply(id.quads)</code></pre>
<p>I want, for example, the variable named <code>pnhw</code> to be named <code>pnhw10</code> in the first of the data frames, to be renamed <code>pnhw11</code> in the second and so on. I first needed to create a <em>list</em> of character vectors containing <em>all</em> of the names for each data frame. I do that by using the <code>lapply()</code> function to paste the year to the appropriate variables.</p>
<pre><code>> namelist <- lapply(10:15,
function(x) c(nonracevars, paste0(c(racevars, 'quad'), x)))</code></pre>
<p>Now comes the magic. It turns out the function I was looking for was <code>mapply()</code>. The key that I continued to miss, however, was the argument <code>SIMPLIFY</code>. R sets the default for the <code>SIMPLIFY</code> argument to be <code>TRUE</code>. This means that R will, if it can, reduce the number of dimensions of the final object. That means that when I attempted the following, R returned a 16 <span class="math inline">×</span> 6 matrix where each cell was a list of the 975 observations for each dataset and the rows were the variables names with the <code>10</code> suffix (I included lots of output to show what each command produced for clarity of the code, even though it makes the post a little harder to read):</p>
<pre><code>> racedtas.named <- mapply(setNames, racedtas, namelist)
> class(racedtas.named)
[1] "matrix"
> dim(racedtas.named)
[1] 16 6
> racedtas.named
[,1] [,2] [,3] [,4] [,5]
GISJOIN Character,975 Character,975 Character,975 Character,975 Character,975
STATE factor,975 factor,975 factor,975 factor,975 factor,975
COUNTY factor,975 factor,975 factor,975 factor,975 factor,975
nhw10 Integer,975 Integer,975 Integer,975 Integer,975 Integer,975
nhb10 Integer,975 Integer,975 Integer,975 Integer,975 Integer,975
api10 Integer,975 Integer,975 Integer,975 Integer,975 Integer,975
hsp10 Integer,975 Integer,975 Integer,975 Integer,975 Integer,975
oth10 Integer,975 Integer,975 Integer,975 Integer,975 Integer,975
two10 Integer,975 Integer,975 Integer,975 Integer,975 Integer,975
pnhw10 Numeric,975 Numeric,975 Numeric,975 Numeric,975 Numeric,975
pnhb10 Numeric,975 Numeric,975 Numeric,975 Numeric,975 Numeric,975
papi10 Numeric,975 Numeric,975 Numeric,975 Numeric,975 Numeric,975
phsp10 Numeric,975 Numeric,975 Numeric,975 Numeric,975 Numeric,975
poth10 Numeric,975 Numeric,975 Numeric,975 Numeric,975 Numeric,975
ptwo10 Numeric,975 Numeric,975 Numeric,975 Numeric,975 Numeric,975
quad10 Logical,975 Logical,975 Logical,975 Logical,975 Logical,975
[,6]
GISJOIN Character,975
STATE factor,975
COUNTY factor,975
nhw10 Integer,975
nhb10 Integer,975
api10 Integer,975
hsp10 Integer,975
oth10 Integer,975
two10 Integer,975
pnhw10 Numeric,975
pnhb10 Numeric,975
papi10 Numeric,975
phsp10 Numeric,975
poth10 Numeric,975
ptwo10 Numeric,975
quad10 Logical,975 </code></pre>
<p><em>Not</em> what I was looking for. Since each dataframe contained 16 rows, R recognized that it could "simplify" the structure into a matrix with 16 rows. Following our command to <code>setNames</code> it took the first 16 to set the names of the rows.</p>
<p>I found out that when I did the same thing, but set the <code>SIMPLIFY</code> argument to <code>FALSE</code>, I got what I wanted:</p>
<pre><code>> racedtas.named <- mapply(setNames, racedtas, namelist, SIMPLIFY = FALSE)
## Check the class to see if it returns a list
> class(racedtas.named)
[1] "list"
## Check to see if each item in the list is a dataframe
> lapply(racedtas.named, class)
[[1]]
[1] "data.frame"
[[2]]
[1] "data.frame"
[[3]]
[1] "data.frame"
[[4]]
[1] "data.frame"
[[5]]
[1] "data.frame"
[[6]]
[1] "data.frame"
## Check the names of each dataframe to see if they contain the year suffix
> lapply(racedtas.named, names)
[[1]]
[1] "GISJOIN" "STATE" "COUNTY" "nhw10" "nhb10" "api10" "hsp10" "oth10"
[9] "two10" "pnhw10" "pnhb10" "papi10" "phsp10" "poth10" "ptwo10" "quad10"
<output omitted>
[[6]]
[1] "GISJOIN" "STATE" "COUNTY" "nhw15" "nhb15" "api15" "hsp15" "oth15"
[9] "two15" "pnhw15" "pnhb15" "papi15" "phsp15" "poth15" "ptwo15" "quad15" </code></pre>
<p>I had a list of six dataframes, each having the year appended as a suffix on the variable names of the racial composition variables.</p>
<p>With this list, I could easily merge all six dataframes into a single wide dataset containing all of the variables for each year using the <code>reduce()</code> command from the <code>purrr</code> library (which comes packaged in the <code>tidyverse</code> library).</p>
<pre><code>> final.dta <- reduce(racedtas.named, left_join, by='GISJOIN')</code></pre>
<p>I am all set to go to map and to analyze the data on racial composition from 2010 to 2015!</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>In addition, I created a variable called <code>quad</code> that contains whether a tract was a "<a href="https://www.sociologicalscience.com/articles-v3-8-135/">quadrivial neighborhood</a>."<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p><code>select.racevars</code> and <code>id.quads</code> are custom functions that I wrote, respectively, to select only the race variables and to list quadrivial neighborhoods.<a href="#fnref2">↩</a></p></li>
</ol>
</div>
michaeldmbader@gmail.com (mikebader)Fri, 31 May 2019 22:03:49 +0000https://mikebader.net/blog/entries/2019/05/31/zipping-r/MethodsMultiple Models with Same Independent Variables in R
https://mikebader.net/blog/entries/2018/11/02/multiple-models-same-independent-variables-r/<p><strong>tl;dr</strong>: paste outcome and dependent variables into R's <code>as.formula()</code> function to avoid typing the same models out repetitively.</p>
<p>The "don't repeat yourself" principle in programming tries to eliminate errors by eliminating repetitious code. When code repeats changing the code in one place requires making the same change everywhere the code repeats. Relying on memory to find all of the places where code should be replicated increases the chances that you introduce bugs by forgetting to change at least one place where the code repeats. </p>
<p>This principle applies to statistical coding. I am working on a paper that uses three sets of variables from the <a href="/projects/dc-area-survey/">2016 DCAS</a>: race, other demographic variables, socioeconomic variables, and variables measuring neighborhood experiences of respondents. I want to run three models for each outcome in the paper. </p>
<ol>
<li>
<p>A model with just race</p>
</li>
<li>
<p>A model with race, demographic, and socioeconomic variables</p>
</li>
<li>
<p>A model with race, demographic, socioeconomic, and neighborhood experience variables</p>
</li>
</ol>
<p>I will be using these same covariates in a series of four regression models. I could copy-and-paste the string of variable names in the formula each time. But that would violate the DRY principle. That wouldn't be a problem if I made the perfect set of models, triple checked that all of the models corresponded with one another, that I will never need to change anything. </p>
<p>What, however, happens when a reviewer points out a variable that should be in the models? If I was fortunate that the lack of the variable didn't sink the paper in the editor's eyes, I will need to add that variable into each model. I will return to my code after three months (or more) and try to remember to include the new variable in every model. That's no good and prone to error.</p>
<p>I will instead define the covariates in each of the three models early in the code. For my example case of three sets of variables, I would define the <em>independent variables</em> (not the outcome) in three character objects, <code>m1</code>, <code>m2</code>, and <code>m3</code>:</p>
<div class="codehilite"><pre><span></span>m1 <span class="o"><-</span> <span class="s">'dem.race'</span>
m2 <span class="o"><-</span> <span class="kp">paste0</span><span class="p">(</span>m1<span class="p">,</span> <span class="s">'+ age + forborn + man + kids + married + '</span><span class="p">,</span>
<span class="s">'educ1 + educ2 + educ3 + educ5 + inc1 + inc3 + inc4'</span><span class="p">)</span>
m3 <span class="o"><-</span> <span class="kp">paste0</span><span class="p">(</span>m2<span class="p">,</span> <span class="s">'+ nhdyrs + nhdsize1 + nhdsize3'</span><span class="p">)</span>
</pre></div>
<p>The three sets of variables correspond to the variables in the list of models above. Notice that I even used the DRY principle as I constructed the objects. Rather than repeat <code>dem.race</code> in <code>m2</code>, I pasted the value of <code>m1</code>--which equals <code>dem.race</code>--to the front of <code>m2</code>. I repeated the same thing for the third model by pasting the contents of <code>m2</code> into the front of <code>m3</code> (this works because the variables in my models are nested within one another). </p>
<p>To use these objects in the models of outcomes, I will need to include them in a <em>formula</em>. R provides a helper function, <code>as.formula()</code> to tell R that the string I enter should be interpreted as a formula in a model function. Notice that I did not define the dependent variable in the objects above. That's because I want to substitute in the dependent variable for each set of analyses. In the first set of models, my dependent variable is called <code>satisfied</code>. I would estimate my models using the following 116 characters of code (it could fit in a tweet!):</p>
<div class="codehilite"><pre><span></span>m1_sat <span class="o"><-</span> lm<span class="p">(</span><span class="kp">paste</span><span class="p">(</span><span class="s">'satisfied ~'</span><span class="p">,</span> m1<span class="p">),</span> data<span class="o">=</span>dcas<span class="p">)</span>
m2_sat <span class="o"><-</span> lm<span class="p">(</span><span class="kp">paste</span><span class="p">(</span><span class="s">'satisfied ~'</span><span class="p">,</span> m2<span class="p">),</span> data<span class="o">=</span>dcas<span class="p">)</span>
m3_sat <span class="o"><-</span> lm<span class="p">(</span><span class="kp">paste</span><span class="p">(</span><span class="s">'satisfied ~'</span><span class="p">,</span> m3<span class="p">),</span> data<span class="o">=</span>dcas<span class="p">)</span>
</pre></div>
<p>I pasted the dependent variable and tilde (<code>~</code>) to the front of the string containing the independent variables. I included all of that in the <code>as.formula()</code> function, and my model ran. </p>
<p>Doing this once isn't that exciting or probably even worth the effort. But I have a second outcome that I want to model using the same covariates. Normally, I would have to re-type all of those covariates into a formula. With our new trick, I can just paste the new dependent variable into the formula and estimate the same models with a new outcome (and still only 116 characters!):</p>
<div class="codehilite"><pre><span></span>m1_imp <span class="o"><-</span> lm<span class="p">(</span><span class="kp">paste</span><span class="p">(</span><span class="s">'improved ~'</span><span class="p">,</span> m1<span class="p">),</span> data<span class="o">=</span>dcas<span class="p">)</span>
m2_imp <span class="o"><-</span> lm<span class="p">(</span><span class="kp">paste</span><span class="p">(</span><span class="s">'improved ~'</span><span class="p">,</span> m2<span class="p">),</span> data<span class="o">=</span>dcas<span class="p">)</span>
m3_imp <span class="o"><-</span> lm<span class="p">(</span><span class="kp">paste</span><span class="p">(</span><span class="s">'improved ~'</span><span class="p">,</span> m3<span class="p">),</span> data<span class="o">=</span>dcas<span class="p">)</span>
</pre></div>
<p>Now I should have output for two sets of three models with the same covariates to analyze in my paper! I hope this helps. Speaking of Twitter, feel free to contact me there if you have questions (<a href="https://twitter.com/mike_bader">@mike_bader</a>).</p>
michaeldmbader@gmail.com (mikebader)Fri, 02 Nov 2018 21:24:02 +0000https://mikebader.net/blog/entries/2018/11/02/multiple-models-same-independent-variables-r/Methodsshifting values in R
https://mikebader.net/blog/entries/2018/03/12/shifting-values-r/
<p>R presents more of a challenge to Stata <a href="http://theannexpodcast.com/rant/">on many fronts</a>, one of which is basic data management. </p>
<p>I often find myself calculating the value of one observation given the value of an adjacent value. For example, to assess a lagged effect, I would take the value of the preceding interval. Stata makes this really easy, R not so much.</p>
<p>Here's what we would do in Stata: </p>
<div class="codehilite"><pre><span></span><span class="k">set</span> obs <span class="m">1000</span><span class="k"></span>
<span class="k">gen</span> i = _n<span class="k"></span>
<span class="k">gen</span> val = <span class="nf">round(runiform(</span>)*<span class="m">10</span>)<span class="k"></span>
<span class="k">gen</span> lag = val[_n<span class="m">-1</span>]
</pre></div>
<p>The last command throws the warning, <code>(1 missing value generated)</code> because the first observation has no lagged observation. The first 10 observations look like this: </p>
<div class="codehilite"><pre><span></span>.<span class="k"> list in</span> <span class="m">1</span>/<span class="m">10</span>
+----------------+
| i val lag |
|----------------|
<span class="m">1</span>. | <span class="m">1</span> <span class="m">3</span> . |
<span class="m">2</span>. | <span class="m">2</span> <span class="m">0</span> <span class="m">3</span> |
<span class="m">3</span>. | <span class="m">3</span> <span class="m">6</span> <span class="m">0</span> |
<span class="m">4</span>. | <span class="m">4</span> <span class="m">4</span> <span class="m">6</span> |
<span class="m">5</span>. | <span class="m">5</span> <span class="m">0</span> <span class="m">4</span> |
|----------------|
<span class="m">6</span>. | <span class="m">6</span> <span class="m">9</span> <span class="m">0</span> |
<span class="m">7</span>. | <span class="m">7</span> <span class="m">6</span> <span class="m">9</span> |
<span class="m">8</span>. | <span class="m">8</span> <span class="m">5</span> <span class="m">6</span> |
<span class="m">9</span>. | <span class="m">9</span> <span class="m">8</span> <span class="m">5</span> |
<span class="m">10</span>. | <span class="m">10</span> <span class="m">5</span> <span class="m">8</span> |
+----------------+
</pre></div>
<p>After my head left an indentation on my desk, here's what we would do in R. But now I can't remember where. The solution is to use the <code>head()</code> command. Anyone who has completed an R tutorial will know this command as a way to view a subset of your data. It turns out that <code>head()</code> has a second optional parameter that allows you to indicate how many observations you want to view. We can use that to obtain the first <em>n-1</em> values and assign them to rows 2 through <em>N</em>. </p>
<div class="codehilite"><pre><span></span>dt <span class="o"><-</span> <span class="kt">data.frame</span><span class="p">(</span>i<span class="o">=</span><span class="kt">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">1000</span><span class="p">),</span>val<span class="o">=</span><span class="kp">round</span><span class="p">(</span>runif<span class="p">(</span><span class="m">1000</span><span class="p">)</span><span class="o">*</span><span class="m">10</span><span class="p">))</span>
dt<span class="o">$</span>lag <span class="o"><-</span> <span class="kt">c</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="kp">head</span><span class="p">(</span>dt<span class="o">$</span>val<span class="p">,</span><span class="kp">nrow</span><span class="p">(</span>dt<span class="p">)</span><span class="m">-1</span><span class="p">))</span>
</pre></div>
<p>Remember how Stata told us that it created one missing value? We need to do that ourselves in R. In the second line, we combine that one <code>NA</code> value with the values of <code>dt$lag</code> for values 2 through <em>N</em>. </p>
<p>And we end up with what we want (using <code>head()</code> for its typical purpose):</p>
<div class="codehilite"><pre><span></span><span class="o">></span> <span class="kp">head</span><span class="p">(</span>dt<span class="p">)</span>
i val lag
<span class="m">1</span> <span class="m">1</span> <span class="m">7</span> <span class="kc">NA</span>
<span class="m">2</span> <span class="m">2</span> <span class="m">7</span> <span class="m">7</span>
<span class="m">3</span> <span class="m">3</span> <span class="m">3</span> <span class="m">7</span>
<span class="m">4</span> <span class="m">4</span> <span class="m">2</span> <span class="m">3</span>
<span class="m">5</span> <span class="m">5</span> <span class="m">3</span> <span class="m">2</span>
<span class="m">6</span> <span class="m">6</span> <span class="m">9</span> <span class="m">3</span>
</pre></div>
michaeldmbader@gmail.com (mikebader)Mon, 12 Mar 2018 17:00:16 +0000https://mikebader.net/blog/entries/2018/03/12/shifting-values-r/MethodsImporting Text Files with Variable Names to R
https://mikebader.net/blog/entries/2014/11/26/importing-text-files-variable-names-r/
<p>I am attempting to learn <a href="http://www.r-project.org/">R</a>. This is either a great thing or a terrible, terrible mistake while on the tenure clock. But, all the cool kids are doing -- so even though they might also jump off a bridge, I'm going to jump into R. </p>
<p>The hardest part so far is doing things that now come as second nature to me in Stata. Although R's tools are much better in the long-run, learning what types of objects different functions return and such ends up being a very high learning curve. </p>
<p>A while back (all posts are a while back now), I wrote <a href="http://mikebader.net/blog/2010/jul/23/importing-text-files-variable-names-stata/">a post</a> describing how to import data with variable labels in Stata. The idea was that I could keep the variable labels with the data in an text file so that I could always figure out what the variables were. It doesn't require an extra codebook or additional files that could be lost. </p>
<p>I have now replicated that script for R, and it is below: </p>
<pre><code>data.file <- file('R10401937_SL160_withLabels.txt',open='r')
lines <- readLines(data.file)
data.labels <- as.vector(strsplit(lines[1],'\t')[[1]])
data.names <- strsplit(lines[2],'\t')[[1]]
tmp <- tempfile()
writeLines(lines[-2:-1],tmp)
places <- read.delim(tmp,header=FALSE,col.names=data.names)
names(data.labels) <- data.names
</code></pre>
<p>Note the last line: this line turns the object <code>data.labels</code> into a look-up table. I can now type: </p>
<pre><code>data.labels['p003001']
</code></pre>
<p>And it will return: <code>"\"Total Population\""</code>. This is even more helpful than Stata's variable labeling convention.</p>
<p><strong>UPDATE MAR 2, 2016:</strong> After learning a little bit more R, I realized that I could simplify what I had. Below is the code that I managed to make based on the data structure from <a href="http://socialexplorer.com">Social Explorer</a>:</p>
<pre><code> t <- read.table('test/R11129600_SL140.txt')
names.labels <- data.frame(varnames = as.vector(t(t[2,])), labels = as.vector(t(t[1,])))
dt <- t[-1:-2,]
colnames(dt) <- names.labels$varnames
rm(t)
</code></pre>
<p>Columns in the data frame <code>dt</code> will be factor variables. For quantitative analyses, those should be converted to numeric types. To do that, define a vector of strings for column names or indices of column numbers and apply to each column: </p>
<pre><code>varnames <- c('var1','var2')
dt[,varnames] <- sapply(dt[,varnames],function(x){as.numeric(as.character(x))})
</code></pre>
michaeldmbader@gmail.com (mikebader)Wed, 26 Nov 2014 10:47:57 +0000https://mikebader.net/blog/entries/2014/11/26/importing-text-files-variable-names-r/Methods