mikebader.net - Entries for the tag statisticshttps://mikebader.net/blog/tags/statistics/The last entries tagged with statisticsen-usZinniaFri, 03 Sep 2021 17:32:59 +0000R Translation of Stata Description Calculating Marginal Effects and Standard Errors
https://mikebader.net/blog/entries/2019/09/04/r-translation-stata-description-calculating-marginal-effects-and-standard-errors/<p>This post translates the Stata code to calculate the standard error of the predictions contained in the document, <a href="https://www.stata.com/support/faqs/statistics/compute-standard-errors-with-margins/">How are average marginal effects and their standard errors computed by margins using the delta method?</a> by Jeff Pitblado of StataCorp, into R code. </p>
<p>This post translates the Stata code to calculate the standard error of the predictions contained in the document, <a href="https://www.stata.com/support/faqs/statistics/compute-standard-errors-with-margins/">How are average marginal effects and their standard errors computed by margins using the delta method?</a> by Jeff Pitblado of StataCorp, into R code. </p>
<p>First, read the data into R. If you have not already done so, you will need to install the R package <a href="https://www.rdocumentation.org/packages/readstata13/versions/0.9.2"><code>readstata13</code></a> by issuing the command, <code>install.packages('readstata13')</code>. Recall that Stata (until Stata 16) could only hold a single dataset in memory at a time. Therefore, the commands issued in Stata to generate variables and such do not explicitly define the data being used. In R, we will load the data into an object, <code>dta</code>. </p>
<div class="codehilite"><pre><span></span><span class="kn">library</span><span class="p">(</span><span class="s">'readstata13'</span><span class="p">)</span>
dta <span class="o"><-</span> read.dta13<span class="p">(</span><span class="s">'http://www.stata-press.com/data/r14/margex.dta'</span><span class="p">)</span>
</pre></div>
<p><strong>Fit the logistic regression model</strong></p>
<p>We fit the same logistic regression model and check to see that it looks the same as Pitblado's:</p>
<div class="codehilite"><pre><span></span>fit <span class="o"><-</span> glm<span class="p">(</span>outcome <span class="o">~</span> treatment <span class="o">+</span> distance<span class="p">,</span> data<span class="o">=</span>dta<span class="p">,</span> family<span class="o">=</span>binomial<span class="p">)</span>
<span class="kp">summary</span><span class="p">(</span>fit<span class="p">)</span>
</pre></div>
<div class="codehilite"><pre><span></span>##
## Call:
## glm(formula = outcome ~ treatment + distance, family = binomial,
## data = dta)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8221 -0.7995 -0.4271 -0.4106 2.2555
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.337762 0.096241 -24.291 < 2e-16 ***
## treatment 1.427760 0.113082 12.626 < 2e-16 ***
## distance -0.004775 0.001105 -4.321 1.55e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2732.1 on 2999 degrees of freedom
## Residual deviance: 2485.6 on 2997 degrees of freedom
## AIC: 2491.6
##
## Number of Fisher Scoring iterations: 7
</pre></div>
<p>Looks good.</p>
<p><strong>Standard errors for predictive margins.</strong> </p>
<p>Now we will calculate the standard error of the predicted margin for the two conditions of the <code>treatment</code> variable (<code>0</code> and <code>1</code>). Stata stores results of scalars, macros, and matrices from an analysis that can be accessed via special commands <code>_b[<name>]</code>, which will give the beta coefficient associated with the variable <em>name</em> from a model. R, in contrast, stores results as attribute of the model fit object (in our case, the <code>fit</code> object above). I create a couple of objects here that correspond to what is available in the returned values of the <code>margins</code> command in Stata: </p>
<table>
<thead>
<tr>
<th>R variable</th>
<th>Stata return value</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>betas</code></td>
<td><code>_b</code></td>
</tr>
<tr>
<td><code>vcov(fit)</code></td>
<td><code>e(V)</code></td>
</tr>
</tbody>
</table>
<p>And, I create two alternate datasets, <code>d0</code> and <code>d1</code> where the value of <code>treatment</code> equals 0 and 1, respectively, and all other variables remain the same.</p>
<div class="codehilite"><pre><span></span>betas <span class="o"><-</span> coef<span class="p">(</span>fit<span class="p">)</span>
d0 <span class="o"><-</span> d1 <span class="o"><-</span> dta <span class="c1">## Create two copies of the data</span>
d0<span class="o">$</span>treatment <span class="o">=</span> <span class="m">0</span> <span class="c1">## Set values of treatment to equal 0 in all rows</span>
d1<span class="o">$</span>treatment <span class="o">=</span> <span class="m">1</span> <span class="c1">## Set values of treatment to equal 1 in all rows</span>
</pre></div>
<p>Now we obtain the predicted values of all observations where <code>treatment==0</code> and all observations where <code>treatment==1</code>. To do this in the probability scale, we use the <code>plogis()</code> function in R that corresponds to the <code>invlogit()</code> function in Stata. We print the values of the two variables to confirm that they match Pitblado's. </p>
<div class="codehilite"><pre><span></span>p0 <span class="o"><-</span> plogis<span class="p">(</span>betas<span class="p">[</span><span class="s">'treatment'</span><span class="p">]</span><span class="o">*</span>d0<span class="o">$</span>treatment <span class="o">+</span> betas<span class="p">[</span><span class="s">'distance'</span><span class="p">]</span><span class="o">*</span>d0<span class="o">$</span>distance <span class="o">+</span> betas<span class="p">[</span><span class="s">'(Intercept)'</span><span class="p">])</span>
p1 <span class="o"><-</span> plogis<span class="p">(</span>betas<span class="p">[</span><span class="s">'treatment'</span><span class="p">]</span><span class="o">*</span>d1<span class="o">$</span>treatment <span class="o">+</span> betas<span class="p">[</span><span class="s">'distance'</span><span class="p">]</span><span class="o">*</span>d1<span class="o">$</span>distance <span class="o">+</span> betas<span class="p">[</span><span class="s">'(Intercept)'</span><span class="p">])</span>
<span class="kp">lapply</span><span class="p">(</span><span class="kt">list</span><span class="p">(</span>p0<span class="p">,</span> p1<span class="p">),</span> <span class="kp">summary</span><span class="p">)</span>
</pre></div>
<div class="codehilite"><pre><span></span>## [[1]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.002695 0.082913 0.085490 0.079115 0.086995 0.087948
##
## [[2]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01114 0.27376 0.28045 0.26002 0.28432 0.28676
</pre></div>
<p>Success!<sup id="fnref:predict"><a class="footnote-ref" href="#fn:predict" rel="footnote">1</a></sup></p>
<p>Now we calculate the matrices that make up the Jacobin matrix. The <code>vecaccum</code> command in Stata multiplies the first column listed by the remaining columns in the matrix. In Stata, the matrix consists of the data stored in the data frame; in R, we manually create this matrix. We test our Jacobian against Pitblado's:</p>
<div class="codehilite"><pre><span></span>dp0dxb <span class="o"><-</span> p0<span class="o">*</span><span class="p">(</span><span class="m">1</span><span class="o">-</span>p0<span class="p">)</span>
dp1dxb <span class="o"><-</span> p1<span class="o">*</span><span class="p">(</span><span class="m">1</span><span class="o">-</span>p1<span class="p">)</span>
zero <span class="o"><-</span> <span class="kp">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="kp">nrow</span><span class="p">(</span>dta<span class="p">))</span>
one <span class="o"><-</span> <span class="kp">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="kp">nrow</span><span class="p">(</span>dta<span class="p">))</span>
J0 <span class="o"><-</span> dp0dxb <span class="o">%*%</span> <span class="kt">matrix</span><span class="p">(</span><span class="kt">c</span><span class="p">(</span>zero<span class="p">,</span> one<span class="p">,</span> zero<span class="p">,</span> dta<span class="o">$</span>distance<span class="p">),</span> ncol<span class="o">=</span><span class="m">4</span><span class="p">)</span>
J1 <span class="o"><-</span> dp1dxb <span class="o">%*%</span> <span class="kt">matrix</span><span class="p">(</span><span class="kt">c</span><span class="p">(</span>zero<span class="p">,</span> one<span class="p">,</span> one<span class="p">,</span> dta<span class="o">$</span>distance<span class="p">),</span> ncol<span class="o">=</span><span class="m">4</span><span class="p">)</span>
Jac <span class="o"><-</span> <span class="kt">matrix</span><span class="p">(</span><span class="kt">c</span><span class="p">(</span>J0<span class="o">/</span><span class="kp">nrow</span><span class="p">(</span>dta<span class="p">),</span> J1<span class="o">/</span><span class="kp">nrow</span><span class="p">(</span>dta<span class="p">)),</span> nrow<span class="o">=</span><span class="m">2</span><span class="p">,</span> byrow<span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span>
</pre></div>
<p>Notice that it doesn't look exactly the same. That is because Stata expects the intercept values (what it calls <code>_const</code>) to be in the last row/column of the variance-covariance matrix where R expects the intercept to be in the first row/column. </p>
<p>With that adjustment, we can calculate the variances by using the last three columns of the Jacobian. Using only the last three matches the output from R's <code>vcov()</code>; Stata saves an extra row and an column to represent the reference category of the <code>treatment</code> variable. All of the cells in the extra row and column equal zero. Now we test the variance matrix that Pitblado calculated:</p>
<div class="codehilite"><pre><span></span>V <span class="o"><-</span> Jac<span class="p">[,</span><span class="m">2</span><span class="o">:</span><span class="m">4</span><span class="p">]</span> <span class="o">%*%</span> vcov<span class="p">(</span>fit<span class="p">)</span> <span class="o">%*%</span> <span class="kp">t</span><span class="p">(</span>Jac<span class="p">[,</span><span class="m">2</span><span class="o">:</span><span class="m">4</span><span class="p">])</span>
V
</pre></div>
<div class="codehilite"><pre><span></span>## [,1] [,2]
## [1,] 4.824092e-05 -1.180452e-07
## [2,] -1.180452e-07 1.249294e-04
</pre></div>
<p>And we get the same values!</p>
<div class="footnote">
<hr />
<ol>
<li id="fn:predict">
<p>As a side note, you can accomplish the same results in R using the <code>predict()</code> function. We could type <code>p0 <- predict(fit, newdata=d0, type='response')</code>, instead to get the same result. <a class="footnote-backref" href="#fnref:predict" rev="footnote" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>
michaeldmbader@gmail.com (mikebader)Wed, 04 Sep 2019 17:43:02 +0000https://mikebader.net/blog/entries/2019/09/04/r-translation-stata-description-calculating-marginal-effects-and-standard-errors/beckieball, or selecting on skill
https://mikebader.net/blog/entries/2015/01/06/beckieball-or-selecting-skill/
<p>Over the weekend, <a href="http://scatter.wordpress.com/author/jeremyfreese/">jeremy</a> <a href="http://scatter.wordpress.com/2015/01/03/beckieball-and-the-study-of-not-quite-elite-selected-groups/">posted about beckieball</a>, a "new sport sweeping the country." The purpose was to show how selection on characteristics affects the correlation between characteristics upon selection. This, as commenter <a href="http://twitter.com/StuartBuck1">Stuart Buck</a> pointed out, is an example of <a href="http://en.wikipedia.org/wiki/Berkson%27s_paradox">Berkson's Paradox</a>, though it relates to <a href="http://scatter.wordpress.com/2014/12/30/selection/">jeremy's post about height and nba</a>.</p>
<p>Although he left several other exercises to the reader, I thought I would do a simpler one: recreate the code that he used to make his example. I did this a) because it was a semi-useful way to shake the cobwebs from egg nog and yuletides, and b) because I think that it will come in handy teaching someday. </p>
<p>The logic that jeremy uses is as follows:</p>
<ol>
<li>Draw a sample with a known correlation structure </li>
<li>Develop a model to pick beckieball players based on measured characteristics</li>
<li>Calculate correlations between characteristics among different samples</li>
</ol>
<p>And here is the code:</p>
<pre><code>// 1. CREATE CORRELATION MATRIX AND DRAW MULTIVARIATE NORMAL SAMPLE OF 10,000
matrix C = (1,.57,1,.57,0,1,.57,0,0,1)
drawnorm perform height skills desire, n(10000) corr(C) cstorage(lower) clear
// 2. CREATE VARIABLE PREDICTING BEING PICKED FOR BECKIEBALL
gen pick = .5*height + .5* skills + rnormal()*.5
gsort -pick
// 3. CALCULATE CORRELATION MATRICES OF POPULATIONS
// Total population
pwcorr perform-desire
// Premiere League Beckieball
pwcorr perform-desire in 1/500
// Minor League Beckieball
pwcorr perform-desire in 501/1500
// Semipro Beckieball
pwcorr perform-desire in 1501/3000
// Overall correlation of characteristics and being picked
pwcorr
</code></pre>
<p>A couple of notes here. The matrix, <code>C</code>, is a lower-triangle matrix and is entered as a single row. The command <code>drawnorm</code>, which is the command that draws the multivariate sample, knows how to interpret this if the <code>cstorage(lower)</code> option is specified. Be careful entering this directly in Stata <strong>the command will clear current memory</strong> (see the <code>clear</code> option in there). </p>
<p>One can see that the correlation of being picked is higher among those with height and skills and unrelated to desire (since scouts can't measure it, according to jeremy). But those correlations disappear when we look at only those in each of the leagues as he notes in his post. </p>
michaeldmbader@gmail.com (mikebader)Tue, 06 Jan 2015 14:35:59 +0000https://mikebader.net/blog/entries/2015/01/06/beckieball-or-selecting-skill/MethodsCalculating Simple Power Analyses
https://mikebader.net/blog/entries/2010/10/18/calculating-simple-power-analyses/
<p>I am currently preparing a proposal for submission and one piece of information that the agency suggests is the power required to distinguish effects. This is obviously a perfectly reasonable piece of information to request; however, power calculations fall into that class of things that I know that I should know but I don't. It is one of those topics that every statistics book will tell you is important, but either a) glosses over the topic, or b) provides such a deep background that it is impossible to follow what the authors are talking about. Additionally, power calculations are complicated enormously by the fact that sample designs can become very complicated. </p>
<p>In contrast to this traditional treatment, <a href="http://www.stat.columbia.edu/~gelman/">Andrew Gelman</a> and <a href="http://steinhardt.nyu.edu/faculty_bios/view/Jennifer_Hill">Jennifer Hill's</a> book, <a href="http://www.powells.com/biblio/72-9780521686891-0">Data Analysis Using Regression and Multilevel/Hierarchical Models</a>, provides a very clear description of simple power analyses, which -- thankfully -- is all that I really need for this project. To make sure that I don't forget, I record below how to find the required sample size, <em>n</em>, for varying levels of between-group effect differences, Δ, at 80% power. The formula is relatively easy (see pp. 437-447 for more info): (5.6σ/Δ)<sup>2</sup>. Therefore, if I measure change in units of standard deviations, <code>sd</code>, then I can estimate the sample size <code>n</code> for each unit of change.</p>
<pre><code>drop _all
range sd 0 1 41
gen n = (5.6/sd)^2
</code></pre>
<p>I can then make a graph of the expected sample size required for a standard unit change using the command <code>twoway line n sd</code>; or, alternatively, just print a table of numbers using <code>list</code>.<br />
</p>
michaeldmbader@gmail.com (mikebader)Mon, 18 Oct 2010 18:31:54 +0000https://mikebader.net/blog/entries/2010/10/18/calculating-simple-power-analyses/Methods