While writing scripts is one of the most important skills for reproducible quantitative sociology, the typical convention is to pick up the skills through more experienced colleagues in graduate school or at the workplace. Below are a few tips that I have learned from others, picked up on my own, or otherwise accumulated in my arsenal of tricks that I thought that I would pass along. There are great resources out there, but I thought it would be helpful to pass along what I think are the most important and helpful tips.
Blog
All entries categorized “Programming”
Basic Tips for Writing Statistical Scripts
Sunday, July 17th, 2011 6:14p.m.
Nesting Stata Macros, or Hacking a Hash Map
Monday, June 6th, 2011 6:37p.m.
Programming in Stata is relatively straightforward and this is partly because the programming syntax is both powerful and relatively straightforward. There are, however, a few minor annoyances in Stata's language including using the backtick and apostrophe to indicate local macros (i.e.,`localname'). Among these shortcomings, I would argue that the lack of anything like a list in Stata's language is one of the largest.
In most langauges, you can store a list of items and refer to the item in the list by some sort of index. This is particularly helpful for iterating over the same step multiple times. Lists generally come in two flavors: lists to which you can refer to an item by its position in the list or lists which you can refer to by a keyword (called hash maps in computer science lingo). Stata's matrices can be used for the first, though doing so might become complicated if you want to do something besides storing basic numbers or strings.
Structuring Work: Data Cleaning and Construction, Laying the Foundation
Saturday, April 16th, 2011 11:37a.m.
In the last step, we downloaded all of our data and deposited into directories that store this source data, backed it up, and write-protected the files. Now that we have done all of that, it is time to start working with the data! There is only one problem: almost inevitably, the data do not come neat, tidy, and ready to use. Often, the data contain major problems and need to be constructed in order to be usable. In this installment, I will write about managing files for cleaning, constructing and storing datasets.
Structuring Work: Data, The Foundation of Work
Monday, March 14th, 2011 3:50p.m.
After establishing where my root directory resides resides, it is time to actually get to work. As with any endeavor, success begins by laying a solid foundation and with academic work that begins foundation is our data.
The most fundamental skill to academic success is asking good questions and acquiring data to answer those questions. Yet, in quantitative research, that skill is useless without the ability to manipulate data into useful formats that are capable of answering the good questions. Data cleaning, construction, and manipulation constitute well over half of my work on major quantitative projects.
Structuring Work: The Root, Where it all Begins
Friday, Feb. 11th, 2011 1:02p.m.
In my last post, I explained the value of a directory structure: consistent file management structures a disciplined workflow that increases productivity. The magnitude of its importance was a revelation that occurred largely after graduate school as the result of starting a new job.
When I moved to start my new job, I needed to move my files to my new computer. In transferring my files, I realized that my work that followed my well-defined workflow transfered easily, while the work that didn't follow the workflow did not.
The contrast between the ease with which I started the well-structured work and difficulty getting up to speed on disorganized pieces threw in sharp relief the importance of maintaining a workflow structured by a consistent file management system. For those well-organized projects the only difference being on my new computer was that I began work from a different "root directory".
Structuring Work
Friday, Feb. 4th, 2011 10:04a.m.
When I say that one of the most important things that I did in graduate school was set up a directory structure and workflow for my files, I am not kidding. Reading theory, learning statistical methods, and writing literature reviews were all important. However, just as important -- though not nearly as sexy -- is setting up a file structure and working directory.
Despite how trivial it sounds, maintaining a well-designed directory structure not only provides a framework for files, it structures productive work.
Given how important it was for me, I will attempt to explain the directory structure that I developed. Let me begin by saying that I am not an expert at developing directory structures. There are experts in these matters. Though I had an interest in becoming an expert at file management, I was too busy trying to become an expert in what I was actually studying to have the time. I will lay out in an ongoing series of posts the basic intuition behind my posts, what has seemed to work (and not) with this system, and improvements I would like to make. I would, of course, be interested in feedback and or comparisons to what others do.
Calculating Simple Power Analyses
Monday, Oct. 18th, 2010 6:31p.m.
I am currently preparing a proposal for submission and one piece of information that the agency suggests is the power required to distinguish effects. This is obviously a perfectly reasonable piece of information to request; however, power calculations fall into that class of things that I know that I should know but I don't. It is one of those topics that every statistics book will tell you is important, but either a) glosses over the topic, or b) provides such a deep background that it is impossible to follow what the authors are talking about. Additionally, power calculations are complicated enormously by the fact that sample designs can become very complicated.
In contrast to this traditional treatment, Andrew Gelman and Jennifer Hill's book, Data Analysis Using Regression and Multilevel/Hierarchical Models, provides a very clear description of simple power analyses, which -- thankfully -- is all that I really need for this project. To make sure that I don't forget, I record below how to find the required sample size, n, for varying levels of between-group effect differences, Δ, at 80% power. The formula is relatively easy (see pp. 437-447 for more info): (5.6σ/Δ)2. Therefore, if I measure change in units of standard deviations, sd, then I can estimate the sample size n for each unit of change.
drop _all
range sd 0 1 41
gen n = (5.6/sd)^2
I can then make a graph of the expected sample size required for a standard unit change using the command twoway line n sd; or, alternatively, just print a table of numbers using list.
Importing Text Files with Variable Names to Stata
Friday, July 23rd, 2010 1:17p.m.
I have come across a problem several times that has been relatively frustrating
to deal with. I have data that is downloaded from a site (specifically the
Census (which is why this comes up consistently) in which the first two
lines of the data contain the variable name and variable description
respectively. This is incredibly useful for documenting data. Rather than
attempting to figure out what variable pct001001 means, the description of the
variable is right there.
The problem with data in this format is that Stata imports variables as string variables with the first observation being the variable description. I could pull the first two lines of the data out of the original dataset, transpose the rows and columns, save them in a separate text file, and then import the variable names and descriptions. However, managing two files means that it is more likely that one gets lost or I forget to send one of the files to a colleague working on the paper, or any number of other problems that could be experienced by separating these two files. Having one single file would be far superior and that is what the code below is designed to accommodate.
Data available from the U.S. Census comes in the following format (data is clipped):
Matching Substrings Entirely Within Stata
Sunday, May 2nd, 2010 7:59p.m.
At Orgtheory, Fabio asked about how to identify substrings within text fields in Stata. Although this is a seemingly simple proposal, there is one big problem, as Gabriel Rossman points out: Stata string fields can only hold 244 characters of text. As Fabio desires to use this field to analyze scientific abstracts, then 244 characters is obviously insufficient.
Gabriel Rossman has posted a solution he has called grepmerge that uses the Linux-based program grep to search for strings in files. This is a great solution, but it comes with one large caveat: it cannot be used in a native Windows environment. This is because the grep command is only native to Linux-based systems (which include Apple computers). Therefore, I set out to find a solution that was a) platform-independent and b) internal to Stata (if possible).
Below is the solution that I developed. The solution, it turns out, is not to rely on Stata's string variables or string functions (both can only handle 244 characters), but instead to rely on Stata's local macros ("macros" are what other programming languages call "variables;" however, this would be confusing given that Stata also has variables, thus Stata calls them "macros"). The second key comes from the extended functions of Stata's macros. These are functions that build in much of the programming functions for Stata. There is no function defined to search for strings that are immediately like regex() or strpos(); however, there is an extended function to substitute within strings that will also provide a count of the number of substitutions made. Since all we really care about is the number of times a string would be substituted, then if we know that the count of substitutions is greater than we have the information that we need.
Front Page
About
- Information about the purpose and topics of this blog can be found here.
Feeds
Archive
- Oct 2011
- Aug 2011
- Jul 2011
- Jun 2011
- Apr 2011
- Mar 2011
- Feb 2011
- Oct 2010
- Sep 2010
- Jul 2010
- Jun 2010
- May 2010
- Apr 2010
- Feb 2010
Categories
Tags
- advice
- architecture
- blogs
- built-environment
- cities
- data
- data-management
- data-visualization
- David-Kindig
- demography
- disorder
- gabriel-rossman
- gentrification
- grants
- graphics
- grocery
- health-policy
- immigration
- inequality
- Jon-Stewart
- kriging
- macros
- measurement
- National-Grocers-Association
- neighborhood-effects
- neighborhoods
- nutrition
- obesity
- orgtheory
- PAA
- peer-review
- personal
- population-health
- public-health
- rejection
- research-design
- research-process
- residential-mobility
- segregation
- Stata
- statistics
- strings
- suburbs
- teaching
- The-American-Prospect
- This-American-Life
- tips-n-tricks
- urban-policy
- whole-foods
- WNYC
- workflow
Miscellany
- The views presented here are solely and entirely my own, they do not represent those of my colleagues, employer, or any funding agencies which may support me.
- The writing on this blog is covered by a Creative Commons License (described here). Feel free to distribute or re-post with a link to the original content provided that it is freely available to others.
