Matching Substrings Entirely Within Stata

At Orgtheory, Fabio asked about how to identify substrings within text fields in Stata. Although this is a seemingly simple proposal, there is one big problem, as Gabriel Rossman points out: Stata string fields can only hold 244 characters of text. As Fabio desires to use this field to analyze scientific abstracts, then 244 characters is obviously insufficient.

Gabriel Rossman has posted a solution he has called grepmerge that uses the Linux-based program grep to search for strings in files. This is a great solution, but it comes with one large caveat: it cannot be used in a native Windows environment. This is because the grep command is only native to Linux-based systems (which include Apple computers). Therefore, I set out to find a solution that was a) platform-independent and b) internal to Stata (if possible).

Below is the solution that I developed. The solution, it turns out, is not to rely on Stata's string variables or string functions (both can only handle 244 characters), but instead to rely on Stata's local macros ("macros" are what other programming languages call "variables;" however, this would be confusing given that Stata also has variables, thus Stata calls them "macros"). The second key comes from the extended functions of Stata's macros. These are functions that build in much of the programming functions for Stata. There is no function defined to search for strings that are immediately like regex() or strpos(); however, there is an extended function to substitute within strings that will also provide a count of the number of substitutions made. Since all we really care about is the number of times a string would be substituted, then if we know that the count of substitutions is greater than we have the information that we need.

This program assumes that you have a) a Stata dataset with a variable that uniquely identifies each observation (with any other data that you desire) and b) a second dataset that contains two variables: the same ID variable and the strings to be searched (this corresponds to what Fabio mentioned was the format of his original dataset in Excel). Given a simple string to search and the filename of the second dataset, the program will create a variable indicating whether the string was found.

program drop lstrmatch
program lstrmatch
    syntax using , MATCHterm(string) /*
            */ [DELIMiter(string) /*
            */ MERGEby(name) /*
            */ MATCHVARname(name)]

    if "`delimiter'"!="" {
        if "`delimiter'"=="tab" {
            local delimiter=`"`=char(9)'"'
        }
        else if "`delimiter'"=="comma" {
            local delimiter=","
        }
        else if "`delimiter'"=="space" {
            local delimiter=" "
        }
    }
    else {
        local delimiter=" "
    }

    if "`matchvarname'"!="" confirm new variable `matchvarname'
    else local matchvarname matchvar

    tempname infile
    file open `infile' `using', r
    if "`mergeby'"=="" {
        file read `infile' fl
        tokenize "`fl'", parse(`"`delimiter'"')
        local mergeby `1'
        }
    confirm variable `mergeby'
    file read `infile' fl
    tempvar m
    gen `m' = 0

    while `r(eof)'==0 {
        tokenize `"`fl'"', parse(`"`delimiter'"')
        local txt : subinstr local 3 `"`matchterm'"' `"`matchterm'"', all count(local num)
        qui replace `m' = 1 if `num'>0 & `mergeby'==`1'
        file read `infile' fl
        }

    gen `matchvarname' = `m'
end
* Have a nice day!

The key line of the program is toward the bottom:

local txt : subinstr local 3 `"`matchterm'"' `"`matchterm'"', all count(local num)

The local 3 contains the string to be searched for the simple string (stored in the local macro matchterm), replaces that string with itself (which is actually unnecessary since we are not actually storing the string), and then the final option count(local num) stores a count of the number of times that the simple string was found. The next line replaces the value of the variable m to equal 1 if the count was greater than one. At the end of the program, the variable defined by the option matchvarname() will be a binary variable identifying the observations where the string was found. The remainder of the program loads and parses the file with the second dataset. The only other confusing part at the top is a shortcut that allows one to define the delimiter by using the words tab, comma, and space rather than having to type the actual characters (although this is still acceptable).

If one had a tab-delimited format file with the first row containing the variable names (the left column being the ID variable to be merged to the Stata dataset and the right column being the text to be searched) then the command below would store the presence of the string in the variable strpres:

lstrmatch using infile.txt, matchterm("experiment") delimiter("tab") matchvarname(strpres)

If the second dataset does not contain the variable names and you need to define the variable on which to merge to the original Stata dataset, then one could use the command:

lstrmatch using infile.txt, matchterm("experiment") mergeby(id) delimiter("tab") matchvarname(strpres)

Of course, if you have several terms that you want to search for, you can embed this command in a foreach loop to capture all, but I will leave that exercise for another day.

mike bader

matching substrings entirely within stata

Pingbacks

Comments