Jeromy Anglim's Blog: Psychology and Statistics


Wednesday, March 10, 2010

Using Regular Expressions in R: Case Study in Cleaning a BibTeX Database

I recently had to clean up a BibTeX database containing around 1,000 references. One of the clean up tasks was to ensure that page numbers were separated with en-dashes as opposed to hyphens. This post sets out how I used regular expressions in R to complete the task and check the results. I also hope to highlight the general power of string manipulation in R.

Problem: There are four main kinds of dashes: hyphens, en-dashes, em-dashes, and minus. Wikipedia discusses the differences between the dashes. LaTeX produces dashes using one, two, or three hyphens: (- for hyphen; -- for en-dash; --- for em-dash) or $-$ for minus. When expressing ranges of numbers (e.g., pages 96-100), an en-dash should be used.  However, all my page numbers were in a format where only a single hyphen was used.
Thus, I wanted to replace "-" with "--" but only for page numbers.

The initial text file looked a little like this but with 1000 more references:
@ARTICLE{Reder1987CP,
  author = {Reder, L. M.},
  title = {Strategy selection in question answering},
  journal = {Cognitive Psychology},
  year = {1987},
  volume = {19},
  pages = {90-138},
  endnotereftype = {Journal Article},
  shorttitle = {Strategy selection in question answering}
}

@ARTICLE{Reder1982PR,
  author = {Reder, L. M.},
  title = {Plausability judgments versus fact retrieval: Strategies for sentence
  verification},
  journal = {Psychological Review},
  year = {1982},
  volume = {89(3)},
  pages = {248-278},
  endnotereftype = {Journal Article},
  shorttitle = {Plausability judgments versus fact retrieval: Strategies for sentence
  verification}
}

And I wanted something like this (note the "pages = {...}"):
@ARTICLE{Reder1987CP,
  author = {Reder, L. M.},
  title = {Strategy selection in question answering},
  journal = {Cognitive Psychology},
  year = {1987},
  volume = {19},
  pages = {90--138},
  endnotereftype = {Journal Article},
  shorttitle = {Strategy selection in question answering}
}

@ARTICLE{Reder1982PR,
  author = {Reder, L. M.},
  title = {Plausability judgments versus fact retrieval: Strategies for sentence
  verification},
  journal = {Psychological Review},
  year = {1982},
  volume = {89(3)},
  pages = {248--278},
  endnotereftype = {Journal Article},
  shorttitle = {Plausability judgments versus fact retrieval: Strategies for sentence
  verification}
}


Solution: The natural choice was to use regular expressions. Many programming languages (and some text editors) support regular expressions. Because I'm most familiar with R, I tend to use R to process regular expressions. It's probably not the most obvious choice, but it does allow me to get feedback about how the patterns are matched and replaced. And it means that I can leverage my skills in R to use regular expressions. It also means that when I need to use string manipulation for data analysis, I am familiar with the tools.

Overview of regular expressions: For readers unfamiliar with regular expressions, they are an extremely powerful tool for finding and replacing text. Information about support for regular expressions in R can be found by typing ?regex. Additional information about the actual search and replace functions can be found by looking at the help for one of the string manipulation functions such as ?gsub. Data Manipulation with R has a chapter on string manipulation in R that I found helpful. RegularExpression.Info also has a tutorial.


Copy of the R Code

  x <- readLines("clipboard-128") 
    #Copy the BibTeX database from the 
    #Clipboard (or this could be a file)
    #result is a character vector where each line is an element
  
  # The initial filter reads:
  # "^"          start of text
  # "  page = "  literal text  
  # "[{]"        the open brace is a special character 
  #                and needs to be escaped by square brackets
  # "[[:digit:]]" any number from 0 to 9
  # "+"          one or more of the preceding characters  
  #               (i.e.,one or more numbers)
  # "-"          literal text
  # "[[:digit:]]" any number from 0 to 9
  # "+"          one or more of the preceding characters 
  #                (i.e., one or more numbers) 
  initialFilter <- "^  pages = [{][[:digit:]]+-[[:digit:]]+" 

  myPattern <- "-"
  myReplacement <- "--"
  xOutput <- x
  
  # Apply initial filter
  xSubset <-  grep(initialFilter, x) 
  
  # Replace matches within filter
  xOutput[xSubset] <- sub(pattern = myPattern, 
      replacement = myReplacement, x = x[xSubset])

  # Basic Check that it worked
  cbind(x[x != xOutput], xOutput[x != xOutput]) 
   # Check replacement: shows original and replaced
  
  xOutput

  # Write the replaced text to a file
  writeLines(xOutput, "xOutput.txt")

Copy of the R Output from the Check:
The following shows the first few lines of the check. The first column shows the original text and second column shows the replaced text:

>   cbind(x[x != xOutput], xOutput[x != xOutput]) 
       [,1]                     [,2]                     
  [1,] "  pages = {598-614},"   "  pages = {598--614},"  
  [2,] "  pages = {883-901},"   "  pages = {883--901},"  
  [3,] "  pages = {360-364},"   "  pages = {360--364},"  
  [4,] "  pages = {288-318},"   "  pages = {288--318},"  
  [5,] "  pages = {3-27},"      "  pages = {3--27},"     
  [6,] "  pages = {567-589},"   "  pages = {567--589},"  
  [7,] "  pages = {259-290},"   "  pages = {259--290},"  
  [8,] "  pages = {270-304},"   "  pages = {270--304},"  

Main points that I take away from this:

  • R has powerful string manipulation tools; They're worth learning, if you use R.
  • R has a habit of introducing users to powerful tools hidden from the typical Windows setup.
  • R, LaTeX, BibTeX, Sweave, and Regular expressions are all  text-driven systems in contrast to largely menu-driven systems such as SPSS, MS Word, and Endnote. Their textual nature facilitates their mutual co-operation.
  • Running checks on replacement operations in regular expressions is important