Regular expressions in Scala

Versions: Scala 2.12.1

Even though the regular expressions look similarly in a lot of languages, each of them brings some own constructs. Scala is not an exception for this rule and we'll try to see it in this post.

This post is divided into 2 parts. In the first one, it talks about the features of RegEx in Scala. The second one focuses more on the internal implementation. The post won't focus on all possible patterns - there are plenty of better resources listing them. Instead, it'll focus on Scala's API.

Features

Among others, Scala represents Regular Expressions (RegEx) as scala.util.matching.Regex class. There are 2 principal methods to initialize it: either explicitly by calling the constructor or by invoking r method of StringLike decorator. The choice is up to you but as you can already see, one is more visible than the other. Alvin Alexander discusses this point in How to find regex patterns in Scala strings post. Here we'll focus only on showing both methods:

it("should be created with r") {
  val regex = "\\w+"

  regex shouldBe a [String]
  regex.r shouldBe a [Regex]
}
it("should be created with a constructor") {
  val regex = new Regex("\\w+")
  
  regex shouldBe a [Regex]
}

Once created, we can do a lot of things with our RegEx. Please notice however that the expressions are created separately for each test. But since their creation is costly, they should be initialized as few times as possible. The most basic thing we can do is the matching of the first corresponding element:

val NumericRegEx = "\\d+".r
val TextToMatch = "2 years ago I was 28"
it("should find first matching occurrence") {
  val matchedNumber = NumericRegEx.findFirstIn(TextToMatch)

  matchedNumber shouldBe defined
  matchedNumber.get shouldEqual "2"
}

And if we want to match one or more groups, we do it in different ways:

it("should find all matching occurrences") {
  val matchedNumbers = NumericRegEx.findAllMatchIn(TextToMatch)

  val numbersFromMatch = matchedNumbers.map(regexMap => regexMap.matched).toSeq

  numbersFromMatch should have size 2
  numbersFromMatch should contain allOf("2", "28")
}

it("should extract group values") {
  val regexWithGroups = "(\\d+)-(\\d+)-(\\d+)".r("year", "month", "day")

  val matches = regexWithGroups.findFirstMatchIn("2018-01-20")

  matches shouldBe defined
  matches.get.group("year") shouldEqual "2018"
  matches.get.group("month") shouldEqual "01"
  matches.get.group("day") shouldEqual "20"
}

it("should extract groups from defined pattern") {
  val universalDateRegexWithGroups = "(\\d+)-(\\d+)-(\\d+)".r
  val inconsistentlyFormattedDate = "2018-2-02"

  val universalDateRegexWithGroups(year, month, day) = inconsistentlyFormattedDate

  year shouldEqual "2018"
  month shouldEqual "2"
  day shouldEqual "02"
}

The last option made a "whaou" effect on me because of its simplicity. However, after some analysis we can figure out that it's not adapted for the cases when we have multiple possible patterns that can match our input. An alternative for that is the use of pattern matching:

it("should extract groups from different pattern matchers") {
  // above solution has a drawback - if the input has multiple different formats, we won't be able
  // to match it easily and extract groups
  // Instead we can use pattern matching
  def matchDate(date: String): Option[String] = {
    // Of course, it can be defined as a singleton, but for sake of readability, it's defined directly in the method
    val UniversalDateRegexWithGroups = "(\\d+)-(\\d+)-(\\d+)".r
    val PrefixedDateRegexWithGroups = "Date is (.+)".r
    val matchingDate = date match {
      case UniversalDateRegexWithGroups(year, month, day) => s"${year}-${month}-${day}"
      case PrefixedDateRegexWithGroups(date) => date
      case _ => null
    }
    Option(matchingDate)
  }

  val numericDateMatch = matchDate("2018-08-10")
  numericDateMatch shouldBe defined
  numericDateMatch.get shouldEqual "2018-08-10"

  val textDateMatch = matchDate("Date is Tuesday, January 1, 2019")
  textDateMatch shouldBe defined
  textDateMatch.get shouldEqual "Tuesday, January 1, 2019"
}

We can also doing simple and remain with verifying whether some pattern is found:

it("should check whether the RegEx matches with pattern matching") {
  val isMatching = TextToMatch match {
    case NumericRegEx(_*) => true
    case _ => false
  }

  isMatching shouldBe true
}

Matching is not a single thing we can do with RegExs. For instance, we can also split a string based on the pattern:

it("should split the text by numbers") {
  val textToSplit = "3a4b10c"
  val splitRegEx = "\\d+".r

  val letters = splitRegEx.split(textToSplit)

  letters should have size 4
  letters should contain allOf("", "a", "b", "c")
}

Or even more, we can replace some matched occurrences:

val NumericRegEx = "\\d+".r
val TextToMatch = "2 years ago I was 28"
it("should replace numbers with question marks") {
  val replacedText = NumericRegEx.replaceAllIn(TextToMatch, "?")

  replacedText shouldEqual "? years ago I was ?"
} 

For the last example I was trying to find a way to write it inside the expression itself but after some unsuccessful researches, I decided to use to illustrate a way to iterate over results:

it("should extract first 2 matched occurrences") {
  val numbersPattern = "(\\d+)".r
  val textToMatch = "3 40 10 50 5"

  val matchingIterator = numbersPattern.findAllMatchIn(textToMatch)

  val numbers = new mutable.ListBuffer[String]()
  while (matchingIterator.hasNext && numbers.size < 2) {
    numbers.append(matchingIterator.next.matched)
  }

  numbers should have size 2
  numbers should contain allOf("3", "40")
}

Scala is Java

All of above is pretty Scala-istic. And that even though Scala internally uses Java Regular Expressions. We can see that delegation in a lot of places. For instance created matcher is in fact an instance of java.util.regex.Matcher:

protected[Regex] val matcher = regex.pattern.matcher(source)

The use of RegEx in the context of pattern matching works, unsurprisingly, thanks to extractors:

object Match {
  def unapply(m: Match): Some[String] = Some(m.matched)
}
object Groups {
  def unapplySeq(m: Match): Option[Seq[String]] = if (m.groupCount > 0) Some(1 to m.groupCount map m.group) else None
}

Scala RegEx is mainly a wrapper on top of Java RegEx, enriched with some of the functional features as extractors or pattern matching. Among the available features presented in this post, we could find 2 categories: read-only (pure matchers) and read-write (match + replace). In the first category, we can find mainly pattern matching and find* methods. The second one contains principally replace* ones.