Regular Expressions

A Task

We have these lines:

Cover.jpg
01 City Ruins.flac
02 Amusement Park.flac
03 A Beautiful Song.flac
04 Alien Manifestation.flac
05 The Tower.flac
06 Dependent Weakling.flac
07 Bipolar Nightmare.flac
08 Mourning.flac
09 The Sound of the End.flac
10 Weight of the World.flac

We want this:

cover.jpg
city-ruins.flac
amusement-park.flac
a-beautiful-song.flac
alien-manifestation.flac
the-tower.flac
dependent-weakling.flac
bipolar-nightmare.flac
mourning.flac
the-sound-of-the-end.flac
weight-of-the-world.flac

How do we accomplish this?

Our Options

  1. Edit each line and word by hand
  2. Use regular expressions

Very Brief Basics

A regular expression is a sequence of characters that represents a search pattern

  • Pattern matches certain text sequences
  • Used for searching and replacing text
  • Also called regex or regexp

Let’s use regexr.com to try these out.

Search Characters

Sequences of letters are a search pattern for those letters. By default they’re case-sensitive.

Use [] to denote sets of things.

Use . to denote any character.


th searches for th:

The quick brown fox jumps over the lazy dog.

o[vwx] searches for ov, ow, and ox:

The quick brow fox jumps over the lazy dog.

o. matches o and then any character, including whitespace:

The quick brow fox jumps over the lazy dog.

Ranges

o[a-z] matches o and then a character a, b, c, … z:

boa lobby location of OJ opening soyoz o0 o1 o2 o5 o9 oH

o[1-8] matches o and then a digit from 1 to 8:

boa lobby location of OJ opening soyoz o0 o1 o2 o5 o9 oH

Multipliers

Multipliers act on the item to the left.

* matches an item zero or more times.

+ matches an item one or more times.

? matches an item zero or one times.


lo* matches l and then zero or more o:


  Are you looking at the lock or the silk?

lo+ matches l then one or more o:


  Are you looking at the lock or the silk?

lo? matches l and then an o or nothing:


  Are you looking at the lock or the silk?

Escaping Metacharacters

Typically, the \ is used to escape metacharacters like ., * or ].

\\ escapes a \.

See the difference between n. and n\. below:

n. matches n and then any character:


  An expression.

n\. matches n and then a period (.):


  An expression.

Search and Replace with Regex

Most decent text editors and IDEs offer search and replace based on regular expressions.

Use () to “capture” text and use in the replacement with $1, $2, etc.


L(.*?)(\s.) “captures” the highlighted text:


  Look over there!

We tell our editor/command to replace the captured text with something such as ABC$1123$2$2 and get:


  ABCook123 o oover there!

Solving Our Task

  1. Use regular expressions to match certain text
  2. Replace the matched text

We’ll write search/replacement regular expression as s/<search>/<replacement>/


A couple of regular expressions and replacement expressions:

  1. Remove the beginning numbers: s/^\d{2}\s//
  2. Replace spaces with dashes: s/\s/-/
  3. Lowercase everything: this isn’t necessarily possible with regex, depending on the implementation

Other Useful Examples

Regular expressions are immediately useful in many situations:

  • Bulk renaming files (like above)
  • Renaming variables, functions, other symbols
  • Searching logs or large files in general
  • Validating string data format (like emails, passwords, datetime, etc.)
  • Web scraping

Integral to many systems:

  • Syntax highlighters
  • Compiler lexers

Example: Renaming Symbols

Suppose we renamed the class User to Account or something. We can do a regex search and replace on the following code segment to make that change in another file:

s/User([^A-Za-z0-9])/Account$1/ or s/User(?![A-Za-z0-9])/Account/

@Service
@Transactional
public class UserService {

    @Autowired
    private UserRepository userRepository;

    @Autowired
    private BCryptPasswordEncoder encoder;

    ...

    public List<User> getAll() {
        return userRepository.findAll();
    }

    public User getFromId(UUID id) {
        Optional<User> found = userRepository.findById(id);
        if (!found.isPresent()) {
            throw createUserNotFoundException(id);
        }
        return found.get();
    }

    ...

}

Example: Validating Timestamps

Try to write a regular expression that matches timestamps in the format H:mm:ss.

Valid examples:

  • 00:00:00
  • 05:09:28
  • 15:31:09
  • 23:59:59

Invalid examples:

  • 24:00:00
  • 00:60:00
  • 00:00:60

A correct response:

(?:(?:[01][0-9])|(?:2[0-4]))(?::[0-5][0-9]){2}

Exercises

Regex Golf: alf.nu/RegexGolf

Regexone: regexone.com