Regular Expressions

A Task

We have these lines:

Cover.jpg
01 City Ruins.flac
02 Amusement Park.flac
03 A Beautiful Song.flac
04 Alien Manifestation.flac
05 The Tower.flac
06 Dependent Weakling.flac
07 Bipolar Nightmare.flac
08 Mourning.flac
09 The Sound of the End.flac
10 Weight of the World.flac

We want this:

cover.jpg
city-ruins.flac
amusement-park.flac
a-beautiful-song.flac
alien-manifestation.flac
the-tower.flac
dependent-weakling.flac
bipolar-nightmare.flac
mourning.flac
the-sound-of-the-end.flac
weight-of-the-world.flac

How do we accomplish this?

Our Options

Edit each line and word by hand
Use regular expressions

Very Brief Basics

A regular expression is a sequence of characters that represents a search pattern

Pattern matches certain text sequences
Used for searching and replacing text
Also called regex or regexp

Let’s use regexr.com to try these out.

Search Characters

Sequences of letters are a search pattern for those letters. By default they’re case-sensitive.

Use [] to denote sets of things.

Use . to denote any character.

th searches for th:

The quick brown fox jumps over the lazy dog.

o[vwx] searches for ov, ow, and ox:

The quick brow fox jumps over the lazy dog.

o. matches o and then any character, including whitespace:

The quick brow fox jumps over the lazy dog.

Ranges

o[a-z] matches o and then a character a, b, c, … z:

boa lobby location of OJ opening soyoz o0 o1 o2 o5 o9 oH

o[1-8] matches o and then a digit from 1 to 8:

boa lobby location of OJ opening soyoz o0 o1 o2 o5 o9 oH

Multipliers

Multipliers act on the item to the left.

* matches an item zero or more times.

+ matches an item one or more times.

? matches an item zero or one times.

lo* matches l and then zero or more o:


  Are you looking at the lock or the silk?

lo+ matches l then one or more o:


  Are you looking at the lock or the silk?

lo? matches l and then an o or nothing:


  Are you looking at the lock or the silk?

Escaping Metacharacters

Typically, the \ is used to escape metacharacters like ., * or ].

\\ escapes a \.

See the difference between n. and n\. below:

n. matches n and then any character:


  An expression.

n\. matches n and then a period (.):


  An expression.

Search and Replace with Regex

Most decent text editors and IDEs offer search and replace based on regular expressions.

Use () to “capture” text and use in the replacement with $1, $2, etc.

L(.*?)(\s.) “captures” the highlighted text:


  Look over there!

We tell our editor/command to replace the captured text with something such as ABC$1123$2$2 and get:


  ABCook123 o oover there!

Solving Our Task

Use regular expressions to match certain text
Replace the matched text

We’ll write search/replacement regular expression as s/<search>/<replacement>/

A couple of regular expressions and replacement expressions:

Remove the beginning numbers: s/^\d{2}\s//
Replace spaces with dashes: s/\s/-/
Lowercase everything: this isn’t necessarily possible with regex, depending on the implementation

Other Useful Examples

Regular expressions are immediately useful in many situations:

Bulk renaming files (like above)
Renaming variables, functions, other symbols
Searching logs or large files in general
Validating string data format (like emails, passwords, datetime, etc.)
Web scraping

Integral to many systems:

Syntax highlighters
Compiler lexers

Example: Renaming Symbols

Suppose we renamed the class User to Account or something. We can do a regex search and replace on the following code segment to make that change in another file:

s/User([^A-Za-z0-9])/Account$1/ or s/User(?![A-Za-z0-9])/Account/

@Service
@Transactional
public class UserService {

    @Autowired
    private UserRepository userRepository;

    @Autowired
    private BCryptPasswordEncoder encoder;

    ...

    public List<User> getAll() {
        return userRepository.findAll();
    }

    public User getFromId(UUID id) {
        Optional<User> found = userRepository.findById(id);
        if (!found.isPresent()) {
            throw createUserNotFoundException(id);
        }
        return found.get();
    }

    ...

}

Example: Validating Timestamps

Try to write a regular expression that matches timestamps in the format H:mm:ss.

Valid examples:

00:00:00
05:09:28
15:31:09
23:59:59

Invalid examples:

24:00:00
00:60:00
00:00:60

A correct response:

(?:(?:[01][0-9])|(?:2[0-4]))(?::[0-5][0-9]){2}

Exercises

Regex Golf: alf.nu/RegexGolf

Regexone: regexone.com