×
Back to all chapters

# Strings

Strings are basically "words in computers". As an ordered set of characters, these are the building blocks that allow us to do things from searching filesystems to decrypting ciphers.

# Regular Expressions

You run a very poorly run hedge fund whose method for logging profits and losses is by entering them in a plaintext file in which they also keep various notes, jokes, and observations. For example, a section of the notebook might read

 1 2 3 4 5 sandwiches were pretty bad today 35648.22 might want to buy puts on SDRL 117.3 2.34e9

As I said, this is a very poorly run fund.

There are several different ways of representing decimals.

 1 2 3 4 5 6 7 8 9 1.23423 -2342.134134 343 8.89e9 1,000,000 343.134 4.321 -34.123 5.35

Suppose the fund goes belly up and down with it goes the money of all its investors. You work for the SEC and are called in to do a forensic analysis of what went wrong at the firm. Your first task is to find the profits and losses over time by parsing their note file, notes_and_stuff.txt

Which of the following regular expressions will capture all of the profit and loss statements from the notepad?

A

 1 ^-?\d+(,\d+)*(.\d+(e\d+)?)?$B  1 ^-?\d+(,\d+)*(\.\d+(e\d+)?)?$

C

 1 -?\d,(,\d+)*(\.\d+(\d+)?)?\$

D

 1 [+-]\d+(,\d+)*(.\d+(e\d+)?)?%

Suppose you run a website that allows users to share math and science problems with a large community. Because you can't display everything about a problem in a preview (some problems are very long), you have to use a subset of the info to summarize the problems. One of the best pieces of information to show is the problem title. Suppose that your data team comes to you with a finding that problems which have repeated characters like !, ?, and @ tend to not get many clicks.

For example:

• "can u believe this?!?!?!?!?!?!?!?!"
• "l@@k at this number theory problem!!"
• "amazing fact of mathumbulus!!!!!!!!!!!!!!!!!!"
• "ki!!er physics fact5!"

You want every problem on the site to have the best possible chance of becoming popular, so you have your engineers write a regular expression to identify when an incoming title has repeated !, @, or ? characters.

Unfortunately, your titles can also have mathematical expressions written in $$\LaTeX$$, and you find that the regular expression is flagging titles like

Craft a regular expression that matches strings which have repeats of !, @, or ? outside of balanced pairs of enclosing $$\LaTeX$$ wrappers (in purple above), but leaves such repeats alone when they are part of mathematical expressions. Deploy your regex on this collection of ~20,000 titles.

How many of the titles would be flagged for removal?

You have a list of addresses for your magazine's new subscribers that you need to send next month's issue to. However, the post office only accepts addresses which are fully written out, thus you cannot provide them with any abbreviated street names. Suppose the following is a representative list of addresses:

• 7839 Billiard St.
• 200 Tabletop Ave.
• 1941 Barrethill St.
• 90210 Hollywood Boulevard
• 1783 Trafalgar Square
• 10010 Blackfriars Way

Which of the following regular expressions could be used to select the abbreviated addresses specifically from the full collection?

• A: \d\s\s+\w\s+\.
• B: \d+\s\w+\s\w+
• C: \d+\s\w+\s\w+\.
• D: \d\s\w+\s\w+\.
×