Damn Regex isn’t hard I just cant remember it.
I have learned and forgotten regex about 30 times now. The following are my notes while I learn again so I can refer back to it.
Basic Regex Matching
To match a string you can just type in the actual string you want to match so to match
Cat you can just type in cat as the regex and it will match
\d matches any digit from 1 to 9. the preceding slash is the escape symbol in regex
The . is a wildcard and matches anything including whitespace so to match three chars and a 4th char which is a full stop ie “htb.” you can use …\. escaping the last one. this matches any three characters and then a .
inside square brackets you can match specific characters. so to match can but not fan or dan you could write [c]an this means only c is the acceptable character in the first place. similarly if you want to match all of these but no pan you could write [cfd]an as the inside of the square bracket only matches one letter it just defines which ones are acceptable.
adding the hat inside the square brackets means match any character except fort these characters which should be excluded [^cfd]an would only match pan above
Use ranges inside of square brackets, [0-6] to match characters 0 to 6 and nothing else. [^b-x] will match anything other than the letters b to x
\w is a special character which is the equivalent of the [A-Za-z0-9] which is often seen to match English language characters. it matches one character only still, but any English language excluding special characters. it includes both uppercase and lower case
Matching Multiple Characters
use curly braces to match multiple characters so a{3} will match a 3 times.
Apparently some Regex engines will allow ranges in here for example a{1,3} will match a between one and three times
other examples
[wxy]{5} matches five characters which must be w x or y
.{2,6} matches between 2 and 6 of any character.
[https]{4,5} would be a crap way to match http or https but also matches hhhhh or htpsp
better way would be https? as that makes the last s optional so matches both
Optional Characters
using a ? means the preceding character is optional
ab?c matches abc and ac as the b is optional.
to match a question mark in a string you can \? escape it.
Whitespace
\s matches whitespace of all types including tabs spaces new lines and carriage returns
Example to match
- abc
- abc
- abc
\d\.\s+abc matches any digit \d then matches a dot so needs to be escaped \. then match any number of whitespace above 1 \s+ then match the abc string
Start and end of a line
Being very specific ensures no unwanted matches
^ indicates the start of a line and $ indicates the end of a line
Example match 1 but not 2 below
- mission successful
- mission unsuccessful
^\w+\ssuccessful matches specifically the \s for whitespace and then the successful straight afterwards
the ^ matches the start of the new line.
Capturing Groups
using () creates a capturing group which can be referenced afterwards
Basically creates a variable from the match
example to match anything that starts with IMG then a filename made of digits finishing with .png would be
^(IMG\d+\.png)$
this indicates it must be the start of a line then match IMG plus any amount of digits more than 1 must exist followed by an escaped . then png
This would match IMG1.png or IMG12345675443232.png and it will extract the full filename to be used afterwards
to match a filename for example
^(file\w+)\.pdf would match anything which starts with file then anything and ends with .pdf capturing the filename
Capturing a month and year example
Jan 1987
May 1969
Aug 2011
match any character and then a whitespace then match any numbers
Use capturing on both groups so you get the full date and just the year
(\w+\s(\d+))
None capturing Groups are indicated with a ?: at the start
so (https?|ftp)://www\.(\w+)\.com would match the protocol and the domain
adding a ?: to the start (?:https?|ftp) still matches but is none capturing as i dont want the actual info. its only got parentheses because it contains an or statement.
[^/\r\n]
This matches anything that does not include a forward slash or a line break
so in the case of a full domain name (https?|ftp)://([^/\r\n]+)(/[^\r\n]*)?
Be Specific
to match things you can use a pipe for an OR statement enclosed in brackets
i love (cats|dogs)
Using a * or a + to match multiple characters
this always follows a character or group so not on its own.
\d* would match any number of digits or \d+ would match at least one digit
a+ would match one or more a’s
[abc]+ would match one or more of a b or c characters
.* would match Zero or more of any character. Not sure yet why this would be a thing. I guess for characters that could appear rather than those that definitely do appear.
to match aaaabcc aabbbbc and aacc you could use a+b*c+ as there is at least 1 a so use a + same with c but in one case there is 0 b’s so need to use a star to match 0 or more. [abc]+c works too.
\d+ matches 1 or more numbers
Characters which match multiple things
. matches anything including whitespace
/w matches A-Z a-z and 1-9 English Language
/d matches any digit 0-9
Meta Characters
\d captures digits
\s captures whitespace
\w matches any english language letter or digit upper and lower case
Upper case versions mean the opposite so
\D means anythign except for digits
\W means any non alphanumeric character
\S means any non whitespace character
\b matches the boundary between a word and a non word character
\w+\b for example capture the rest of a word until the next whitespace
Specific Examples
US phone numbers
To grab the area code from the phone numbers, we can simply capture the first three digits, using the expression (\d{3}).
However, to match the full phone number as well, we can use the expression 1?[\s-]?\(?(\d{3})\)?[\s-]?\d{3}[\s-]?\d{4}. This breaks down into the country code ‘1?’, the captured area code ‘\(?(\d{3})\)?’, and the rest of the digits ‘\d{3}’ and ‘\d{4}’ respectively. We use ‘[\s-]?’ to catch the space or dashes between each component.
Matching HTML
- dont do it. use a proper parsing library as HTML is not consistent enough
- <(\w+) will match anything in a < tag
- >([\w\s]*)< matches the content of tags. useful for a hrefs ?
- href='([\w://.]*)’ finds the link target
- ='([\w://.]*)’ finds any attribute value
Leave a Response