Basic Regular Expressions
First and foremost, let's see the two symbols, an integral part of Regular Expressions, viz., '^' and '$'.
'^' indicates the start of string whereas '$' indicates the end of string.
For instance,
- "^One" : Matches any string that starts with "One".
- "in million$" : Matches a string that ends in the substring "in million".
- "^abc$" : Matches a string that starts and ends with "abc".
- "eureka" : Matches a string that has the text "eureka" in it. Here, you can notice, that if you don't use either of two symbols I mentioned above, you're saying that the pattern may occur anywhere inside the string you are searching.
There are few more characters i.e. "?", "*" and "+", which indicates the number of times a character(s) may occur. What it mean is, “Zero or one”, “Zero or more” and “One or more” respectively.
For instance,
- “xy*” : Matches a string that has an “x” followed by zero or more “y”s .
e.g. “x”, “xy”, “xyy” etc. - “xy+” : Matches a string that has an “x” followed by at least one “y”.
e.g. “xy”, “xyy”, “xyyy” etc. - “xy?” : Matches a string that has an “x” and there might be a “y” or not.
- “x?y+$” : Matches a possible “x” followed by one or more “y”s ending a string.
Now, about the bounds. These bounds come inside braces and denotes ranges in the number of occurrence.
- “xy{2}” : Matches a string that has an “x” followed by exactly two “y”s.
e.g. “xyy” - xy{2, } : Matches string where there are at least two “y”s.
e.g. “xyy”, “xyyyy” etc. - xy{3,5} : Matches a string where “y”s are three to five.
e.g. “xyyy”, “xyyyy” or “xyyyyy”
Make a note, the first number in range is mandatory, so it could always be {0,4} or {4,5} or {7, }
but never { , 5}
To quantify the sequence of characters, put them inside parentheses.
- “x(yz)*” : Matches a string that has “x” followed by zero or more copies of the sequence “yz”.
- “x(yz){1,5}” : Matches one through five copies of “yz”.
There’s also a “|” symbol, which works an an OR operator.
- “hi|hello” : Matches a string that has either “hi” OR “hello” in it.
- “(b|cd)ef” : Matches a string that has either “bef” OR “cdef”.
- “(a|b)*c” : Matches a string that has a sequence of alternating “a”s and “b”s ending in a “c”
A period “.” stands for any single character.
- “a.[0-9]” : Matches a string that has an “a” followed by any one character and a digit.
- “^.{3}$” : Matches a string with exactly three characters.
Bracket expressions specify which characters are allowed in a single position of a string.
- “[ab]” : Matches a string that has either an “a” OR a “b” (same as that of “a|b”)
- “[a-d]” : Matches a string that has lowercase letters “a” through “d” (same as that of “a|b|c|d”)
- “^[a-zA-Z]” : Matches a string that starts with a letter, case-insensitive.
- “[0-9]%” : Matches a string that has a single digit before a percent sign.
- “, [a-zA-Z0-9]$” : Matches a string that ends in a comma followed by an alphanumeric character.
You can also list which characters you Don’t want. Just use a “^” as the first symbol in a bracket.
For instance, “%[^a-zA-Z]%” matches a string with a character that is not a letter between two percent.
This concludes the Regular Expressions in nutshell. Stay tuned for further posts.
Ciao !!
Comments
Anonymous (not verified)
Wed, 01/24/2018 - 16:07
Great post for basic learners
Awesome ! Thanks Sandip for this descriptive explanation in nutshell. Keep it up !!
Add new comment