Basic Regular Expressions

First and foremost, let's see the two symbols, an integral part of Regular Expressions, viz., '^' and '$'.

'^' indicates the start of string whereas '$' indicates the end of string.

For instance,

  1. "^One" : Matches any string that starts with "One".
  2. "in million$" : Matches a string that ends in the substring "in million".
  3. "^abc$" : Matches a string that starts and ends with "abc".
  4. "eureka" : Matches a string that has the text "eureka" in it.  Here, you can notice, that if you don't use either of two symbols I mentioned above, you're saying that the pattern may occur anywhere inside the string you are searching.

There are few more characters i.e. "?", "*" and "+", which indicates the number of times a character(s) may occur.  What it mean is, “Zero or one”, “Zero or more” and “One or more” respectively.

For instance,

  1. “xy*”  :  Matches a string that has an “x” followed by zero or more “y”s .
    e.g. “x”, “xy”, “xyy” etc.
  2. “xy+”  :  Matches a string that has an “x” followed by at least one “y”.
    e.g. “xy”, “xyy”, “xyyy” etc.
  3. “xy?”  :  Matches a string that has an “x” and there might be a “y” or not.
  4. “x?y+$” : Matches a possible “x” followed by one or more “y”s ending a string.

Now, about the bounds. These bounds come inside braces and denotes ranges in the number of occurrence.

  1. “xy{2}” : Matches a string that has an “x” followed by exactly two “y”s.
    e.g. “xyy”
  2. xy{2, }  :  Matches string where there are at least two “y”s.
    e.g. “xyy”, “xyyyy” etc.
  3. xy{3,5}  : Matches a string where “y”s are three to five.

e.g. “xyyy”, “xyyyy” or “xyyyyy”

Make a note, the first number in range is mandatory, so it could always be {0,4} or {4,5} or {7, }
but never { , 5}

To quantify the sequence of characters, put them inside parentheses.

  1. “x(yz)*” :  Matches a string that has “x” followed by zero or more copies of the sequence “yz”.
  2. “x(yz){1,5}” :  Matches one through five copies of “yz”.

There’s also a “|” symbol, which works an an OR operator.

  1. “hi|hello”  :  Matches a string that has either “hi” OR “hello” in it.
  2. “(b|cd)ef”  :  Matches a string that has either “bef” OR “cdef”.
  3. “(a|b)*c”  :  Matches a string that has a sequence of alternating “a”s and “b”s ending in a “c”

A period “.” stands for any single character.

  1. “a.[0-9]”  :  Matches a string that has an “a” followed by any one character and a digit.
  2. “^.{3}$”  :  Matches a string with exactly three characters.

Bracket expressions specify which characters are allowed in a single position of a string.

  1. “[ab]”  :  Matches a string that has either an “a” OR a “b” (same as that of “a|b”)
  2. “[a-d]” :  Matches a string that has lowercase letters “a” through “d” (same as that of “a|b|c|d”)
  3. “^[a-zA-Z]”  :  Matches a string that starts with a letter, case-insensitive.
  4. “[0-9]%”  :  Matches a string that has a single digit before a percent sign.
  5. “, [a-zA-Z0-9]$”  :  Matches a string that ends in a comma followed by an alphanumeric character.

You can also list which characters you Don’t want.  Just use a “^” as the first symbol in a bracket.
For instance, “%[^a-zA-Z]%” matches a string with a character that is not a letter between two percent.

This concludes the Regular Expressions in nutshell.  Stay tuned for further posts.

Ciao !!

Comments

Awesome ! Thanks Sandip for this descriptive explanation in nutshell. Keep it up !!

Add new comment