Everything about regular expression

Preet Sharma
8 min readNov 4, 2020

Be any search function, string manipulation, or feature extraction, regular expression is a very powerful language with a short vocabulary which is simple to understand. Mostly every programming language has a regular expression class as part of the package or inbuild with the compiler. It plays a major role in text analytics.

What is a regular expression: a simple definition is “it is a set of alphanumeric characters or a Pattern which is used for a string search operation. A pattern is a fixed structure of characters like tweeter hashtags: # in tweets, @ in the email id, etc. So to say regular expression can work on any text, having any kind of pattern. This article will refer to Python regular expression package which is available in Python as “re”.

To use ‘re’ package in python, first ‘re’ package should be imported and later it can be referred for any string manipulation operation. ‘re’ have multiple methods which we will include in this article. Below is one of the example :

Re.search() function detects whether the given regular expression pattern is present in the given input string or not. The re.search() method return RegexObject() it the pattern is found in the string, else it return a None object.

As mentioned in the above example, method result.start() and result.end() capture starting and the end index value of the pattern in the string. We use quantifiers in the regular expression for string operation. The quantifier allows us to control the search pattern as below.

Lets say, in one of the documents we have multiple words for good (i.e. formal, slang) goooooooood, and goooooddddddd or may be more. All represent the word good and in both the cases where we want to consider all the word as Good or we don’t want to consider as Good, we can use a different type of quantifiers in regex expression as below:

· The ‘?’ operator

· The ‘*’ operator

· The ‘+’ operator

· The {m,n} operator

‘?’ operator can be used in the pattern, where we want to specify precceding character to be present optional (means Zero or One time). For example, if we want to match ‘car’ and ‘cars’ both the words then regex expression re.search(‘cars?’,<input text>) will be used.

‘*’ operator is similar to ‘?’ except it looks for precceding character to be present for Zero or more time. As per below mentioned example we can see that it matches for Zero character presence (‘ne’ in ‘none’), one character presence( ‘ne’ in ‘new’) and more than one character presence(‘ne’ in ‘need’). In simple word we can say ‘*’ matches character any number of times.

Above both qualifiers are OPTIONAL.

‘+’ operator is considered mandatory. That means it has to be present in the text to match the pattern. As per below-mentioned example , mandatory pattern ‘ne+’ failed for string ‘name’ which would pass for optional pattern ‘ne*’ for same string ‘name’.

· ‘?’ qualifier is OPTIONAL and for zero to one character to be present

· ‘*’ qualifier is OPTIONAL and for zero to more character to be present

· ‘+’ qualifier is MANDATORY and for one to more character to be present

Usage of the above qualifier works well for simple search operations however fail in complex operation where the count of specific character matters. Our previous example passes ‘new’ and ‘need’ both word for the pattern ‘ne+’, and it will pass other wrongly spelled characters as well like ‘neeeed’. In cases where we are sure about no of character presence, we use pattern along with fixed no of presence. Below table helps to understand all possible use cases of using these patterns in a different flavor.

These are basic expressions for simple use cases. For complex and NLP based applications we use comprehension regular expression that functions on whitespaces, alphabets, numeric, alphanumeric, and a combination of them along with some logical rules.

Text analytics is not as simple as we think hence single-character pattern doesn’t suffice our requirement. We need to create a pattern with multiple characters. This is where grouping works. “Grouping is just to use parentheses for the multiple characters enclosed along with required pattern”. Regex engine considers the whole word for pattern search rater then looking for single character occurrence.

Using multiple qualifiers itself we can create a pattern that can search any complex text. Sometimes we need to define some logical conditions where we need to define multiple patterns. We don’t need to define a separate regex expression to do that, operator pipe (‘|’) helps us with that. Pipe operator act as an OR condition we generally use in logical if condition.

We can use this as a list to filter text from string using a single pattern. Another simple and useful example is to validate the country code. using regex expression it can be done easily. i.e. ‘(0|1|2|3|4|5|6|7|8|9){2}’ pattern will look for the double occurrence of any number between 0 to 9. In some cases where a special character has to be in the pattern, we need to use escape sequence as we use in any programming syntax. A backslash character will be preceding before the special character to make it escape sequence.

Like in some documents we need to search for any questions. We can’t use ‘?’ in our pattern as it denotes zero or more occurrence of preceding character. As per the below example, if we use ‘?’ as a pattern then it throws the error and when we use as an escape sequence then it works well.

As another example, if we want to extract the current score from “Sachin, a great Indian batsman who scored so many centuries, is a bit nervous to play the next ball as the score is 99(99)”. A simple regex expression to extract score will be ‘(0|1|2|3|4|5|6|7|8|9)+\((0|1|2|3|4|5|6|7|8|9)+\)’. Regex qualifier is so much flexible that we can use multiple expressions (easy, complex) for the same problem. we will use another simple form also to extract the score.

Regex has multiple indicators to define in expression. These indicators are known as flags and we use these flags to specify regex engine for case sensitive information, search criteria in a multi-line document, and many more.

Anchors and wildcard:

Anchors are used in the pattern to search for the start and end character of a given string. Anchors doesn’t check any words in between the string in a big sentence. ‘^’ specifies the start of the sentence and ‘$’ specifies end of the sentence. Like re.search(‘^I have’,’I have to go now.’) will return the search string as string starts with the pattern. however, re.search(‘^I have’,’to late for today. I have to go now.’) will not return anything. few more example

As wildcard, we use ‘.’(dot) character in the regular expression. It represents a placeholder for any universal character. Like if we need to search for a pattern that starts with 3 characters followed by two 1s and three 0s, followed by any two characters then the simple form of expression using wildcard will be ‘.{3}1{2}0{3}.{2}’. Another example to clarify is “Any name having a length between 5 to 13 character”, simple wildcard expression will be ‘.{5,13}’.

So far we have used exact letter or wild card character in our expression. This can’t be practice for a complex search from a lot of documents. Text can have anything that includes character, numbers, alphanumeric, symbols, whitespace, and wildcards. These are termed as character sets in the regular expression. Here set is a combination of patterns represented in a square bracket. These combinations can be a sequence of characters like ‘[abcdefg]’ or a range of characters like ‘[a-g]’.Character set can be used with or without quantifiers. When used without any quantifier then the pattern will look for only a single character in the search and if characterset is with a quantifier then the pattern will look for any number of matching characters.

As we can see in the above examples, Just adding a quantifier changes the behavior of the search pattern for characterset. Sometimes, we need to search for a pattern EXCEPT a specific character or group of characters. In such cases previously used ‘^’ quantifier work with the character set. While working with text mostly we use a combination of the characterset, hence there is shorthand way to write pattern as Meta sequence. Few of the examples listed below for better understanding.

A regular expression is by default with greedy behavior. That means it tries to find the maximum availability of the pattern in the string. For example pattern ‘No{2,5}’ will search Nooooo from the string ‘Nooooo’. This behavior is a greedy approach. To change the behavior from default greedy to nongreedy we include ‘?’ at the end of the pattern. For example pattern ‘No{2,5}’ will search Noo from the string ‘Nooooo’.

Method/Operation of re function:

Cases where we need to find and extract sub-pattern out of a larger pattern, we group the sub-pattern using parenthesis. Let’s say our text string is: “I still remember 31/12/1999, the whole world was having fear of Y2K”. To extract the date from this string we define the pattern — “\d{1,2}\/\d{1,2}\/\d{4}”.

So the Regular expression can be used for multiple use-cases, few of them are: search text, search files, extracting dates, extracting emails. Regular expression plays a major role in NLP.

--

--