You can check out my projects here like chat analysis where I have used regex on my GIT: jay6445/Data-Analytics-Projects

  1. Regular expression is useful for many projects including but not limited to web scraping, chat or email analysis to extract useful information from a large text file
  2. Regex library in python makes our lives simple as there are certain symbols assigned to various characters to be used to form a pattern to excavate the information in the text.
  3. import re this is the library we need.
  4. If not present, install pip install re on Anaconda terminal.

Symbols

  1. \d matches the decimal digits. \D if not a decimal digit.
  2. \s is used to match an empty space.
  3. () to match a group of characters. Will explain this in a minute.
  4. \w to match all the word characters that will appear next. \W is for non-word characters.
  5. * is used to match the current pattern till the end. For example \w* will match all the characters [A-Z0-9a-z] till the next space \s appears.
  6. [a-z] matches characters from a to z.
  7. [^a–z] not in the set, basically ^ is the not symbol.
  8. + is used to match the pattern if it appears 1 or more times.
  9. {n} is used to match the pattern up to specific number of characters, called the delimiter. For example d{4} will match 4 integers. Alternative can be \d\d\d\d
  10. {n,m} is used to match the pattern if it appears between n and m times.
  11. If else (?: String1|String2) can be used in case you want to match many choices of strings. For example when we are extracting time, it can be am or pm, we can write (?: am|pm|AM|PM)
  12. i is used if it is case sensitive.

Functions

  1. Functions used to find the pattern formed by the above symbols in a text file are as follows.
  2. re.match will find match string at the starting of the text.
  3. re.findall will return all the matching strings.
  4. re.search will find the first match in the text.
  5. re.sub replace the matching string.
  6. re.split will split the text around the matching string.

Consider a text

  1. 12:34pm today is Sunday
  2. We need to extract the time of the day and day of the week from this sentence.
  3. We will form the first group we need which is time using () therefore (\d{2}\d{2}\d{2})
  4. Now comes the am and pm, as it’s not a 24 hour time we have to consider it in our group, therefore (?:am|pm) this will select either am or pm if it appears in future.
  5. \s insert this wherever you see a space like after pm in our case.
  6. ‘today is ‘ is the part we do not need.
  7. We will form our next group for the day of the week. () , as it is all characters we will include (\w*) , \w takes all characters and * will take characters till the end, therefore it will take the entire word 'Sunday'.
  8. When we combine the symbols we get this pattern (\d{2}:\d{2}(?:am|pm))\s\w*\s\w*\s(\w*)

The first and the second group is indicated by green and the red portions respectively. This is a very important website where you can test your patterns https://regex101.com . The () as discussed earlier form the groups we want to extract.

Usage with Python

I would love to be contacted below for an insightful discussion.

Hi good to see y'all, I am an aspiring data analyst and will be posting stuff about Statistics, Python and R and also some interesting projects I do. B-)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store