Using Regex in Python

  1. Regular expression is useful for many projects including but not limited to web scraping, chat or email analysis to extract useful information from a large text file
  2. Regex library in python makes our lives simple as there are certain symbols assigned to various characters to be used to form a pattern to excavate the information in the text.
  3. import re this is the library we need.
  4. If not present, install pip install re on Anaconda terminal.

Symbols

  1. \d matches the decimal digits. \D if not a decimal digit.
  2. \s is used to match an empty space.
  3. () to match a group of characters. Will explain this in a minute.
  4. \w to match all the word characters that will appear next. \W is for non-word characters.
  5. * is used to match the current pattern till the end. For example \w* will match all the characters [A-Z0-9a-z] till the next space \s appears.
  6. [a-z] matches characters from a to z.
  7. [^a–z] not in the set, basically ^ is the not symbol.
  8. + is used to match the pattern if it appears 1 or more times.
  9. {n} is used to match the pattern up to specific number of characters, called the delimiter. For example d{4} will match 4 integers. Alternative can be \d\d\d\d
  10. {n,m} is used to match the pattern if it appears between n and m times.
  11. If else (?: String1|String2) can be used in case you want to match many choices of strings. For example when we are extracting time, it can be am or pm, we can write (?: am|pm|AM|PM)
  12. i is used if it is case sensitive.

Functions

  1. Functions used to find the pattern formed by the above symbols in a text file are as follows.
  2. re.match will find match string at the starting of the text.
  3. re.findall will return all the matching strings.
  4. re.search will find the first match in the text.
  5. re.sub replace the matching string.
  6. re.split will split the text around the matching string.

Consider a text

  1. 12:34pm today is Sunday
  2. We need to extract the time of the day and day of the week from this sentence.
  3. We will form the first group we need which is time using () therefore (\d{2}\d{2}\d{2})
  4. Now comes the am and pm, as it’s not a 24 hour time we have to consider it in our group, therefore (?:am|pm) this will select either am or pm if it appears in future.
  5. \s insert this wherever you see a space like after pm in our case.
  6. ‘today is ‘ is the part we do not need.
  7. We will form our next group for the day of the week. () , as it is all characters we will include (\w*) , \w takes all characters and * will take characters till the end, therefore it will take the entire word 'Sunday'.
  8. When we combine the symbols we get this pattern (\d{2}:\d{2}(?:am|pm))\s\w*\s\w*\s(\w*)

Usage with Python

--

--

--

Hi good to see y'all, I am an aspiring data analyst and will be posting stuff about Statistics, Python and R and also some interesting projects I do. B-)

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

DP IS EASY!

Why choose Flutter for your company?

5 Things to Help Devs with Open Source Contributing

Networking Course Syllabus

Tests as Must-Have for Complex Projects

Make The Leap to SQL Greatness

My first algorithm

Setting up an SSH Key for your containers using docker compose

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jayesh Rao

Jayesh Rao

Hi good to see y'all, I am an aspiring data analyst and will be posting stuff about Statistics, Python and R and also some interesting projects I do. B-)

More from Medium

What can we do with spreadsheets?

Upgrade your Python apps from cx_Oracle 8 to python-oracledb

Upgrade Cover Image

Dear PyPI, whats todays weather in Rome?

Get all GitHub Repos with PowerShell