.feed-links {display:none !important;} -->

Regular expressions in python

 Regular expressions are used in almost all languages. It is a very powerful tool which allows to check if the contents of a variable have the shape of what one expects. For example if we get a phone number, we expect the variable to be made up of numbers and spaces (or hyphens) but nothing more. Regular expressions allow not only to warn you of an unwanted character but also to remove / modify any unwanted ones.

The basics

We use symbols that have a meaning:

. ^ $ * +? {} [] \ | ()
. The period matches any character.
^ Indicates the beginning of a segment but also means "opposite of"
$ End of segment
[xy] A possible segment list. Example [abc] is equivalent to: a, b or c
(x | y) Indicates multiple choice type (ps | ump) is equivalent to "ps" OR "UMP" 
\ d the segment consists of only digits, which is equivalent to [0-9].
\ D the segment does not consist of a number, which is equivalent to [^ 0-9].
\ s A space, which is equivalent to [\ t \ n \ r \ f \ v].
\ S No space, which is equivalent to [^ \ t \ n \ r \ f \ v].
\ w Alphanumeric presence, which is equivalent to [a-zA-Z0-9_].
\ W No alphanumeric presence [^ a-zA-Z0-9_].
\ Is an escape character

It is possible to impose the number of occurrences with the following syntax:

A {2}: we expect the letter A (in upper case) to repeat 2 consecutive times.
BA {1,9}: BA segment is expected to repeat 1 to 9 consecutive times.
BRA {, 10}: The BRA segment is expected not to be present at all or present up to 10 consecutive times.
VO {1,}: we expect the VO segment to be present at least once.

SymbolNumber of expected charactersExamplePossible cases
?0 or 1GR (.)? SGRS, GR O S, GR I S, GR A S
+1 or moreGR (.) + SGR O S, GR I S, GR A S
*0, 1 or moreGR (.) * SGRS, GRO O , GR III S, GR Olivier S

We will abbreviate as quickly as possible any theoretical course, programming is fun when it is concrete.

Let's take this tutorial as a game: the object of the game is to anticipate whether an expression is TRUE or FALSE simply.

The library re

Launch your python interpreter and import the library re .

>>>  import  re

Then let's test an expression:

>>>  print  re . match ( r "GR (.)? S" ,  "GRAY" ) 
< _sre . SRE_Match  object  at  0x7f37acd2c558 >

If the answer is not, None the match matches.

Regular expressions exercise

We're not going to lie to each other, to master regular expressions we have to work with them.

First of all, let's prepare the fundamentals:

Small exercise

Here is a little exercise where you have to guess if the match matches or not.

EXPRESSIONCHAINTRUEFALSESOLUTION
GR (.) + SGREY
GR (.)? SGRS
GRA (.)? SBOLD
GAS (.)?BOLD
GR (A)? SBOLD
GR (A)? SGRS
M (.) + NHOUSE
M (.) + (O) + NHOUSE
M (.) + ([Az]) + NHOUSE
M (.) + ([AZ]) + NHOUSE
^!!HOUSE!
!HOUSE!HOUSE!
^! MAISO! $!HOUSE!
^! HOME! $!HOUSE!
^! M (.) +! $!HOUSE!
([0-9])03 88 00 00 00
^ 0 [0-9] ([.- /]? [0-9] {2}) {4}03 88 00 00 00
^ 0 [0-9] ([.- /]? [0-9] {2}) {4}03/88/00/00/00
^ 0 [0-9] ([.- /]? [0-9] {2}) {4}03_88_00_00_00

Search for an expression

The match is very interesting to validate the integrity of a variable, but it is also possible to look for specific expressions in a string of characters.

>>>  import  re 
>>>  re . findall ( "([0-9] +)" ,  "Hello 111 Goodbye 222" ) 
[ '111' ,  '222' ]

It is also possible to search by group:

>>>  import  re 
>>>  m  =  re . search ( r "Welcome to (? P <chezqui> \ w +)! Are you (? P <age> \ d +) years?" ,  "Bienvenue chez olivier! You are 32 years old" ) 
>>>  if  m  is  not  None : 
...      print  m . group ( 'chezqui' ) 
...      print  m . group ( 'age' ) 
...  
olive tree 
32

Replace an expression

To replace an expression we use the sub () method .

>>>  print  re . sub ( r "Welcome to (? P <chezqui> \ w +)! Are you (? P <age> \ d +) years old?" ,  r "\ g <chezqui> is \ g <age> years" ,  "Welcome to olivier! You are 32 years old " ) 
olivier  is  32  years old

Expression replacement is done on all possible matches:

>>>  data  =  "" " 
... olivier; engel; 30 years; 
... bruce; wayne; 45 years; 
..." "" 
>>>  print  re . sub ( r "(? P <firstname> \ w +); (? P <name> \ w +); (? P <age> \ w +);" ,  r "\ g <firstname>, \ g <name>, \ g <age> " ,  data )

olivier , engel , 30 years old  
bruce , wayne , 45 years old

Compile an expression

If you have to use the same expression several times (for example in a loop), you can compile it to gain in performance

>>>  mails  =  [ " olivier@mailbidon.com " ,  " olivier@mailbidon.ca " ,  " 8@mailbidon.com " ,  "@ mailbidon.com" ,  " olivier @ mailbidon " ] 
>>>  regex  =  re . compile ( r "^ [a-z0-9 ._-] + @ [a-z0-9 ._-] + \. [(com | fr)] +" 
>>>  for  mail  in  mails : 
...      if  regex .   None : 
...              print  "This mail: % s is valid"  %  mail    
...      else : 
...              print  "Error this mail: % s is not valid"  %  mail   
...  
This  mail  :  olivier @mailbidon . com  is  valid 
Error  this  mail  :  olivier @mailbidon . AC  is  not  valid 
This  email  :  8 @mailbidon . com  is  valid
Error  this  mail  :  @mailbidon . com  is  not  valid 
Error  this  mail  :  olivier @mailbidon  is  not  valid

Summary exercise: create an expression that recognizes an email

In many tutorials the case of the email address is used since it is both often used by developers and quite complex / complete

When you start to write a regular expression, you don't have to be very ambitious, you always have to start small, build it brick by brick.

We plant the scenery:

# coding: utf-8

import  re

string  =  "TEST"  
regexp  =  r "(TEST)"

if  re . match ( regexp ,  string )  is  not  None : 
    print  "TRUE" 
else : 
    print  "FALSE"

print  re . search ( regexp ,  string ) . groups ()

If you run this script, TRUE and "TEST" will be displayed. This makes it possible not to start from scratch. The idea is to follow the evolution of our regular expression step by step.

A wholesale email address looks like this XXXXXXX@XXXXX.COM

Let's start at the beginning, search XXXXXXXX@ , this can translate to ^[a-z0-9._-]+@

# coding: utf-8

import  re

string  =  " olivier@mailbidon.com "  
regexp  =  r "(^ [a-z0-9 ._-] + @ )"

if  re . match ( regexp ,  string )  is  not  None : 
    print  "TRUE" 
else : 
    print  "FALSE"

print  re . search ( regexp ,  string ) . groups ()

If you run this script, TRUE and "olivier@" will be displayed. We are on the right path! Let's continue with [a-z0-9._-]+\.[(com|fr)]+ then test.

# coding: utf-8

import  re

string  =  " olivier@mailbidon.com "  
regexp  =  r "(^ [a-z0-9 ._-] + @ [a-z0-9 ._-] + \. [(com | fr)] +)"

if  re . match ( regexp ,  string )  is  not  None : 
    print  "TRUE" 
else : 
    print  "FALSE"

print  re . search ( regexp ,  string ) . groups ()
And voila, the result should be good. You can remove the parentheses that serve as captions of expressions.

No comments:

Post a Comment