Regular expressions are used in almost all languages. It is a very powerful tool which allows to check if the contents of a variable have the shape of what one expects. For example if we get a phone number, we expect the variable to be made up of numbers and spaces (or hyphens) but nothing more. Regular expressions allow not only to warn you of an unwanted character but also to remove / modify any unwanted ones.
The basics
We use symbols that have a meaning:
. ^ $ * +? {} [] \ | ()
. The period matches any character. ^ Indicates the beginning of a segment but also means "opposite of" $ End of segment [xy] A possible segment list. Example [abc] is equivalent to: a, b or c (x | y) Indicates multiple choice type (ps | ump) is equivalent to "ps" OR "UMP" \ d the segment consists of only digits, which is equivalent to [0-9]. \ D the segment does not consist of a number, which is equivalent to [^ 0-9]. \ s A space, which is equivalent to [\ t \ n \ r \ f \ v]. \ S No space, which is equivalent to [^ \ t \ n \ r \ f \ v]. \ w Alphanumeric presence, which is equivalent to [a-zA-Z0-9_]. \ W No alphanumeric presence [^ a-zA-Z0-9_]. \ Is an escape character
It is possible to impose the number of occurrences with the following syntax:
A {2}: we expect the letter A (in upper case) to repeat 2 consecutive times. BA {1,9}: BA segment is expected to repeat 1 to 9 consecutive times. BRA {, 10}: The BRA segment is expected not to be present at all or present up to 10 consecutive times. VO {1,}: we expect the VO segment to be present at least once.
Symbol | Number of expected characters | Example | Possible cases |
---|---|---|---|
? | 0 or 1 | GR (.)? S | GRS, GR O S, GR I S, GR A S |
+ | 1 or more | GR (.) + S | GR O S, GR I S, GR A S |
* | 0, 1 or more | GR (.) * S | GRS, GRO O , GR III S, GR Olivier S |
We will abbreviate as quickly as possible any theoretical course, programming is fun when it is concrete.
Let's take this tutorial as a game: the object of the game is to anticipate whether an expression is TRUE
or FALSE
simply.
The library re
Launch your python interpreter and import the library re
.
>>> import re
Then let's test an expression:
>>> print re . match ( r "GR (.)? S" , "GRAY" ) < _sre . SRE_Match object at 0x7f37acd2c558 >
If the answer is not, None
the match matches.
Regular expressions exercise
We're not going to lie to each other, to master regular expressions we have to work with them.
First of all, let's prepare the fundamentals:
Small exercise
Here is a little exercise where you have to guess if the match matches or not.
EXPRESSION | CHAIN | TRUE | FALSE | SOLUTION |
---|---|---|---|---|
GR (.) + S | GREY | |||
GR (.)? S | GRS | |||
GRA (.)? S | BOLD | |||
GAS (.)? | BOLD | |||
GR (A)? S | BOLD | |||
GR (A)? S | GRS | |||
M (.) + N | HOUSE | |||
M (.) + (O) + N | HOUSE | |||
M (.) + ([Az]) + N | HOUSE | |||
M (.) + ([AZ]) + N | HOUSE | |||
^! | !HOUSE! | |||
!HOUSE | !HOUSE! | |||
^! MAISO! $ | !HOUSE! | |||
^! HOME! $ | !HOUSE! | |||
^! M (.) +! $ | !HOUSE! | |||
([0-9]) | 03 88 00 00 00 | |||
^ 0 [0-9] ([.- /]? [0-9] {2}) {4} | 03 88 00 00 00 | |||
^ 0 [0-9] ([.- /]? [0-9] {2}) {4} | 03/88/00/00/00 | |||
^ 0 [0-9] ([.- /]? [0-9] {2}) {4} | 03_88_00_00_00 |
Search for an expression
The match is very interesting to validate the integrity of a variable, but it is also possible to look for specific expressions in a string of characters.
>>> import re >>> re . findall ( "([0-9] +)" , "Hello 111 Goodbye 222" ) [ '111' , '222' ]
It is also possible to search by group:
>>> import re >>> m = re . search ( r "Welcome to (? P <chezqui> \ w +)! Are you (? P <age> \ d +) years?" , "Bienvenue chez olivier! You are 32 years old" ) >>> if m is not None : ... print m . group ( 'chezqui' ) ... print m . group ( 'age' ) ... olive tree 32
Replace an expression
To replace an expression we use the sub () method .
>>> print re . sub ( r "Welcome to (? P <chezqui> \ w +)! Are you (? P <age> \ d +) years old?" , r "\ g <chezqui> is \ g <age> years" , "Welcome to olivier! You are 32 years old " ) olivier is 32 years old
Expression replacement is done on all possible matches:
>>> data = "" " ... olivier; engel; 30 years; ... bruce; wayne; 45 years; ..." "" >>> print re . sub ( r "(? P <firstname> \ w +); (? P <name> \ w +); (? P <age> \ w +);" , r "\ g <firstname>, \ g <name>, \ g <age> " , data ) olivier , engel , 30 years old bruce , wayne , 45 years old
Compile an expression
If you have to use the same expression several times (for example in a loop), you can compile it to gain in performance
>>> mails = [ " olivier@mailbidon.com " , " olivier@mailbidon.ca " , " 8@mailbidon.com " , "@ mailbidon.com" , " olivier @ mailbidon " ] >>> regex = re . compile ( r "^ [a-z0-9 ._-] + @ [a-z0-9 ._-] + \. [(com | fr)] +" >>> for mail in mails : ... if regex . None : ... print "This mail: % s is valid" % mail ... else : ... print "Error this mail: % s is not valid" % mail ... This mail : olivier @mailbidon . com is valid Error this mail : olivier @mailbidon . AC is not valid This email : 8 @mailbidon . com is valid Error this mail : @mailbidon . com is not valid Error this mail : olivier @mailbidon is not valid
Summary exercise: create an expression that recognizes an email
In many tutorials the case of the email address is used since it is both often used by developers and quite complex / complete
When you start to write a regular expression, you don't have to be very ambitious, you always have to start small, build it brick by brick.
We plant the scenery:
# coding: utf-8 import re string = "TEST" regexp = r "(TEST)" if re . match ( regexp , string ) is not None : print "TRUE" else : print "FALSE" print re . search ( regexp , string ) . groups ()
If you run this script, TRUE
and "TEST"
will be displayed. This makes it possible not to start from scratch. The idea is to follow the evolution of our regular expression step by step.
A wholesale email address looks like this XXXXXXX@XXXXX.COM
Let's start at the beginning, search XXXXXXXX@
, this can translate to ^[a-z0-9._-]+@
# coding: utf-8 import re string = " olivier@mailbidon.com " regexp = r "(^ [a-z0-9 ._-] + @ )" if re . match ( regexp , string ) is not None : print "TRUE" else : print "FALSE" print re . search ( regexp , string ) . groups ()
If you run this script, TRUE
and "olivier@"
will be displayed. We are on the right path! Let's continue with [a-z0-9._-]+\.[(com|fr)]+
then test.
# coding: utf-8 import re string = " olivier@mailbidon.com " regexp = r "(^ [a-z0-9 ._-] + @ [a-z0-9 ._-] + \. [(com | fr)] +)" if re . match ( regexp , string ) is not None : print "TRUE" else : print "FALSE" print re . search ( regexp , string ) . groups ()And voila, the result should be good. You can remove the parentheses that serve as captions of expressions.
No comments:
Post a Comment