We saw previously how to parse XML , it is also possible to parse HTML and the tool that does the job best in my opinion is the BeautifulSoup library
Install the BeautifulSoup library
Who says lib python says pip
pip install beautifulsoup4
Retrieve the content of a specified tag
BeautifulSoup offers you for example to retrieve all the p tags of an HTML page
# coding: utf-8 from bs4 import BeautifulSoup html_doc = """ <html> <head> <title>Titre de votre site</title> </head> <body> <p>Texte à lire 1</p> <p>Texte à lire 2</p> </body> </html> """ soup = BeautifulSoup(html_doc) for p in soup.find_all('p'): print p
Which will return:
<p> Text to read 1 </p> <p> Text to read 2 </p>
Change the content of tags
Finding the elements that interest us is one thing, but being able to modify them is even better!
# coding: utf-8 from bs4 import BeautifulSoup html_doc = """ <html> <head> <title>Titre de votre site</title> </head> <body> <p>Texte à lire 1</p> <p>Texte à lire 2</p> </body> </html> """ soup = BeautifulSoup(html_doc) for p in soup.find_all('p'): p.string = "Nouveau texte" soupprint
Result:
<html> <head> <title>Titre de votre site</title> </head> <body> <p>Nouveau texte</p> <p>Nouveau texte</p> </body> </html>
Replace tags
You can replace the tags with the replace_with method :
# coding: utf-8 from bs4 import BeautifulSoup html_doc = """ <html> <head> <title>Titre de votre site</title> </head> <body> <p>Texte à lire 1</p> <p>Texte à lire 2</p> </body> </html> """ soup = BeautifulSoup(html_doc) for p in soup.find_all('p'): n = BeautifulSoup('<pre>%s</pre>' % p.string) p.replace_with(n.body.contents[0]) print soup
Script response:
<html> <head> <title> Title of your site </title> </head> <body> <pre> Text to read 1 </pre> <pre> Text to read 2 </pre> </body> </html>
Read attributes
It is possible to read the attributes of the elements with the get method :
# coding: utf-8 from bs4 import BeautifulSoup html_doc = """ <html> <head> <title>Titre de votre site</title> </head> <body> <p class="c1 c2">Texte à lire 1</p> <p class="c3">Texte à lire 2</p> </body> </html> """ soup = BeautifulSoup(html_doc) for p in soup.find_all('p'): print p.get("class")
Result:
>>> ['c1', 'c2'] >>> ['c3']
The methods of the BeautifulSoup class
clear ( decompose=False )
Check out all children
decode_contents ( indent_level=None, eventual_encoding='utf-8', formatter='minimal' )
Create a unicode chain rendering
decompose ( )
Recursively destroy the contents of the tree
encode ( encoding='utf-8', indent_level=None, formatter='minimal', errors='xmlcharrefreplace' )
encode
encode_contents ( indent_level=None, encoding='utf-8', formatter='minimal' )
Create bytestring tag renderings
find ( name=None, attrs={}, recursive=True, text=None, **kwargs )
Return only the first child of the matching tag for the given criteria
find_all ( name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs )
Returns a list of tag object matching the request.
find ( name=None, attrs={}, recursive=True, text=None, **kwargs )
Return only the first child of the searched tag
findChildren ( name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs )
Returns a list of tag object corresponding to the request
get ( key, default=None )
Returns the value of the "key" attribute of the tag or returns the value default
get_text ( self, separator=u'', strip=False, types=( <class 'bs4.element.NavigableString'>, <class 'bs4.element.CData'> ) )
Returns all strings of children concatenated using the specified separator
has_attr ( key )
True if the requested attribute is present
has_key ( key )
Check the presence of the key
index ( element )
Returns the index of the element
prettify ( self, encoding=None, formatter='minimal' )
Improves code reading
recursiveChildGenerator ( )
append ( self, tag )
Adds the given tag to the current object
extract ( )
Extract item from tree
find_next_siblings ( self, name=None, attrs={}, text=None, limit=None, **kwargs )
Return the sibling objects of the current object
find_parents ( self, name=None, attrs={}, limit=None, **kwargs )
Dismissal parents
find_all_previous ( self, name=None, attrs={}, text=None, limit=None, **kwargs )
Returns all items that match the given criterion before the current object
find_previous_siblings ( self, name=None, attrs={}, text=None, limit=None, **kwargs )
Returns the sibling objects of the current object which are before this one
find_all_next ( self, name=None, attrs={}, text=None, limit=None, **kwargs )
Returns the objects which match the search but which are located after the current object
find_all_previous ( self, name=None, attrs={}, text=None, limit=None, **kwargs )
Returns the objects which match the search but which are located before the current object
find_next ( self, name=None, attrs={}, text=None, **kwargs )
Returns the first sibling object after the current object
find_next_sibling ( self, name=None, attrs={}, text=None, **kwargs )
Return the nearest sibling object after it
find_next_siblings ( self, name=None, attrs={}, text=None, limit=None, **kwargs )
Returns the following sibling objects
find_parent ( self, name=None, attrs={}, **kwargs )
Returns the nearest relative
find_parents ( self, name=None, attrs={}, limit=None, **kwargs )
Parents returns
find_previous ( self, name=None, attrs={}, text=None, **kwargs )
Returns the first item before the current object
find_previous_sibling ( self, name=None, attrs={}, text=None, **kwargs )
Returns the closest sibling item preceding the current object
find_previous_siblings ( self, name=None, attrs={}, text=None, limit=None, **kwargs )
Returns the closest sibling items preceding the current object
find_all_next ( self, name=None, attrs={}, text=None, limit=None, **kwargs )
Returns all the items following the current object
find_all_previous ( self, name=None, attrs={}, text=None, limit=None, **kwargs )
Returns all the items that precede the current object
Example of use
I needed an HTML parser to format and color the code I present on this site; I share this little script:
# coding: utf-8 import sys import glob from bs4 import BeautifulSoup from pygments import highlight from pygments.lexers import PythonLexer from pygments.formatters import HtmlFormatter def pygments_file(pathname): # On ouvre le fichier with open(pathname, "r" ) as f: html_doc = f.read() soup = BeautifulSoup(html_doc) # On boucle sur les pre trouvés for pre in soup.find_all('pre'): try: if "code" in pre.get("class"): texte = highlight(pre.get_text(), PythonLexer(), \ HtmlFormatter(nowrap=True)) n = BeautifulSoup('%s' % texte) pre.replace_with(n.body.contents[0]) except: print("Erreur avec {}".format(pre,)) if soup.body: with open(pathname, "w") as f: f.write(soup.body.encode_contents()) p = "/home/olivier/*.html" if sys.argv[1]: p = str(sys.argv[1]) pathnames = glob.glob(p) for pathname in pathnames: pygments_file(pathname)
You can just as well enter a folder as a single file.
No comments:
Post a Comment