.feed-links {display:none !important;} -->

BeautifulSoup / parse your XML and HTML

 We saw previously how to parse XML , it is also possible to parse HTML and the tool that does the job best in my opinion is the BeautifulSoup library






Install the BeautifulSoup library

Who says lib python says pip

pip install beautifulsoup4

Retrieve the content of a specified tag

BeautifulSoup offers you for example to retrieve all the p tags of an HTML page

# coding: utf-8

from bs4 import BeautifulSoup

html_doc = """
<html>
    <head>
    <title>Titre de votre site</title>
    </head>
    <body>
        <p>Texte à lire 1</p>
        <p>Texte à lire 2</p>
    </body>
</html>
"""
soup = BeautifulSoup(html_doc)
    
for p in soup.find_all('p'):
    print p

Which will return:

<p> Text to read 1 </p> 
<p> Text to read 2 </p>

Change the content of tags

Finding the elements that interest us is one thing, but being able to modify them is even better!

# coding: utf-8

from bs4 import BeautifulSoup

html_doc = """
<html>
    <head>
    <title>Titre de votre site</title>
    </head>
    <body>
        <p>Texte à lire 1</p>
        <p>Texte à lire 2</p>
    </body>
</html>
"""
soup = BeautifulSoup(html_doc)
    
for p in soup.find_all('p'):
    p.string = "Nouveau texte"
    
soupprint 

Result:

<html>
<head>
<title>Titre de votre site</title>
</head>
<body>
<p>Nouveau texte</p>
<p>Nouveau texte</p>
</body>
</html>

Replace tags

You can replace the tags with the replace_with method :

# coding: utf-8

from bs4 import BeautifulSoup

html_doc = """
<html>
    <head>
    <title>Titre de votre site</title>
    </head>
    <body>
        <p>Texte à lire 1</p>
        <p>Texte à lire 2</p>
    </body>
</html>
"""
soup = BeautifulSoup(html_doc)
    
for p in soup.find_all('p'):
    n = BeautifulSoup('<pre>%s</pre>' % p.string)
    p.replace_with(n.body.contents[0])
    
print soup

Script response:

<html> 
<head> 
<title> Title of your site </title> 
</head> 
<body> 
<pre> Text to read 1 </pre> 
<pre> Text to read 2 </pre> 
</body> 
</html>

Read attributes

It is possible to read the attributes of the elements with the get method :

# coding: utf-8

from bs4 import BeautifulSoup

html_doc = """
<html>
    <head>
    <title>Titre de votre site</title>
    </head>
    <body>
        <p class="c1 c2">Texte à lire 1</p>
        <p class="c3">Texte à lire 2</p>
    </body>
</html>
"""
soup = BeautifulSoup(html_doc)
    
for p in soup.find_all('p'):
    print p.get("class")

Result:

>>> ['c1', 'c2']
>>> ['c3']

The methods of the BeautifulSoup class

clear decompose=False )

Check out all children

decode_contents indent_level=None, eventual_encoding='utf-8', formatter='minimal' )

Create a unicode chain rendering

decompose )

Recursively destroy the contents of the tree

encode encoding='utf-8', indent_level=None, formatter='minimal', errors='xmlcharrefreplace' )

encode

encode_contents indent_level=None, encoding='utf-8', formatter='minimal' )

Create bytestring tag renderings

find name=None, attrs={}, recursive=True, text=None, **kwargs )

Return only the first child of the matching tag for the given criteria

find_all name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs )

Returns a list of tag object matching the request.

find name=None, attrs={}, recursive=True, text=None, **kwargs )

Return only the first child of the searched tag

findChildren name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs )

Returns a list of tag object corresponding to the request

get key, default=None )

Returns the value of the "key" attribute of the tag or returns the value default

get_text self, separator=u'', strip=False, types=( <class 'bs4.element.NavigableString'>, <class 'bs4.element.CData'> )

Returns all strings of children concatenated using the specified separator

has_attr key )

True if the requested attribute is present

has_key key )

Check the presence of the key

index ( element )

Returns the index of the element

prettify self, encoding=None, formatter='minimal' )

Improves code reading

recursiveChildGenerator ( )

append self, tag )

Adds the given tag to the current object

extract )

Extract item from tree

find_next_siblings self, name=None, attrs={}, text=None, limit=None, **kwargs )

Return the sibling objects of the current object

find_parents self, name=None, attrs={}, limit=None, **kwargs )

Dismissal parents

find_all_previous self, name=None, attrs={}, text=None, limit=None, **kwargs )

Returns all items that match the given criterion before the current object

find_previous_siblings self, name=None, attrs={}, text=None, limit=None, **kwargs )

Returns the sibling objects of the current object which are before this one

find_all_next self, name=None, attrs={}, text=None, limit=None, **kwargs )

Returns the objects which match the search but which are located after the current object

find_all_previous self, name=None, attrs={}, text=None, limit=None, **kwargs )

Returns the objects which match the search but which are located before the current object

find_next self, name=None, attrs={}, text=None, **kwargs )

Returns the first sibling object after the current object

find_next_sibling self, name=None, attrs={}, text=None, **kwargs )

Return the nearest sibling object after it

find_next_siblings self, name=None, attrs={}, text=None, limit=None, **kwargs )

Returns the following sibling objects

find_parent self, name=None, attrs={}, **kwargs )

Returns the nearest relative

find_parents self, name=None, attrs={}, limit=None, **kwargs )

Parents returns

find_previous self, name=None, attrs={}, text=None, **kwargs )

Returns the first item before the current object

find_previous_sibling self, name=None, attrs={}, text=None, **kwargs )

Returns the closest sibling item preceding the current object

find_previous_siblings self, name=None, attrs={}, text=None, limit=None, **kwargs )

Returns the closest sibling items preceding the current object

find_all_next self, name=None, attrs={}, text=None, limit=None, **kwargs )

Returns all the items following the current object

find_all_previous self, name=None, attrs={}, text=None, limit=None, **kwargs )

Returns all the items that precede the current object

Example of use

I needed an HTML parser to format and color the code I present on this site; I share this little script:

# coding: utf-8

import sys
import glob

from bs4 import BeautifulSoup
from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import HtmlFormatter

def pygments_file(pathname):
    # On ouvre le fichier
    with open(pathname, "r" ) as f:
        html_doc = f.read()
        
    soup = BeautifulSoup(html_doc)
    # On boucle sur les pre trouvés
    for pre in soup.find_all('pre'):
        try:
            if "code" in pre.get("class"):
                texte = highlight(pre.get_text(), PythonLexer(), \
                HtmlFormatter(nowrap=True))
                n = BeautifulSoup('%s' % texte)        
                pre.replace_with(n.body.contents[0])
        except:
            print("Erreur avec {}".format(pre,))
        
    if soup.body:
        with open(pathname, "w") as f:
            f.write(soup.body.encode_contents())
       
p = "/home/olivier/*.html"

if sys.argv[1]:
    p = str(sys.argv[1])

pathnames = glob.glob(p)
for pathname in pathnames:
    pygments_file(pathname)

You can just as well enter a folder as a single file.

No comments:

Post a Comment