BeautifulSoup / parse your XML and HTML

We saw previously how to parse XML , it is also possible to parse HTML and the tool that does the job best in my opinion is the BeautifulSoup library

Install the BeautifulSoup library

Who says lib python says pip

pip install beautifulsoup4

Retrieve the content of a specified tag

BeautifulSoup offers you for example to retrieve all the p tags of an HTML page

# coding: utf-8

from bs4 import BeautifulSoup

html_doc = """
<html>
    <head>
    <title>Titre de votre site</title>
    </head>
    <body>
        <p>Texte à lire 1</p>
        <p>Texte à lire 2</p>
    </body>
</html>
"""
soup = BeautifulSoup(html_doc)
    
for p in soup.find_all('p'):
    print p

Which will return:

<p> Text to read 1 </p> 
<p> Text to read 2 </p>

Change the content of tags

Finding the elements that interest us is one thing, but being able to modify them is even better!

# coding: utf-8

from bs4 import BeautifulSoup

html_doc = """
<html>
    <head>
    <title>Titre de votre site</title>
    </head>
    <body>
        <p>Texte à lire 1</p>
        <p>Texte à lire 2</p>
    </body>
</html>
"""
soup = BeautifulSoup(html_doc)
    
for p in soup.find_all('p'):
    p.string = "Nouveau texte"
    
soupprint

Result:

<html>
<head>
<title>Titre de votre site</title>
</head>
<body>
<p>Nouveau texte</p>
<p>Nouveau texte</p>
</body>
</html>

Replace tags

You can replace the tags with the replace_with method :

# coding: utf-8

from bs4 import BeautifulSoup

html_doc = """
<html>
    <head>
    <title>Titre de votre site</title>
    </head>
    <body>
        <p>Texte à lire 1</p>
        <p>Texte à lire 2</p>
    </body>
</html>
"""
soup = BeautifulSoup(html_doc)
    
for p in soup.find_all('p'):
    n = BeautifulSoup('<pre>%s</pre>' % p.string)
    p.replace_with(n.body.contents[0])
    
print soup

Script response:

<html> 
<head> 
<title> Title of your site </title> 
</head> 
<body> 
<pre> Text to read 1 </pre> 
<pre> Text to read 2 </pre> 
</body> 
</html>

Read attributes

It is possible to read the attributes of the elements with the get method :

# coding: utf-8

from bs4 import BeautifulSoup

html_doc = """
<html>
    <head>
    <title>Titre de votre site</title>
    </head>
    <body>
        <p class="c1 c2">Texte à lire 1</p>
        <p class="c3">Texte à lire 2</p>
    </body>
</html>
"""
soup = BeautifulSoup(html_doc)
    
for p in soup.find_all('p'):
    print p.get("class")

Result:

>>> ['c1', 'c2']
>>> ['c3']

The methods of the BeautifulSoup class

clear ( decompose=False )

Check out all children

decode_contents ( indent_level=None, eventual_encoding='utf-8', formatter='minimal' )

Create a unicode chain rendering

decompose ( )

Recursively destroy the contents of the tree

encode ( encoding='utf-8', indent_level=None, formatter='minimal', errors='xmlcharrefreplace' )

encode

encode_contents ( indent_level=None, encoding='utf-8', formatter='minimal' )

Create bytestring tag renderings

find ( name=None, attrs={}, recursive=True, text=None, **kwargs )

Return only the first child of the matching tag for the given criteria

find_all ( name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs )

Returns a list of tag object matching the request.

find ( name=None, attrs={}, recursive=True, text=None, **kwargs )

Return only the first child of the searched tag

findChildren ( name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs )

Returns a list of tag object corresponding to the request

get ( key, default=None )

Returns the value of the "key" attribute of the tag or returns the value default

get_text ( self, separator=u'', strip=False, types=( <class 'bs4.element.NavigableString'>, <class 'bs4.element.CData'> ) )

Returns all strings of children concatenated using the specified separator

has_attr ( key )

True if the requested attribute is present

has_key ( key )

Check the presence of the key

index ( element )

Returns the index of the element

prettify ( self, encoding=None, formatter='minimal' )

Improves code reading

recursiveChildGenerator ( )

append ( self, tag )

Adds the given tag to the current object

extract ( )

Extract item from tree

find_next_siblings ( self, name=None, attrs={}, text=None, limit=None, **kwargs )

Return the sibling objects of the current object

find_parents ( self, name=None, attrs={}, limit=None, **kwargs )

Dismissal parents

find_all_previous ( self, name=None, attrs={}, text=None, limit=None, **kwargs )

Returns all items that match the given criterion before the current object

find_previous_siblings ( self, name=None, attrs={}, text=None, limit=None, **kwargs )

Returns the sibling objects of the current object which are before this one

find_all_next ( self, name=None, attrs={}, text=None, limit=None, **kwargs )

Returns the objects which match the search but which are located after the current object

find_all_previous ( self, name=None, attrs={}, text=None, limit=None, **kwargs )

Returns the objects which match the search but which are located before the current object

find_next ( self, name=None, attrs={}, text=None, **kwargs )

Returns the first sibling object after the current object

find_next_sibling ( self, name=None, attrs={}, text=None, **kwargs )

Return the nearest sibling object after it

find_next_siblings ( self, name=None, attrs={}, text=None, limit=None, **kwargs )

Returns the following sibling objects

find_parent ( self, name=None, attrs={}, **kwargs )

Returns the nearest relative

find_parents ( self, name=None, attrs={}, limit=None, **kwargs )

Parents returns

find_previous ( self, name=None, attrs={}, text=None, **kwargs )

Returns the first item before the current object

find_previous_sibling ( self, name=None, attrs={}, text=None, **kwargs )

Returns the closest sibling item preceding the current object

find_previous_siblings ( self, name=None, attrs={}, text=None, limit=None, **kwargs )

Returns the closest sibling items preceding the current object

find_all_next ( self, name=None, attrs={}, text=None, limit=None, **kwargs )

Returns all the items following the current object

find_all_previous ( self, name=None, attrs={}, text=None, limit=None, **kwargs )

Returns all the items that precede the current object

Example of use

I needed an HTML parser to format and color the code I present on this site; I share this little script:

# coding: utf-8

import sys
import glob

from bs4 import BeautifulSoup
from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import HtmlFormatter

def pygments_file(pathname):
    # On ouvre le fichier
    with open(pathname, "r" ) as f:
        html_doc = f.read()
        
    soup = BeautifulSoup(html_doc)
    # On boucle sur les pre trouvés
    for pre in soup.find_all('pre'):
        try:
            if "code" in pre.get("class"):
                texte = highlight(pre.get_text(), PythonLexer(), \
                HtmlFormatter(nowrap=True))
                n = BeautifulSoup('%s' % texte)        
                pre.replace_with(n.body.contents[0])
        except:
            print("Erreur avec {}".format(pre,))
        
    if soup.body:
        with open(pathname, "w") as f:
            f.write(soup.body.encode_contents())
       
p = "/home/olivier/*.html"

if sys.argv[1]:
    p = str(sys.argv[1])

pathnames = glob.glob(p)
for pathname in pathnames:
    pygments_file(pathname)

You can just as well enter a folder as a single file.

learn python by projects

BeautifulSoup / parse your XML and HTML

Install the BeautifulSoup library

Retrieve the content of a specified tag

Change the content of tags

Replace tags

Read attributes

The methods of the BeautifulSoup class

clear ( decompose=False )

decode_contents ( indent_level=None, eventual_encoding='utf-8', formatter='minimal' )

decompose ( )

encode ( encoding='utf-8', indent_level=None, formatter='minimal', errors='xmlcharrefreplace' )

encode_contents ( indent_level=None, encoding='utf-8', formatter='minimal' )

find ( name=None, attrs={}, recursive=True, text=None, **kwargs )

find_all ( name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs )

find ( name=None, attrs={}, recursive=True, text=None, **kwargs )

findChildren ( name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs )

get ( key, default=None )

get_text ( self, separator=u'', strip=False, types=( <class 'bs4.element.NavigableString'>, <class 'bs4.element.CData'> ) )

has_attr ( key )

has_key ( key )

index ( element )

prettify ( self, encoding=None, formatter='minimal' )

recursiveChildGenerator ( )

append ( self, tag )

extract ( )

find_next_siblings ( self, name=None, attrs={}, text=None, limit=None, **kwargs )

find_parents ( self, name=None, attrs={}, limit=None, **kwargs )

find_all_previous ( self, name=None, attrs={}, text=None, limit=None, **kwargs )

find_previous_siblings ( self, name=None, attrs={}, text=None, limit=None, **kwargs )

find_all_next ( self, name=None, attrs={}, text=None, limit=None, **kwargs )

find_all_previous ( self, name=None, attrs={}, text=None, limit=None, **kwargs )

find_next ( self, name=None, attrs={}, text=None, **kwargs )

find_next_sibling ( self, name=None, attrs={}, text=None, **kwargs )

find_next_siblings ( self, name=None, attrs={}, text=None, limit=None, **kwargs )

find_parent ( self, name=None, attrs={}, **kwargs )

find_parents ( self, name=None, attrs={}, limit=None, **kwargs )

find_previous ( self, name=None, attrs={}, text=None, **kwargs )

find_previous_sibling ( self, name=None, attrs={}, text=None, **kwargs )

find_previous_siblings ( self, name=None, attrs={}, text=None, limit=None, **kwargs )

find_all_next ( self, name=None, attrs={}, text=None, limit=None, **kwargs )

find_all_previous ( self, name=None, attrs={}, text=None, limit=None, **kwargs )

Example of use

No comments:

Post a Comment