burst.parser package

Submodules

burst.parser.HTMLParser module

A parser for HTML and XHTML.

exception burst.parser.HTMLParser.HTMLParseError(msg, position=(None, None))[source]

Bases: exceptions.Exception

Exception raised for all parse errors.

class burst.parser.HTMLParser.HTMLParser[source]

Bases: burst.parser.markupbase.ParserBase

Find tags and other markup and call handler functions.

Usage:
p = HTMLParser() p.feed(data) ... p.close()

Start tags are handled by calling self.handle_starttag() or self.handle_startendtag(); end tags by self.handle_endtag(). The data between tags is passed from the parser to the derived class by calling self.handle_data() with the data as argument (the data may be split up in arbitrary chunks). Entity references are passed by calling self.handle_entityref() with the entity reference as the argument. Numeric character references are passed to self.handle_charref() with the string containing the reference as the argument.

CDATA_CONTENT_ELEMENTS = ('script', 'style')
reset()[source]

Reset this instance. Loses all unprocessed data.

feed(data)[source]

Feed data to the parser.

Call this as often as you want, with as little or as much text as you want (may include ‘n’).

close()[source]

Handle any buffered data.

error(message)[source]
get_starttag_text()[source]

Return full source of start tag: ‘<...>’.

set_cdata_mode(elem)[source]
clear_cdata_mode()[source]
goahead(end)[source]
parse_html_declaration(i)[source]
parse_bogus_comment(i, report=1)[source]
parse_pi(i)[source]
parse_starttag(i)[source]
check_for_whole_start_tag(i)[source]
parse_endtag(i)[source]
handle_startendtag(tag, attrs)[source]
handle_starttag(tag, attrs)[source]
handle_endtag(tag)[source]
handle_charref(name)[source]
handle_entityref(name)[source]
handle_data(data)[source]
handle_comment(data)[source]
handle_decl(decl)[source]
handle_pi(data)[source]
unknown_decl(data)[source]
entitydefs = None
unescape(s)[source]

burst.parser.ehp module

” All the credit of this code to Iury de oliveira gomes figueiredo Easy Html Parser is an AST generator for html/xml documents. You can easily delete/insert/extract tags in html/xml documents as well as look for patterns. https://github.com/iogf/ehp

class burst.parser.ehp.Attribute[source]

Bases: dict

This class holds the tags’s attributes. The idea consists in providing an efficient and flexible way of manipulating tags attributes inside the dom.

Example: dom = Html().feed(‘<p style=”color:green”> foo </p>’)

for ind in dom.sail(): if ind.name == ‘p’: ind.attr[‘style’] = “color:blue”

It would change to color blue.

class burst.parser.ehp.Root(name=None, attr=None)[source]

Bases: list

A Root instance is the outmost node for a xml/html document. All xml/html entities inherit from this class.

html = Html() dom = html.feed(‘<html> ... </body>’)

dom.name == ‘’ True type(dom) == Root True

sail()[source]

This is used to navigate through the xml/html document. Every xml/html object is represented by a python class instance that inherits from Root.

The method sail is used to return an iterator for these objects.

Example: data = ‘<a> <b> </b> </a>’

html = Html() dom = html.feed(data)

for ind in dom.sail(): print type(ind),’,’, ind.name

It would output.

<class ‘ehp.Root’> , a <class ‘ehp.Root’> , b

index(item, **kwargs)[source]

This is similar to index but uses id to check for equality.

Example:

data = ‘<a><b></b><b></b></a>’ html = Html() dom = html.feed(data)

for root, ind in dom.sail_with_root(): print root.name, ind.name, root.index(ind)

It would print.

a b 0 a b 1 a 0

The line where it appears ‘ a 0’ corresponds to the outmost object. The outmost object is an instance of Root that contains all the other objects. :param item:

remove(item)[source]

This is as list.remove but works with id.

data = ‘<a><b></b><b></b></a>’ html = Html() dom = html.feed(data) for root, ind in dom.sail_with_root(): if ind.name == ‘b’: root.remove(ind)

print dom

It should print.

<a ></a>

find(name='', every=1, start=1, *args)[source]

It is used to find all objects that match name.

Example 1:

data = ‘<a><b></b><b></b></a>’ html = Html() dom = html.feed(data)

for ind in dom.find(‘b’): print ind

It should print.

<b ></b> <b ></b>

Example 2.

data = ‘<body> <p> alpha. </p> <p style=”color:green”> beta.</p> </body>’ html = Html() dom = html.feed(data)

for ind in dom.find(‘p’, (‘style’, ‘color:green’)): print ind

Or

for ind in dom.find(‘p’, (‘style’, [‘color:green’, ‘color:red’])): print ind

Output.

<p style=”color:green” > beta.</p>

find_once(tag=None, select=None, order=1)[source]

” It returns the nth (order) ocurrence from the tag matching with the attributes from select

find_all(tag=None, select=None, every=1, start=1)[source]

” It returns all ocurrences from the tag matching with the attributes from select

find_with_root(name, *args)[source]

Like Root.find but returns its parent tag.

from ehp import *

html = Html() dom = html.feed(‘’‘<body> <p> alpha </p> <p> beta </p> </body>’‘’)

for root, ind in dom.find_with_root(‘p’): root.remove(ind)

print dom

It would output.

<body > </body>

by_id(id_value)[source]

It is a shortcut for finding an object whose attribute ‘id’ matches id.

Example:

data = ‘<a><b id=”foo”></b></a>’ html = Html() dom = html.feed(data)

print dom.byid(‘foo’) print dom.byid(‘bar’)

It should print.

<b id=”foo” ></b> None

take(*args)[source]

It returns the first object whose one of its attributes matches (key0, value0), (key1, value1), ... .

Example:

data = ‘<a><b id=”foo” size=”1”></b></a>’ html = Html() dom = html.feed(data)

print dom.take((‘id’, ‘foo’)) print dom.take((‘id’, ‘foo’), (‘size’, ‘2’))

take_with_root(*args)[source]

Like Root.take but returns the tag parent.

match(*args)[source]

It returns a sequence of objects whose attributes match. (key0, value0), (key1, value1), ... .

Example:

data = ‘<a size=”1”><b size=”1”></b></a>’ html = Html() dom = html.feed(data)

for ind in dom.match((‘size’, ‘1’)): print ind

It would print.

<b size=”1” ></b> <a size=”1” ><b size=”1” ></b></a>

match_with_root(*args)[source]

Like Root.match but with its parent tag.

Example:

from ehp import *

html = Html() dom = html.feed(‘’‘<body> <p style=”color:black”> xxx </p> <p style = “color:black”> mmm </p></body>’‘’)

for root, ind in dom.match_with_root((‘style’, ‘color:black’)): del ind.attr[‘style’]

item = dom.fst(‘body’) item.attr[‘style’] = ‘color:black’

print dom

Output.

<body style=”color:black” > <p > xxx </p> <p > mmm </p></body>

join(delim, *args)[source]

It joins all the objects whose name appears in args.

Example 1:

html = Html() data = ‘<a><b> This is cool. </b><b> That is. </b></a>’ dom = html.feed(data)

print dom.join(‘’, ‘b’) print type(dom.join(‘b’))

It would print.

<b > This is cool. </b><b > That is. </b> <type ‘str’>

Example 2:

html = Html() data = ‘<a><b> alpha</b><c>beta</c> <b>gamma</a>’ dom = html.feed(data)

print dom.join(‘’, ‘b’, ‘c’)

It would print.

<b > alpha</b><c >beta</c><b >gamma</b>

Example 3:

html = Html() data = ‘<a><b>alpha</b><c>beta</c><b>gamma</a>’ dom = html.feed(data)

print dom.join(‘n’, DATA)

It would print.

alpha beta gamma

fst(name, *args)[source]

It returns the first object whose name matches.

Example 1:

html = Html() data = ‘<body> <em> Cool. </em></body>’ dom = html.feed(data)

print dom.fst(‘em’)

It outputs.

<em > Cool. </em>

Example 2:

data = ‘<body> <p> alpha. </p> <p style=”color:green”> beta.</p> </body>’ html = Html() dom = html.feed(data)

for ind in dom.find(‘p’, (‘style’, ‘color:green’)): print ind

print dom.fst(‘p’, (‘style’, ‘color:green’)) print dom.fst_with_root(‘p’, (‘style’, ‘color:green’))

Output:

<p style=”color:green” > beta.</p> <p style=”color:green” > beta.</p> (<ehp.Tag object at 0xb7216c0c>, <ehp.Tag object at 0xb7216d24>)

fst_with_root(name, *args)[source]

Like fst but returns its item parent.

Example:

html = Html() data = ‘<body> <em> Cool. </em></body>’ dom = html.feed(data)

root, item dom.fst_with_root(‘em’) root.insert_after(item, Tag(‘p’)) print root

It outputs.

<body > <em > Cool. </em><p ></p></body>

For another similar example, see help(Root.fst)

text()[source]

It returns all objects whose name matches DATA. It basically returns a string corresponding to all asci characters that are inside a xml/html tag.

Example:

html = Html() data = ‘<body><em>This is all the text.</em></body>’ dom = html.feed(data)

print dom.fst(‘em’).text()

It outputs.

This is all the text.

Notice that if you call text() on an item with children then it returns all the printable characters for that node.

write(filename)[source]

It saves the structure to a file.

sail_with_root()[source]

This one works like sail(), however it yields the tag’s parents as well as the child tag.

For an example, see help(Root.remove).

walk()[source]

Like sail but carries name and attr.

Example:

html = Html() data = ‘<body> <em> This is all the text.</em></body>’ dom = html.feed(data)

for ind, name, attr in dom.walk(): print ‘TAG:’, ind print ‘NAME:’, name print ‘ATTR:’, attr

It should print.

TAG: NAME: 1 ATTR: TAG: This is all the text. NAME: 1 ATTR: TAG: <em > This is all the text.</em> NAME: em ATTR: TAG: <body > <em > This is all the text.</em></body> NAME: body ATTR:

walk_with_root()[source]

Like walk but carries root.

Example:

html = Html() data = ‘<body><em>alpha</em></body>’ dom = html.feed(data)

for (root, name, attr), (ind, name, attr) in dom.walk_with_root(): print root, name, ind, name

Output:

<em >alpha</em> 1 alpha 1 <body ><em >alpha</em></body> em <em >alpha</em> em <body ><em >alpha</em></body> body <body ><em >alpha</em></body> body

insert_after(y, k)[source]

Insert after a given tag.

For an example, see help(Root.fst_with_root).

insert_before(y, k)[source]

Insert before a given tag.

For a similar example, see help(Root.fst_with_root).

parent(dom)[source]

Find the parent tag

list_(text='')[source]
select(text='')[source]
get_attributes(text)[source]
class burst.parser.ehp.Tag(name, attr=None)[source]

Bases: burst.parser.ehp.Root

This class’s instances represent xml/html tags under the form: <name key=”value” ...> ... </name>.

It holds useful methods for parsing xml/html documents.

class burst.parser.ehp.Data(data)[source]

Bases: burst.parser.ehp.Root

The pythonic representation of data that is inside xml/html documents.

All data that is not a xml/html token is represented by this class in the structure of the document.

Example:

html = Html() data = ‘<body><em>alpha</em></body>’ dom = html.feed(data)

x = dom.fst(‘em’)

# x holds a Data instance.

type(x[0]) print x[0]

Output:

<class ‘ehp.Data’> alpha

The Data instances are everywhere in the document, when the tokenizer finds them between the xml/html tags it builds up the structure identically to the document.

text()[source]
class burst.parser.ehp.XTag(name, attr=None)[source]

Bases: burst.parser.ehp.Root

This tag is the representation of html’s tags in XHTML style like <img src=”t.gif” /> It is tags which do not have children.

class burst.parser.ehp.Meta(data)[source]

Bases: burst.parser.ehp.Root

class burst.parser.ehp.Code(data)[source]

Bases: burst.parser.ehp.Root

class burst.parser.ehp.Amp(data)[source]

Bases: burst.parser.ehp.Root

class burst.parser.ehp.Pi(data)[source]

Bases: burst.parser.ehp.Root

class burst.parser.ehp.Comment(data)[source]

Bases: burst.parser.ehp.Root

class burst.parser.ehp.Tree[source]

Bases: object

The engine class.

clear()[source]

Clear the outmost and stack for a new parsing.

last()[source]

Return the last pointer which point to the actual tag scope.

nest(name, attr)[source]

Nest a given tag at the bottom of the tree using the last stack’s pointer.

dnest(data)[source]

Nest the actual data onto the tree.

xnest(name, attr)[source]

Nest a XTag onto the tree.

ynest(data)[source]
mnest(data)[source]
cnest(data)[source]
rnest(data)[source]
inest(data)[source]
enclose(name)[source]

When found a closing tag then pops the pointer’s scope from the stack so pointing to the earlier scope’s tag.

class burst.parser.ehp.Html[source]

Bases: burst.parser.HTMLParser.HTMLParser

The tokenizer class.

fromfile(filename)[source]

It builds a structure from a file.

feed(data)[source]
handle_starttag(name, attr)[source]

When found an opening tag then nest it onto the tree

handle_startendtag(name, attr)[source]

When found a XHTML tag style then nest it up to the tree

handle_endtag(name)[source]

When found a closing tag then makes it point to the right scope

handle_data(data)[source]

Nest data onto the tree.

handle_decl(decl)[source]
unknown_decl(decl)[source]
handle_charref(data)[source]
handle_entityref(data)[source]
handle_pi(data)[source]
handle_comment(data)[source]

burst.parser.markupbase module

Shared support for scanning document type declarations in HTML and XHTML.

This module is used as a foundation for the HTMLParser and sgmllib modules (indirectly, for htmllib as well). It has no documented public API and should not be used directly.

class burst.parser.markupbase.ParserBase[source]

Parser base class which provides some common support methods used by the SGML/HTML and XHTML parsers.

error(message)[source]
reset()[source]
getpos()[source]

Return current line number and offset.

updatepos(i, j)[source]
parse_declaration(i)[source]
parse_marked_section(i, report=1)[source]
parse_comment(i, report=1)[source]
unknown_decl(data)[source]