burst.parser package¶
Submodules¶
burst.parser.HTMLParser module¶
A parser for HTML and XHTML.
-
exception
burst.parser.HTMLParser.
HTMLParseError
(msg, position=(None, None))[source]¶ Bases:
exceptions.Exception
Exception raised for all parse errors.
-
class
burst.parser.HTMLParser.
HTMLParser
[source]¶ Bases:
burst.parser.markupbase.ParserBase
Find tags and other markup and call handler functions.
- Usage:
- p = HTMLParser() p.feed(data) ... p.close()
Start tags are handled by calling self.handle_starttag() or self.handle_startendtag(); end tags by self.handle_endtag(). The data between tags is passed from the parser to the derived class by calling self.handle_data() with the data as argument (the data may be split up in arbitrary chunks). Entity references are passed by calling self.handle_entityref() with the entity reference as the argument. Numeric character references are passed to self.handle_charref() with the string containing the reference as the argument.
-
CDATA_CONTENT_ELEMENTS
= ('script', 'style')¶
-
feed
(data)[source]¶ Feed data to the parser.
Call this as often as you want, with as little or as much text as you want (may include ‘n’).
-
entitydefs
= None¶
burst.parser.ehp module¶
” All the credit of this code to Iury de oliveira gomes figueiredo Easy Html Parser is an AST generator for html/xml documents. You can easily delete/insert/extract tags in html/xml documents as well as look for patterns. https://github.com/iogf/ehp
-
class
burst.parser.ehp.
Attribute
[source]¶ Bases:
dict
This class holds the tags’s attributes. The idea consists in providing an efficient and flexible way of manipulating tags attributes inside the dom.
Example: dom = Html().feed(‘<p style=”color:green”> foo </p>’)
for ind in dom.sail(): if ind.name == ‘p’: ind.attr[‘style’] = “color:blue”
It would change to color blue.
-
class
burst.parser.ehp.
Root
(name=None, attr=None)[source]¶ Bases:
list
A Root instance is the outmost node for a xml/html document. All xml/html entities inherit from this class.
html = Html() dom = html.feed(‘<html> ... </body>’)
dom.name == ‘’ True type(dom) == Root True
-
sail
()[source]¶ This is used to navigate through the xml/html document. Every xml/html object is represented by a python class instance that inherits from Root.
The method sail is used to return an iterator for these objects.
Example: data = ‘<a> <b> </b> </a>’
html = Html() dom = html.feed(data)
for ind in dom.sail(): print type(ind),’,’, ind.name
It would output.
<class ‘ehp.Root’> , a <class ‘ehp.Root’> , b
-
index
(item, **kwargs)[source]¶ This is similar to index but uses id to check for equality.
Example:
data = ‘<a><b></b><b></b></a>’ html = Html() dom = html.feed(data)
for root, ind in dom.sail_with_root(): print root.name, ind.name, root.index(ind)
It would print.
a b 0 a b 1 a 0
The line where it appears ‘ a 0’ corresponds to the outmost object. The outmost object is an instance of Root that contains all the other objects. :param item:
-
remove
(item)[source]¶ This is as list.remove but works with id.
data = ‘<a><b></b><b></b></a>’ html = Html() dom = html.feed(data) for root, ind in dom.sail_with_root(): if ind.name == ‘b’: root.remove(ind)
print dom
It should print.
<a ></a>
-
find
(name='', every=1, start=1, *args)[source]¶ It is used to find all objects that match name.
Example 1:
data = ‘<a><b></b><b></b></a>’ html = Html() dom = html.feed(data)
for ind in dom.find(‘b’): print ind
It should print.
<b ></b> <b ></b>
Example 2.
data = ‘<body> <p> alpha. </p> <p style=”color:green”> beta.</p> </body>’ html = Html() dom = html.feed(data)
for ind in dom.find(‘p’, (‘style’, ‘color:green’)): print ind
Or
for ind in dom.find(‘p’, (‘style’, [‘color:green’, ‘color:red’])): print ind
Output.
<p style=”color:green” > beta.</p>
-
find_once
(tag=None, select=None, order=1)[source]¶ ” It returns the nth (order) ocurrence from the tag matching with the attributes from select
-
find_all
(tag=None, select=None, every=1, start=1)[source]¶ ” It returns all ocurrences from the tag matching with the attributes from select
-
find_with_root
(name, *args)[source]¶ Like Root.find but returns its parent tag.
from ehp import *
html = Html() dom = html.feed(‘’‘<body> <p> alpha </p> <p> beta </p> </body>’‘’)
for root, ind in dom.find_with_root(‘p’): root.remove(ind)
print dom
It would output.
<body > </body>
-
by_id
(id_value)[source]¶ It is a shortcut for finding an object whose attribute ‘id’ matches id.
Example:
data = ‘<a><b id=”foo”></b></a>’ html = Html() dom = html.feed(data)
print dom.byid(‘foo’) print dom.byid(‘bar’)
It should print.
<b id=”foo” ></b> None
-
take
(*args)[source]¶ It returns the first object whose one of its attributes matches (key0, value0), (key1, value1), ... .
Example:
data = ‘<a><b id=”foo” size=”1”></b></a>’ html = Html() dom = html.feed(data)
print dom.take((‘id’, ‘foo’)) print dom.take((‘id’, ‘foo’), (‘size’, ‘2’))
-
match
(*args)[source]¶ It returns a sequence of objects whose attributes match. (key0, value0), (key1, value1), ... .
Example:
data = ‘<a size=”1”><b size=”1”></b></a>’ html = Html() dom = html.feed(data)
for ind in dom.match((‘size’, ‘1’)): print ind
It would print.
<b size=”1” ></b> <a size=”1” ><b size=”1” ></b></a>
-
match_with_root
(*args)[source]¶ Like Root.match but with its parent tag.
Example:
from ehp import *
html = Html() dom = html.feed(‘’‘<body> <p style=”color:black”> xxx </p> <p style = “color:black”> mmm </p></body>’‘’)
for root, ind in dom.match_with_root((‘style’, ‘color:black’)): del ind.attr[‘style’]
item = dom.fst(‘body’) item.attr[‘style’] = ‘color:black’
print dom
Output.
<body style=”color:black” > <p > xxx </p> <p > mmm </p></body>
-
join
(delim, *args)[source]¶ It joins all the objects whose name appears in args.
Example 1:
html = Html() data = ‘<a><b> This is cool. </b><b> That is. </b></a>’ dom = html.feed(data)
print dom.join(‘’, ‘b’) print type(dom.join(‘b’))
It would print.
<b > This is cool. </b><b > That is. </b> <type ‘str’>
Example 2:
html = Html() data = ‘<a><b> alpha</b><c>beta</c> <b>gamma</a>’ dom = html.feed(data)
print dom.join(‘’, ‘b’, ‘c’)
It would print.
<b > alpha</b><c >beta</c><b >gamma</b>
Example 3:
html = Html() data = ‘<a><b>alpha</b><c>beta</c><b>gamma</a>’ dom = html.feed(data)
print dom.join(‘n’, DATA)
It would print.
alpha beta gamma
-
fst
(name, *args)[source]¶ It returns the first object whose name matches.
Example 1:
html = Html() data = ‘<body> <em> Cool. </em></body>’ dom = html.feed(data)
print dom.fst(‘em’)
It outputs.
<em > Cool. </em>
Example 2:
data = ‘<body> <p> alpha. </p> <p style=”color:green”> beta.</p> </body>’ html = Html() dom = html.feed(data)
for ind in dom.find(‘p’, (‘style’, ‘color:green’)): print ind
print dom.fst(‘p’, (‘style’, ‘color:green’)) print dom.fst_with_root(‘p’, (‘style’, ‘color:green’))
Output:
<p style=”color:green” > beta.</p> <p style=”color:green” > beta.</p> (<ehp.Tag object at 0xb7216c0c>, <ehp.Tag object at 0xb7216d24>)
-
fst_with_root
(name, *args)[source]¶ Like fst but returns its item parent.
Example:
html = Html() data = ‘<body> <em> Cool. </em></body>’ dom = html.feed(data)
root, item dom.fst_with_root(‘em’) root.insert_after(item, Tag(‘p’)) print root
It outputs.
<body > <em > Cool. </em><p ></p></body>
For another similar example, see help(Root.fst)
-
text
()[source]¶ It returns all objects whose name matches DATA. It basically returns a string corresponding to all asci characters that are inside a xml/html tag.
Example:
html = Html() data = ‘<body><em>This is all the text.</em></body>’ dom = html.feed(data)
print dom.fst(‘em’).text()
It outputs.
This is all the text.
Notice that if you call text() on an item with children then it returns all the printable characters for that node.
-
sail_with_root
()[source]¶ This one works like sail(), however it yields the tag’s parents as well as the child tag.
For an example, see help(Root.remove).
-
walk
()[source]¶ Like sail but carries name and attr.
Example:
html = Html() data = ‘<body> <em> This is all the text.</em></body>’ dom = html.feed(data)
for ind, name, attr in dom.walk(): print ‘TAG:’, ind print ‘NAME:’, name print ‘ATTR:’, attr
It should print.
TAG: NAME: 1 ATTR: TAG: This is all the text. NAME: 1 ATTR: TAG: <em > This is all the text.</em> NAME: em ATTR: TAG: <body > <em > This is all the text.</em></body> NAME: body ATTR:
-
walk_with_root
()[source]¶ Like walk but carries root.
Example:
html = Html() data = ‘<body><em>alpha</em></body>’ dom = html.feed(data)
for (root, name, attr), (ind, name, attr) in dom.walk_with_root(): print root, name, ind, name
Output:
<em >alpha</em> 1 alpha 1 <body ><em >alpha</em></body> em <em >alpha</em> em <body ><em >alpha</em></body> body <body ><em >alpha</em></body> body
-
-
class
burst.parser.ehp.
Tag
(name, attr=None)[source]¶ Bases:
burst.parser.ehp.Root
This class’s instances represent xml/html tags under the form: <name key=”value” ...> ... </name>.
It holds useful methods for parsing xml/html documents.
-
class
burst.parser.ehp.
Data
(data)[source]¶ Bases:
burst.parser.ehp.Root
The pythonic representation of data that is inside xml/html documents.
All data that is not a xml/html token is represented by this class in the structure of the document.
Example:
html = Html() data = ‘<body><em>alpha</em></body>’ dom = html.feed(data)
x = dom.fst(‘em’)
# x holds a Data instance.
type(x[0]) print x[0]
Output:
<class ‘ehp.Data’> alpha
The Data instances are everywhere in the document, when the tokenizer finds them between the xml/html tags it builds up the structure identically to the document.
-
class
burst.parser.ehp.
XTag
(name, attr=None)[source]¶ Bases:
burst.parser.ehp.Root
This tag is the representation of html’s tags in XHTML style like <img src=”t.gif” /> It is tags which do not have children.
-
class
burst.parser.ehp.
Meta
(data)[source]¶ Bases:
burst.parser.ehp.Root
-
class
burst.parser.ehp.
Code
(data)[source]¶ Bases:
burst.parser.ehp.Root
-
class
burst.parser.ehp.
Amp
(data)[source]¶ Bases:
burst.parser.ehp.Root
-
class
burst.parser.ehp.
Pi
(data)[source]¶ Bases:
burst.parser.ehp.Root
-
class
burst.parser.ehp.
Comment
(data)[source]¶ Bases:
burst.parser.ehp.Root
-
class
burst.parser.ehp.
Tree
[source]¶ Bases:
object
The engine class.
-
class
burst.parser.ehp.
Html
[source]¶ Bases:
burst.parser.HTMLParser.HTMLParser
The tokenizer class.
burst.parser.markupbase module¶
Shared support for scanning document type declarations in HTML and XHTML.
This module is used as a foundation for the HTMLParser and sgmllib modules (indirectly, for htmllib as well). It has no documented public API and should not be used directly.