lxml.html.html5parser module
An interface to html5lib that mimics the lxml.html interface.
- class lxml.html.html5parser.HTMLParser(strict=False, **kwargs)[source]
- Bases: - HTMLParser- An html5lib HTML parser with lxml as tree. - parse(stream, *args, **kwargs)[source]
- Parse a HTML document into a well-formed tree - Parameters:
- stream – - a file-like object or string containing the HTML to be parsed - The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element). 
- scripting – treat noscript elements as if JavaScript was turned on 
 
- Returns:
- parsed tree 
 - Example: - >>> from html5lib.html5parser import HTMLParser >>> parser = HTMLParser() >>> parser.parse('<html><body><p>This is a doc</p></body></html>') <Element u'{http://www.w3.org/1999/xhtml}html' at 0x7feac4909db0> 
 - parseFragment(stream, *args, **kwargs)[source]
- Parse a HTML fragment into a well-formed tree fragment - Parameters:
- container – name of the element we’re setting the innerHTML property if set to None, default to ‘div’ 
- stream – - a file-like object or string containing the HTML to be parsed - The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element) 
- scripting – treat noscript elements as if JavaScript was turned on 
 
- Returns:
- parsed tree 
 - Example: - >>> from html5lib.html5libparser import HTMLParser >>> parser = HTMLParser() >>> parser.parseFragment('<b>this is a fragment</b>') <Element u'DOCUMENT_FRAGMENT' at 0x7feac484b090> 
 - property documentEncoding
- Name of the character encoding that was used to decode the input stream, or - Noneif that is not determined yet
 
- lxml.html.html5parser.document_fromstring(html, guess_charset=None, parser=None)[source]
- Parse a whole document into a string. - If guess_charset is true, or if the input is not Unicode but a byte string, the chardet library will perform charset guessing on the string. 
- lxml.html.html5parser.fragment_fromstring(html, create_parent=False, guess_charset=None, parser=None)[source]
- Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element. - If ‘create_parent’ is true (or is a tag name) then a parent node will be created to encapsulate the HTML in a single element. In this case, leading or trailing text is allowed. - If guess_charset is true, the chardet library will perform charset guessing on the string. 
- lxml.html.html5parser.fragments_fromstring(html, no_leading_text=False, guess_charset=None, parser=None)[source]
- Parses several HTML elements, returning a list of elements. - The first item in the list may be a string. If no_leading_text is true, then it will be an error if there is leading text, and it will always be a list of only elements. - If guess_charset is true, the chardet library will perform charset guessing on the string. 
- lxml.html.html5parser.fromstring(html, guess_charset=None, parser=None)[source]
- Parse the html, returning a single element/document. - This tries to minimally parse the chunk of text, without knowing if it is a fragment or a document. - ‘base_url’ will set the document’s base_url attribute (and the tree’s docinfo.URL) - If guess_charset is true, or if the input is not Unicode but a byte string, the chardet library will perform charset guessing on the string. 
- lxml.html.html5parser.parse(filename_url_or_file, guess_charset=None, parser=None)[source]
- Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use - parse(...).getroot()to get the document root.- If - guess_charsetis true, the- useChardetoption is passed into html5lib to enable character detection. This option is on by default when parsing from URLs, off by default when parsing from file(-like) objects (which tend to return Unicode more often than not), and on by default when parsing from a file path (which is read in binary mode).