API

Extractor public API

class Extractor
keep(xpath)

Adds an Xpath expression to keep

Parameters:xpath (str) – The Xpath expression to add
Returns:The self instance
Return type:Extractor
discard(xpath)

Adds an Xpath expression to discard

Parameters:xpath (str) – The Xpath expression to add
Returns:The self instance
Return type:Extractor
extract(html_contents, css_contents=None, base_url=None)

Extracts the cleaned html tree as a string and only css rules matching the cleaned html tree

Parameters:
  • html_contents (str) – The HTML contents to parse
  • css_contents (str) – The CSS contents to parse
  • base_url (str) – The base page URL to use for relative to absolute links
Returns:

cleaned HTML contents or (cleaned HTML contents, cleaned CSS contents)

Return type:

str or tuple