API

Extractor public API

class Extractor
keep(xpath)

Adds an Xpath expression to keep

Parameters:

xpath (str) – The Xpath expression to add

Returns:

The self instance

Return type:

Extractor

discard(xpath)

Adds an Xpath expression to discard

Parameters:

xpath (str) – The Xpath expression to add

Returns:

The self instance

Return type:

Extractor

extract(html_contents, css_contents=None, base_url=None)

Extracts the cleaned html tree as a string and only css rules matching the cleaned html tree

Parameters:
  • html_contents (str) – The HTML contents to parse

  • css_contents (str) – The CSS contents to parse

  • base_url (str) – The base page URL to use for relative to absolute links

Returns:

cleaned HTML contents or (cleaned HTML contents, cleaned CSS contents)

Return type:

str or tuple