Usage

Create the Extractor instance

First, you need to import the Extractor class :

from chopper.extractor import Extractor

Then you can create an Extractor instance by explicitly instantiating one or by directly using Extractor.keep() and Extractor.discard() class methods :

from chopper.extractor import Extractor

# Instantiate style
extractor = Extractor().keep('//div').discard('//a')

# Class method style
extractor = Extractor.keep('//div').discard('//a')

Add Xpath expressions

The Extractor instance allows you to chain multiple Extractor.keep() and Extractor.discard()

from chopper.extractor import Extractor

e = Extractor.keep('//div[p]').discard('//span').discard('//a').keep('strong')

Extract contents

Once your Extractor instance is created you can call the Extractor.extract() method on it. The Extractor.extract() method takes at least one argument that is the HTML to parse.

If you want to also parse CSS, pass it as the second argument.

from chopper.extractor import Extractor

HTML = """
<html>
  <head>
    <title>Hello world !</title>
  </head>
  <body>
    <header>This is the header</header>
    <div>
      <p><span>Main </span>content</p>
      <a href="/">See more</a>
    </div>
    <footer>This is the footer</footer>
  </body>
</html>
"""

CSS = """
a { color: blue; }
p { color: red; }
span { border: 1px solid red; }
body { background-color: green; }
"""

# Create the Extractor
e = Extractor.keep('//div[p]').discard('//span').discard('//a')

# Parse HTML only
html = e.extract(HTML)

>>> html
"""
<html>
  <body>
    <div>
      <p>content</p>
    </div>
  </body>
</html>
"""

# Parse HTML & CSS
html, css = e.extract(HTML, CSS)

>>> html
"""
<html>
  <body>
    <div>
      <p>content</p>
    </div>
  </body>
</html>
"""

>>> css
"""
p{color:red;}
body{background-color:green;}
"""