#21 ✓resolved
rb2k

pluggable html parser

Reported by rb2k | May 4th, 2010 @ 03:16 PM

It would be nice if there was the possibility of using hpricot instead of nokogiri.
They should be API compatible.

In my tests, hpricot was always a tiny bit faster

Comments and changes to this ticket

  • rb2k

    rb2k May 5th, 2010 @ 12:33 PM

    Here are some performance benchmarks.
    Switching from CSS to xpath also results in speed boosts:

    http://gist.github.com/391134

    (tl;dr: hpricot (xpath) took: 3.599083
    hpricot (xpath, no href) took: 3.283622
    hpricot (css) took: 4.996853
    nokogiri (xpath) took: 4.169071
    nokogiri (css_nocontent) took: 4.372877
    nokogiri (css) took: 4.494918
    nokogiri (xpath no href) took: 3.861592
    )

  • chris (at chriskite)

    chris (at chriskite) May 25th, 2010 @ 09:12 PM

    • State changed from “new” to “resolved”
    • Assigned user set to “chris (at chriskite)”

    Thanks for your patch, I've incorporated your change to parse links with xpath instead of CSS.

    I don't plan to include pluggable parser support for a couple of reasons: it increases complexity of the codebase without giving much in return, and potentially ties the project to using a restricted API (i.e. only what is common to Nokogiri and hpricot).

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site.

Shared Ticket Bins

People watching this ticket

Pages