pluggable html parser
Reported by rb2k | May 4th, 2010 @ 03:16 PM
It would be nice if there was the possibility of using hpricot
instead of nokogiri.
They should be API compatible.
In my tests, hpricot was always a tiny bit faster
Comments and changes to this ticket
-
rb2k May 5th, 2010 @ 12:33 PM
Here are some performance benchmarks.
Switching from CSS to xpath also results in speed boosts:(tl;dr: hpricot (xpath) took: 3.599083
hpricot (xpath, no href) took: 3.283622
hpricot (css) took: 4.996853
nokogiri (xpath) took: 4.169071
nokogiri (css_nocontent) took: 4.372877
nokogiri (css) took: 4.494918
nokogiri (xpath no href) took: 3.861592
) -
chris (at chriskite) May 25th, 2010 @ 09:12 PM
- State changed from new to resolved
- Assigned user set to chris (at chriskite)
Thanks for your patch, I've incorporated your change to parse links with xpath instead of CSS.
I don't plan to include pluggable parser support for a couple of reasons: it increases complexity of the codebase without giving much in return, and potentially ties the project to using a restricted API (i.e. only what is common to Nokogiri and hpricot).
Please Sign in or create a free account to add a new ticket.
With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.
Create your profile
Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป
Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site.