100% cpu + memory filling at certain sites
Reported by rb2k | June 21st, 2010 @ 11:33 AM
I don't know what causes it yet, but there are some sites that
seem to break the crawling process.
two of them are e.g.
http://www.skiandorre.com/
http://www.hiver2018.com/
happens on 1.8 and 1.9
no network load detectable during that time
here's a short example, just let that run for 15 seconds and it should stop and go 100% cpu
require 'anemone'
Anemone.crawl("http://www.hiver2018.com/") do |anemone|
anemone.on_every_page do |page|
puts page.url
end
end
Comments and changes to this ticket
-
rb2k June 21st, 2010 @ 11:36 AM
also: sites are the same and some sort of spam...
they all lead to p*.vixns.eumy guess would be an xml parser going haywire
-
rb2k June 21st, 2010 @ 11:47 AM
ok, just replaced nokogiri with hpricot, still the same problem
This page is however weird: http://www.hiver2018.com/partenaires.php
It's 34 MB and basically looks like this 99% of the time:><p><a href="http://" target="_blank"></a></p><br /><p><a href="http://" target="_blank"></a></p><br /><p><a href="http://" target="_blank"></a></p><br /><p><a href="http://" target="_blank"></a></p><br /><p><a href="http://" target="_blank"></a></p><br /><p><a href="http://" target="_blank"></a></p><br /><p><a href="http://" target="_blank"></a></p><br /><p><a href="http://" target="_blank"></a></p><br /><p><a href="http://" target="_blank"></a></p><br /><p><a href="http://" target="_blank"></a></p><br /><p><a href="http://" t
Please Sign in or create a free account to add a new ticket.
With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.
Create your profile
Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป
Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site.