#28 new
rb2k

100% cpu + memory filling at certain sites

Reported by rb2k | June 21st, 2010 @ 11:33 AM

I don't know what causes it yet, but there are some sites that seem to break the crawling process.
two of them are e.g.
http://www.skiandorre.com/
http://www.hiver2018.com/

happens on 1.8 and 1.9
no network load detectable during that time

here's a short example, just let that run for 15 seconds and it should stop and go 100% cpu

require 'anemone'

Anemone.crawl("http://www.hiver2018.com/") do |anemone|
  anemone.on_every_page do |page|
       puts page.url
  end
end

Comments and changes to this ticket

  • rb2k

    rb2k June 21st, 2010 @ 11:36 AM

    also: sites are the same and some sort of spam...
    they all lead to p*.vixns.eu

    my guess would be an xml parser going haywire

  • rb2k

    rb2k June 21st, 2010 @ 11:47 AM

    ok, just replaced nokogiri with hpricot, still the same problem

    This page is however weird: http://www.hiver2018.com/partenaires.php
    It's 34 MB and basically looks like this 99% of the time:

    ><p><a href="http://" target="_blank"></a></p><br /><p><a href="http://" target="_blank"></a></p><br /><p><a href="http://" target="_blank"></a></p><br /><p><a href="http://" target="_blank"></a></p><br /><p><a href="http://" target="_blank"></a></p><br /><p><a href="http://" target="_blank"></a></p><br /><p><a href="http://" target="_blank"></a></p><br /><p><a href="http://" target="_blank"></a></p><br /><p><a href="http://" target="_blank"></a></p><br /><p><a href="http://" target="_blank"></a></p><br /><p><a href="http://" t
    

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site.

Shared Ticket Bins

People watching this ticket

Pages