#6 ✓resolved
hayato

crawling slowly

Reported by hayato | August 3rd, 2009 @ 05:05 PM

anemone is too violent to crawl any web server.

example

require 'anemone'
Anemone.crawl("http://www.yahoo.co.jp/") do |anemone|
  anemone.on_every_page do |page|

puts "#{Time.now} : #{page.url}"



end end

following result

Tue Aug 04 06:50:53 +0900 2009 : http://www.yahoo.co.jp/
Tue Aug 04 06:50:53 +0900 2009 : http://www.yahoo.co.jp/s/41484
Tue Aug 04 06:50:53 +0900 2009 : http://www.yahoo.co.jp/r/mht
Tue Aug 04 06:50:53 +0900 2009 : http://www.yahoo.co.jp/s/41483
Tue Aug 04 06:50:53 +0900 2009 : http://www.yahoo.co.jp/s/41482
Tue Aug 04 06:50:54 +0900 2009 : http://www.yahoo.co.jp/r/c2
Tue Aug 04 06:50:54 +0900 2009 : http://www.yahoo.co.jp/r/c5
Tue Aug 04 06:50:54 +0900 2009 : http://www.yahoo.co.jp/r/c12
Tue Aug 04 06:50:54 +0900 2009 : http://www.yahoo.co.jp/r/c1
....

this script 5 request per sec. It may be deny of service attack.

please add function to crawl slowly at regular intarval on same FQDN.

Comments and changes to this ticket

  • hayato

    hayato August 3rd, 2009 @ 05:11 PM

    patch attatched

    add on_pre_fetch handler.

    if this block return false,link is not fetched and re-enq to link_queue.

  • hayato

    hayato August 4th, 2009 @ 04:41 PM

    Term "reserved" in patch causes confusion.
    "Delay" is felicity term.

    I attached the new patch.

    if apply this patch,this code behavior improve.

    require 'anemone'
    require 'uri'
    require 'anemone/delayed'

    Anemone.crawl("http://www.yahoo.co.jp/") do |anemone|
    anemone.on_every_page do |page|

    puts "#{Time.now} : #{page.url}"
    

    end

    delayed_crawl = DelayedCrawl.new

    anemone.on_pre_fetch do |link|

    delayed_crawl.can_crawl?(link.host)
    

    end

    end

    following output

    Wed Aug 05 06:36:24 +0900 2009 : http://www.yahoo.co.jp/
    Wed Aug 05 06:36:27 +0900 2009 : http://www.yahoo.co.jp/s/38667
    Wed Aug 05 06:36:30 +0900 2009 : http://www.yahoo.co.jp/s/41546
    Wed Aug 05 06:36:33 +0900 2009 : http://www.yahoo.co.jp/r/c37

    Wed Aug 05 06:36:36 +0900 2009 : http://www.yahoo.co.jp/s/gallery/list9.html

  • hayato

    hayato August 4th, 2009 @ 04:50 PM

    sorry my broken comment......

    Term "reserved" in patch causes confusion. "Delay" is felicity term.
    I attached the new patch.

    if apply this patch,this code behavior improve.

    require 'anemone'
    require 'uri'
    require 'anemone/delayed'
    
    Anemone.crawl("http://www.yahoo.co.jp/") do |anemone|
      anemone.on_every_page do |page|
        puts "#{Time.now} : #{page.url}"
      end
    
      delayed_crawl = DelayedCrawl.new
    
      anemone.on_pre_fetch do |link|
        delayed_crawl.can_crawl?(link.host)
      end
    end
    

    following output

    Wed Aug 05 06:36:24 +0900 2009 : http://www.yahoo.co.jp/
    Wed Aug 05 06:36:27 +0900 2009 : http://www.yahoo.co.jp/s/38667
    Wed Aug 05 06:36:30 +0900 2009 : http://www.yahoo.co.jp/s/41546
    Wed Aug 05 06:36:33 +0900 2009 : http://www.yahoo.co.jp/r/c37
    Wed Aug 05 06:36:36 +0900 2009 : http://www.yahoo.co.jp/s/gallery/list9.html
    

    It implemented 3sec interval

  • chris (at chriskite)

    chris (at chriskite) August 10th, 2009 @ 09:06 PM

    • State changed from “new” to “resolved”

    Hi,

    I've implemented time delay functionality in the latest release of Anemone (0.1.2). You can simply specify a :delay option when starting the crawl, like so:

    require 'anemone'
    Anemone.crawl("http://www.example.com/", :delay => 3) do |anemone|
      anemone.on_every_page do |page|
        puts "#{Time.now} : #{page.url}"
      end
    end
    
  • hayato

    hayato August 24th, 2009 @ 04:25 AM

    Your fix looks good.

    But, following test is failed.

    .F.FF.
    
    1)
    'Anemone::Page should store the response headers when fetching a page' FAILED
    expected nil? to return false, got true
    ./spec/page_spec.rb:16:
    
    2)
    'Anemone::Page should have a Nokogori::HTML::Document attribute for the page body' FAILED
    expected nil? to return false, got true
    ./spec/page_spec.rb:29:
    
    3)
    'Anemone::Page should indicate whether it was fetched after an HTTP redirect' FAILED
    expected: true,
         got: false (using ==)
    ./spec/page_spec.rb:38:
    
    Finished in 0.007746 seconds
    
    6 examples, 3 failures
    

    Because useragent is unset.
    Please apply the attached patch.

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site.

Shared Ticket Bins

People watching this ticket

Attachments

Pages