crawling slowly
Reported by hayato | August 3rd, 2009 @ 05:05 PM
anemone is too violent to crawl any web server.
example
require 'anemone'
Anemone.crawl("http://www.yahoo.co.jp/") do |anemone|
anemone.on_every_page do |page|
puts "#{Time.now} : #{page.url}"
end end
following result
Tue Aug 04 06:50:53 +0900 2009 : http://www.yahoo.co.jp/
Tue Aug 04 06:50:53 +0900 2009 : http://www.yahoo.co.jp/s/41484
Tue Aug 04 06:50:53 +0900 2009 : http://www.yahoo.co.jp/r/mht
Tue Aug 04 06:50:53 +0900 2009 : http://www.yahoo.co.jp/s/41483
Tue Aug 04 06:50:53 +0900 2009 : http://www.yahoo.co.jp/s/41482
Tue Aug 04 06:50:54 +0900 2009 : http://www.yahoo.co.jp/r/c2
Tue Aug 04 06:50:54 +0900 2009 : http://www.yahoo.co.jp/r/c5
Tue Aug 04 06:50:54 +0900 2009 : http://www.yahoo.co.jp/r/c12
Tue Aug 04 06:50:54 +0900 2009 : http://www.yahoo.co.jp/r/c1
....
this script 5 request per sec. It may be deny of service attack.
please add function to crawl slowly at regular intarval on same FQDN.
Comments and changes to this ticket
-
hayato August 3rd, 2009 @ 05:11 PM
patch attatched
add on_pre_fetch handler.
if this block return false,link is not fetched and re-enq to link_queue.
-
hayato August 4th, 2009 @ 04:41 PM
Term "reserved" in patch causes confusion.
"Delay" is felicity term.I attached the new patch.
if apply this patch,this code behavior improve.
require 'anemone'
require 'uri'
require 'anemone/delayed'Anemone.crawl("http://www.yahoo.co.jp/") do |anemone|
anemone.on_every_page do |page|puts "#{Time.now} : #{page.url}"
end
delayed_crawl = DelayedCrawl.new
anemone.on_pre_fetch do |link|
delayed_crawl.can_crawl?(link.host)
end
end
following output
Wed Aug 05 06:36:24 +0900 2009 : http://www.yahoo.co.jp/
Wed Aug 05 06:36:27 +0900 2009 : http://www.yahoo.co.jp/s/38667
Wed Aug 05 06:36:30 +0900 2009 : http://www.yahoo.co.jp/s/41546
Wed Aug 05 06:36:33 +0900 2009 : http://www.yahoo.co.jp/r/c37Wed Aug 05 06:36:36 +0900 2009 : http://www.yahoo.co.jp/s/gallery/list9.html
-
hayato August 4th, 2009 @ 04:50 PM
sorry my broken comment......
Term "reserved" in patch causes confusion. "Delay" is felicity term.
I attached the new patch.if apply this patch,this code behavior improve.
require 'anemone' require 'uri' require 'anemone/delayed' Anemone.crawl("http://www.yahoo.co.jp/") do |anemone| anemone.on_every_page do |page| puts "#{Time.now} : #{page.url}" end delayed_crawl = DelayedCrawl.new anemone.on_pre_fetch do |link| delayed_crawl.can_crawl?(link.host) end end
following output
Wed Aug 05 06:36:24 +0900 2009 : http://www.yahoo.co.jp/ Wed Aug 05 06:36:27 +0900 2009 : http://www.yahoo.co.jp/s/38667 Wed Aug 05 06:36:30 +0900 2009 : http://www.yahoo.co.jp/s/41546 Wed Aug 05 06:36:33 +0900 2009 : http://www.yahoo.co.jp/r/c37 Wed Aug 05 06:36:36 +0900 2009 : http://www.yahoo.co.jp/s/gallery/list9.html
It implemented 3sec interval
-
chris (at chriskite) August 10th, 2009 @ 09:06 PM
- State changed from new to resolved
Hi,
I've implemented time delay functionality in the latest release of Anemone (0.1.2). You can simply specify a :delay option when starting the crawl, like so:
require 'anemone' Anemone.crawl("http://www.example.com/", :delay => 3) do |anemone| anemone.on_every_page do |page| puts "#{Time.now} : #{page.url}" end end
-
hayato August 24th, 2009 @ 04:25 AM
Your fix looks good.
But, following test is failed.
.F.FF. 1) 'Anemone::Page should store the response headers when fetching a page' FAILED expected nil? to return false, got true ./spec/page_spec.rb:16: 2) 'Anemone::Page should have a Nokogori::HTML::Document attribute for the page body' FAILED expected nil? to return false, got true ./spec/page_spec.rb:29: 3) 'Anemone::Page should indicate whether it was fetched after an HTTP redirect' FAILED expected: true, got: false (using ==) ./spec/page_spec.rb:38: Finished in 0.007746 seconds 6 examples, 3 failures
Because useragent is unset.
Please apply the attached patch.
Please Sign in or create a free account to add a new ticket.
With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.
Create your profile
Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป
Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site.