No response code or referer after running for hours
Reported by michael.harrington | July 26th, 2010 @ 01:09 PM
OS X Snow Leopard
Ruby 1.8.7 (System Ruby)
require 'rubygems'
require 'bundler'
Bundler.setup
require 'anemone'
files = {}
Anemone.crawl 'http://local.acton.org' do |anemone|
anemone.on_every_page do |page|
puts "#{page.code}: #{page.url} (#{page.referer})"
files[page.code] ||= File.open "internal_#{page.code || 'unknown'}s.txt", 'w'
files[page.code] << "#{page.url} (#{page.referer})\n"
files[page.code].flush
end
anemone.skip_links_like /login\?/
end
files.each do |code, file|
file.close
end
For the better part of an hour, everything moves along smoothly, but at a certain point it looks like one -- and then all -- of the tentacles starts giving me nil response codes and referers.
I see this kind of console output:
----
302: https://local.acton.org/it/user/login?destination=global%252Farticles-it (http://local.acton.org/it/global/articles-it)
302: http://local.acton.org/it/user/login?destination=global%252Farticles-it (http://local.acton.org/it/global/articles-it)
200: http://local.acton.org/it/support/donating-appreciated-assets (http://local.acton.org/it/index/support)
200: http://local.acton.org/it/global/articles-it?page=1 (http://local.acton.org/it/global/articles-it)
200: http://local.acton.org/it/global/articles-it?page=2 (http://local.acton.org/it/global/articles-it)
: http://local.acton.org/it/global/article/lettera-dal-direttore-maggio-2010-it ()
200: http://local.acton.org/it/global/articles-it?page=3 (http://local.acton.org/it/global/articles-it)
: http://local.acton.org/it/global/article/lettera-dal-direttore-aprile-2010-it ()
: http://local.acton.org/it/global/article/il-profeta-jim-wallis-e-la-chiesa-dell%25E2%2580%2599ignoranza-e-it ()
: http://local.acton.org/it/global/article/la-scienza-della-custodia-peccato-sostenibilit%25C3%25A0-e-it ()
: http://local.acton.org/it/global/article/lettera-dal-direttore-it-0 ()
: http://local.acton.org/it/global/article/due-evviva-ai-vescovi-inglesi-e-gallesi-it ()
----
And the empty codes/referers continue for hundreds of pages.
Any idea what's going on or how to fix it?
Comments and changes to this ticket
-
michael.harrington July 29th, 2010 @ 10:05 AM
I changed my crawl to use 2 threads and discard page bodies, which appears to avoid this issue.
-
chris (at chriskite) July 30th, 2010 @ 07:33 PM
- State changed from new to open
-
chris (at chriskite) July 30th, 2010 @ 07:34 PM
Any idea what the memory usage was like on your system towards the end of the crawl? Which storage engine are you using, the default in-memory hash or TokyoCabinet?
Please Sign in or create a free account to add a new ticket.
With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.
Create your profile
Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป
Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site.