#23 new
rb2k

Allow root domain redirects

Reported by rb2k | May 6th, 2010 @ 02:33 PM

Anemone isn't currently able to crawl a site if the root domain redirects to another (sub)domain

an example of this would be http://heise.de which redirects to http://www.heise.de

$ curl -I heise.de HTTP/1.1 301 Moved Permanently
Location: http://www.heise.de/

it would be nice to allow anemone to follow these initial redirects, maybe by setting a special parameter in the options hash?

A modification in the allowed?() method of /lib/anemone/http.rb would probably work.
This is the current one:

#
# Allowed to connect to the requested url?
#
def allowed?(to_url, from_url)
  to_url.host.nil? || (to_url.host == from_url.host)
end

It is however a design decision if this option makes sense.
What would be your take on this?

Comments and changes to this ticket

  • Alex Johnson

    Alex Johnson November 18th, 2011 @ 01:51 AM

    • Milestone order changed from “0” to “0”

    In case you or someone else is still facing this problem, the following solution can be applied.

    Anemone.crawl("http://heise.de") do |anemone|
    anemone.on_every_page do |page|

    if page.code==301 
      puts Anemone.crawl(page.redirect_to)
    

    This code is intended to give the logic behind the idea. on_every_page yields a page object that has the methods code and redirect_to. One can use them to crawl the redirected page accordingly.

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site.

Shared Ticket Bins

People watching this ticket

Pages