#23 Allow root domain redirects - Anemone

Type	To find
responsible:me	tickets assigned to you
tagged:"@high"	tickets tagged @high
milestone:next	tickets in the upcoming milestone
state:invalid	tickets with the state invalid
created:"last week"	tickets created last week
sort:number, importance, updated	tickets sorted by #, importance or updated
Combine keywords for powerful searching.
Use advanced searching »

#23 new

Allow root domain redirects

Reported by rb2k | May 6th, 2010 @ 02:33 PM

Anemone isn't currently able to crawl a site if the root domain redirects to another (sub)domain

an example of this would be http://heise.de which redirects to http://www.heise.de

$ curl -I heise.de HTTP/1.1 301 Moved Permanently
Location: http://www.heise.de/

it would be nice to allow anemone to follow these initial redirects, maybe by setting a special parameter in the options hash?

A modification in the allowed?() method of /lib/anemone/http.rb would probably work.
This is the current one:

#
# Allowed to connect to the requested url?
#
def allowed?(to_url, from_url)
  to_url.host.nil? || (to_url.host == from_url.host)
end

It is however a design decision if this option makes sense.
What would be your take on this?

Comments and changes to this ticket

You flagged this item as spam.
Alex Johnson November 18th, 2011 @ 01:51 AM
- Milestone order changed from “0” to “0”
In case you or someone else is still facing this problem, the following solution can be applied.

Anemone.crawl("http://heise.de") do |anemone|
anemone.on_every_page do |page|
```
if page.code==301 
  puts Anemone.crawl(page.redirect_to)
```
This code is intended to give the logic behind the idea. on_every_page yields a page object that has the methods code and redirect_to. One can use them to crawl the redirected page accordingly.

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile »

Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site.

Shared Ticket Bins (Sort)

↓↑ drag 19 Open tickets
↓↑ drag 17 Resolved tickets
↓↑ drag 0 This week's tickets

Chriskite Anemone

Allow root domain redirects

Comments and changes to this ticket

Alex Johnson November 18th, 2011 @ 01:51 AM

Create your profile

Shared Ticket Bins (Sort)

People watching this ticket

Tags

Pages

Chriskite Anemone

Keyword searching

Allow root domain redirects

Comments and changes to this ticket

Alex Johnson November 18th, 2011 @ 01:51 AM

Create your profile

Shared Ticket Bins (Sort)

People watching this ticket

Tags

Pages