#15 open
Nilesh

Canonicalize URLs

Reported by Nilesh | December 16th, 2009 @ 11:14 PM | in 0.5.0

It would be great if anemone can handle canonical URLs.

For example, if we enable canonicalization,

  http://www.example.com/index.html
  http://www.example.com/index.asp
  http://www.example.com/index.php
  http://www.example.com/index.cgi?a=2&b=3
  http://www.example.com/index.cgi?foo=bar

should all resolve to:

  http://www.example.com/
and anemone should not download the duplicate pages.

Another example:

  http://www.example.com/search.php?q=how+to+spider
  http://www.example.com/search.php?q=how+to+not+spider
should reduce to :
  http://www.example.com/search.php

Comments and changes to this ticket

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site.

Shared Ticket Bins

People watching this ticket

Pages