Canonicalize URLs
Reported by Nilesh | December 16th, 2009 @ 11:14 PM | in 0.5.0
It would be great if anemone can handle canonical URLs.
For example, if we enable canonicalization,
http://www.example.com/index.html
http://www.example.com/index.asp
http://www.example.com/index.php
http://www.example.com/index.cgi?a=2&b=3
http://www.example.com/index.cgi?foo=bar
should all resolve to:
http://www.example.com/
and anemone should not download the duplicate pages.
Another example:
http://www.example.com/search.php?q=how+to+spider
http://www.example.com/search.php?q=how+to+not+spider
should reduce to :
http://www.example.com/search.php
Comments and changes to this ticket
-
chris (at chriskite) January 22nd, 2010 @ 08:12 PM
- State changed from new to open
- Assigned user set to chris (at chriskite)
-
chris (at chriskite) May 25th, 2010 @ 09:16 PM
- Milestone set to 0.5.0
I think this makes sense if we utilize the rel=canonical tag on a page. The same script with a different query string should be considered as a different page unless the webpage itself tells us otherwise.
-
Lee Hambley September 7th, 2010 @ 09:53 AM
- Milestone order changed from 0 to 0
Agreed completely, if there's a canonical URI/L specified in the source, it should be used (or at least made available).
Please Sign in or create a free account to add a new ticket.
With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.
Create your profile
Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป
Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site.