Double encoding links
Reported by Ben VandenBos | June 25th, 2010 @ 01:59 PM | in 0.4.1
It seems that links with %20's get double encoded while trying to strip the anchor off.
For example, if a page contains the link:
/Company%20Info/103070.aspx
It will change it to:
/Company%2520Info/103070.aspx
... which is bogus.
In page.rb line 138
def to_absolute(link)
return nil if link.nil?
# remove anchor
link = URI.encode(link.to_s.gsub(/#[a-zA-Z0-9_-]*$/,'')) # <== this encode mucks up the url
relative = URI(link)
absolute = @url.merge(relative)
absolute.path = '/' if absolute.path.empty?
return absolute
end
I think what you want it to be is:
def to_absolute(link)
return nil if link.nil?
relative = URI(link.to_s.gsub(/#[a-zA-Z0-9_-]*$/,''))
absolute = @url.merge(relative)
absolute.path = '/' if absolute.path.empty?
return absolute
end
That get's rid of the extra encode
call.
Comments and changes to this ticket
-
chris (at chriskite) July 30th, 2010 @ 07:36 PM
- Assigned user set to chris (at chriskite)
- State changed from new to open
- Milestone set to 0.4.1
- Milestone order changed from 0 to 0
Please Sign in or create a free account to add a new ticket.
With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.
Create your profile
Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป
Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site.