#29 open
Ben VandenBos

Double encoding links

Reported by Ben VandenBos | June 25th, 2010 @ 01:59 PM | in 0.4.1

It seems that links with %20's get double encoded while trying to strip the anchor off.

For example, if a page contains the link:

/Company%20Info/103070.aspx

It will change it to:

/Company%2520Info/103070.aspx

... which is bogus.

In page.rb line 138

    def to_absolute(link)
      return nil if link.nil?

      # remove anchor
      link = URI.encode(link.to_s.gsub(/#[a-zA-Z0-9_-]*$/,'')) # <== this encode mucks up the url

      relative = URI(link)
      absolute = @url.merge(relative)

      absolute.path = '/' if absolute.path.empty?

      return absolute
    end

I think what you want it to be is:

    def to_absolute(link)
      return nil if link.nil?

      relative = URI(link.to_s.gsub(/#[a-zA-Z0-9_-]*$/,''))
      absolute = @url.merge(relative)

      absolute.path = '/' if absolute.path.empty?

      return absolute
    end

That get's rid of the extra encode call.

Comments and changes to this ticket

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site.

Shared Ticket Bins

People watching this ticket

Pages