Doesn't handle redirection properly.
Reported by urbanadventurer | October 5th, 2009 @ 09:45 AM
Doesn't handle redirection properly.
irb(main):171:0 Anemone.crawl("http://treshna.com/") do |a|
irb(main):172:1 a.on_every_page do |x|
irb(main):173:2* pp x
irb(main):174:2> end
irb(main):175:1> end
<Anemone::Page:0xb758bd8c
@aliases=[], @code=nil, @data=#, @depth=0, @headers=nil, @links=[], @referer=nil, @url=#<URI::HTTP:0xb758d7cc URL:http://treshna.com/>> => #<Anemone::Core:0xb758d998 @urls=[#<URI::HTTP:0xb758d7cc URL:http://treshna.com/>], @skip_link_patterns=[], @pages={"http://treshna.com/"=>#<Anemone::Page:0xb758bd8c @links=[], @referer=nil, @url=#<URI::HTTP:0xb758d7cc URL:http://treshna.com/>, @data=#, @aliases=[], @headers=nil, @code=nil, @depth=0>}, @on_pages_like_blocks={}, @tentacles=[#<Thread:0xb758d5b0 dead>, #<Thread:0xb758d4fc dead>, #<Thread:0xb758d448 dead>, #<Thread:0xb758d394 dead>], @on_every_page_blocks=[#Proc:0xb758e550@(irb):172], @after_crawl_blocks=[]>
curl -vv treshna.com
* About to connect() to treshna.com port 80 (#0) * Trying
210.48.71.196... connected * Connected to treshna.com
(210.48.71.196) port 80 (#0)
GET / HTTP/1.1 User-Agent: curl/7.18.2 (i486-pc-linux-gnu) libcurl/7.18.2 OpenSSL/0.9.8g zlib/1.2.3.3 libidn/1.10 Host: treshna.com Accept: /
< HTTP/1.1 302 Found
< Date: Mon, 05 Oct 2009 14:44:02 GMT
< Server: Apache/2.2.9 (Debian) PHP/5.2.6-1+lenny3 with Suhosin-Patch proxy_html/3.0.0 mod_ssl/2.2.9 OpenSSL/0.9.8g
< Location: http://www.treshna.com
< Content-Length: 366
< Content-Type: text/html; charset=iso-8859-1
<
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
302 Found
Found
The document has moved here.
Apache/2.2.9 (Debian) PHP/5.2.6-1+lenny3 with Suhosin-Patch proxy_html/3.0.0 mod_ssl/2.2.9 OpenSSL/0.9.8g Server at treshna.com Port 80
* Connection #0 to host treshna.com left intact * Closing connection #0
Comments and changes to this ticket
-
chris (at chriskite) November 5th, 2009 @ 10:27 AM
- State changed from new to resolved
- Assigned user set to chris (at chriskite)
Since Anemone limits the crawl to a single domain, it won't switch over to your www subdomain after the redirect. You'll need to start on the domain you intend to crawl.
Please Sign in or create a free account to add a new ticket.
With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.
Create your profile
Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป
Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site.