#24 ✓resolved
rb2k

Memleak with pages?

Reported by rb2k | May 15th, 2010 @ 04:06 PM

Is it possible that, when running a lot of .crawl() operations, the @pages hash will keep on growing and there is no method that allows the user to empty it?
(using a simple Hash in my case)

Comments and changes to this ticket

  • chris (at chriskite)

    chris (at chriskite) May 25th, 2010 @ 09:13 PM

    • State changed from “new” to “resolved”

    Yes, although it's not really a "leak" because it is intentionally storing data about all the pages you crawl. If you crawl a lot of pages, that data has to go somewhere, and if you're using a Hash then it's in memory. Using the TokyoCabinet storage engine is a good solution, as the data is persisted on disk and doesn't use nearly as much memory.

  • rb2k

    rb2k May 25th, 2010 @ 09:16 PM

    I crawl different domains though.
    Is there a way to "reset" the page cache in between crawls?

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site.

Shared Ticket Bins

People watching this ticket

Pages