Hurley's World

  • games
  • genealogy
  • videos
  • contact
Home › Blogs › tracy's blog

Using MongoDB to replace the sessions table?

The sites/tracy.hurleyit.com/files/img_thumbnails is not writeable

tracy — Sun, 02/21/2010 - 14:48

Maintaining session information has long been the bane of the web developer. Many web platforms put that information into a MySQL table, something that can work decently well for small to medium sites. The problem is that as the site's popularity grows, so do the number of database entries and hits. Sites that aren't as good about purging the sessions table of stale entries are going to see increasingly slower lookup times at a time when the number of hits are increasing.

One common solution is to turn off session logging of anonymous users. A module exists for Drupal users which does just that, No Anonymous Sessions. This can help reduce server load and database contention issues, particularly on busy sites. However, this also can have some effects that are less than desirable. For instance, if you are a stats junkie, it can be a lot harder to figure out the traffic patterns of different anonymous users. In addition, some sites provide use user functionality to anonymous users, such as the ability to vote, bookmark favorites, etc. These sites often do this as a way to allow the user to "try before they buy," allowing the user to determine how useful they find the site before requiring registration. Likewise, if you run your own targeted advertising, not having anonymous sessions can make it harder to determine if you've shown a particular ad to a user, etc.

Given the drawbacks to completely turning off anonymous sessions, some sites have turned to putting them into some other sort of cache instead, commonly something along the lines of APC for single-server setups or memcached for multi-server setups. Since the information resides in memory, these can be much faster to look up compared to MySQL. In addition, they are memory-constrained areas where older data is periodically removed from the cache. This means we don't have to worry about creating cron scripts to clean out the database and our data set can get only so large. Granted, it also means that if there is an issue with memcached or if we have to restart our web server, all the current session data will be lost and, during particularly busy times, active user sessions might be lost if there is just too much session data being saved at once.

All of this makes me wonder if MongoDB might be a good replacement for these methods of saving session data, particularly using the capped collections option. There are a couple of caveats that are important to note before going down this path:

  • You can not explicitly delete an object from a capped collection.
  • You can modify an object in a capped collection, but you can not increase its size. However, you can pad the object to make it take the max amount of space required.

I'm creating a simplified script to test how these different options compare to one another. The sessions table is based off of the sessions table used in Drupal 6 since I would most likely use this to help improve the speed of my personal websites. The code will prefill the sessions table with some sessions so we can emulate returning visitors. I'm not going to worry about putting information into the cache key for now so I don't have to worry about solving the issue of changing the object size. Likewise, I'm not going to worry about the instances of a user explicitly logging out or of session expiration.

#!/usr/bin/perl
use strict;
use warnings;

use boolean;
use Data::Dumper;
use English;
use FileHandle;
use MongoDB;

# Configuration
our $NUM_OF_HITS     = 100000;      # The number of hits we want to simulate
our $NEW_VISIT_PROB  = 60;          # The probability that the visitor is a new
                                    # visitor (no session id set)
our $DEFAULT_USER_ID = 0;           # The default user id, for anonymous user
our $DEFAULT_CACHE   = 0;           # The default value for the cache column.
our $DEFAULT_SESSION = '';          # The default value for the session column.
our $MONGODB_HOST    = 'localhost'; # The hostname for the mongodb instance.
our $MONGODB_PORT    = 27017;       # The port for the mongodb instance.
our $TIME_MODULO     = 10;          # The number of hits after which we should
                                    # change the timestamp.
our $DATABASE        = 'website';   # The database name
our $COLLECTION      = 'sessions';  # The collection name
our $DB_SIZE         = 1000000;     # The size for the capped database.
our $CONCURRENT_HITS = 10;

# Initialize some variables
my @sessions            = ();      # An array of sessions already in play.
my $user_count          = 1;       # A count variable that allows us to create
                                   # new users with unique ids
my $time                = time();  # The timestamp to set for the hit.
my $saved_session_count = 0;       # A count of the number of times we found a 
                                   # saved session.

initialize_database();

my @children = ();
if ($CONCURRENT_HITS > 1) {
    for(1..$CONCURRENT_HITS) {
        my $pid = fork();
        if ($pid) {
            # parent
            push(@children, $pid);
        } elsif ($pid == 0) {
            # child
            srand($PID);
            simulate_hits();
            exit(0);
        } else {
            die "couldn.t fork: $!\n";
        }
    }

    foreach (@children) {
        waitpid($_, 0);
    }
} else {
    simulate_hits();
}

tear_down_database();

sub initialize_database {
    # Connect to our mongodb instance
    my $connection = MongoDB::Connection->new(
        host => $MONGODB_HOST,
        port => $MONGODB_PORT,
        );

    # Connect to the database
    my $database   = $connection->get_database($DATABASE);

    # Set up our capped table
    my $instruction
        = "db.createCollection('$COLLECTION', {capped:true, size:$DB_SIZE})";

    my $result
        = $database->eval($instruction);

    my $collection = $database->get_collection($COLLECTION);
    #$collection->ensure_index({sid => 1}, { unique => true });
    #$collection->ensure_index({uid => 1});
    #$collection->ensure_index({timestamp => 1});

    # Precreate the existing sessions so that we don't artificially
    # increase the number of saved sessions
    my $time_precreate_start = time();
    my $sessions_to_precreate = ($NUM_OF_HITS * ( $NEW_VISIT_PROB / 100));
    for my $index (1..$sessions_to_precreate) {
        my $session = {
            sid       => generate_random_string(32),
            ip        => generate_random_ip(),
            timestamp => $time,
            cache     => $DEFAULT_CACHE,
            session   => $DEFAULT_SESSION,
            uid       => $DEFAULT_USER_ID,
        };

        my $new_user_rand = int rand 10;
        if ($new_user_rand < 2) {
            $session->{'uid'} = $user_count++;
        }

        push @sessions, $session;
        $collection->update(
            { sid   => $session->{'sid'} },
            $session,
            { 'upsert' => 1 }
            );
    }
    my $time_precreate_end = time();
    print 'We spent ' . ($time_precreate_end - $time_precreate_start)
        . ' seconds on pre-caching ' . $sessions_to_precreate . ' sessions.';
    print "\n";
}

sub simulate_hits {

    # Simulate our hits.
    my $time_simulate_start = time();
    my $hits_to_simulate    = $NUM_OF_HITS/$CONCURRENT_HITS;

    for my $index (1..$hits_to_simulate) {
        my $new_visit_rand = int rand 100;
        my $session = {};

        if ($new_visit_rand < $NEW_VISIT_PROB) {
            $session = {
                sid       => generate_random_string(32),
                ip        => generate_random_ip(),
                timestamp => $time,
                cache     => $DEFAULT_CACHE,
                session   => $DEFAULT_SESSION,
                uid       => $DEFAULT_USER_ID,
            };

            my $new_user_rand = int rand 10;
            if ($new_user_rand < 2) {
                $session->{'uid'} = $user_count++;
            }

        } else {
            my $index = int rand scalar @sessions;
            $session = $sessions[$index];
            $session->{'timestamp'} = $time;
        }

        my $connection = MongoDB::Connection->new(
            host => $MONGODB_HOST,
            port => $MONGODB_PORT,
            );

        my $database   = $connection->get_database($DATABASE);
        my $collection = $database->get_collection($COLLECTION);

        my $saved_session
            = $collection->find_one({ sid => $session->{'sid'}});

        if ($saved_session) {
            $saved_session_count++;
            $collection->update(
                { sid   => $session->{'sid'} },
                $session,
                );
        } else {
            $collection->insert($session);
        }

        if ($index % $TIME_MODULO == 0) {
            $time--;
        }
    }
    my $time_simulate_end = time();
    print 'We spent ' . ($time_simulate_end - $time_simulate_start)
        . ' seconds simulating ' . $hits_to_simulate . ' hits.';
    print "\n";

    print 'We had a cached session hit percentage of '
        . (($saved_session_count/$hits_to_simulate) * 100)  . '%.';
    print "\n";
}

sub tear_down_database {
    # Drop the database so the test will be clean next time.
    my $connection = MongoDB::Connection->new(
        host => $MONGODB_HOST,
        port => $MONGODB_PORT,
        );
    my $database   = $connection->get_database($DATABASE);
    $database   = $connection->get_database($DATABASE);
    $database->drop();
}

sub generate_random_string {
    my $string_size = shift;
    $string_size--;

    my @alphanumeric = ('a'..'z', 'A'..'Z', 0..9);
    my $rand_string = join '',
    map $alphanumeric[rand @alphanumeric], 0..$string_size;

    return $rand_string;
}

sub generate_random_ip {
    my $ip_string = '';
    $ip_string .= join '.', map int rand 100, 0..3;
    return $ip_string;
}

I ran this code on an Amazon EC2 instance, m1.large, running Ubuntu 8.04 (hardy). That instance type is a 64-bit platform with 7.5 GB of memory, 4 EC2 Compute Units, and 850 GB of local instance storage. My test code had the following results:

We spent 675 seconds on pre-caching 60000 sessions.
We spent 3060 seconds simulating 10000 hits.
We had a cached session hit percentage of 40.36%.
We spent 3062 seconds simulating 10000 hits.
We had a cached session hit percentage of 40.31%.
We spent 3064 seconds simulating 10000 hits.
We had a cached session hit percentage of 39.94%.
We spent 3065 seconds simulating 10000 hits.
We had a cached session hit percentage of 40.35%.
We spent 3065 seconds simulating 10000 hits.
We had a cached session hit percentage of 39.78%.
We spent 3066 seconds simulating 10000 hits.
We had a cached session hit percentage of 40.37%.
We spent 3067 seconds simulating 10000 hits.
We had a cached session hit percentage of 39.75%.
We spent 3067 seconds simulating 10000 hits.
We had a cached session hit percentage of 40.14%.
We spent 3068 seconds simulating 10000 hits.
We had a cached session hit percentage of 40.08%.
We spent 3069 seconds simulating 10000 hits.
We had a cached session hit percentage of 39.9%.

In future articles, I'll run similar code to populate a MySQL database and a memcached instance so we can draw comparisons.

Trackback URL for this post:

http://tracy.hurleyit.com/trackback/1176
  • drupal
  • mongodb
  • tracy's blog
  • Add new comment

Can you publish the

Jason Fisher — Sat, 10/16/2010 - 15:20

Can you publish the comparison tests?

  • reply

  • @Bartoneus hi dude! — 11 years 33 weeks ago
  • @sendgrid Do you know of a #drupal module that works with your email system? — 11 years 35 weeks ago
  • Woot! RT @eboudrot: @Snap_App is up for product of the year! Mass Technology Leadership Council’s 2011 Awards tonight! — 11 years 36 weeks ago
  • @jaggedlines I circled you :) — 11 years 37 weeks ago
  • @asoehnlen which one are you in? — 11 years 39 weeks ago
  • ‹‹
  • 2 of 17
  • ››
more

Navigation

  • Recent posts

User login

  • Create new account
  • Request new password
Connect
Sign in using Facebook

Hurley's World Community


Hurley's World on Facebook

  • games
  • genealogy
  • videos
  • contact

© Copyright 2009: Tracy Hurley