Using MongoDB to replace the sessions table?
tracy — Sun, 02/21/2010 - 14:48
Maintaining session information has long been the bane of the web developer. Many web platforms put that information into a MySQL table, something that can work decently well for small to medium sites. The problem is that as the site's popularity grows, so do the number of database entries and hits. Sites that aren't as good about purging the sessions table of stale entries are going to see increasingly slower lookup times at a time when the number of hits are increasing.
One common solution is to turn off session logging of anonymous users. A module exists for Drupal users which does just that, No Anonymous Sessions. This can help reduce server load and database contention issues, particularly on busy sites. However, this also can have some effects that are less than desirable. For instance, if you are a stats junkie, it can be a lot harder to figure out the traffic patterns of different anonymous users. In addition, some sites provide use user functionality to anonymous users, such as the ability to vote, bookmark favorites, etc. These sites often do this as a way to allow the user to "try before they buy," allowing the user to determine how useful they find the site before requiring registration. Likewise, if you run your own targeted advertising, not having anonymous sessions can make it harder to determine if you've shown a particular ad to a user, etc.
Given the drawbacks to completely turning off anonymous sessions, some sites have turned to putting them into some other sort of cache instead, commonly something along the lines of APC for single-server setups or memcached for multi-server setups. Since the information resides in memory, these can be much faster to look up compared to MySQL. In addition, they are memory-constrained areas where older data is periodically removed from the cache. This means we don't have to worry about creating cron scripts to clean out the database and our data set can get only so large. Granted, it also means that if there is an issue with memcached or if we have to restart our web server, all the current session data will be lost and, during particularly busy times, active user sessions might be lost if there is just too much session data being saved at once.
All of this makes me wonder if MongoDB might be a good replacement for these methods of saving session data, particularly using the capped collections option. There are a couple of caveats that are important to note before going down this path:
- You can not explicitly delete an object from a capped collection.
- You can modify an object in a capped collection, but you can not increase its size. However, you can pad the object to make it take the max amount of space required.
I'm creating a simplified script to test how these different options compare to one another. The sessions table is based off of the sessions table used in Drupal 6 since I would most likely use this to help improve the speed of my personal websites. The code will prefill the sessions table with some sessions so we can emulate returning visitors. I'm not going to worry about putting information into the cache key for now so I don't have to worry about solving the issue of changing the object size. Likewise, I'm not going to worry about the instances of a user explicitly logging out or of session expiration.
#!/usr/bin/perl use strict; use warnings; use boolean; use Data::Dumper; use English; use FileHandle; use MongoDB; # Configuration our $NUM_OF_HITS = 100000; # The number of hits we want to simulate our $NEW_VISIT_PROB = 60; # The probability that the visitor is a new # visitor (no session id set) our $DEFAULT_USER_ID = 0; # The default user id, for anonymous user our $DEFAULT_CACHE = 0; # The default value for the cache column. our $DEFAULT_SESSION = ''; # The default value for the session column. our $MONGODB_HOST = 'localhost'; # The hostname for the mongodb instance. our $MONGODB_PORT = 27017; # The port for the mongodb instance. our $TIME_MODULO = 10; # The number of hits after which we should # change the timestamp. our $DATABASE = 'website'; # The database name our $COLLECTION = 'sessions'; # The collection name our $DB_SIZE = 1000000; # The size for the capped database. our $CONCURRENT_HITS = 10; # Initialize some variables my @sessions = (); # An array of sessions already in play. my $user_count = 1; # A count variable that allows us to create # new users with unique ids my $time = time(); # The timestamp to set for the hit. my $saved_session_count = 0; # A count of the number of times we found a # saved session. initialize_database(); my @children = (); if ($CONCURRENT_HITS > 1) { for(1..$CONCURRENT_HITS) { my $pid = fork(); if ($pid) { # parent push(@children, $pid); } elsif ($pid == 0) { # child srand($PID); simulate_hits(); exit(0); } else { die "couldn.t fork: $!\n"; } } foreach (@children) { waitpid($_, 0); } } else { simulate_hits(); } tear_down_database(); sub initialize_database { # Connect to our mongodb instance my $connection = MongoDB::Connection->new( host => $MONGODB_HOST, port => $MONGODB_PORT, ); # Connect to the database my $database = $connection->get_database($DATABASE); # Set up our capped table my $instruction = "db.createCollection('$COLLECTION', {capped:true, size:$DB_SIZE})"; my $result = $database->eval($instruction); my $collection = $database->get_collection($COLLECTION); #$collection->ensure_index({sid => 1}, { unique => true }); #$collection->ensure_index({uid => 1}); #$collection->ensure_index({timestamp => 1}); # Precreate the existing sessions so that we don't artificially # increase the number of saved sessions my $time_precreate_start = time(); my $sessions_to_precreate = ($NUM_OF_HITS * ( $NEW_VISIT_PROB / 100)); for my $index (1..$sessions_to_precreate) { my $session = { sid => generate_random_string(32), ip => generate_random_ip(), timestamp => $time, cache => $DEFAULT_CACHE, session => $DEFAULT_SESSION, uid => $DEFAULT_USER_ID, }; my $new_user_rand = int rand 10; if ($new_user_rand < 2) { $session->{'uid'} = $user_count++; } push @sessions, $session; $collection->update( { sid => $session->{'sid'} }, $session, { 'upsert' => 1 } ); } my $time_precreate_end = time(); print 'We spent ' . ($time_precreate_end - $time_precreate_start) . ' seconds on pre-caching ' . $sessions_to_precreate . ' sessions.'; print "\n"; } sub simulate_hits { # Simulate our hits. my $time_simulate_start = time(); my $hits_to_simulate = $NUM_OF_HITS/$CONCURRENT_HITS; for my $index (1..$hits_to_simulate) { my $new_visit_rand = int rand 100; my $session = {}; if ($new_visit_rand < $NEW_VISIT_PROB) { $session = { sid => generate_random_string(32), ip => generate_random_ip(), timestamp => $time, cache => $DEFAULT_CACHE, session => $DEFAULT_SESSION, uid => $DEFAULT_USER_ID, }; my $new_user_rand = int rand 10; if ($new_user_rand < 2) { $session->{'uid'} = $user_count++; } } else { my $index = int rand scalar @sessions; $session = $sessions[$index]; $session->{'timestamp'} = $time; } my $connection = MongoDB::Connection->new( host => $MONGODB_HOST, port => $MONGODB_PORT, ); my $database = $connection->get_database($DATABASE); my $collection = $database->get_collection($COLLECTION); my $saved_session = $collection->find_one({ sid => $session->{'sid'}}); if ($saved_session) { $saved_session_count++; $collection->update( { sid => $session->{'sid'} }, $session, ); } else { $collection->insert($session); } if ($index % $TIME_MODULO == 0) { $time--; } } my $time_simulate_end = time(); print 'We spent ' . ($time_simulate_end - $time_simulate_start) . ' seconds simulating ' . $hits_to_simulate . ' hits.'; print "\n"; print 'We had a cached session hit percentage of ' . (($saved_session_count/$hits_to_simulate) * 100) . '%.'; print "\n"; } sub tear_down_database { # Drop the database so the test will be clean next time. my $connection = MongoDB::Connection->new( host => $MONGODB_HOST, port => $MONGODB_PORT, ); my $database = $connection->get_database($DATABASE); $database = $connection->get_database($DATABASE); $database->drop(); } sub generate_random_string { my $string_size = shift; $string_size--; my @alphanumeric = ('a'..'z', 'A'..'Z', 0..9); my $rand_string = join '', map $alphanumeric[rand @alphanumeric], 0..$string_size; return $rand_string; } sub generate_random_ip { my $ip_string = ''; $ip_string .= join '.', map int rand 100, 0..3; return $ip_string; }
I ran this code on an Amazon EC2 instance, m1.large, running Ubuntu 8.04 (hardy). That instance type is a 64-bit platform with 7.5 GB of memory, 4 EC2 Compute Units, and 850 GB of local instance storage. My test code had the following results:
We spent 675 seconds on pre-caching 60000 sessions. We spent 3060 seconds simulating 10000 hits. We had a cached session hit percentage of 40.36%. We spent 3062 seconds simulating 10000 hits. We had a cached session hit percentage of 40.31%. We spent 3064 seconds simulating 10000 hits. We had a cached session hit percentage of 39.94%. We spent 3065 seconds simulating 10000 hits. We had a cached session hit percentage of 40.35%. We spent 3065 seconds simulating 10000 hits. We had a cached session hit percentage of 39.78%. We spent 3066 seconds simulating 10000 hits. We had a cached session hit percentage of 40.37%. We spent 3067 seconds simulating 10000 hits. We had a cached session hit percentage of 39.75%. We spent 3067 seconds simulating 10000 hits. We had a cached session hit percentage of 40.14%. We spent 3068 seconds simulating 10000 hits. We had a cached session hit percentage of 40.08%. We spent 3069 seconds simulating 10000 hits. We had a cached session hit percentage of 39.9%.
In future articles, I'll run similar code to populate a MySQL database and a memcached instance so we can draw comparisons.
Can you publish the
Jason Fisher — Sat, 10/16/2010 - 15:20Can you publish the comparison tests?