MongoDB

Tweets by @markusgattol

MongoDB

Status: This page is work in progress...

Last changed: Saturday 2015-01-10 18:32 UTC

Abstract:

The name MongoDB comes from "humongous". MongoDB is a so-called NoSQL DBMS (Database Management System). Its most notable key-features are: FLOSS (Free/Libre Open Source Software), horizontally scalable, very fast, schema-less, document-oriented, written in C++ with native drivers to most programming languages out there e.g. C, C++, C# & .NET, ColdFusion, Erlang, Factor, Java, JavaScript, PHP, Python, Ruby, Perl, etc. Like other document-oriented DBMSs such as CouchDB, MongoDB is not a RDBMS (Relational Database Management System) like PostgreSQL or MySQL for example. Instead, MongoDB belongs into the so called NoSQL category of DBMSs. The way it works is so that it manages collections (the equivalent to tables in RDBMSs) of JSON (JavaScript Object Notation) documents (the equivalent to rows in RDBMSs) which are stored in a binary format referred to as BSON (Binary JSON).

Table of Contents

Introduction

MongoDB

FAQs

Basics
Administration
Resource Usage
Collections, Namespaces
Statistics, Monitoring, Logging
Schema Design
Configuration
Indexes, Search, Metadata
Map/Reduce
GIS
GridFS
Development
Scalability, Fault Tolerance, Load Balancing
Miscellaneous

Scalability, Fault Tolerance, Load Balancing

Sharding
Replication
Load Balancing

Miscellaneous

Python
Utilities
Topics
Use Cases
Hosting
Applications using MongoDB

Introduction

With MongoDB we can become more Googley i.e. grow our data set horizontally on a cluster of commodity hardware and do distributed (read parallel execution of) queries/updates/inserts/deletes. We can also ensure that our data is hold redundant i.e. each bit of data is stored more than just once, but rather several times on different physical nodes in the cluster, even across datacenter barriers if we wish to do so. And guess what, failover/recovery and sharding happens automatically. Also, MongoDB has dynamic schemas. You are going to love it!

MongoDB

This section is about one particular NoSQL DBMS (Database Management System) known as MongoDB. For a summary, have a look at the abstract from above.

Installation and Configuration

http://www.mongodb.org/display/DOCS/Ubuntu+and+Debian+packages

Quickstart

http://www.snailinaturtleneck.com/blog/2010/04/23/there-must-be-50-ways-to-start-your-mongo/

Files

WRITEME

Interactive Shell

mongo is a full JavaScript shell, so any JavaScript function, syntax, or class can be used in the shell. It also functions as a client to mongod and mongos, just like any other language driver — it does so by calling through the C driver.

In addition, MongoDB defines some of its own classes and globals (e.g. db which is the variable that holds the current database connection). We can see the full API at http://api.mongodb.org/js/.

WRITEME

Help on the CLI

When in the CLI (Command Line Interface), it is important to know how to get information. In addition to the general commands and methods

sa@wks:~$ mongo --quiet
type "help" for help
> help()
HELP
        show dbs                     show database names
        show collections             show collections in current database
        show users                   show users in current database
        show profile                 show most recent system.profile entries with time >= 1ms
        use <db name>                set curent database to <db name>
        db.help()                    help on DB methods
        db.foo.help()                help on collection methods
        db.foo.find()                list objects in collection foo
        db.foo.find( { a : 1 } )     list objects in foo where a == 1
        it                           result of the last line evaluated; use to further iterate

we can have a look at database specific commands and methods

> db.help();
DB methods:
        db.addUser(username, password[, readOnly=false])
        db.auth(username, password)
        db.cloneDatabase(fromhost)
        db.commandHelp(name) returns the help for the command
        db.copyDatabase(fromdb, todb, fromhost)
        db.createCollection(name, { size :..., capped :..., max :... } )
        db.currentOp() displays the current operation in the db
        db.dropDatabase()
        db.eval(func, args) run code server-side
        db.getCollection(cname) same as db['cname'] or db.cname
        db.getCollectionNames()
        db.getLastError() - just returns the err msg string
        db.getLastErrorObj() - return full status object
        db.getMongo() get the server connection object
        db.getMongo().setSlaveOk() allow this connection to read from the nonmaster member of a replica pair
        db.getName()
        db.getPrevError()
        db.getProfilingLevel()
        db.getReplicationInfo()
        db.getSisterDB(name) get the db at the same server as this onew
        db.killOp(opid) kills the current operation in the db
        db.listCommands() lists all the db commands
        db.printCollectionStats()
        db.printReplicationInfo()
        db.printSlaveReplicationInfo()
        db.printShardingStatus()
        db.removeUser(username)
        db.repairDatabase()
        db.resetError()
        db.runCommand(cmdObj) run a database command.  if cmdObj is a string, turns it into { cmdObj : 1 }
        db.serverStatus()
        db.setProfilingLevel(level,<slowms>) 0=off 1=slow 2=all
        db.shutdownServer()
        db.stats()
        db.version() current version of the server
        db.getMongo().setSlaveOk() allow queries on a replication slave server

as well as collection specific ones

> db.test.help();
DBCollection help
        db.test.count()
        db.test.dataSize()
        db.test.distinct( key ) - eg. db.test.distinct( 'x' )
        db.test.drop() drop the collection
        db.test.dropIndex(name)
        db.test.dropIndexes()
        db.test.ensureIndex(keypattern,options) - options should be an object with these possible fields: name, unique, dropDups
        db.test.reIndex()
        db.test.find( [query] , [fields]) - first parameter is an optional query filter. second parameter is optional set of fields to return.
                                           e.g. db.test.find( { x : 77 } , { name : 1 , x : 1 } )
        db.test.find(...).count()
        db.test.find(...).limit(n)
        db.test.find(...).skip(n)
        db.test.find(...).sort(...)
        db.test.findOne([query])
        db.test.findAndModify( { update :... , remove : bool [, query: {}, sort: {}, 'new': false] } )
        db.test.getDB() get DB object associated with collection
        db.test.getIndexes()
        db.test.group( { key :..., initial:..., reduce :...[, cond:...] } )
        db.test.mapReduce( mapFunction , reduceFunction , <optional params> )
        db.test.remove(query)
        db.test.renameCollection( newName , <dropTarget> ) renames the collection.
        db.test.runCommand( name , <options> ) runs a db command with the given name where the 1st param is the colleciton name
        db.test.save(obj)
        db.test.stats()
        db.test.storageSize() - includes free space allocated to this collection
        db.test.totalIndexSize() - size in bytes of all the indexes
        db.test.totalSize() - storage allocated for all data and indexes
        db.test.update(query, object[, upsert_bool, multi_bool])
        db.test.validate() - SLOW
        db.test.getShardVersion() - only for use with sharding

If we are curious about what a particular method is doing, we can type it without the {{()}} and the shell will print its source code

> db.test.find;
function (query, fields, limit, skip) {
    return new DBQuery(this._mongo, this._db, this, this._fullName, this._massageObject(query), fields, limit, skip);
}

One last thing we could do is list all available database commands and maybe drill deeper and look at their description

> db.listCommands();
eval lock: node adminOnly: false slaveOk: false Evaluate javascript at the server.
http://www.mongodb.org/display/DOCS/Server-side+Code+Execution
assertInfo lock: write adminOnly: false slaveOk: true check if any asserts have occurred on the server
assertInfo lock: write adminOnly: false slaveOk: true check if any asserts have occurred on the server
authenticate lock: write adminOnly: false slaveOk: true internal
buildInfo lock: node adminOnly: true slaveOk: true get version #, etc.
{ buildinfo:1 }
buildInfo lock: node adminOnly: true slaveOk: true get version #, etc.
{ buildinfo:1 }


[skipping a lot of lines...]


validate lock: read adminOnly: false slaveOk: true Validate contents of a namespace by scanning its data structures for correctness.  Slow.
whatsmyuri lock: node adminOnly: false slaveOk: true {whatsmyuri:1}
writebacklisten lock: node adminOnly: true slaveOk: false internal
> db.commandHelp('datasize');
help for: dataSize determine data size for a set of data in a certain range
example: { datasize:"blog.posts", keyPattern:{x:1}, min:{x:10}, max:{x:55} }
keyPattern, min, and max parameters are optional.
not: This command may take a while to run
>

FAQs

This section gathers FAQs about MongoDB, all split into subsections of particular subjects.

Basics

This subsection gathers basic FAQs about MongoDB.

Reasons not to use MongoDB

The days where one DBMS fit all our needs are over! Technological advances in the area of NoSQL (e.g. CouchDB, MongoDB, etc.) are the perfect fit in certain areas, but still, established technologies like for example RDBMSs (Relational Database Management Systems) are very good at what they do best, storing relational data that is. What we want to do is use whatever tool is best for the job...

Below is a quick summary which makes it easy to see whether or not a switch to MongoDB would make sense, or, maybe more realistic, when and what parts of infrastructure can be moved over to MongoDB and which parts are to be left on RDBMSs like for example Oracle, MySQL, PostgreSQL and friends. We maybe should not use MongoDB when:

We need strict transactional behavior with any query/write (read ACID) as for example often required with applications/problems in the financial/scientific domain. However, please note that for ordinary use cases the level of ACID provided by MongoDB is by and large sufficient.
Our data is very relational. In this case one should just stick to one of the many RDBMSs (Relational Database Management Systems) out there.
Related to 2, we want to be able to do joins on the server (but can not do embedded objects/arrays).
We need triggers on our tables (called collections in MongoDB parlance) — note: there might be triggers available soon.
Related to 4, we rely on triggers (or similar functionality) to do cascading updates or deletes. As for #4, this issue probably goes away once triggers are available.
We need the database to enforce referential integrity (MongoDB has no notion of this at all).
Crystal reports is an example of a type of use that MongoDB is not good at: Dynamic aggregation with ad-hoc queries. These reporting systems (business intelligence) require being able to aggregate and apply mathematical expression to multiple joined sets of data (like across collections). This is something that MongoDB can not handle very well, or at all. Data warehousing, large graph representations (with efficient traversal) and many other types of data, and analysis just do not fit well into the restrictions and choices MongoDB has made, but unlike most of those, reporting is a more generic need that is not well supported.

Roadmap

Yes, go here.

Wiki? PDF Manual?

There is a PDF version of the Wiki.

Programming Languages

Go here and here for the available MongoDB drivers/APIs. In short: right now (October 2010) we can use MongoDB from at least C, C++, C# & .NET, ColdFusion, Erlang, Factor, Java, JavaScript, PHP, Python, Ruby, Perl. Of course, there might be more languages available in the future.

Python 3 Support

The current thought is to use Django as more or less a signal for when adding full support for Python 3 makes sense — this in turn will mainly depend on two things:

when major Linux distributions such as RedHat or Debian switch to Python 3 as their default Python version as well as
when we will see WSGI support in Python 3

MongoDB can probably support Python 3 it a bit earlier than Django does, but that is certainly not something the MongoDB community wants to rush and then have to support two totally different codebases. It can be tracked here. Personally I think that having an official Python 3 branch of PyMongo sooner rather than later would be a good thing.

Building Blocks

We have RDBMSs (Relational Database Management Systems) like for example MySQL, Oracle, PostgreSQL and then there are NoSQL DBMSs like for example MongoDB. Below is a breakout about how MongoDB relates to the afore mentioned, it is a breakout about how the main building blocks of each party resemble:

MySQL, PostgreSQL...
----------------------
Server:Port
 - Database(s)
   - Table(s)
     - Row(s)
     - Column(s)

 - Index
 - Join

MongoDB
--------
Server:Port
 - Database(s)
   - Collection(s)
     - Document(s)
      - Field(s)

 - Index
 - embedding and linking

The concept of server, database and index are very similar but the concepts of table/collection, row/document as well as column/field are quite different.

In a RDBMSs a table is a rectangle made of columns and rows. Each row has a fixed number of columns, if we add a new column, we add that column to each and every row.

In MongoDB a collection is more like a really big box and each document is like a little bag of stuff in that box. Each bag contains whatever it needs in a totally flexible manner (read schema-less).

Note that schema-less does not equal type-less i.e. it is just that with MongoDB any document has its own schema, which it may or may not share with other documents.

In practice it is normal to have the same schema for all the documents in a collection. The concept of a column in RDBMSs is closest to what we call a field (key/value pair) in MongoDB — note however what we said above: we can create/read/update/delete individual fields for a particular document in a collection. This is different from creating/reading/updating/deleting a column in a RDBMSs, which happens for every row in the entire table.

Indexes are more or less the same for RDBMSs and MongoDB. Joins however do not exists in MongoDB but instead we can embed and/or link documents into/with other documents.

Administration

This subsection gathers FAQs with regards to administering MongoDB.

Large Pages

sa@wks:~$ cat /proc/meminfo | grep Hugepagesize
Hugepagesize:       2048 kB
sa@wks:~$

The use of huge pages is generally discouraged for MongoDB because it makes the granularity for managing memory very large, which is usually a bad thing.

MongoDB does not start anymore

Usually what happened is an unclean shutdown (e.g. SIGKILL rather than SIGTERM has been sent, a power outage, etc.) of mongod. A quick fix could be to remove

sa@wks:~$ locate mongod.lock
/var/lib/mongodb/mongod.lock
sa@wks:~$

and start in repair mode. There are several ways for recovery, all depending on our particular setup. Note that both, SIGTERM and SIGKILL are Unix system calls — some others are used by MongoDB for things like logrotation for example.

Filesystem

As of now (December 2010) most people prefer ext4 or XFS — in the future BTRFS might become peoples first choice. Ext4 fully implements posix_fallocate, which means MongoDB can allocate large files instantly. It is being reported that for example ext3 does some strange things with large files, it is also not very good with a huge number of small files.

On older filesystems the semantics is so that literally all the file is zeroed out i.e. zeros are written to disk — obviously this takes a very long time. Without going into details, ext4 and other modern filesystems do not actually do that but achieve the same result by other (way faster) means.

Ctrl-C

Ctrl-C cleanly shuts down the MongoDB (read: a running mongod process). Actually, it is one of the recommended ways for shutting down MongoDB and will not cause corruption. The rationale here is that MongDB treats SIGINT the same as SIGTERM i.e. it shuts down cleanly which in turn does not require to run --repair.

MongoDB is asking me to do a repair

This could happen if there is an unclean shut down e.g. a power outage. It is important to note that it does not necessarily mean that the data is actually corrupted — MongoDB is just very paranoid and wants to make sure everything is all right and therefore asks to run --repair after an unclean shutdown i.e. it does so even when there might be no data corruption at all.

Bottom Line: if MongoDB suggests to do a repair, it does not automatically imply that we actually have corrupted data. It might be but it might also not be the case...

Compact Database

There are two ways to achieve this...

repairDatabase

db.repairDatabase() respectively --repair is also the canonical way to compact a database (indexes and data) as it gets rid of sparse space within the database files. Note that repairDatabase() will block the whole server (mongod) it is issued on.

One strategy is to do a rolling repair on a replica set i.e. if we have n machines in a replica set (one primary, n-1 secondaries), we bring each one down, repair it, let it resync, move on to the next machine/node — one machine at a time until all machines in the replica set have been compacted. That way we always have n-1 machines up and no downtime with regards to the entire cluster i.e. a rolling repair is transparent to the logic tier.

While at it, this is also a very good time to take a backup since the machine is down anyway and after the compaction/repair we end up with a smaller data set before it starts growing again — perfect time to run a backup.

From an administrative point of view it is recommended that we use the replSetFreeze and replSetStepdown commands because, since replica sets automatically negotiate which member of the set is primary and which are secondaries, it is otherwise hard to choose a particular order when doing the rolling repair.

For example, if we have the replica set members A, B, and C, and A is the current primary, and we want B to become primary, we would send replSetFreeze to C so that it does not attempt to become primary, and then replSetStepdown to A.

compact Command

The compact command compacts and defragments a collection. Indexes are rebuilt and compacted too. It is conceptually similar to db.repairDatabase(), but works on a single collection rather than an entire database. More information can be found here.

WebGUI? REST Interface/API?

A generally good resource on GUIs can be found here. Also, assuming a mongod process is running on localhost then we can access some statistics at http://localhost:28017/ and http://localhost:28017/_status.

In order to have a REST (Representational State Transfer) interface to MongoDB, same as CouchDB has it, we have to start mongod with the --rest switch (or put rest = true into the configuration file). Note however that this is just a read-only REST interface. For a read and/or write REST interface, please go here and here and here for more information.

Note, if we wanted real-time updates from the CLI, then we could also use mongostat — see man 5 mongostat for more information.

Rename a Database

Yes, we can rename a database but it is not as easy as renaming a collection. As of now (October 2010), the recommended way to rename a database is to clone it and thereby rename it. This will require enough additional free diskspace to fit the current/old database at least twice.

Physically migrate a Database

That is a pretty common need, we have our database on one machine and want to migrate/copy it onto another machine.

There is even a clone command for that. Note however that neither copyDatabase() nor cloneDatabase() actually perform a point-in-time snapshot of the entire database — what they basically do is query the source database and then replicate to the target database i.e. if we use copyDatabase() or cloneDatabase() on a source database which is online and has operations performed on it, then the target database cannot be a point-in-time snapshot pointing to the exact time when either one command was issued. Rather, at some point in time, they will/might have the same data/state as their source database.

Ok, so now we know MongoDB has tools to migrate a database... good to know but, well, it might just not be good enough for any use case out there, something we will look into a bit more below.

I find that most use cases are a bit more complex and thus might require a slightly different approach. For example, in order to make it a bit more complex and realistic, we might need to migrate from one datacenter to another, trough an insecure network i.e. the Internet.

Of course, we would then maybe also be faced with low bandwidth and high latency which means our connection might get canceled at any time and we would then have to start all over again. With bandwidth of e.g. only 10 Mbit/sec and 1 TiB of data to transfer, that quickly becomes a problem... or not?

Actually this is a lot easier than it sounds because, at first we figure the dbpath e.g. by using db.runCommand("getCmdLineOpts"); after switching to the admin database using use admin. Of course, /etc/mongodb.conf would work too.

Next we shut down mongod and then we simply copy the database directory (e.g. /var/lib/mongodb) over to the new machine using a bit of SCP (Secure Copy) magic. On the new machine, we start mongod with dbpath set appropriately... Et voilà!... we are done migrating a database from one machine to another, possibly even while we had to go trough an insecure network and maybe even had to resume the copying because the connection got canceled.

Minimum Downtime

While the above is fine, it might cause significant downtime — what we did in a nutshell was:

shutdown mongod on the old machine
copied/synced the database directory to the new machine
start mongod on the new machine with dbpath set appropriately

Below is what we could do in order to have as little downtime as possible:

stop and re-start the existing mongod as master (if it is not already running as master that is)
install mongod on the new machine and configure it as slave using --slave and --source
wait while the slave copies the database, re-indexes and then catches up with its master (this happens automatically when we point a slave to its master). Once the slave has caught up, we
disable writes to the master (clients can still read/query)
once all outstanding writes have been committed on the master and the slave caught up, we shutdown the master and restart the slave as new master. The old master can now be removed entirely.
now we point all traffic at the new master
finally we enable writes on the new master again... Et voilà!

Of course, we might also use OpenVZ and its live-migration feature to move a VE (Virtual Environment) containing our MongoDB database from the old to the new machine with no downtime whatsoever ;-]

Update to a new MongoDB version

If it is a drop-in replacement we just need to shutdown the older version and start the new one with the appropriate dbpath.

Otherwise, i.e. if it is not a drop-in replacement, we would use mongoexport followed by mongoimport or, even better, run --upgrade if the data format changes as it did from 1.0 to 1.2 for example.

Default listening Port and IP

We can use netstat to find out:

wks:/home/sa# netstat -tulpena | grep mongo
tcp        0      0 0.0.0.0:27017           0.0.0.0:*               LISTEN      124        1474236     8822/mongod
tcp        0      0 0.0.0.0:28017           0.0.0.0:*               LISTEN      124        1474237     8822/mongod
wks:/home/sa#

The default listening port for mongod is 27017. 28017 is where we can point our web browser in order to get some statistics. The default listening IPs are all IPs i.e. mongod listens on all IPs and ports, including RFC 1918 ones which are: the loopback device/address/network 127.0.0.0/8, the private class A network 10.0.0.0/8, the private class B network 172.16.0.0/12 and of course also the private class C network 192.168.0.0/16 amongst others.

Both, listening port and IP address, can be changed either by using the CLI (Command Line Interface) switches --port and --bind_ip or the configuration file which we can figure out by looking at the run-time configuration.

Automatic Backups

Yes, possible. The manual way is to use mongodump and mongorestore which in fact use mongoexport and mongoimport respectively.

One might use --drop with mongorestore to drop the collection(s) first in order to make sure that the restored collection(s) is/are a bit-by-bit reflection of the source collection.

Go here for more information on backups and especially recovery.

getSisterDB()

We can use it to get ourselves references to databases which not just saves a lot of typing but is, once we got used to using it, a lot more intuitive:

 1  sa@wks:~/mm/new$ mongo
 2  MongoDB shell version: 1.5.2-pre-
 3  url: test
 4  connecting to: test
 5  type "help" for help
 6  > db.getCollectionNames();
 7  [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ]
 8  > reference_to_test_db = db.getSisterDB('test');
 9  test
10  > reference_to_test_db.getCollectionNames();
11  [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ]
12  > use admin
13  switched to db admin
14  > reference_to_test_db.getCollectionNames();
15  [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ]
16  > bye
17  sa@wks:~/mm/new$

Note how we get a reference to our test database in line 8 and how it is used in lines 10 and even line 14, after switching from our test database to the admin database. getCollectionNames() has just been chosen as an example, it could have been any other command as well of course.

Automatically start/restart on Server boot/reboot

One way would be to use the @reboot directive with Cron. However, .deb and .rpm packages install init scripts (sysv or upstart style, as appropriate) on Debian, Ubuntu, Fedora, and CentOS already so MongoDB will restart there without further need from us to do anything special.

For other constellations, this one is an init.d script for Unix-like systems (based on this one).
For Mac OS X, people have reported that launchctl configurations like this one work.
For Windows, we have this documentation.

Use Case for —fork

The --fork switch forks a mongod process of another exiting process. One use case where this is pretty handy for example is if we SSH into a remote machine and start a MongoDB and leave right away after a new mongod got started.

The fastest way to do so would simply be to append /path/to/mongod --fork to the SSH command line since this will skip creating an intermediate shell and bring up a remote mongod right away.

Make a mongod Instance a Master without restarting it

No, we have to restart mongod using --master. Usually applications using mongod will continue to work (assuming they where written properly) i.e. they re-connect when the server comes back — this is true for pretty much any driver.

For example, with PyMongo we would catch the catch the AutoReconnect exception and when the server comes back up the client would simply reconnect.

Resource Usage

We only have so much memory, CPU cycles, storage/network I/O... we want to use this resources as efficiently as possible.

Make MongoDB faster

We need to give MongoDB enough RAM i.e. to do so we need to learn/know our use case — every use case has its individual metrics with regards to how it is best indexed and, if sharding is used, what the right sharding keys are.

Things like mounting the files system with the noatime option and using SSDs (Solid State Drives) are good too but we need to keep in mind, by far the most can be achieved if the working set size and indexes fit in RAM — as said, every use case has its individual metrics.

Database growing fast

The first file for a database is dbname.0, then dbname.1, etc. dbname.0 will be 64 MiB, dbname.1 128 MiB... up to 2 GiB. Once the files reach 2 GiB in size, each successive file is also 2 GiB.

So, if we have say, database files up to dbname.n, then dbname.n-1 might be 90% unused but dbname.n has already be allocated once we start using dbname.n-1. The reasoning here is simple: we do not want to wait for new database files when we need them so we always allocate the next one in the background as soon as we start to use an empty one.

Note that deleting data and/or dropping a collection or index will not release already allocated diskspace since it is allocated per database. Diskspace will only be released if a database is repaired or the database is dropped altogether. Go here for more information.

Also note that while MongoDB generally does a lot of pre-allocation, we can remedy this by starting mongod with --noprealloc and --smallfiles.

Why does MongoDB use so much RAM?

Well, it does not actually, it is just that most folks do not really understand memory management — there is more to it than just is in RAM or is not in RAM.

The current default storage engine for MongoDB is called MongoMemMapped_RecStore. It uses memory-mapped files for all disk I/O (Input/Output) operations. Using this strategy, the operating system's virtual memory manager is in charge of caching. This has several implications:

There is no redundancy between file system cache and database cache, actually, they are one and the same.
MongoDB can use all free memory on the server for cache space automatically without any configuration of a cache size.
Virtual memory size and RSS (Resident Set Size) will appear to be very large for the mongod process. This is benign however — virtual memory space will be just larger than the size of the datafiles open and mapped i.e. resident size will vary depending on the amount of memory not used by other processes on the machine.
Caching behavior (such as LRU'ing out of pages, and laziness of page writes) is controlled by the operating system. The quality of the VMM (Virtual Memory Manager) implementation will vary by OS.

As of now (January 2012), an alternative storage engine (CachedBasicRecStore), which does not use memory-mapped files, is under development. This engine is more traditional in design with its own page cache. With this store the database has more control over the exact timing of reads and writes, and of the cache LRU strategy.

Generally, the memory-mapped store (MongoMemMapped_RecStore) works quite well. The alternative store will be useful in cases where an operating system's VMM is behaving suboptimal.

Caching Algorithm

Actually, this is done by the OS using the LRU (Least Recently Used) caching pattern.

However, sooner or later most DBMSs implement their own CRA (cache replacement algorithm) simply because every DBMS is different and therefore a DBMS specific CRA can get the most out of every DBMS. Sounds good but then this bad news at the very same time. While having its own cache does usually make caching more efficient/faster, it causes problems with cache reheating:

Cache Reheating

Using the OSes cache, the nice attribute of the default MongoDB storage engine is its use of OS memory-mapped files — the cache is the operating system's file system cache. Therefore restarting mongod does not cause a reheat issue at all — we do not need to get lots of data from disk into memory as its probably still there. Good!

If MongoDB would use its own caching system including a bespoke CRA then that would cause a reheat scenario even when we restart mongod. Of course, on a server reboot it does not matter because reheating has to happen either way...

Working Set Size

Working set size can roughly be thought of as how much data we will need MongoDB (or any other DBMS (Database Management System), relational or non-relational) to access in a period of time.

For example, YouTube has ridiculous amounts of data, but only 1% may be accessed at any given time. If, however, we are in the rare case where all the data we store is accessed at the same rate at all times, then our working set size can be defined as our entire data set stored in MongoDB.

How much RAM does MongoDB need?

We now know MongoDB's caching pattern, we also know what a working set size is. Therefore we can have the following rule of thumb on how much RAM (Random Access Memory) a machine ideally needs in order to work as fast as possible.

It is the working set size plus MongoDB's indexes which should ideally reside in RAM at all times i.e. the amount of available RAM should ideally be at least the working set size plus the size of indexes plus what the rest of the OS (Operating System) and other software running on the same machine needs. If the available RAM is less than that, LRUing is what happens and we might therefore get significant slowdown.

One thing to keep in mind is that in an index btree buckets are cached, not individual index keys i.e. if we had a uniform distribution of keys in an index including for historical data, we might need more of the index in RAM compared to when we have a compound index on time plus something else. With the latter, keys in the same btree bucket are usually from the same time era, so this caveat does not happen.

Also, we should keep in mind that our field names in BSON are stored in the records (but not the index) so if we are under memory pressure they should be kept short.

Those who are interested in MongoDB's current virtual memory usage (which of course also is about RAM), can have a look at the status of mongod.

Speed Impact of not having enough RAM

Generally, when databases are to big to fit into RAM entirely, and if we are doing random access, we are in trouble as HDDs (Hard Disk Drives) are slow at that (roughly a 100 operations per second per drive).

One solution is to have lots of HDDs (10, 100...). Another one is to use SSDs (Solid State Drives) or, even better, add more RAM. Now that being said, the key factor here is random access. If we do sequential access to data bigger than RAM, then that is fine.

So, it is ok if the database is huge (more than RAM size), but if we do a lot of random access to data, it is best if the working set fits in RAM entirely.

However, there are some nuances around having indexes bigger than RAM with MongoDB. For example, we can speed up inserts if the index keys have certain properties — if inserts are an issue, then that would help.

Limit RAM Usage

MongoDB is not designed to do that, it is designed for speed and scalability.

If we wanted to run MongoDB on the same physical machine alongside some web server and for example some application server like Django, then we could ensure memory limits on each one by simply using virtualization and putting each one in its own VE (Virtual Environment). In the end we would thus have a web application made of MongoDB, Django and for example Nginx, all running on the same physical machine but being limited to whatever limits we set on each VE they run in.

Out Of Memory Errors

If we are getting something like this Fri May 21 08:29:52 JS Error: out of memory (or akin stuff) in our logs, then we hit a memory limit.

As we already know, MongoDB takes all RAM it can get i.e. RAM (Random Access Memory), or more precisely RSS (Resident Set Size), itself part of virtual memory, will appear to be very large for the mongod process.

The important point here is how it is handled by the OS (Operating System). If the OS just blocks any attempt to get more virtual memory or, even worse, kills the process (e.g. mongod) which tries to get more virtual memory, then we have got a problem. What can be done is to elevated/alter a few settings:

 1  sa@wks:~$ ulimit -a | egrep virtual\|open
 2  open files                      (-n) 1024
 3  virtual memory          (kbytes, -v) unlimited
 4  sa@wks:~$ lsb_release -irc
 5  Distributor ID: Debian
 6  Release:        unstable
 7  Codename:       sid
 8  sa@wks:~$ uname -a
 9  Linux wks 2.6.32-trunk-amd64 #1 SMP Sun Jan 10 22:40:40 UTC 2010 x86_64 GNU/Linux
10  sa@wks:~$

As we can see from lines 5 to 9, I am on Debian sid (still in development) running the 2.6.32 Linux kernel.

The settings we are interested in are with lines 2 and 3. Virtual memory is unlimited by default so that is fine already — this is actually what causes the most problems so we need to make sure virtual memory is either reasonably high or, even better, set to unlimited as shown above. With regards to allowed open file descriptors — by default we are limited to 1024 open files which, in some cases, might pose a problem — simply elevating it might be enough already and make memory errors go away.

Note that we need to run these commands (e.g. ulimit -v unlimited) in the same user context as mongod i.e. we basically want to script them as part of our mongod startup process.

OpenVZ

If we are running MongoDB with OpenVZ then there are some more settings we might want to tune in order to avoid the OOM (Out of memory) killer to kick in or simply hit the virtual memory ceiling if not set to unlimited. Special attention should be paid to the OpenVZ memory settings i.e. they should be set to reflect MongoDB's memory usage.

Does MongoDB use more than one CPU Core?

As for map/reduce (and any other JavaScript related operations), write operations are single-threaded i.e. they make only use of one CPU core for any running mongod instance (there is, as of now (January 2012) a global lock acquired by write operations).

For read operations however, which are the majority of operations, a mongod instance uses all CPU cores available to it — simply because we might be doing several read operations in parallel.

In short: we will notice a speedup going from a single-core CPU to multi-core since the speed increase is roughly linear to the available CPU cores when doing client-side read operations from any of MongoDB's drivers. However we will not see linear speedup with write operations and server-side read operations on a single mongod instance.

The latter, server-side JavaScript execution is mostly due to poor support for concurrency in JavaScript engines. There are plans however to also make JavaScript execute in parallel. As of now (October 2010) however, JavaScript executes single-threaded.

Global Lock

Right now (January 2012) we still have a global lock which applies across all database on a mongod instance. It is a global many-readers/one-writer lock i.e. the semantics are so that a write lock acquisition is greedy i.e. a pending write lock acquisition will prevent further read lock acquisitions until fulfilled.

There are a lot of tricks in the codebase to mitigate possible contention issues, but the lock does apply against all databases on a single mongod.

The plan is to have a per-collection lock some time in the future which will ease the pain of having a global lock a lot. This feature will probably be introduced with v2.2 i.e. probably before mid 2012. The tickets to watch SERVER-1240 and probably also SERVER-1830.

>1 Threads inserting at the same Time

If we have 2 threads inserting under one mongod instance, one inserting at a rate of 100k documents/s and another doing 1k documents/s then the inserts will get interleaved while they are both inserting.

How much Data can be stored with MongoDB?

16 MiB is the limit on individual documents. GridFS however uses many documents (split into small chunks, usually 256kB in size) so practically speaking there is no limit.

However, some implementations (read drivers) have a limit of 2^31 chunks — if we do the maths here that means that we end up with around 549 TiB storage for GridFS — it is expected that this limit will go away at some point plus the 256kB default chunk size can be changed to allow for bigger GridFS collections. Note that chunks with regards to GridFS (usually 256kB in size) are not the same as chunks with regards to sharding. With normal sharded collections, assuming the same limit of 3^31 chunks and a default chunk size of 64 MiB, we would theoretically be able to store 137 PiB.

The above is true for x86-64 systems, it is not entirely true for x86 (32 bit) systems — there is a limit because of how memory mapped files work which is a limit of 2GiB per database.

Do embedded Documents count toward the 16 MiB BSON Document Size Limit?

Yes, the entire BSON (Binary JSON) document (including all embedded documents, etc.) cannot be more than 16 MiB in size.

Does Document Size impact read/write Performance?

Yes, but this is mostly due to network limitations e.g. one will max out a GigE link with inserts before document size starts to slow down MongoDB itself.

Size of a Document

We can use Object.bsonsize(db.whatever.findOne()) in the shell like this:

sa@wks:~$ mongo
MongoDB shell version: 1.5.1-pre-
url: test
connecting to: test
type "help" for help
> db.test.save({ name : "katze" });
> Object.bsonsize(db.test.findOne({ name : "katze"}))
38
> bye
sa@wks:~$

Size of a Collection and its Indexes

sa@wks:~$ mongo --quiet
type "help" for help
> db.getCollectionNames();
[ "fs.chunks", "fs.files", "people", "system.indexes", "test" ]
> db.test.dataSize();
160
> db.test.storageSize();
2304
> db.test.totalIndexSize();
8192
> db.test.totalSize();
10496

We are using the test collection here. dataSize() is self-explanatory. storageSize() includes our data and all the still free but already allocated diskspace to this collection — all in bytes.

totalIndexSize() is the size in bytes of all the indexes in this collection and totalSize() is all the storage allocated for all data and indexes in this collection.

Current version of MongoDB also show the so-called fileSize (db.stats().fileSize) which shows total disk allocation.

If we need/want a more detailed view we could also have a look at

> db.test.validate();
{
        "ns" : "test.test",
        "result" : "
validate
  firstExtent:2:2b00 ns:test.test
  lastExtent:2:2b00 ns:test.test
  # extents:1
  datasize?:160 nrecords?:4 lastExtentSize:2304
  padding:1
  first extent:
    loc:2:2b00 xnext:null xprev:null
    nsdiag:test.test
    size:2304 firstRecord:2:2be8 lastRecord:2:2c58
  4 objects found, nobj:4
  224 bytes data w/headers
  160 bytes data wout/headers
  deletedList: 0000001000000000000
  deleted: n: 1 size: 1904
  nIndexes:1
    test.test.$_id_ keys:4
",
        "ok" : 1,
        "valid" : true,
        "lastExtentSize" : 2304
}
> bye
sa@wks:~$

Note that while MongoDB generally does a lot of pre-allocation, we can remedy this by starting mongod with --noprealloc and --smallfiles.

Reusing Free Space

Any time we delete a document using remove(), the freed space is reused by putting the next inserted document into collection into this sparse space or, by moving an existing document into that sparse space because it grew and needed to be moved. Of course, for both cases, this only works if the removed document was bigger than or equal in size compared to the inserted/moved document.

The way this works internally is so that MongoDB maintains a so-called free list for all of its collections. This is not entirely true however — capped collections have quite different semantics in their behavior and use, which is why they do not have to maintain a free list, a fact that makes them a lot faster on I/O compared to ordinary collections.

Collections, Namespaces

As we already know, collections are to MongoDB what tables are to MySQL and friends. In MongoDB collections are so-called namespaces.

Capped Collection

WRITEME

Sharding a Capped Collection

It is not possible to shard a capped collection.

Capped Array

http://jira.mongodb.org/browse/SERVER-991

Rename a Collection

Using help(); from MongoDB's interactive shell we get, amongst others, db.test.renameCollection( newName , <dropTarget> ) which renames the collection. So yes, we could do db.foo.renameCollection('bar'); and have the collection foo renamed to bar. Renaming a collection is an atomic operation by the way.

Virtual Collection

It refers to the ability to reference embedded documents as if they were a first-class collection of top level documents, querying on them and returning them as stand-alone entities, etc.

Number of Collections/Namespaces

There is a limit to how many collections/namespaces we can have within a single MongoDB database — this number is essentially the number of collections plus the number of indexes. The limit is ~24000 namespaces per database, with each namespace being 628 Bytes in size and the .ns being 16 MiB per default. Go here for more information.

Limit of Indexes per Collection

Yes, currently, with v1.6, we have a limit of 64 indexes per collection.

Cloning a Collection

Yes, possible. Have a look at mongoexport and mongoimport or mongodump and mongorestore which in fact use mongoexport and mongoimport respectively. Note that with mongoexport indexes are also exported and can thus be transferred.

Merge two or more Collections into One

Yes, we read from all collections we want to merge and use insert() to write it into our single target collection. This can be done on the server (using MongoDB's interactive shell) or from a client.

List of Collections per Database

We can use getCollectionNames() as shown below in lines 8 and 9. Yet another possibility is shown in lines 23 to 28. Of course, since every collection is also a namespace, we can find them aside indexes in lines 11 to 21.

 1  sa@wks:~$ mongo
 2  MongoDB shell version: 1.2.4
 3  url: test
 4  connecting to: test
 5  type "help" for help
 6  > db
 7  test
 8  > db.getCollectionNames();
 9  [ "fs.chunks", "fs.files", "mycollection", "system.indexes", "things" ]
10  > db.system.namespaces.find();
11  { "name" : "test.system.indexes" }
12  { "name" : "test.fs.files" }
13  { "name" : "test.fs.files.$_id_" }
14  { "name" : "test.fs.files.$filename_1" }
15  { "name" : "test.fs.chunks" }
16  { "name" : "test.fs.chunks.$_id_" }
17  { "name" : "test.fs.chunks.$files_id_1_n_1" }
18  { "name" : "test.things" }
19  { "name" : "test.things.$_id_" }
20  { "name" : "test.mycollection" }
21  { "name" : "test.mycollection.$_id_" }
23  > show collections
24  fs.chunks
25  fs.files
26  mycollection
27  system.indexes
28  things
29  > bye
30  sa@wks:~$

Delete a Collection

db.collection.drop() but there is no undo so beware.

Delete a Database

db.dropDatabase() but there is no undo so beware.

Namespace

Collections can be organized in namespaces. These are named groups of collections defined using a dot notation. For example, we could define collections blog.posts and blog.authors, both reside under the namespace blog but are two separate collections.

Namespaces can then be used to access these collections using the dot notation e.g. db.blog.posts.find(); will return all documents from the collection blog.posts but nothing from the collection blog.authors.

Namespaces simply provide an organizational mechanism for the user i.e. the collection namespace is flat from the database point of view which means that blog.authors really just is a collection on its own and not some collection authors grouped under some namespace blog. Again, the collection namespace is flat from the database point of view i.e. technically speaking blog.authors is no different than foo or foo.bar.baz — grouping (even though there is none from a technical point of view) just helps the humans keep track...

List of Namespaces in the Database

One way to list all namespaces for a particular database would be to enter MongoDB's interactive shell:

sa@wks:~$ mongo
MongoDB shell version: 1.2.4
url: test
connecting to: test
type "help" for help
> db.system.namespaces.find();
{ "name" : "test.system.indexes" }
{ "name" : "test.fs.files" }
{ "name" : "test.fs.files.$_id_" }
{ "name" : "test.fs.files.$filename_1" }
{ "name" : "test.fs.chunks" }
{ "name" : "test.fs.chunks.$_id_" }
{ "name" : "test.fs.chunks.$files_id_1_n_1" }
{ "name" : "test.things" }
{ "name" : "test.things.$_id_" }
{ "name" : "test.mycollection" }
{ "name" : "test.mycollection.$_id_" }
> db.system.namespaces.count();
11
> bye
sa@wks:~$

The system namespace in MongoDB is special since it contains database system information (read metadata). There are several collections like for example system.namespaces which for example can be used to get information about all the namespaces with some database.

Statistics, Monitoring, Logging

Once MongoDB is installed, configured and tested properly we switch the setup into production mode. Heck, we might even have a dedicated testbed!

Anyhow, once we leave the garage for good, we need to know about the setup's health at all times. In order to optimize the system we want to have statistics about it and, last but not least, we need history about all kinds of events which occurred... we log!

Server Status

Often folks do not know exactly what the units and abbreviations are with regards to the server status output:

sa@wks:~$ mongo --quiet
type "help" for help
> version();
version: 1.5.0-pre-
> db.serverStatus();
{
        "uptime" : 6695,
        "localTime" : "Sun Apr 11 2010 11:22:19 GMT+0200 (CEST)",
        "globalLock" : {
                "totalTime" : 6694193239,
                "lockTime" : 45048,
                "ratio" : 0.000006729414343397326
        },
        "mem" : {
                "resident" : 3,
                "virtual" : 138,
                "supported" : true,
                "mapped" : 0
        },
        "connections" : {
                "current" : 2,
                "available" : 19998
        },
        "extra_info" : {
                "note" : "fields vary by platform",
                "heap_usage_bytes" : 146048,
                "page_faults" : 57
        },
        "indexCounters" : {
                "btree" : {
                        "accesses" : 0,
                        "hits" : 0,
                        "misses" : 0,
                        "resets" : 0,
                        "missRatio" : 0
                }
        },
        "backgroundFlushing" : {
                "flushes" : 111,
                "total_ms" : 2,
                "average_ms" : 0.018018018018018018,
                "last_ms" : 0,
                "last_finished" : "Sun Apr 11 2010 11:21:45 GMT+0200 (CEST)"
        },
        "opcounters" : {
                "insert" : 0,
                "query" : 1,
                "update" : 0,
                 "delete" : 0,
                "getmore" : 0,
                "command" : 12
        },
        "asserts" : {
                "regular" : 0,
                "warning" : 0,
                "msg" : 0,
                "user" : 0,
                "rollovers" : 0
        },
        "ok" : 1
}
> bye
sa@wks:~$

Most of it is obvious like for example uptime. The globalLock part is interesting. totalTime is the same as uptime but in microseconds. lockTime is the amount of time the global lock has been held i.e. the total time spend waiting for write queries until a lock has been assigned and thus a write could be made.

One may ask what is the point of having both, uptime and totalTime? Well, totalTime will rollover faster since it is in microseconds, at some point they diverge. The rollover is coordinated between totalTime and lockTime.

mem units are in MiB, all of them. resident, what is in physical memory (also known as RAM), virtual is the virtual address space, mapped is the space memory mapped, and supported is if memory info is supported on our platform.

connections tells us how many client connections we can open against mongod, more precisely, current tells us how many existing client connections to mongod there are right now and available shows us how many we got left.

Within the extra_info part we have heap_usage_bytes which is the main memory needed by the database.

The opcounters part is also pretty interesting. The one from above shows that in fact this is a MongoDB database with nothing much happening. However, what if we had

"opcounters" : {
"insert" : 16513,
"query" : 1482263,
"update" : 141594,
"delete" : 38,
"getmore" : 246889,
"command" : 1247316
},

insert, query, update, and delete are self-explanatory but getmore and command are probably not. When we do a query, we get results in batches. The first batch is counted in query, all subsequent in getmore. commands are things like count, group, distinct, etc.

And yes, taking those numbers and dividing them by time (delta or total) will give us operations/time e.g. operations per second or operations since mongod got started. In fact, there is a Munin plugin which does use this. There also exists some Nagios plugin and Ganglia plugin which can be used to monitor MongoDB.

Collection Statistics

Note that this statistics are about one particular collection i.e. neither is it about the entire database (db.stats();), another collection, or about the server (db.serverStatus();) i.e. mongod itself:

sa@wks:~$ mongo
MongoDB shell version: 1.8.0-rc2
connecting to: test
> db.test.stats();                      # hint: db.test.stats(1024 * 1024); would give MiB instead of Bytes
{
        "ns" : "test.test",
        "count" : 2,
        "size" : 80,
        "storageSize" : 2304,
        "numExtents" : 1,
        "nindexes" : 1,
        "lastExtentSize" : 2304,
        "paddingFactor" : 1,
        "flags" : 1,
        "totalIndexSize" : 8192,
        "indexSizes" : {
                "_id_" : 8192
        },
        "ok" : 1
}
> bye
sa@wks:~$

numExtents is the number of extents allocated to the current collection (test in our current case). paddingFactor is the amount of extra space added to a record when a new document/object gets inserted into a collection. It is calculated based on how our objects tend to grow over time. A paddingFactor of 1 however, as shown above, means no padding, so no extra space. The reason for padding is simply that we can avoid moving objects around or at least minimize the cases when it is needed — moving objects is an expensive operation, that is why...

So, just to be clear: The padding is only calculated and updated when updates are applied to existing objects and if they cause a move/re-alloc. If the update fits in the existing space then no change to the paddingFactor is made for the collection-metadata.

That means that on inserts data is just pushed to the end or an existing hole that is big enough for the document/object to be inserted. It also means that inserts include some extra space when paddingFactor > 1 for a collection. The allocated size will be the size of the inserted document multiplied by the paddingFactor. Also, there is no way/need to set paddingFactor manually for the afore mentioned reasons.

Monitoring and Diagnostics

Yes, all possible.

Default logging Behavior

By default mongod will log

queries which take >500ms, and
log every database operation every 500ms

Note: we have queries and then we have ordinary database operations.

Profiling

How fast is a certain query? How fast a certain database operation? MongoDB has a profiler which can answer such questions with ease.

Log its Queries

Yes, go here. Basically, by using the database profiler it is possible to

not log queries at all
only log slow queries i.e. those who take >100ms
log all queries

Note that the time slot we are talking about here is based on when mongod acquires a lock until it transfers data out of this lock i.e. no network time and no acquiring the lock time. As of now the time needed to getting the lock is not part of this.

Change the logging Verbosity

Yes, use --quiet to get less verbose logs for basic database operations.

Logrotate

Yes, we can run db.runCommand("logRotate") periodically and thus delete old log files. This way we get the most recent log data, but do not fill our storage — might be a big thing in case we are using SSDs (Solid State Drives) instead of HDDs (Hard Disk Drives).

Another possibility of making mongod rotate its logs is using killall -USR1 mongod which will send SIGUSR1 to mongod which is a signal chosen by MongoDB in order to trigger log rotation. If we put this command with a cron then we get the same semantics we get with db.runCommand("logRotate") and cron.

This is somewhat also related to stopping MongoDB and the problems that arise if it is not done properly.

/var/lib/mongo/dialog.2f65d3d4

Files like /var/lib/mongo/dialog.<foo> appear if we use --configsvr or --diaglog at the command line. It is basically a very verbose internal log so that if there is an issue we can figure out what is going on. It is safe to remove those files if needed, just not the last one that is currently being written to.

Schema Design

Data modeling does not only concern us within the logic tier but also within (and across) the data tier i.e. we also model our database layout in order to achieve high maintainability, overall speed (entire stack) and diskspace usage amongst other things... query time minimization etc. We are speed devils after all...

Well, first off, there is no such thing as a schema with MongoDB i.e. it is schema-less and thus we can alter/create a schema as we go.

Therefore, there is no such thing as the best. There are a few best practices however like for example using embedded documents

http://blog.mongodb.org/post/434865639/state-of-mongodb-march-2010
http://www.mongodb.org/display/DOCS/Schema+Design
- http://www.mongodb.org/display/DOCS/Updating+Data+in+Mongo#UpdatingDatainMongo-EmbeddingDocumentsDirectlyinDocuments
- http://www.mongodb.org/display/DOCS/Updating+Data+in+Mongo#UpdatingDatainMongo-DatabaseReferences
- Rule of thumb is:
  - First class objects (also known as domain objects) that are at top level typically have their own collection and should never be embedded
  - Line item detail objects and objects which follow an object modeling contains relationship should generally be embedded.

WRITEME

Query across Collections/Databases

There is currently (January 2012) no way to query multiple collections and/or databases (possibly on different mongod instances) with a single query.

For now the best approach is to create a schema where first class objects (also known as domain objects) are in the same collection. In other words, documents which are all of the same general type, belong into the same collection. This way we only need to query one collection and do not need to do cross-collection/database queries.

Example

In this example, both blogpost and newsitem are of the general type content i.e. we will put them both into the content collection:

{
  type: "blogpost",
  title: "India has three seasons",
  slug: "mumbai-monsoon"
  tags: ["mumbai", "seasons", "monsoon"]
},
{
  type: "blogpost",
  title: "A summer in Paris",
  slug: "paris-summer"
  tags: ["paris", "summer"]
},
{
  type: "newsitem",
  headline: "2023, first human clone"
  slug: "first-human-clone"
  tags: ["clone"]
}

This approach takes advantage of the schema-less nature of MongoDB i.e. although both document types may have different properties/keys (e.g. title vs headline), they can all be stored in the same collection because the belong the same general type (content).

This allows us to query all of our content, or only some type(s) of content, depending on our requirements. db.content.find({ "type" : "newsitem" }) for example would only return news items but not blog posts.

Referencing vs Embedding

WRITEME

DBRef

http://valyagolev.net/article/mongo_dbref_activity_feed/
http://valyagolev.net/article/mongo_dbref/
- They are useful only when we do not know the type of a referred object. If we know the type, using DBRef is unnecessary overhead.
http://www.mongodb.org/display/DOCS/Database+References

Graphs

http://permalink.gmane.org/gmane.comp.db.mongodb.user/6856

Trees

Configuration

Even though a basic MongoDB setup is done in no time, when things get big, when we go head-to-head with YouTube, we need to configure/tune a little more ;-]

Runtime Configuration

If we wanted to see the run-time configuration our current mongod runs with, we can use the getCmdLineOpts command like so:

sa@wks:~$ mongo
MongoDB shell version: 1.7.2-pre-
connecting to: test
> use admin
switched to db admin
> db.runCommand("getCmdLineOpts");
{
        "argv" : [
                "/usr/bin/mongod",
                "--dbpath",
                "/var/lib/mongodb",
                "--logpath",
                "/var/log/mongodb/mongodb.log",
                "run",
                "--config",
                "/etc/mongodb.conf"
        ],
        "ok" : 1
}
> bye
sa@wks:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux unstable (sid)
Release:        unstable
Codename:       sid
sa@wks:~$

Indexes, Search, Metadata

We want context relevant information and we want it yesterday...

Create Index while/after Dataimport

Assuming we are using mongoimport to import a huge amount of data (e.g. from MySQL) into MongoDB, creating the index(es) after the import has already finished can be much faster.

Storage Location of Indexes

Indexes are part of normal database files (stuff in dbpath) i.e. the same semantics (LRU'ing etc.) apply to them as it does for ordinary data in our data set.

More than one Index per Query

As of now (October 2010) we can only use one index when doing a query. However, this one index can of course be a compound index.

Data Structure used for Indexes

MongoDB uses B-trees for indexes — thus O(log n) query characteristics. The B-tree nodes store the search key and the disklocation for the document, this is the same for single as well as compound indexes.

Besides the search key, each node in the B-tree also stores the disklocation of the data we actually want to retrieve — the database file in which it is contained and the offset.

With sharding, all chunks on the same shard share the same index structure.

Cost

Let us have a look at the different costs for searching through different data sets with either 50, 500 or 50000 objects:

>>> import math
>>> math.log(50)
3.912023005428146
>>> math.log(500)
6.2146080984221914
>>> math.log(50000)
10.819778284410283
>>>

As can be seen, the cost goes up rather mildly compared to the input set size e.g. we only have to look at roughly 11 objects even though our data set contains 50000 objects in the end — that is, the cost only roughly tripled even though the input set size grew by a factor of 1000!

Compound vs Single Indexes

What is the difference between a compound index on (foo, bar) and a single index on (foo) (besides the fact that (foo, bar) allows lookups on both, the foo as well as the bar field):

Which one has faster insert times?
Which one has faster query times when searching for foo only?
Which one takes more space?

The answer is that single indexes will be slightly better on all of those points — if we are just querying on foo that is! The only reason to use a compound index here is if we are also doing queries on bar in addition to foo.

ensureIndex is a NOP

There are several operations (createCollection(), ensureIndex(), etc.) which are so-called NOP (No Operation Performed). For example, if we do not have a collection created yet but call ensureIndex() on it, the collection as well as the index will be created. Then, if we issue ensureIndex() again, it is a NOP — both, the collection and the index already exist so...

Also, depending on the driver used there is a difference between using create_index() and ensure_index() — using PyMongo for example ensure_index() takes advantage of some caching within the driver such that it only attempts to create indexes that might not already exist. When an index is created (or ensured) by PyMongo it is remembered for ttl seconds (defaults to ttl=300). Repeated calls to ensure_index() within that time limit will be lightweight i.e. they will not attempt to actually create the index. create_index() on the other hand attempts to create an index unconditionally.

What BSON Object type is best used for the Document ID?

Every document in MongoDB has a unique key (_id). By default this value is the BSON (Binary JSON) ObjectID type, if not specified by the user explicitly.

We can use any BSON type (string, long, embedded document, timestamp, etc.) as long as _id is unique for any given collection.

The actual reason why, by default, IDs are BSON ObjectID types has two main reason: firstly, it is only 12 bytes in size rather than, for example, strings which are 29 bytes. Secondly, the BSON ObjectID type has some really useful information encoded with it automatically e.g. a timestamp, machine ID (e.g. hashed hostname), PID (Process Identifier), and a auto-increasing object counter. However, the timestamp part of the ObjectID is not a real timestamp as it is just in seconds, not ticks (μs on Linux).

Also, the ObjectID values themselves are not actually used for anything internally e.g. some people think it is used for sharding which is not the case. All those values are just a reasonable semantic way to comprise the ObjectID, that is all.

Another reason why people prefer using an ObjectID (or some other random string as value for the _id key) is because it helps to hide/obfuscate size (e.g. how many users a website has). It is not impossible with the default ObjectID, but it already makes it harder per default... harder as opposed to using an incrementing integer as primary key, as most RDBMSs do.

Natural Order

Natural order is defined as the database's native ordering of objects in a collection. With capped collections natural order is guaranteed to be insertion order which, in case default ObjectID's are used, is basically the same as sorting documents by their creation time and then inserting them one by one.

I said basically because it is not a 100% true that natural/insertion order guarantees to have documents ordered by their creation time. The reason is that ObjectID's are constructed on the client side (e.g. PyMongo) and then send to the server i.e. there is latency in between when the ObjectID is created, the document send over the wire and finally committed to disk on the server.

So if we had several clients creating and sending documents, the insertion/natural order might sometimes be off a little bit in that a slightly younger document might have been inserted before an older one.

However, in general, if dealing with a capped collection, the natural order feature is a very efficient way to store and retrieve data in insertion order (much faster than say, indexing on a timestamp field). We can use this fact to retrieve results sorted using $natural. Note however that $natural means do a full table scan i.e. this is O(n), no b-tree index is used.

Example Usage with Sorting

In addition to forward natural order, items may be retrieved in reverse natural order. For example, to return the 50 most recently inserted items (ordered most recent to less recent) from a capped collection, we would use db.mycapcol.find().sort({ $natural : -1 }).limit(50).

Full-text Search

MongoDB itself does not have built-in full-text search but there are other means, like for example using Lucene, so that one can have full text search with MongoDB — Elastic Search, itself based on Lucene, seems quite a promising solution.

What is built-in with MongoDB however is quite good and might be enough for most use cases where actually not all features provided by some full-blown full text search engine might be needed.

Multikey

Most commonly each key has one value such as { key : value }. We can then build and index on the key field which indexes all values belonging to that key.

A multikey is when a key has not just one value but an array of values such as { key : [foo, bar, baz] }. MongoDB then indexes each element of the array and not just a single value.

There are two main use cases when this is of great use. One is with full-text search and the other when we have a varying number of indexes to build that cannot be determined beforehand i.e. when keys and values, both, might vary in quantity and quality (type).

Constraints on Field Attributes

For example, what if we want/need the email attribute of the user document to be unique i.e. no two or more users can have the same email address. Is this possible? The answer is yes, and it is usually done using unique indexes.

Metadata

Go here and here for more information.

Case-insensitiv Queries

How can I make it so that db.products.find({products : "peppers"}); and db.products.find({products : "Peppers"}); returns the same result?

The solution is with regular expressions. In the particular case, db.products.find({product : /^peppers$/i} would do the trick. The downside here is that regular expressions like that do not use indexes.

One way around this would be to literally store both versions (or normalize right away), peppers and Peppers that is and index on peppers as usual, then, return Peppers when peppers is found.

Map/Reduce

Map/Reduce in MongoDB is intended as an aggregation and/or ETL (Extract Transform Load) mechanism, not as a general purpose query mechanism. What that means is that for simpler queries we would just use a filter those will leverage an index and be more real-time.

O(1) query characteristics at all times... no matter if we have 1 GiB of data and a single node or 1 PiB of data spread across several thousand shared-nothing nodes.

Map/Reduce for real-time Queries

No, it is a little more complicated than that. It is true that map/reduce allows for parallel execution of queries on a given data set, thus guaranteeing for O(1) response characteristics as we can grow our cluster (number of nodes) alongside our growing data set, thus keeping response time constant.

However, O(1) is not to be confused with real-time behavior since that is actually something completely different e.g. the time a robot, via its sensors, realizes an obstacle, until this input signal is processed and the command halt sent to the robots actuators (e.g. its wheels).

So, the robots brain is said to be real-time when, from the time a signal arrives, to the time a response/command leaves the brain, no more than a certain amount of time passes.

Note that this does not include the time spend before receiving a signal (recognising the obstacle via its sensors) and, based on this signal, the robot body reacting to its brains response/command (based on the formerly received input signal) send to its body.

Therefore real-time is a time constraint for a (computer) system in which it has to respond to a given input... or bad things may happen.

What people really mean when the use the (misleading) term real-time with MongoDB (or any other DBMS for that matter) is that a response to a given query is darn fast (whatever that means?!) ;-] Therefore the actual question should be

Is Map/Reduce our first choice for darn-fast Queries?

Sorry, No again! Map/Reduce is the right thing to do ETL type operations... facepalm, I know... we are close though, hang in! ;-]

So how the f... can we get our darn-fast queries? Easy! We tell map/reduce to create a permanent collection which we then index and query. This makes for darn-fast queries while we can update and index the collection periodically with map/reduce in the background.

PyMongo for Map/Reduce

Have a look at pymongo.collection.collection.map_reduce(). note however that even though we can use pymongo to send/receive map/reduce queries/results, the actual map and reduce functions are javascript i.e. no, you do not get away not having to deal with javascript ;-]

the reason is that map/reduce runs on the server (mongod), so as long as the server just runs javascript, it is not possible to use php/python/ruby/etc. but javascript. Go here for an example using Map/Reduce with PyMongo.

Continuous/Incremental Map/Reduce

It is not possible to do incremental or continuous map/reduce but it is possible to have ordinary indexes on the collection created by map/reduce.

Does Map/Reduce use Indexes?

Yes, all normal query semantics apply. However, map/reduce itself does not use indexes. In other words: we can use a query to filter the docs that map/reduce iterates over i.e. reduce the number of documents map/reduce has to iterate over.

What are the semantics of sort() and limit() with regards to sharding?

sort() and limit() are applied on the shards i.e. any limit is applied per shard.

How often does reduce run?

When map/reduce is used on a sharded setup, each shard will run reduce and the whole process is initiated and controlled by one mongos (the one that initially receives the map/reduce request from a client).

The general MongoDB map/reduce system does iterative reduce so intermediary stages are keep as small as possible. Right now (January 2012) all final results end up on one shard only (the final output shard). Since specifying out has become a requirement in 1.7.4, this means that the collections we specify by using out end up there too — just as they would with in a non-sharded setup with just one machine.

Is it possible to call Map/Reduce on only portions of a Collection?

Yes, as mentioned above, we can use the query option to filter on what gets input to map/reduce i.e. db.runcommand({ mapreduce : <collection>, map : <mapfunction>, reduce : <reducefunction>, query : <query filter object> });.

Does the Map/Reduce calculate every time there is a Query?

Yes. Map/Reduce recalculates each time there is a map/reduce query. The results are then saved in a new collection.

While map/reduce runs, does MongoDB wait for the Map/Reduce Query to finish?

Yes and no. We will have to wait until it finished, but we can also run it in the background.

Is it necessary for the Map step to emit anything? If so how often/much?

Usually we have 1 emit per map step but it is designed to be very flexible i.e. we can have anywhere between 0 and n emits per map.

Speedup Map/Reduce

As with writes, map/reduce runs single-threaded on a machine i.e. if we want more speed out of map/reduce, we want more machines — by more machines we mean shards, to really split up the work.

Example: a setup with 2 shards running on 2 dual core machines would be faster than a setup with 1 quad core machine without sharding.

Can Map/Reduce speed be increased by adding more Machines to a Shard?

No. Map/Reduce writes new collections to the server. Because of this, for now (October 2010) it may only be used on the primary i.e. if for example used with a replica set, it only runs/works on the primary of this replica set.

This will be enhanced at some point so it will be possible to use up to the maximum number of machines per shard participating in map/reduce — thus being able to deliver somewhat less but close to linear speedup for map/reduce relative to the number of machines per shard.

GIS

Because location matters and because people like to work with and use GISs (Geographic Information Systems)...

WRITEME

GridFS

Mommy! Yes? Let us store the world!... and better even, let us be able to retrieve it once we are done... mommyisglancingaroundnervouslyatthispoint

What is GridFS?

Please go here.

Why use GridFS over ordinary Filesystem Storage?

One major benefit is that using GridFS does simplify our entire software stack and maintenance procedures — instead of using a distributed filesystem like for example MogileFS in addition to MongoDB, we can now use GridFS for all the nifty things we would have used a distributed filesystem before, thus get rid of it altogether.

Also, with GridFS we get disk locality because of the way MongoDB allocates data files i.e. if we store a big binary (e.g. a video), chances are good it all fits into one pre-allocated data file and is therefore stored in GridFS as one continuous big chunk (data file) on disk. This way we can use any Linux/Unix utilities to do CRUD (Create Read Update Delete) operations e.g. rescue it by copying it to another node, etc.

Another big plus is that if we use the ordinary filesystem, we would have to handle backup/replication/scaling ourselves. We would also have to come up with some sort of hashing scheme ourselves plus we would need to take care about cleanup/sorting/moving because filesystems do not love lots of small files.

With GridFS, we can use MongoDB's built-in replication/backup/scaling e.g. scale reads by adding more read-only slaves and writes by using sharding. We also get out of the box hashing (read uuid (universally unique identifier)) for stored content plus we do not suffer from filesystem performance degradation because of a myriad of small files.

Also, we can easily access information from random sections of large files, another thing traditional tools working with data right off the filesystem are not good at. Last but not least, we can keep information associated with the file (who has edited it, download count, description, etc.) right with the file itself.

Web Applications

Another difference worth mentioning is that serving blobs (e.g. images) directly from the database is perfectly fine with GridFS and thus MongoDB as it was designed for that.

With MySQL and friends we would rather use the filesystem to do that since storing blobs within a rdbms (relational database management system) is usually performing poorly, both in terms of diskspace and speed.

Store non-binary Data that references binary Data

A common use case is when we need to store information about some entity and, along with it, some binary data.

For example, let us consider a person which might have information like address, hobbies, phone number and so on — all non-binary data i.e. regular textual data like strings, booleans, floats, arrays of strings, etc. in addition to that non-binary data, we would like to store binary data for that person like, for example, his/her profile picture for some facebook-alike website.

The best approach is to store the non-binary data in a usual collection and the binary data in a GridFS collection i.e. a separate collection for both types of data, binary and non-binary. We would then need to access the GridFS collection to get the binary data belonging to a particular user e.g. his profile picture.

In terms of performance impact, the best thing we could do now is to have a unique path to the profile picture that includes the userid and which can then be used to serve that image without even hitting our logic tier e.g.Django i.e. circumventing it and making the shortcut directly from the web server into GridFS.

We can think about how serving a list of images works in HTML. When we serve a page it includes <img> tags that reference an image and then the browser makes a separate request to get each one. This works well with mongodb. We can just serve the page and insert tags like <img src="/profile_pics/userid"> and then our server just redirects that path to the GridFS file with {"profile_pic": userid}. In this case, we can ignore the filename, _id, etc. properties of the GridFS file altogether.

The logic needed to create the unique path containing the userid can be pushed to the image load time when we only have to look at the two collections once — the one containing the non-binary data (person information) and the one containing the binary data (GridFS). In addition, if we wanted, we could also flag the profile picture, so that we can load the file with a query like this {user: userid, profile_pic:true}.

So, what is the benefit of all of this? Why not just query both collections for each request? The answer is speed and scalability. If we only have to touch both collections once and can even circumvent the logic tier for requests altogether, then that makes for a huge performance boost! Secondly, with a unique path (also known as URI (Uniform Resource Identifier)) to a picture (or any other kind of binary data), we can use so-called CDN (Content Delivery Network) and have content cached close to the user and thus have low latency and minimized global traffic along with a ton of other goodies.

Can I change the Contents of a File in GridFS?

No, simply because this could lead to data corruption. Instead, the recommended way is to put a new version of the file into GridFS by using the same filename.

When done we can use gridfs.get_last_version to get the latest version of a given file by filename — all the old versions are still around and can be referenced by _id.

(dramatic pause...)

Oh yes, absolutely, GridFS can be thought of as a versioned filestore ... gosh!

Delete Files in GridFS

If we wanted to delete all files stored within GridFS then we should just drop the fs.files (GridFS metadata) and fs.chunks (GridFS data) collection. If just wanted to delete some files, it has to be done file by file so that metadata and data collections are cleaned up accordingly.

Delete / Read Concurrency

If we are reading from a file stored in GridFS while it is being deleted, then we will likely see an invalid/corrupt file. Therefore care should be taken to avoid concurrent reads to a file while it is being deleted.

Note that this is not driver specific but based on the fact that we cannot do multiple, isolated operations, which would be needed to do a delete in GridFS that is concurrency safe. However, one way we could work around this would be by using a singleton in our logic tier (e.g. Django) and proxy delete/read operations through it.

Development

This subsection is about using MongoDB as data tier for our logic tier and the various things that need to be known in order to do so.

Is there a scons clean Command for MongoDB?

When scons is used, scons -c cleans up the build directory. If we use git then git clean -dfx would be a good way to do the clean up (read the git-clean man page for more information).

Insert multiple Documents at once

Most drivers have a so-called bulk insert method. PyMongo for example has pymongo.collection.collection.insert() which works totally transparent i.e. we just provide it with an iterable object and it will insert however many documents are found iterating over the object.

How are large resultsets handled by MongoDB?

Most people fear (and rightfully so) that if a query is send to some mongod which produces a huge result set it is all assembled in ram on one shoot.

That is not the case. What happens is that mongodb uses cursors to send results back in batches i.e. it does not assemble all of the results in ram before returning them to the client. Last but not least, the batch size is configurable to provide the most possible flexibility for any possible use case out there.

Can i run MongoDB as Master and Slave at the same Time?

Yes, there are certain use cases where this might make sense.

When using PyMongo, what can be done for improving single-server Durability?

Since MongoDB does not wait for a response by default when writing to a mongod instance (no CRUD operation does), a couple of commands exist for ensuring that these operations have succeeded.

These commands can be invoked automatically with many of the drivers (e.g. PyMongo) when saving or updating in safe mode — behind the scenes however, what really happens is that a special command called getLastError is being invoked.

To answer the initial question: with getLastError we have two ways to ensure for higher single server durability than what the default setup provides for (note that in the end only a replica set will provide for real redundancy):

We use safe=true with either one of pymongo.collection.collection.update(..., safe=true), pymongo.collection.collection.save(..., safe=true), pymongo.collection.collection.insert(..., safe=true), pymongo.collection.collection.remove(..., safe=true). If safe=true is used then the insert/update/save/remove will be checked for errors, raising OperationFailure if one occurred. Safe inserts/updates/saves/removes wait for a response from the database, while normal inserts/updates/saves/removes do not.
We use the fsync parameter to force MongoDB to write (also known as flush) our data to disk before returning. This is as easy with PyMongo as we simply have to pass fsync=true as a keyword argument e.g. pymongo.collection.collection.update(..., fsync=true). Note that if we do that, i.e. choose to specify any additional keyword argument, then this automatically implies the internal usage of safe=true as option to the getLastError command i.e. we do not need to explicitly specify safe=true.

Above we already mentioned replica sets. Replica pairs and/or sets have the ability to ensure writes to a certain number of nodes (slave for replica pairs respectively secondaries in context of replica sets) before returning.

Commands are safe

The mentioned semantics for single-server durability are automatically true for commands i.e. they are all safe as all commands always have a response from the server — no fire and forget semantics...

Birds-eye View on Data Set

During development, when we populate/flush our data set every five minutes or so to try out whether or not the code we have just written does what it is supposed to do, it comes in handy if can get an overview on our data set instantly. Using PyMongo this task is easily accomplished:

sa@wks:/tmp$ python
Python 2.6.6 (r266:84292, Apr 20 2012, 09:34:38)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pymongo
>>>
>>> con = pymongo.Connection()
>>>
>>> def print_collections():
...
...     for dbname in con.database_names():
...         print '\nDatabase: %s' % (dbname,)
...         db = con[dbname]
...
...         for name in db.collection_names():
...             print '  Collection: %s' % (name,)
...             coll = db[name]
...
...             for doc in coll.find():
...                 print '      Document: %s' % (doc,)
...
...
...
...
...
>>> print_collections()

Database: test
  Collection: test
      Document: {u'_id': ObjectId('4dda5230b9b052f4b881309b'), u'name': u'katze'}
  Collection: system.indexes
      Document: {u'ns': u'test.test', u'name': u'_id_', u'key': {u'_id': 1}, u'v': 0}

Database: aa
  Collection: polls_poll
      Document: {u'question': u"What's up?", u'_id': ObjectId('4d63beba4ed6db0e36000000')}
  Collection: system.indexes
      Document: {u'ns': u'aa.polls_poll', u'name': u'_id_', u'key': {u'_id': 1}, u'v': 0}

Database: admin

Database: local
>>>
sa@wks:/tmp$

Of course, that is just a very simplistic go at the problem and we would adapt/alter where necessary based on our use case at hand e.g. to nicely traverse and print nested documents.

Scalability, Fault Tolerance, Load Balancing

Scalability, fault tolerance and load balancing. Big words! Sounds scary, sounds hard, sounds expensive. All true, except when armed with MongoDB.

Hidden Replica Set Members

We can hide a secondary if we use hidden : true when configuring/adding it to the replica set. The effect of hidden : true is that the secondary's existence is not advertised to clients in isMaster() command responses.

This also means that hidden secondaries will not be used if a driver automatically distributes reads to secondaries. A hidden replica set member must have a priority of 0 which means we cannot have a hidden primary (only hidden secondaries). To add a hidden member to a replica set we would do

> rs.add({"_id" : <some_id>, "host" : <a_hostname>, "priority" : 0, "hidden" : true})

Last but not least, let it be known that what we just explained is not in any way affected by the use of replica set priorities — we can use both or either one at the same time.

Corrupted Primary affecting Secondary

A secondary queries the primary's oplog the same way any client does i.e. document-by-document. This means that if a document's headers are intact and the document's payload is the expected size, yet corrupt, then that corruption could be transferred, in theory. However, this is a very unlikely scenario, but not 100% impossible.

The important thing to understand here is that a problem/corruption on the primary might (very unlikely but theoretically possible; see above) cause a few corrupted documents on the secondary but it will never cause a corrupted secondary — worst case we will have to write off a few documents but never an entire secondary.

Replicating/Migrating Indexes

Data gets migrated (sharded setup) and/or replicated (replica set) but never indexes. They will be build and maintained entirely by any node itself (e.g. a secondary within a replica set) and never transferred in any way between nodes.

bounce

What does bouncing mongod or mongos mean? Bouncing or doing a bounce really just is a restart of the instance in question i.e. a stop followed by a start.

local

The local database contains data that is local to a given server i.e. it will not be replicated anywhere. This is one reason why it holds all of the metainformation replication info like for example the oplog.

local.* is not just reserved for replication stuff but we can also put our own data there. If we do a write operation on the local database and then check the oplog, we will notice that there is no record of the write operation writen to the oplog — the oplog does not track changes to the local database since they are not going to be replicated anyway.

Time to Failover

How long does it take until automatic failover happens? With replica sets this is configurable (combination of heartbeatsleep, heartbeattimeout and heartbeatconnretries), default is 32 seconds till we initiate the automatic failover plus, the time it takes for the entire cluster to settle on the new circumstances (new primary).

The heartbeat signal is not blocked by running transactions e.g. if we run an insert/update that takes 5 seconds, it will not block the default heartbeatsleep which is set to 2 seconds per default.

Semantics for writes to a Replica Set which Primary just died?

From an application point of view, when trying to write to a replica set which primary just died/vanished, the first write request will fail i.e. the application has to handle a retry. Subsequent requests should go to the right place i.e. the new primary.

Can I go from a single Machine/Replica Pair to a Replica Set?

Yes

Is there Downtime for adding a Secondary to a Replica Set?

No, whether it is the first secondary within a replica set or any further secondary after the first one, there is no mandatory downtime involved in adding secondaries.

Initial synchronization Source

In order to synchronize a new secondary in a replica set we can either use its primary or another secondary as source. Using a secondary is recommended because it does not draw resources of the primary.

For example, if we wanted to add a new node and force it to synchronize from a secondary, we could do

> rs.add({"_id" : <some_id>, "host" : <a_hostname>, "initialSync" : {"state" : 2}})

We can choose a source by its state (primary or secondary), _id, hostname, or up-to-date-ness. For the last, we can specify a date or timestamp and the new member will choose a source that is at least that up-to-date.

By default, a new member of a replica set will attempt to synchronize from a random secondary and, if it cannot find one, synchronize from the primary. If it chooses a secondary, it will only use the secondary for its initial synchronization. Once the initial synchronization is done and it is ready to synchronize new data, it will switch over and use the primary for normal synchronization.

Semantics

During the initial synchronization, the entire data set is copied to a secondary (except for data we put into the local database). The way this works is sort of snapshotting i.e. it clones the master's oplog in _id index order and then applies all the operations from the master's oplog on the secondary.

Slavedelay

With a replica set and or the older master/slave setup it might be beneficial to have a secondary that is purposefully many hours behind the primary in order to prevent human error. In MongoDB 1.3.3+, we can specify this with --slavedelay i.e. specify the delay in seconds the secondary will stay behind its primary. For example, if we wanted to have a secondary stay one hour behind its primary we could do

> rs.add({"_id" : <some_id>, "host" : <a_hostname>, "slaveDelay" : 3600})

Is there a Limit on the Number of physical Machines for a Replica Set?

Yes there is. As of now (February 2012) it is 12 physical machines per replica set — a shard usually consists of a replica set so... This number is inclusive the primary i.e. 1 primary and 11 secondaries are the maximum number of physical machines for a replica set as of now.

Scale from 1 Shard to >=2 Shards

Yes, doing so not even requires downtime.

Is there Downtime for adding a Shard to an existing sharded Setup?

No, a shard can be added to an existing sharded setup without downtime for the existing setup. The way this works is rather simple — we set up a new replica set and then use the addshard command to add it to the existing setup.

In which Order do I upgrade a sharded Setup?

Assuming we have sharding set up and each shard is made redundant using a replica set, the recommended order how to upgrade the whole setup is by starting with the config servers, next the shards (replica sets), and finally the mongos used to route client requests to our shards.

It is also recommended to use the same version of mongodb for all instances and, if possible, use pre-build packages as provides by our os e.g. .debs installed via APT (advanced packaging tool) on DebianGNU/Linux.

Do I need an Arbiter?

An arbiter is a mongod instance which only participates in elections i.e. it does not have a copy of the data set and will never become primary or secondary. An arbiter is mainly useful for breaking ties during elections e.g. if a replica set has only two secondaries left (maybe because its primary died and/or there has been a network split, separating primary and secondary/secondaries).

Therefore, an arbiter is only needed for a replica set that might end up with 2 secondaries (or any other even number of secondaries) which carry the same weight in terms of votes (priority : <weight>). This could happen because the primary dies and/or because there is a network split and the primary vanishes (primary cannot see its secondaries anymore and vice versa).

In the case of an odd number of secondaries remaining after the primary vanished, they can elect a new primary amongst themselves without the need for an arbiter — that is, even if all of them carry the same weight in terms of votes.

Usually what we want to start out with is an odd number of mongod instances within our replica set (sum of primary and secondaries) to increase chances that majorities can form should its primary fail. The reasoning for an odd number is that the majority is computed from the total number of members in a replica set rather than its current up members.

This is merely a slight insurance assuming that every instance carries the same weight in terms of votes — there is also the factor of which of the remaining secondaries has the most up-to-date copy of the data set i.e. votes are not the only fact that determines the election of a new primary. There are also replica set priorities which influence the election of a new primary.

An arbiter would ideally run on a different physical machine (neither one that participates in a replica set i.e. is primary or secondary).

Note however that there is nothing stopping us from having an arbiter deployed for replica sets which, even after its primary vanishes (physical machine dies, mongod on the primary segfaults etc. and/or there is a network split), end up with 3 or more secondaries (including each even number of 4 and bigger).

In Short:

A replica set needs an arbiter in cases where it might end up with an even number of secondaries (e.g. 2) after its primary failed. The reason being that we will still have an odd number of voters that can break ties and elect one of the secondaries to primary.
- Example: say we start with a replica set of 3 members (1 primary, 2 secondaries) and the primary fails, thus leaving the replica set with only 2 members (2 secondaries). 2 secondaries and 1 arbiter are an odd number of voters, they are able to elect a secondary to become the new primary — we end up with a replica set of 2 members (1 primary, 1 secondary) and 1 arbiter.
- Example: a replica set that starts out with only 2 members (1 primary and 1 secondary) and ends up with just 1 member (its secondary) in case its primary fails is an exception and needs an arbiter in order to elect the secondary to primary (see saving diskspace).
There is no need for an arbiter in cases where we might end up with an odd number of secondaries >=3 (e.g. 3, 5, ...).
- Example: say we start with a replica set of 4 members and the primary fails. 3 secondaries can break ties and elect on secondary to primary

Saving Diskspace

Some setups start out with only 2 members (1 primary and 1 secondary). The benefit of doing so is saving diskspace — 2 copies of our data set rather than 3. The downside with this setup is that there are only two copies of the data set at all times which increases the probability for total data loss.

With 1 arbiter and 1 secondary the replica set is able to break ties in the voting process and elect the secondary to primary automatically (without manual/human intervention) — without the arbiter the replica set would not be able to recover automatically and remain in read-only mode until manually fixed.

Replica Set Priorities

MongoDB 1.9 allows us to assign priorities in the range of 0 to 100 to each member of the replica set.

WRITEME

Arbiters with —oplogsize 1

By default the oplog size is roughly 5% of diskspace on a machine running a mongod instance — the arbiter is a mongod with special configuration, same as a config server.

Anyway, arbiters do not need to hold any data or have an oplog therefore 5% diskspace would be a huge waste of space.

How can I make sure a write Operation takes place?

Go here and here for more information.

Sharding works on the collection-level but not on the database-level?

Exactly. When sharding is enabled we shard collections but not the entire database. This makes sense since, as our data set grows, certain collections will grow much larger than others.

Replication works on the database-level but not on the collection-level?

Collection-level replication is not supported for replica sets i.e. we can either replicate the entire database or nothing at all.

I want MongoDB start sharding with less X MiB?

MongoDB's sharding is range based. So all the objects/documents in a collection get put into a chunk. Sharding only starts after there is more than 1 chunk. Right now (February 2012), the chunk size is 64 MiB, so we need at least 64 MiB for sharding to occur.

If we do not want to wait until we have 64 MiB, then we can configure mongos with --chunksize=1 to make the chunk size 1 MiB and also reduce the number of chunk differential before balancing kicks in — which currently is 8 chunks. Then sharding will start sooner e.g. with --chunksize=1 it will currently start sharding after we put 8 MiB into the database.

Also: note that chunks with regards to GridFS (usually 256kb in size) are something entirely different than chunks with regards to sharding, both in size and the semantics that go with them, it is just that on both cases the term chunk is used.

What ties exist between Map/Reduce and ordinary Queries and indexing?

Indexing and standard queries in mongodb are separate from map/reduce i.e. for somebody who might have used couchdb in the past this is a big difference — MongoDB is more like mysql for basic querying and indexing i.e. those who just want to index and query can do so without the need to ever touch the whole map/reduce subject.

Does a Shard Key have to be a unique Index?

A shard key can either be a single or a compound index. We might also choose to make it unique. If so, then we should know that a sharded collection can only have one unique index, which must exist on the shard key i.e. no other unique indexes can exist on a sharded collection except for the unique index which exists on the shard key itself. Note that _id does not count with regards to this constraint, not even if it is of type ObjectID.

Either ways, some variation is good since otherwise it will be hard for MongoDB to split up our data set if the shard key lookup always returns either true or false.

Choose a Shard Key?

First and foremost, note that this very much depends on the use case at hand i.e. a shard key is always specific to a particular setup and therefore has to be chosen wisely — the choice for a shard key will effect overall cluster performance as well as time spend to extend and maintain it.

What is generally true to any use case is that we want our data set evenly distributed across the entire cluster i.e. data_set/number_shards = data_per_shard where data_per_shard should ideally be the same for each shard.

Second, we want data to only live on one shard at all times i.e. we do not want the same data to live on two or more shards at the same time.

In order to adhere to this two general rules, a shard key should fit our use case so that:

A shard key should be granular enough to ensure even distribution of data e.g. if we had a shard key pattern { username : 1 }, then we have to be careful that we do not have a disproportionate number of users with the same name. If so, the individual chunk can become too large and find itself unable to split e.g. where the entire range comprises the same slot within the shard key range. If it is possible that a single value within the shard key range might grow exceptionally large, it is best to use a compound shard key so that further discrimination of the values will be possible.
A shard key should ensure that data is not sent to two or more shards at the same time i.e. it should possibly trigger on some unique value rather than an ambiguous one, therefore sending data to more than one shard.

Last but not least, note that what is true for how data gets stored in a sharded setup is true for anything else as well e.g. the same semantics apply for queries to our data set. The reason the above explanation only focused on storing data is for reasons of simplicity only, as most people new to sharding already have a hard time wrapping their minds around it.

Time vs Non-time base

As a rule of thumb, we can decide to go with one of the following use cases:

Time based: If we shard by timestamp, we fill up one shard after the other, one by one. That means one shard takes all the writes, reads might still be distributed. In some cases that is fine — mostly if our data set is huge, but insert load is modest, then that might be a great solution.
Non-time based: If we need to distribute writes, then inserting by timestamp (or semantically akin shard key) is not the solution to go with. Rather, we need a shard key that distributes writes/reads evenly across the entire cluster (load balancing). The username example from above would do so. In addition, information related to one particular entity (e.g. user) lives on the same shard (which is good for a whole variety of things e.g. aggregation).

Compound Shard Keys

This is when the shard key pattern has more than one field — { surname : 1, firstname : 1 } rather than just {surname : 1 }.

The way semantics change (for splitting up our data set and/or for queries etc.) is like this:

If we shard by a compound shard key patter of {by : 1, about : 1}, queries for get all answers by... and get all answers by... about ... will be targeted to a subset of shards, but queries such as get all answers about... will have to be sent to all shards in the cluster.

Should I use a Compound Shard Key?

As usual, depends on the data set at hand and how it is queried (see above). Please also go here for more information.

Change Shard Key

So far, (February 2012) there is no way to change the sharding key on a collection. In order to have a different shard key, we would have to recreate the collection with the new sharding key and write a script to move data from the old collection to the new one.

Unique Index

The same that is true for changing the shard key is true for having a unique index on sharded collections — we need to recreate and move data from the old to the new collection which would have the unique index in place. Also, note that a sharded collection can have only one unique index, which must exist on the shard key itself.

Migrating Chunks

With a sharded MongoDB setup data (chunks) is moved around on all mongod instances in a cluster. The way migrating chunks works is so that mongos triggers the migration and then the transfer is done mongod to mongod i.e. mongos does not do any further work after it has triggered the migration. Once this is done it can even crash or go offline.

post-cleanup or removedDuring

Chunks are a logic unit — right now (January 2012), mongod stores all data of any of the chunks it holds as if there were no separation at all i.e. chunk boundaries are metadata and are stored on the config servers.

The ../data/moveChunk directory contains files prefixed with post-cleanup or removedDuring which is related to moving chunks around the cluster when sharding is enabled. What happens is that mongod records all data deletion during a chunk's migration in the ../data/moveChunk directory. In case there is some error occurring while the migration is still taking place, and mongod does not recover alone, the alternative exists to pull the data from those files.

Specifically, post-clean is where mongod puts the data for a chunk that has just migrated (it is a copy of the chunk). We find those on the FROM side of a migration.

removedDuring on the other hand is a directory on the TO side of the migration which contains the deletions that occurred in between the first cloning of the chunk and the commit phase of the operation.

Swap Master And Slave

We can swap master and slave. Actually this is not much different from cloning a snapshot. What we need to do is:

Disable writes to the master (clients and/or the slave can still read/query) e.g. we just disconnect our logic tier (e.g. Django) from the data tier (MongoDB).
Once all outstanding writes have been committed on the master and the slave caught up, we shutdown the slave and the master and restart the slave being our new master by using --master.
We wait until the oplog on the new master got build then we bring up the slave (old master) by using --slave and --source to point it to the new master (this is a full resync if we do not use --fastsync which is recommended in this case since our data (dbpath) did not change).
After the slave is in sync with its master, we enable writes on the new master again i.e. we point all traffic at the new master again (plugging the logic tier into the data tier again)... et voilà!

Can I upgrade the Master without the Slave(s) needing to resync?

Yes. Assuming we want to upgrade the master (e.g. give it faster/bigger hardware) to cope with growth, we certainly do not want to resync the/all slave(s) which were pointed to this master using --source.

The way this works is by copying the oplog from the old master to the new one and then pointing the/all slave(s) to the new master. The necessary steps are outlined below:

We start by shutting down the/all slave(s).
Next we stop writes to the old master and wait until all writes have been committed.
Now we migrate the oplog (all local.oplog.* files) and data (dbpath) to the new master.
Bring up the new master using --master.
Restart the/all slave(s) in stand-alone mode i.e. not using --slave nor --source and edit local.sources on the/all slave(s) to point to the new master.
Restart the/all slave(s) with --slave and the new --source.

How does distributing Reads amongst Replica Sets/Pairs work?

This is not happening automatically (i.e. slave_ok is not set by default) as many people think i.e. it should not be confused with what is called auto-sharding.

As per default all I/O (input/output) operations go to the master (in replica pair terminology) or primary (replica sets terminology), the problem here is that there is no guarantee all secondaries within the replica set are totally synced with their primary at any given time — there are cases where this is not a problem but there might be cases when it actually matters.

The way distributing reads works is so that an application using mongodb gets to decide when/how to distribute reads by means provided through their native mongodb language driver e.g. PyMongo.

This provides maximum flexibility and still makes total sense since replica sets are thought to solve more problems than just distributing reads (load balancing). For example, with replica sets we get:

n slaves (called secondaries with replica sets) for backup
can have a disaster recovery secondary in a different geographical location
application by application can decide if reading from secondaries is ok for the particular problem at hand. A common case often seen is allowing (potentially) stale reads for non-logged in users and directing our logged in users directly to the primary with guaranteed up to date information.

PyMongo for example might have some more flexible options in the future, but to distribute reads to a replica set we have to maintain multiple connection instances and distribute them in our application code i.e. as of now (February 2012) PyMongo does not do load balancing (e.g. in round-robin fashion) reads across secondaries automatically. As said however, we might see this feature soon...

Distributing reads with mongos

If we have a sharded setup where each shard is made up of a replica set, we also have one or more mongos which route queries to and aggregate results from the replica sets.

We already know that with slave_ok and a non-sharded setup, it is possible to distribute reads across the replica set. However, as of now (January 2012) mongos is not capable of doing the same i.e. whenever we have a sharded setup, for now, reads will all go to the primary of each of the replica sets.

Replica Set Configuration is persistent

That is, if i shutdown a whole cluster, do i have to re-initialize it when i bring it back up?

No, you do not have to re-initialize because, as for sharding, the configuration is persistent. Each node within the replica set contains the configuration for the entire replica set.

Shard Configuration is persistent

Yes, the config server handles persistence of the shard setup.

Overhead involved with Sharding

In auto-sharding setup, the mongos process basically acts like a proxy to all the shards. Using a proxy usually results in reduced single node performance and increased latency (especially if the shard servers and management servers are not on the same network and there is OSI layer 2 routing involved). Still, the overhead is mostly fairly small as there is about 5% overhead for going through mongos plus what network latency takes away.

Compared to {My,Postgre}SQL, what is the Performance Penalty with Replication?

When we use replication with mysql/<some_other_rdbms>, it requires us to enable binlog and that results in significant performance drop, usually 50% or more.

This different with MongoDB as it does not require writing to an additional replication log and therefore does not have anywhere near that 50% performance drop. With MongoDB replication works by writing to an oplog which is generally pretty fast (depending on use case of course).

What if a Primary's Oplog is to far ahead for a Secondary to catch up again?

Since the oplog is a capped collection, this could very well happen. The question is what would a secondary do that already existed but may have been without a connection to its primary for some time — maybe long enough for the primary's oplog to roll over! Most folks would guess one of three cases:

The secondary knows that it is far out of sync (i.e. more than what the primary's oplog contains) and just does a full resync with its primary.
It does not know that it is far out of sync, and just updates with whatever the latest operations in the primary's oplog is i.e. basically creating an inconsistent secondary database.
It knows it is out of sync, but does not care and just updates itself with whatever is in the primary oplog, again creating an inconsistent secondary database.

Well, of course 1 is what happens, whether we had the primary roll over its oplog already or not, the secondary always does a full resync in order to have the exact same state/data as its primary.

In case we want to speed things up, we could clone a snapshot from the primary and then use fastsync to create a new and up to date secondary from an existing primary which could then again be integrated within the replica pair/set again.

Does using —only mean not the entire Oplog is written on the Master?

No, even if we use --only on a slave, the master writes the oplog for all the databases it carries i.e. it does not just write the oplog for databases listed with --only on all/the slave(s) but instead, the master writes it for all databases — it is just that a slave does not replicate every database from the master but just those listed with --only.

DNS

Replica sets as well as shards can use domain names instead of IPs.

Miscellaneous

A colorful mix of various non-related bits and pieces... tell me, does this also makes you think of the food you had back then when you were still a student? ;-]

C Extensions

By default PyMongo gets installed/build using optional C extensions (used for increasing performance). If any extension fails to build then PyMongo will be installed anyway but a warning will be printed. This is how we can check whether or not PyMongo got compiled with the C extensions:

>>> import pymongo
>>> pymongo.has_c()
True
>>>

Connections

For each TCP connection from a client to a mongod, there is a queue for that connections requests. If a client sends a new request, it will be placed at the end of a connection queue — subsequent requests on the same connection will be processed after the currently running operation has finished.

This means that a single connection always has a consistent view of the data set — it reads its own writes e.g. if we do a query after we did an insert/update, then the query will return the most current status (we will see the changes done by insert/update).

However, two or more connections do not necessarily have a consistent view of the data set as one connection might do a write/update which might not be seen immediately by another connection.

For example, each mongo has its own TCP connection to a mongod i.e. if we do an insert on one mongo followed by a query in another one, we might not see the insert immediately.

getLastError

This is useful for all CRUD (Create Read Update Delete) operations (e.g. insert(), remove(), update()), especially write operations. However, people sometimes set it after database commands and queries too.

What all those CRUD operations have in common is that by default none of them waits for a return code from mongod — this saves the client from waiting for client/server turnarounds during CRUD operations. However, we can always call getLastError if we want a return code — most drivers call getLastError internally.

For example, if we are writing data to a mongod instance on multiple TCP connections, then it can sometimes be important to call getLastError on one connection to be certain that the data has been written to the database. For instance, if we are writing to connection A and want those writes to be reflected in reads from connection B, we can assure this by calling getLastError after writing to connection A.

From within the shell the syntax to query directly on the $cmd collection would look like this: db.$cmd.findOne({ "getLastError" : 1 }). With no options it blocks until the last write operation on the current TCP connection has completed.

If we use options such as fsync=true, w=<n> or wtimeout=<milliseconds> then it blocks until either fsync finished, the write has been completed on n machines in the replica set or until a certain number of milliseconds have gone by.

Connection Pooling

mongod will use one thread per TCP connection, therefore it is highly recommended that our logic tier (e.g. Django) uses some sort of connection pooling.

The good news here is that most drivers support connection pooling i.e. the drivers open multiple TCP connections (a pool) to mongod and then distribute requests across them. This is very efficient and provides for a real performance boost. Without connection pooling things would be a lot slower and resource intensive.

PyMongo for example does automatic connection pooling in that it uses one socket per thread. As of now (January 2012) the number of pooled sockets is fixed at 10, but that does not include sockets that are checked out by any threads.

Number of Clients connected

Have a look at the connections field (current) with the server status.

Number of parallel Client Connections

Have a look at the connections field (available) with the server status.

Endianness

MongoDB assumes that we are using a little-endian system e.g. Linux. in a nutshell: little-endian is when the lsb (least significant byte) is at the lowest address in memory. The other bytes follow in increasing order of significance.

Even though that MongoDB assumes little-endian, all of the drivers work on big-endian systems, so we can run the database remotely and still do development on our big-endian (old-school) machine.

32 bit limitation

There is a size limit because of how memory mapped files work. This limit is 2 GiB per database (read per mongod instance).

Type Casting

In MySQL we have CAST(expr AS type) and other cast functions and operators. In order to do something like this in MongoDB we would use $set to update the field to a new value/type.

Forbidden Characters/Keywords

Yes, there are some. It is recommended not to use certain characters in key names.

Store Logic on the Server

You, like me, we are both sane technicians i.e. we follow the three tier model when building web applications and thus keep presentation, application logic and data separated.

However, with MongoDB, it sometimes makes sense to put query logic onto/into the data tier. In a way that is pretty much the same as stored procedures with RDBMSs.

So, the answer is yes. With MongoDB we can use JavaScript to write functions (read logic) which we can then store and execute directly on the server side i.e. where mongod runs.

We should use this with caution however since there is only one thread of execution for each JavaScript call on the server (e.g. 10 Map/Reduce jobs would run 10 JavaScript threads in serial i.e. one after the other). This makes it a potential bottleneck and might cause problems on a heavily used server (e.g block other operations). Go here for more information.

Cron-like Scheduler

As of now (February 2012) there is no built-in scheduler with MongoDB i.e. it is not possible to schedule execution for some server-side stored JavaScript.

The simplest solution to do this now would be to put the JavaScript code into a file, put this file into the system.js collection and then invoke MongoDB's interactive shell with this script.

Oplog

The so-called oplog or operations log also known as transaction log is in fact a capped collection (all local.oplog.* files on the primary). Its main usage is with replication where it is used to keep a secondary(s) synchronized with its primary— there are other use cases as well of course.

The way this (replication) works is fabulous — we do not move the actual data but instead we move operations (read actions/computations) from the primary to the secondary and then run them again on the secondary, thereby creating the exact same state/data as found on the primary.

Since the oplog is a capped collection, it is allocated to a fixed size (5% of overall diskspace by default; --oplogsize can be used to change this) i.e. when more data is entered, the collection will loop around and overwrite itself instead of growing beyond its pre-allocated size.

If the secondary can not keep up with this process, then replication will be halted. The solution is to increase the size of the primary's oplog. There are a couple of ways to do this, depending on how big our oplog will be and how much downtime we can stand. But first we need to figure out how big an oplog we need.

The Oplog is Idempotent

If working with the oplog (e.g. by replaying it on some database or by using it to synchronize two members of a replica set etc.) it is important to know that the oplog is idempotent i.e. we can apply it to our data set as many times as we want and our data set will always end up the same.

How do I determine and create the right Oplog Size?

If the current oplog size is wrong, how do you figure out what's right? The goal is not to let the oplog age out in the time it takes to clone the database.
http://permalink.gmane.org/gmane.comp.db.mongodb.user/3957
http://www.mongodb.org/display/DOCS/Halted+Replication
While a slave is syncing with the master, we cannot resize the masters oplog — if we do, replication will halt/fail and we will not be able to resume it until we clear out the masters oplog (remove all local.oplog.* files) and start from scratch.
If we resize or delete the masters oplog, all slaves need to start over from scratch i.e. again sync with their master. To be sure this works smoothly, it is recommended to clean out the slave i.e. remove all datafiles from the slave.
- We need to make sure our slave is not running when we shut down the master in order to modify the oplog size.

WRITEME

What else can I use the Oplog for?

Well, there are many good ideas floating around. Some of them centered around events. For example, by watching the oplog we might build an event framework/loop and do all kinds of fancy things.

That is pretty much exactly the same as using CouchDB's _change API. For MongoDB this would probably also involve so-called tailable cursors.

Cursor

Well, generally speaking i.e. for DBMSs (Database Management Systems) in general, a cursor comprises a control structure for the successive traversal and potential processing of records in a result set. By using cursors, the client to the DBMS can get, put, and delete database records.

Of course, this is also true for MongoDB i.e. a query for example returns a cursor which can then be iterated over to retrieve results.

The exact way to query will vary with language driver. For example, using MongoDB's interactive shell, database queries are performed with the find() method which returns a cursor which is then used to iteratively retrieve all the documents returned by the query.

Conceptually speaking, a cursor always has a current position. If we delete the item at the current position, the cursor automatically skips its current position forward to the next item.

As of now (February 2012), a cursor in MongoDB does not provide true point-in-time snapshot functionality when being accessed — we call this cursors being latent i.e. depending on whether or not an intervening CRUD (Create Read Update Delete) operation to the database happens after we did an initial access/query to the cursor, the exact same access/query to the cursor might not give us the same result after the CRUD operation happened.

There is however a snapshot() mode which assures that objects which update during the lifetime of a query are returned once and only once. However, as mentioned, even with snapshot() mode, items inserted or deleted during the query may or may not be returned — that is, this mode is not a true point-in-time snapshot. There is one exception though: a true point-in-time snapshot occurs for query responses of less than 1 MiB in size.

Tailable Cursor

If we issue main 1 tail, look at what the -f switch is doing, we also know what tailable cursors basically are and what their intended use is.

Tailable in this context means the cursor is not closed once all data is retrieved from an initial query. Rather, the cursor marks the last known object's position and we can resume using the cursor later, from where that object was located, provided more data is available.

Like all latent cursors, the cursor may become invalid at any point e.g. if the last object returned is deleted. Thus, we should be prepared to requery if the cursor is dead. We can determine if a cursor is dead by checking its ID. An ID of zero indicates a dead cursor. In addition, the cursor may be dead upon creation if the initial query returns no matches. In this case a requery is required to create a persistent tailable cursor.

Last but not least, tailable cursors are only allowed on capped collections and can only return objects in natural order.

Transactions and Durability

Please go here for more information.

How can I make the ObjectID to be a Timestamp?

Firstly, that is what happens per default anyway i.e. the ObjectID contains a timestamp. If, for whatever reason, we want the ObjectID be a timestamp and nothing but a timestamp, then we could do dummy_id = ObjectId.from_datetime(gen_time). Note that this is an example using PyMongo as driver i.e. for any other driver we would have to adapt the query appropriately.

However the downside of doing all this is that the ObjectID will not be guaranteed to be unique anymore which is generally considered a bad thing. One should have a very good reason for going down that road i.e. replacing the default ObjectID with a timestamp. It is probably better to store a timestamp in addition to the default ObjectID or, even better, use the portion of the default ObjectID that contains the timestamp and work based on that.

If I insert the same Document twice, it does not raise an Error?

Using PyMongo for example, why does inserting the same document (read same _id) more than once not raise an error? When we need to detect if a document already exists in the database, we could try catching DuplicateKeyError on insert. Below we explicitly insert the document with the same _id twice but the exception is never raised. Why?

try:
     doc = { _id: '123123' }
     db.foo.insert(doc)
     db.foo.insert(doc)
except pymongo.errors.DuplicateKeyError, error:
     print("Same _id inserted twice:", error)

The answer is that DuplicateKeyError will only be raised if we do the insert in safe mode i.e. db.foo.insert(doc, safe=True). The reason why we do not see an error raised with most drivers but with MongoDB's interactive shell is because the shell does a safe insert by default whereas, with most drivers, it is the developers choice whether or not to use a safe insert.

Is incrementing and retrieving a new Value in one Step possible?

Yes, it is even an atomic operation. The way it works is it returns the old value, not the new one i.e. the document returned will not include the modifications made on the update.

What are MongoDB's Components/Building-blocks ?

mongod is the server
mongos
arbiter
config server

with sharding we get shards where each shard can be composed of either a single mongodb or a so called replica set; a MongoDB auto-sharding+replication cluster has a master server at a given point in time for each shard.

Config Server

Config servers manage/keep all information with regards to sharding i.e. the config servers store the cluster's metadata, which includes basic information on each shard (which in most cases consists of a replica set) and the chunks contained therein.

Chunk information is the main data stored by the config servers. Each config server has a complete copy of all chunk information. A two-phase commit is used to ensure the consistency of the configuration data among the config servers.

If any of the config servers is down, then the cluster's metadata becomes read only. However, even in such a failure state, the MongoDB cluster can still be read from and written to.

As of now (February 2012) there can be either one (for development) or three (use when in production) config servers — no two or more than three. However, future plans include more than three config servers with more fine grained control of what happens when one or more config servers go down.

WRITEME: mongodb, mongos, arbiter

Scalability, Fault Tolerance, Load Balancing

All of it we can do using sharding in combination with replica sets. Before we start however, I want the reader to know that there is a FAQs section with regards to scalability, fault tolerance and load balancing.

Sharding

WRITEME... just notes so far

Size

There is no hard-coded limit to the size of a sharded collection. We can start a sharded collection and add data for as long as we also add the corresponding number of shards required by the workload we experience. There are two things however we need to be smart at:

picking the right shard key (probably random as opposed to serial e.g. increasing timestamp) and
having the majority of operations being targeted rather than being global

Global vs Targeted

MongoDB sharding supports two styles of operations — targeted and global. On giant systems, global operations will be of less applicability.

http://www.mongodb.org/display/DOCS/Sharding+Introduction#ShardingIntroduction-OperationTypes

Replication

WRITEME... just notes so far

the whole stack e.g. hardware + software
replication actually works by copying computations/operations and not data e.g. the same operations are applied on the slave after the slave pulled them from the master. See oplog
http://www.snailinaturtleneck.com/blog/2010/10/12/replication-internals
http://www.snailinaturtleneck.com/blog/2010/10/14/getting-to-know-your-oplog
http://www.snailinaturtleneck.com/blog/2010/10/27/bending-the-oplog-to-your-will

Replica Pair

Replica Set

Fastsync

WRITEME

http://www.mongodb.org/display/DOCS/Master+Slave#MasterSlave-fastsync

Ensure Writes to n Nodes

WRITEME

Load Balancing

WRITEME... just notes so far

distributing requests across the cluster

Miscellaneous

This section provides miscellaneous information within regards to MongoDB.

Python

This section is about using MongoDB as data-tier for a logic-tier written in Python. Best of both worlds, we love it because it is totally

Crazy? Exciting? Fabulous? Hm... not sure ;-]

PyMongo

      Python, because JavaScript is slow, C++ is hard and Java feels like
      riding a five-legged Mammoth.
            — The Ruby/Perl/Python Programmer

ORM

MapReduce with PyMongo

http://www.djcinnovations.com/archives/84

with Python

http://clouddbs.blogspot.com/2010/10/googles-mapreduce-in-98-lines-of-python.html

Python drivers/stack

PyMongo http://api.mongodb.org/python
mongodb

Django Stack

http://api.mongodb.org/python/1.7%2B/faq.html#how-can-i-use-pymongo-from-a-web-framework-like-django

with mongokit

with mongoengine

Utilities

sa@wks:/tmp/nightly/mongodb-linux-x86_64-v1.6-2010-10-21$ ll bin/
total 63M
-rwxr-xr-x 1 sa sa 6.7M Oct 21 08:45 bsondump
-rwxr-xr-x 1 sa sa 3.1M Oct 21 08:45 mongo
-rwxr-xr-x 1 sa sa 6.8M Oct 21 08:45 mongod
-rwxr-xr-x 1 sa sa 6.8M Oct 21 08:45 mongodump
-rwxr-xr-x 1 sa sa 6.7M Oct 21 08:45 mongoexport
-rwxr-xr-x 1 sa sa 6.8M Oct 21 08:45 mongofiles
-rwxr-xr-x 1 sa sa 6.8M Oct 21 08:45 mongoimport
-rwxr-xr-x 1 sa sa 6.7M Oct 21 08:45 mongorestore
-rwxr-xr-x 1 sa sa 4.6M Oct 21 08:45 mongos
-rwxr-xr-x 1 sa sa 1.1M Oct 21 08:45 mongosniff
-rwxr-xr-x 1 sa sa 6.8M Oct 21 08:45 mongostat
sa@wks:/tmp/nightly/mongodb-linux-x86_64-v1.6-2010-10-21$

Some of them (mongos, mongod, mongo) are no utilities of course...

Topics

WRITEME

Tips and Tricks

http://www.snailinaturtleneck.com/blog/2010/09/13/oh-the-mistakes-ive-seen/

Backup, Recovery

Below we are going to look at different scenarios when we would need to recover (as opposed to how to create/keep a backup and/or run a replica set) e.g. because we have a hardware failure or power outage etc.

Hardware Failure

cause	hardware failure
server physically ok	no
replica set
backup
reads and/or writes

That is the easiest one. We need to go and fix the server before we can move on to either case from below.

Power Outage, No Writes

cause	power outage
server physically ok	yes
replica set	no
backup	yes
reads and/or writes	only reads

If we did not do any writes during the session that shutdown uncleanly, our data set is fine. All we need to do is remove the mongod.lock file after rebooting the machine and start mongod.

Power Outage, Writes, No Replica Set, Backup

cause	power outage
server physically ok	yes
replica set	no
backup	yes
reads and/or writes	reads and writes

We had a power outage during a session which had writes. We run without a replica set but luckily we have a backup. All we need to do is stop mongod, remove the entire data directory (dbpath), replace it with the backup and restart mongod.

Power Outage, Writes, No Replica Set, No Backup

cause	power outage
server physically ok	yes
replica set	no
backup	no
reads and/or writes	reads and writes

This is the same as above but unfortunately we do not have a backup that we can drop into dbpath before removing mongod.lock and starting mongod.

If we have a single mongod instance that shut down uncleanly, we may lose data — MongoDB makes use of memory-mapped files remember? This is why we should always run with a replica set... anyhow...

Since we only have this one copy of our data set, we will have to repair whatever is left i.e. we remove the mongod.lock file and start the mongod with --repair and any other options we usually use i.e. the things from the config file e.g. dbpath if it is a non-default one — --repair has no way of knowing where mongod put data unless we tell it, it can't repair data unless it can find it.

What we should not do at all is just remove the mongod.lock file and start mongod back up. If we did, and if we have got a corrupt data set, the database (mongod) will start up fine but we will start getting really weird errors when we try to read data — mongod.lock is there for a reason!

--repair will make a full copy of the uncorrupted data (we need to make sure we have enough diskspace) and remove any corrupted data as well as compact the data set. It can take a while for large data sets because it looks at every field of every document.

Note that MongoDB keeps a sort of table of contents at the beginning of each collection. If it was updating its table of contents when the unclean shutdown happened, it might not be able to find a lot of healthy data. This is how the I lost all my documents! horror stories happen. Run with replication, people!

Power Outage, Physical Damage, Writes, Replica Set

cause	power outage	physical damage
server physically ok	yes/no
replica set	yes
backup
reads and/or writes	reads and writes

This is the procedure to follow when, within a replica set, the primary vanishes e.g. because of power outage, physical damage, hardware failure, etc.

Since we are running a replica set (with or without sharding), we do not need to do anything, the promotion of a primary will happen automatically — assuming we do not need an arbiter. Also we do not need to change anything in our application code (e.g. Django) since failover will happen automatically as well.

Instead, really, if we are running a replica set, all we need to do is start the node that failed with the same arguments we used before i.e. all the stuff that lives in the configuration file.

Next, remove the entire data directory (dbpath) and start mongod back up again. This way the secondary will copy over the entire data set from the new primary until synchronized again.

Last but least, two additional but unrelated notes:

in case we are running no replica set but a master/slave setup, we might just swap them.
the whole idea of replica sets and automatic failover/recovery of course goes out the window that time all machines in the replica set are connected to the same power circuit which then might have a power outage; the semantics of this would then be no different than having a single server doing writes and no backup.

Map/Reduce

since 1.7.4 it is required to specify a name for the output collection map/reduce creates
- http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-Outputoptions
- http://blog.evilmonkeylabs.com/2011/01/27/MongoDB-1_8-MapReduce/
Map/Reduce is a lot easier to understand if one looks and understands http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Group already
Map/Reduce in MongoDB is useful for batch manipulation of data and aggregation operations.
indexing and standard queries in MongoDB are separate from Map/Reduce.
map, reduce, and finalize functions are written in JavaScript and execute on the server.
The MapReduce engine may invoke reduce functions iteratively; thus, these functions must be idempotent http://en.wikipedia.org/wiki/Idempotence
http://www.snailinaturtleneck.com/blog/2010/03/15/mapreduce-the-fanfiction/
http://kylebanker.com/blog/2009/12/mongodb-map-reduce-basics/
http://www.mongodb.org/display/DOCS/MapReduce

ACID, Concurrency

The basic thing to know about MongoDB with regards to ACID (Atomicity, Consistency, Isolation, Durability) is that MongoDB

is atomic on the document level but not
across several documents or let alone across collections

Basically what that means is that if we have several clients competing to make writes to a single document, we can be sure that this will be atomic — it either happens from start to end or not at all i.e. there will be no two clients interfering with each others writes thus possibly create inconsistency.

WRITEME

some notes from http://permalink.gmane.org/gmane.comp.db.mongodb.user/2590

Actually CAP is not really related to why we do not support multi-object transactions.
The main reasons we do not is that when distributed, if you need a transaction that spans servers, that is very expensive. So most NoSQL databases are designed with horizontal scalability in mind, so are not doing multi-object transactions.

Single Server Durability

New since 1.8. One discussed solution is to have append only transaction logs (Write ahead log).

Security

there is no such thing like SQL injection in MongoDB
- http://www.mongodb.org/display/DOCS/Do+I+Have+to+Worry+About+SQL+Injection
http://www.mongodb.org/display/DOCS/Master+Slave#MasterSlave-Security
http://www.mongodb.org/display/DOCS/Security+and+Authentication
http://www.mongodb.org/display/DOCS/Replica+Set+Authentication
see #os_tweaks
http://jira.mongodb.org/browse/PYTHON-162
http://jira.mongodb.org/browse/SERVER-1105
http://jira.mongodb.org/browse/SERVER-1469
http://jira.mongodb.org/browse/SERVER-524
per document ACL
- http://jira.mongodb.org/browse/SERVER-648
http://jira.mongodb.org/browse/SERVER-921

Server Side Code Execution

http://www.mongodb.org/display/DOCS/Server-side+Code+Execution
- There is a special system collection called system.js that can store JavaScript function to be re-used; stored procedures
http://www.mongodb.org/display/DOCS/Server-side+Code+Execution#Server-sideCodeExecution-Storingfunctionsserverside
http://www.mongodb.org/display/DOCS/Server-side+Code+Execution#Server-sideCodeExecution-NotesonConcurrency
http://dirolf.com/2010/04/05/stored-javascript-in-mongodb-and-pymongo.html
http://www.djcinnovations.com/archives/69

db.eval(), $where

When we pass JavaScript to a mongod instance to run with $where or db.eval, or db.eval() ) from the shell it is basically converting it to text and sending it on the wire. It is in no way connected to the javascript shell (client) from the server.

db.eval() in sharded enviroment

https://groups.google.com/group/mongodb-user/browse_thread/thread/152f6661f1a24ad0/6fbc965698faa828#6fbc965698faa828

Stored Functions/Procedures

http://en.wikipedia.org/wiki/Stored_procedure
- overhead
- avoidance of network traffic
- encapsulation of business logic
- delegation of access-rights

Triggers

BSON

BSON is a language independent data interchange format, not just exclusive to MongoDB. BSON was designed to have the following three characteristics:

Lightweight: Keeping spatial overhead to a minimum is important for any data representation format, especially when used over the network.
Traversable: BSON is designed to be traversed easily. This is a vital property in its role as the primary data representation for MongoDB. BSON adds some extra information to documents, like for example length prefixes, that make it easy and fast to traverse.
Efficient: Encoding data to BSON and decoding from BSON can be performed very quickly in most languages due to the use of C data types. For example, integers are stored as 32 (or 64) bit integers, so they do not need to be parsed to and from text.

So, what is the point of BSON when it is no smaller than JSON in many cases? Well, BSON is designed to be efficient in space, but in many cases is not much more efficient than JSON. In some cases BSON uses even more space than JSON. The reason for this is another of the BSON design goals: traversability.

BSON is also designed to be fast to encode and decode. This uses more space than JSON for small integers, but is much faster to parse.

SON

http://api.mongodb.org/python/1.7%2B/api/pymongo/son.html

OS Tweaks

http://www.mongodb.org/display/DOCS/Too+Many+Open+Files
We came across a problem with too many open files which was caused by the default OS file descriptor limit. This is set at 1024 which can be too low. MongoDB now has documentation about this but you can set your limit higher on RedHat systems by editing the /etc/security/limits.conf file. You also need to set UsePAM to yes in /etc/ssh/sshd_config to have it take effect when you log in as your user.
We also disabled atime on our database servers so that the filesystem is not caught up updating timestamps every time MongoDB reads from disk.

GIS

GridFS

http://www.mongodb.org/display/DOCS/GridFS
http://www.mongodb.org/display/DOCS/GridFS+Specification
http://dirolf.com/2010/03/29/new-gridfs-implementation-for-pymongo.html
- we can treat GridFS as a versioned filestore; get_last_version
http://www.snailinaturtleneck.com/blog/2010/02/11/mongo-mailbag-2-updating-gridfs-files/
PyMongo: http://api.mongodb.org/python/current/api/gridfs/index.html
http://api.mongodb.org/python/1.9%2B/api/bson/binary.html
http://github.com/mongodb/mongo-python-driver/examples/custom_type.py

Serving GridFS data without going through the Logic Tier

Nginx

http://github.com/mdirolf/nginx-gridfs

read from secondaries

https://github.com/mdirolf/nginx-gridfs/issues/unreads#issue/11

Misc

http://permalink.gmane.org/gmane.comp.db.mongodb.user/1692
- Also mathias has a python GridFS server that performs pretty well: http://github.com/mdirolf/python-gridfs-server Working on getting this integrated w/ PyMongo's built-in GridFS API now...
  - http://bottle.paws.de/
http://github.com/mikejs/gridfs-fuse
http://bytebucket.org/namlook/mongokit/wiki/html/gridfs.html
http://www.coffeepowered.net/2010/02/17/serving-files-out-of-gridfs/

Import/Export

http://www.mongodb.org/display/DOCS/Import+Export+Tools

Migrating from RDBMS XY to MongoDB

Streaming

http://jira.mongodb.org/browse/SERVER-699
- http://wiki.apache.org/hadoop/HadoopStreaming

E-Commerce

http://kylebanker.com/blog/2010/04/30/mongodb-and-ecommerce/

Use Cases

http://www.mongodb.org/display/DOCS/Use+Cases

Financial Applications

Also, if you are worried about loss of precision with large integers (with more than 15 digits) you should use a JavaScript port of BigInteger (or BigDecimal) for calculations and store the numbers as strings. I'm successfully using BigDecimal for financial calculations at the moment. It's probably overkill for most people but it works. See also http://issues.apache.org/jira/browse/COUCHDB-227 for some links about this issue.
http://en.wikipedia.org/wiki/OLTP