Tweets by @markusgattol |
IntroductionWith MongoDB we can become more Googley i.e. grow our data set horizontally on a cluster of commodity hardware and do distributed (read parallel execution of) queries/updates/inserts/deletes. We can also ensure that our data is hold redundant i.e. each bit of data is stored more than just once, but rather several times on different physical nodes in the cluster, even across datacenter barriers if we wish to do so. And guess what, failover/recovery and sharding happens automatically. Also, MongoDB has dynamic schemas. You are going to love it! MongoDBThis section is about one particular NoSQL DBMS (Database Management System) known as MongoDB. For a summary, have a look at the abstract from above. Installation and ConfigurationQuickstartFilesWRITEME Interactive Shell
In addition, MongoDB defines some of its own classes and globals (e.g.
WRITEME Help on the CLIWhen in the CLI (Command Line Interface), it is important to know how to get information. In addition to the general commands and methods sa@wks:~$ mongo --quiet type "help" for help > help() HELP show dbs show database names show collections show collections in current database show users show users in current database show profile show most recent system.profile entries with time >= 1ms use <db name> set curent database to <db name> db.help() help on DB methods db.foo.help() help on collection methods db.foo.find() list objects in collection foo db.foo.find( { a : 1 } ) list objects in foo where a == 1 it result of the last line evaluated; use to further iterate we can have a look at database specific commands and methods > db.help(); DB methods: db.addUser(username, password[, readOnly=false]) db.auth(username, password) db.cloneDatabase(fromhost) db.commandHelp(name) returns the help for the command db.copyDatabase(fromdb, todb, fromhost) db.createCollection(name, { size :..., capped :..., max :... } ) db.currentOp() displays the current operation in the db db.dropDatabase() db.eval(func, args) run code server-side db.getCollection(cname) same as db['cname'] or db.cname db.getCollectionNames() db.getLastError() - just returns the err msg string db.getLastErrorObj() - return full status object db.getMongo() get the server connection object db.getMongo().setSlaveOk() allow this connection to read from the nonmaster member of a replica pair db.getName() db.getPrevError() db.getProfilingLevel() db.getReplicationInfo() db.getSisterDB(name) get the db at the same server as this onew db.killOp(opid) kills the current operation in the db db.listCommands() lists all the db commands db.printCollectionStats() db.printReplicationInfo() db.printSlaveReplicationInfo() db.printShardingStatus() db.removeUser(username) db.repairDatabase() db.resetError() db.runCommand(cmdObj) run a database command. if cmdObj is a string, turns it into { cmdObj : 1 } db.serverStatus() db.setProfilingLevel(level,<slowms>) 0=off 1=slow 2=all db.shutdownServer() db.stats() db.version() current version of the server db.getMongo().setSlaveOk() allow queries on a replication slave server as well as collection specific ones > db.test.help(); DBCollection help db.test.count() db.test.dataSize() db.test.distinct( key ) - eg. db.test.distinct( 'x' ) db.test.drop() drop the collection db.test.dropIndex(name) db.test.dropIndexes() db.test.ensureIndex(keypattern,options) - options should be an object with these possible fields: name, unique, dropDups db.test.reIndex() db.test.find( [query] , [fields]) - first parameter is an optional query filter. second parameter is optional set of fields to return. e.g. db.test.find( { x : 77 } , { name : 1 , x : 1 } ) db.test.find(...).count() db.test.find(...).limit(n) db.test.find(...).skip(n) db.test.find(...).sort(...) db.test.findOne([query]) db.test.findAndModify( { update :... , remove : bool [, query: {}, sort: {}, 'new': false] } ) db.test.getDB() get DB object associated with collection db.test.getIndexes() db.test.group( { key :..., initial:..., reduce :...[, cond:...] } ) db.test.mapReduce( mapFunction , reduceFunction , <optional params> ) db.test.remove(query) db.test.renameCollection( newName , <dropTarget> ) renames the collection. db.test.runCommand( name , <options> ) runs a db command with the given name where the 1st param is the colleciton name db.test.save(obj) db.test.stats() db.test.storageSize() - includes free space allocated to this collection db.test.totalIndexSize() - size in bytes of all the indexes db.test.totalSize() - storage allocated for all data and indexes db.test.update(query, object[, upsert_bool, multi_bool]) db.test.validate() - SLOW db.test.getShardVersion() - only for use with sharding If we are curious about what a particular method is doing, we can type
it without the > db.test.find; function (query, fields, limit, skip) { return new DBQuery(this._mongo, this._db, this, this._fullName, this._massageObject(query), fields, limit, skip); } One last thing we could do is list all available database commands and maybe drill deeper and look at their description > db.listCommands(); eval lock: node adminOnly: false slaveOk: false Evaluate javascript at the server. http://www.mongodb.org/display/DOCS/Server-side+Code+Execution assertInfo lock: write adminOnly: false slaveOk: true check if any asserts have occurred on the server assertInfo lock: write adminOnly: false slaveOk: true check if any asserts have occurred on the server authenticate lock: write adminOnly: false slaveOk: true internal buildInfo lock: node adminOnly: true slaveOk: true get version #, etc. { buildinfo:1 } buildInfo lock: node adminOnly: true slaveOk: true get version #, etc. { buildinfo:1 } [skipping a lot of lines...] validate lock: read adminOnly: false slaveOk: true Validate contents of a namespace by scanning its data structures for correctness. Slow. whatsmyuri lock: node adminOnly: false slaveOk: true {whatsmyuri:1} writebacklisten lock: node adminOnly: true slaveOk: false internal > db.commandHelp('datasize'); help for: dataSize determine data size for a set of data in a certain range example: { datasize:"blog.posts", keyPattern:{x:1}, min:{x:10}, max:{x:55} } keyPattern, min, and max parameters are optional. not: This command may take a while to run > FAQsThis section gathers FAQs about MongoDB, all split into subsections of particular subjects. BasicsThis subsection gathers basic FAQs about MongoDB. Reasons not to use MongoDBThe days where one DBMS fit all our needs are over! Technological advances in the area of NoSQL (e.g. CouchDB, MongoDB, etc.) are the perfect fit in certain areas, but still, established technologies like for example RDBMSs (Relational Database Management Systems) are very good at what they do best, storing relational data that is. What we want to do is use whatever tool is best for the job... Below is a quick summary which makes it easy to see whether or not a switch to MongoDB would make sense, or, maybe more realistic, when and what parts of infrastructure can be moved over to MongoDB and which parts are to be left on RDBMSs like for example Oracle, MySQL, PostgreSQL and friends. We maybe should not use MongoDB when:
RoadmapYes, go here. Wiki? PDF Manual?There is a PDF version of the Wiki. Programming LanguagesGo here and here for the available MongoDB drivers/APIs. In short: right now (October 2010) we can use MongoDB from at least C, C++, C# & .NET, ColdFusion, Erlang, Factor, Java, JavaScript, PHP, Python, Ruby, Perl. Of course, there might be more languages available in the future. Python 3 SupportThe current thought is to use Django as more or less a signal for when adding full support for Python 3 makes sense — this in turn will mainly depend on two things:
MongoDB can probably support Python 3 it a bit earlier than Django does, but that is certainly not something the MongoDB community wants to rush and then have to support two totally different codebases. It can be tracked here. Personally I think that having an official Python 3 branch of PyMongo sooner rather than later would be a good thing. Building BlocksWe have RDBMSs (Relational Database Management Systems) like for example MySQL, Oracle, PostgreSQL and then there are NoSQL DBMSs like for example MongoDB. Below is a breakout about how MongoDB relates to the afore mentioned, it is a breakout about how the main building blocks of each party resemble: MySQL, PostgreSQL... ---------------------- Server:Port - Database(s) - Table(s) - Row(s) - Column(s) - Index - Join MongoDB -------- Server:Port - Database(s) - Collection(s) - Document(s) - Field(s) - Index - embedding and linking The concept of server, database and index are very similar but the concepts of table/collection, row/document as well as column/field are quite different. In a RDBMSs a table is a rectangle made of columns and rows. Each row has a fixed number of columns, if we add a new column, we add that column to each and every row. In MongoDB a collection is more like a really big box and each document is like a little bag of stuff in that box. Each bag contains whatever it needs in a totally flexible manner (read schema-less). Note that schema-less does not equal type-less i.e. it is just that with MongoDB any document has its own schema, which it may or may not share with other documents. In practice it is normal to have the same schema for all the documents in a collection. The concept of a column in RDBMSs is closest to what we call a field (key/value pair) in MongoDB — note however what we said above: we can create/read/update/delete individual fields for a particular document in a collection. This is different from creating/reading/updating/deleting a column in a RDBMSs, which happens for every row in the entire table. Indexes are more or less the same for RDBMSs and MongoDB. Joins however do not exists in MongoDB but instead we can embed and/or link documents into/with other documents. AdministrationThis subsection gathers FAQs with regards to administering MongoDB. Large Pagessa@wks:~$ cat /proc/meminfo | grep Hugepagesize Hugepagesize: 2048 kB sa@wks:~$ The use of huge pages is generally discouraged for MongoDB because it makes the granularity for managing memory very large, which is usually a bad thing. MongoDB does not start anymoreUsually what happened is an unclean shutdown (e.g.
SIGKILL rather than SIGTERM has been sent, a power outage, etc.) of
sa@wks:~$ locate mongod.lock /var/lib/mongodb/mongod.lock sa@wks:~$ and start in repair mode. There are several ways for recovery, all
depending on our particular setup. Note that both, FilesystemAs of now (December 2010) most people prefer ext4 or XFS — in the
future BTRFS might become peoples first choice. Ext4 fully implements
On older filesystems the semantics is so that literally all the file is zeroed out i.e. zeros are written to disk — obviously this takes a very long time. Without going into details, ext4 and other modern filesystems do not actually do that but achieve the same result by other (way faster) means. Ctrl-CCtrl-C cleanly shuts down the MongoDB (read: a running MongoDB is asking me to do a repairThis could happen if there is an unclean shut down e.g. a power
outage. It is important to note that it does not necessarily mean that
the data is actually corrupted — MongoDB is just very paranoid and
wants to make sure everything is all right and therefore asks to run
Bottom Line: if MongoDB suggests to do a repair, it does not automatically imply that we actually have corrupted data. It might be but it might also not be the case... Compact DatabaseThere are two ways to achieve this... repairDatabase
One strategy is to do a rolling repair on a replica set i.e. if we have n machines in a replica set (one primary, n-1 secondaries), we bring each one down, repair it, let it resync, move on to the next machine/node — one machine at a time until all machines in the replica set have been compacted. That way we always have n-1 machines up and no downtime with regards to the entire cluster i.e. a rolling repair is transparent to the logic tier. While at it, this is also a very good time to take a backup since the machine is down anyway and after the compaction/repair we end up with a smaller data set before it starts growing again — perfect time to run a backup. From an administrative point of view it is recommended that we use the
For example, if we have the replica set members compact CommandThe WebGUI? REST Interface/API?A generally good resource on GUIs can be found here. Also, assuming a
In order to have a REST (Representational State Transfer) interface to
MongoDB, same as CouchDB has it, we have to start Note, if we wanted real-time updates from the CLI, then we could also
use Rename a DatabaseYes, we can rename a database but it is not as easy as renaming a collection. As of now (October 2010), the recommended way to rename a database is to clone it and thereby rename it. This will require enough additional free diskspace to fit the current/old database at least twice. Physically migrate a DatabaseThat is a pretty common need, we have our database on one machine and want to migrate/copy it onto another machine. There is even a clone command for that. Note however that neither
Ok, so now we know MongoDB has tools to migrate a database... good to know but, well, it might just not be good enough for any use case out there, something we will look into a bit more below. I find that most use cases are a bit more complex and thus might require a slightly different approach. For example, in order to make it a bit more complex and realistic, we might need to migrate from one datacenter to another, trough an insecure network i.e. the Internet. Of course, we would then maybe also be faced with low bandwidth and high latency which means our connection might get canceled at any time and we would then have to start all over again. With bandwidth of e.g. only 10 Mbit/sec and 1 TiB of data to transfer, that quickly becomes a problem... or not? Actually this is a lot easier than it sounds because, at first we
figure the Next we shut down Minimum DowntimeWhile the above is fine, it might cause significant downtime — what we did in a nutshell was:
Below is what we could do in order to have as little downtime as possible:
Of course, we might also use OpenVZ and its live-migration feature to move a VE (Virtual Environment) containing our MongoDB database from the old to the new machine with no downtime whatsoever ;-] Update to a new MongoDB versionIf it is a drop-in replacement we just need to shutdown the older
version and start the new one with the appropriate Otherwise, i.e. if it is not a drop-in replacement, we would use
Default listening Port and IPWe can use wks:/home/sa# netstat -tulpena | grep mongo tcp 0 0 0.0.0.0:27017 0.0.0.0:* LISTEN 124 1474236 8822/mongod tcp 0 0 0.0.0.0:28017 0.0.0.0:* LISTEN 124 1474237 8822/mongod wks:/home/sa# The default listening port for Both, listening port and IP address, can be changed either by using
the CLI (Command Line Interface) switches Automatic BackupsYes, possible. The manual way is to use One might use Go here for more information on backups and especially recovery. getSisterDB()We can use it to get ourselves references to databases which not just saves a lot of typing but is, once we got used to using it, a lot more intuitive: 1 sa@wks:~/mm/new$ mongo 2 MongoDB shell version: 1.5.2-pre- 3 url: test 4 connecting to: test 5 type "help" for help 6 > db.getCollectionNames(); 7 [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ] 8 > reference_to_test_db = db.getSisterDB('test'); 9 test 10 > reference_to_test_db.getCollectionNames(); 11 [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ] 12 > use admin 13 switched to db admin 14 > reference_to_test_db.getCollectionNames(); 15 [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ] 16 > bye 17 sa@wks:~/mm/new$ Note how we get a reference to our Automatically start/restart on Server boot/rebootOne way would be to use the
Use Case for —forkThe The fastest way to do so would simply be to append Make a mongod Instance a Master without restarting itNo, we have to restart For example, with PyMongo we would catch the catch the Resource UsageWe only have so much memory, CPU cycles, storage/network I/O... we want to use this resources as efficiently as possible. Make MongoDB fasterWe need to give MongoDB enough RAM i.e. to do so we need to learn/know our use case — every use case has its individual metrics with regards to how it is best indexed and, if sharding is used, what the right sharding keys are. Things like mounting the files system with the noatime option and using SSDs (Solid State Drives) are good too but we need to keep in mind, by far the most can be achieved if the working set size and indexes fit in RAM — as said, every use case has its individual metrics. Database growing fastThe first file for a database is So, if we have say, database files up to Note that deleting data and/or dropping a collection or index will not release already allocated diskspace since it is allocated per database. Diskspace will only be released if a database is repaired or the database is dropped altogether. Go here for more information. Also note that while MongoDB generally does a lot of pre-allocation,
we can remedy this by starting Why does MongoDB use so much RAM?Well, it does not actually, it is just that most folks do not really understand memory management — there is more to it than just is in RAM or is not in RAM. The current default storage engine for MongoDB is called
As of now (January 2012), an alternative storage engine
( Generally, the memory-mapped store ( Caching AlgorithmActually, this is done by the OS using the LRU (Least Recently Used) caching pattern. However, sooner or later most DBMSs implement their own CRA (cache replacement algorithm) simply because every DBMS is different and therefore a DBMS specific CRA can get the most out of every DBMS. Sounds good but then this bad news at the very same time. While having its own cache does usually make caching more efficient/faster, it causes problems with cache reheating: Cache ReheatingUsing the OSes cache, the nice attribute of the default MongoDB
storage engine is its use of OS memory-mapped files — the cache is
the operating system's file system cache. Therefore restarting If MongoDB would use its own caching system including a bespoke CRA
then that would cause a reheat scenario even when we restart Working Set SizeWorking set size can roughly be thought of as how much data we will need MongoDB (or any other DBMS (Database Management System), relational or non-relational) to access in a period of time. For example, YouTube has ridiculous amounts of data, but only 1% may be accessed at any given time. If, however, we are in the rare case where all the data we store is accessed at the same rate at all times, then our working set size can be defined as our entire data set stored in MongoDB. How much RAM does MongoDB need?We now know MongoDB's caching pattern, we also know what a working set size is. Therefore we can have the following rule of thumb on how much RAM (Random Access Memory) a machine ideally needs in order to work as fast as possible. It is the working set size plus MongoDB's indexes which should ideally reside in RAM at all times i.e. the amount of available RAM should ideally be at least the working set size plus the size of indexes plus what the rest of the OS (Operating System) and other software running on the same machine needs. If the available RAM is less than that, LRUing is what happens and we might therefore get significant slowdown. One thing to keep in mind is that in an index btree buckets are cached, not individual index keys i.e. if we had a uniform distribution of keys in an index including for historical data, we might need more of the index in RAM compared to when we have a compound index on time plus something else. With the latter, keys in the same btree bucket are usually from the same time era, so this caveat does not happen. Also, we should keep in mind that our field names in BSON are stored in the records (but not the index) so if we are under memory pressure they should be kept short. Those who are interested in MongoDB's current virtual memory usage
(which of course also is about RAM), can have a look at the status of
Speed Impact of not having enough RAMGenerally, when databases are to big to fit into RAM entirely, and if we are doing random access, we are in trouble as HDDs (Hard Disk Drives) are slow at that (roughly a 100 operations per second per drive). One solution is to have lots of HDDs (10, 100...). Another one is to use SSDs (Solid State Drives) or, even better, add more RAM. Now that being said, the key factor here is random access. If we do sequential access to data bigger than RAM, then that is fine. So, it is ok if the database is huge (more than RAM size), but if we do a lot of random access to data, it is best if the working set fits in RAM entirely. However, there are some nuances around having indexes bigger than RAM with MongoDB. For example, we can speed up inserts if the index keys have certain properties — if inserts are an issue, then that would help. Limit RAM UsageMongoDB is not designed to do that, it is designed for speed and scalability. If we wanted to run MongoDB on the same physical machine alongside some web server and for example some application server like Django, then we could ensure memory limits on each one by simply using virtualization and putting each one in its own VE (Virtual Environment). In the end we would thus have a web application made of MongoDB, Django and for example Nginx, all running on the same physical machine but being limited to whatever limits we set on each VE they run in. Out Of Memory ErrorsIf we are getting something like this As we already know, MongoDB takes all RAM it can get i.e. RAM (Random
Access Memory), or more precisely RSS (Resident Set Size), itself part
of virtual memory, will appear to be very large for the The important point here is how it is handled by the OS (Operating
System). If the OS just blocks any attempt to get more virtual memory
or, even worse, kills the process (e.g. 1 sa@wks:~$ ulimit -a | egrep virtual\|open 2 open files (-n) 1024 3 virtual memory (kbytes, -v) unlimited 4 sa@wks:~$ lsb_release -irc 5 Distributor ID: Debian 6 Release: unstable 7 Codename: sid 8 sa@wks:~$ uname -a 9 Linux wks 2.6.32-trunk-amd64 #1 SMP Sun Jan 10 22:40:40 UTC 2010 x86_64 GNU/Linux 10 sa@wks:~$ As we can see from lines 5 to 9, I am on Debian sid (still in development) running the 2.6.32 Linux kernel. The settings we are interested in are with lines 2 and 3. Virtual
memory is unlimited by default so that is fine already — this is
actually what causes the most problems so we need to make sure Note that we need to run these commands (e.g. OpenVZIf we are running MongoDB with OpenVZ then there are some more
settings we might want to tune in order to avoid the OOM (Out of
memory) killer to kick in or simply hit the virtual memory ceiling if
not set to Does MongoDB use more than one CPU Core?As for map/reduce (and any other JavaScript related operations), write
operations are single-threaded i.e. they make only use of one CPU core
for any running For read operations however, which are the majority of operations, a
In short: we will notice a speedup going from a single-core CPU to
multi-core since the speed increase is roughly linear to the available
CPU cores when doing client-side read operations from any of MongoDB's
drivers. However we will not see linear speedup with write operations
and server-side read operations on a single The latter, server-side JavaScript execution is mostly due to poor support for concurrency in JavaScript engines. There are plans however to also make JavaScript execute in parallel. As of now (October 2010) however, JavaScript executes single-threaded. Global LockRight now (January 2012) we still have a global lock which applies
across all database on a There are a lot of tricks in the codebase to mitigate possible
contention issues, but the lock does apply against all databases on a
single The plan is to have a per-collection lock some time in the future which will ease the pain of having a global lock a lot. This feature will probably be introduced with v2.2 i.e. probably before mid 2012. The tickets to watch SERVER-1240 and probably also SERVER-1830. >1 Threads inserting at the same TimeIf we have 2 threads inserting under one How much Data can be stored with MongoDB?16 MiB is the limit on individual documents. GridFS however uses many documents (split into small chunks, usually 256kB in size) so practically speaking there is no limit. However, some implementations (read drivers) have a limit of 2^31 chunks — if we do the maths here that means that we end up with around 549 TiB storage for GridFS — it is expected that this limit will go away at some point plus the 256kB default chunk size can be changed to allow for bigger GridFS collections. Note that chunks with regards to GridFS (usually 256kB in size) are not the same as chunks with regards to sharding. With normal sharded collections, assuming the same limit of 3^31 chunks and a default chunk size of 64 MiB, we would theoretically be able to store 137 PiB. The above is true for x86-64 systems, it is not entirely true for x86 (32 bit) systems — there is a limit because of how memory mapped files work which is a limit of 2GiB per database. Do embedded Documents count toward the 16 MiB BSON Document Size Limit?Yes, the entire BSON (Binary JSON) document (including all embedded documents, etc.) cannot be more than 16 MiB in size. Does Document Size impact read/write Performance?Yes, but this is mostly due to network limitations e.g. one will max out a GigE link with inserts before document size starts to slow down MongoDB itself. Size of a DocumentWe can use sa@wks:~$ mongo MongoDB shell version: 1.5.1-pre- url: test connecting to: test type "help" for help > db.test.save({ name : "katze" }); > Object.bsonsize(db.test.findOne({ name : "katze"})) 38 > bye sa@wks:~$ Size of a Collection and its Indexessa@wks:~$ mongo --quiet type "help" for help > db.getCollectionNames(); [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ] > db.test.dataSize(); 160 > db.test.storageSize(); 2304 > db.test.totalIndexSize(); 8192 > db.test.totalSize(); 10496 We are using the
Current version of MongoDB also show the so-called If we need/want a more detailed view we could also have a look at > db.test.validate(); { "ns" : "test.test", "result" : " validate firstExtent:2:2b00 ns:test.test lastExtent:2:2b00 ns:test.test # extents:1 datasize?:160 nrecords?:4 lastExtentSize:2304 padding:1 first extent: loc:2:2b00 xnext:null xprev:null nsdiag:test.test size:2304 firstRecord:2:2be8 lastRecord:2:2c58 4 objects found, nobj:4 224 bytes data w/headers 160 bytes data wout/headers deletedList: 0000001000000000000 deleted: n: 1 size: 1904 nIndexes:1 test.test.$_id_ keys:4 ", "ok" : 1, "valid" : true, "lastExtentSize" : 2304 } > bye sa@wks:~$ Note that while MongoDB generally does a lot of pre-allocation, we can
remedy this by starting Reusing Free SpaceAny time we delete a document using The way this works internally is so that MongoDB maintains a so-called free list for all of its collections. This is not entirely true however — capped collections have quite different semantics in their behavior and use, which is why they do not have to maintain a free list, a fact that makes them a lot faster on I/O compared to ordinary collections. Collections, NamespacesAs we already know, collections are to MongoDB what tables are to MySQL and friends. In MongoDB collections are so-called namespaces. Capped Collection
WRITEME Sharding a Capped CollectionIt is not possible to shard a capped collection. Capped ArrayRename a CollectionUsing Virtual CollectionIt refers to the ability to reference embedded documents as if they were a first-class collection of top level documents, querying on them and returning them as stand-alone entities, etc. Number of Collections/NamespacesThere is a limit to how many collections/namespaces we can have within
a single MongoDB database — this number is essentially the number of
collections plus the number of indexes. The limit is ~24000 namespaces
per database, with each namespace being 628 Bytes in size and the Limit of Indexes per CollectionYes, currently, with v1.6, we have a limit of 64 indexes per collection. Cloning a CollectionYes, possible. Have a look at Merge two or more Collections into OneYes, we read from all collections we want to merge and use List of Collections per DatabaseWe can use 1 sa@wks:~$ mongo 2 MongoDB shell version: 1.2.4 3 url: test 4 connecting to: test 5 type "help" for help 6 > db 7 test 8 > db.getCollectionNames(); 9 [ "fs.chunks", "fs.files", "mycollection", "system.indexes", "things" ] 10 > db.system.namespaces.find(); 11 { "name" : "test.system.indexes" } 12 { "name" : "test.fs.files" } 13 { "name" : "test.fs.files.$_id_" } 14 { "name" : "test.fs.files.$filename_1" } 15 { "name" : "test.fs.chunks" } 16 { "name" : "test.fs.chunks.$_id_" } 17 { "name" : "test.fs.chunks.$files_id_1_n_1" } 18 { "name" : "test.things" } 19 { "name" : "test.things.$_id_" } 20 { "name" : "test.mycollection" } 21 { "name" : "test.mycollection.$_id_" } 23 > show collections 24 fs.chunks 25 fs.files 26 mycollection 27 system.indexes 28 things 29 > bye 30 sa@wks:~$ Delete a Collection
Delete a Database
NamespaceCollections can be organized in namespaces. These are named groups of
collections defined using a dot notation. For example, we could define
collections Namespaces can then be used to access these collections using the dot
notation e.g. Namespaces simply provide an organizational mechanism for the user
i.e. the collection namespace is flat from the database point of view
which means that List of Namespaces in the DatabaseOne way to list all namespaces for a particular database would be to enter MongoDB's interactive shell: sa@wks:~$ mongo MongoDB shell version: 1.2.4 url: test connecting to: test type "help" for help > db.system.namespaces.find(); { "name" : "test.system.indexes" } { "name" : "test.fs.files" } { "name" : "test.fs.files.$_id_" } { "name" : "test.fs.files.$filename_1" } { "name" : "test.fs.chunks" } { "name" : "test.fs.chunks.$_id_" } { "name" : "test.fs.chunks.$files_id_1_n_1" } { "name" : "test.things" } { "name" : "test.things.$_id_" } { "name" : "test.mycollection" } { "name" : "test.mycollection.$_id_" } > db.system.namespaces.count(); 11 > bye sa@wks:~$ The Statistics, Monitoring, LoggingOnce MongoDB is installed, configured and tested properly we switch the setup into production mode. Heck, we might even have a dedicated testbed! Anyhow, once we leave the garage for good, we need to know about the setup's health at all times. In order to optimize the system we want to have statistics about it and, last but not least, we need history about all kinds of events which occurred... we log! Server StatusOften folks do not know exactly what the units and abbreviations are with regards to the server status output: sa@wks:~$ mongo --quiet type "help" for help > version(); version: 1.5.0-pre- > db.serverStatus(); { "uptime" : 6695, "localTime" : "Sun Apr 11 2010 11:22:19 GMT+0200 (CEST)", "globalLock" : { "totalTime" : 6694193239, "lockTime" : 45048, "ratio" : 0.000006729414343397326 }, "mem" : { "resident" : 3, "virtual" : 138, "supported" : true, "mapped" : 0 }, "connections" : { "current" : 2, "available" : 19998 }, "extra_info" : { "note" : "fields vary by platform", "heap_usage_bytes" : 146048, "page_faults" : 57 }, "indexCounters" : { "btree" : { "accesses" : 0, "hits" : 0, "misses" : 0, "resets" : 0, "missRatio" : 0 } }, "backgroundFlushing" : { "flushes" : 111, "total_ms" : 2, "average_ms" : 0.018018018018018018, "last_ms" : 0, "last_finished" : "Sun Apr 11 2010 11:21:45 GMT+0200 (CEST)" }, "opcounters" : { "insert" : 0, "query" : 1, "update" : 0, "delete" : 0, "getmore" : 0, "command" : 12 }, "asserts" : { "regular" : 0, "warning" : 0, "msg" : 0, "user" : 0, "rollovers" : 0 }, "ok" : 1 } > bye sa@wks:~$ Most of it is obvious like for example One may ask what is the point of having both,
Within the The "opcounters" : { "insert" : 16513, "query" : 1482263, "update" : 141594, "delete" : 38, "getmore" : 246889, "command" : 1247316 },
And yes, taking those numbers and dividing them by time (delta or
total) will give us operations/time e.g. operations per second or
operations since Collection StatisticsNote that this statistics are about one particular collection i.e.
neither is it about the entire database ( sa@wks:~$ mongo MongoDB shell version: 1.8.0-rc2 connecting to: test > db.test.stats(); # hint: db.test.stats(1024 * 1024); would give MiB instead of Bytes { "ns" : "test.test", "count" : 2, "size" : 80, "storageSize" : 2304, "numExtents" : 1, "nindexes" : 1, "lastExtentSize" : 2304, "paddingFactor" : 1, "flags" : 1, "totalIndexSize" : 8192, "indexSizes" : { "_id_" : 8192 }, "ok" : 1 } > bye sa@wks:~$
So, just to be clear: The padding is only calculated and updated when
updates are applied to existing objects and if they cause a
move/re-alloc. If the update fits in the existing space then no change
to the That means that on inserts data is just pushed to the end or an
existing hole that is big enough for the document/object to be
inserted. It also means that inserts include some extra space when
Monitoring and DiagnosticsYes, all possible. Default logging BehaviorBy default
Note: we have queries and then we have ordinary database operations. ProfilingHow fast is a certain query? How fast a certain database operation? MongoDB has a profiler which can answer such questions with ease. Log its QueriesYes, go here. Basically, by using the database profiler it is possible to
Note that the time slot we are talking about here is based on when
Change the logging VerbosityYes, use LogrotateYes, we can run Another possibility of making This is somewhat also related to stopping MongoDB and the problems that arise if it is not done properly. /var/lib/mongo/dialog.2f65d3d4Files like Schema DesignData modeling does not only concern us within the logic tier but also within (and across) the data tier i.e. we also model our database layout in order to achieve high maintainability, overall speed (entire stack) and diskspace usage amongst other things... query time minimization etc. We are speed devils after all... Well, first off, there is no such thing as a schema with MongoDB i.e. it is schema-less and thus we can alter/create a schema as we go. Therefore, there is no such thing as the best. There are a few best practices however like for example using embedded documents
WRITEME Query across Collections/DatabasesThere is currently (January 2012) no way to query multiple collections
and/or databases (possibly on different For now the best approach is to create a schema where first class objects (also known as domain objects) are in the same collection. In other words, documents which are all of the same general type, belong into the same collection. This way we only need to query one collection and do not need to do cross-collection/database queries. ExampleIn this example, both blogpost and newsitem are of the general type content i.e. we will put them both into the content collection: { type: "blogpost", title: "India has three seasons", slug: "mumbai-monsoon" tags: ["mumbai", "seasons", "monsoon"] }, { type: "blogpost", title: "A summer in Paris", slug: "paris-summer" tags: ["paris", "summer"] }, { type: "newsitem", headline: "2023, first human clone" slug: "first-human-clone" tags: ["clone"] } This approach takes advantage of the schema-less nature of MongoDB
i.e. although both document types may have different properties/keys
(e.g. This allows us to query all of our content, or only some type(s) of
content, depending on our requirements. Referencing vs EmbeddingWRITEME
DBRef
GraphsTrees
ConfigurationEven though a basic MongoDB setup is done in no time, when things get big, when we go head-to-head with YouTube, we need to configure/tune a little more ;-] Runtime ConfigurationIf we wanted to see the run-time configuration our current sa@wks:~$ mongo MongoDB shell version: 1.7.2-pre- connecting to: test > use admin switched to db admin > db.runCommand("getCmdLineOpts"); { "argv" : [ "/usr/bin/mongod", "--dbpath", "/var/lib/mongodb", "--logpath", "/var/log/mongodb/mongodb.log", "run", "--config", "/etc/mongodb.conf" ], "ok" : 1 } > bye sa@wks:~$ lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux unstable (sid) Release: unstable Codename: sid sa@wks:~$ Indexes, Search, MetadataWe want context relevant information and we want it yesterday... Create Index while/after DataimportAssuming we are using Storage Location of IndexesIndexes are part of normal database files (stuff in More than one Index per QueryAs of now (October 2010) we can only use one index when doing a query. However, this one index can of course be a compound index. Data Structure used for IndexesMongoDB uses B-trees for indexes — thus Besides the search key, each node in the B-tree also stores the disklocation of the data we actually want to retrieve — the database file in which it is contained and the offset. With sharding, all chunks on the same shard share the same index structure. CostLet us have a look at the different costs for searching through different data sets with either 50, 500 or 50000 objects: >>> import math >>> math.log(50) 3.912023005428146 >>> math.log(500) 6.2146080984221914 >>> math.log(50000) 10.819778284410283 >>> As can be seen, the cost goes up rather mildly compared to the input set size e.g. we only have to look at roughly 11 objects even though our data set contains 50000 objects in the end — that is, the cost only roughly tripled even though the input set size grew by a factor of 1000! Compound vs Single IndexesWhat is the difference between a compound index on
The answer is that single indexes will be slightly better on all of
those points — if we are just querying on ensureIndex is a NOPThere are several operations ( Also, depending on the driver used there is a difference between using
What BSON Object type is best used for the Document ID?Every document in MongoDB has a unique key ( We can use any BSON type (string, long, embedded document, timestamp,
etc.) as long as The actual reason why, by default, IDs are BSON ObjectID types has two
main reason: firstly, it is only 12 bytes in size rather than, for
example, strings which are 29 bytes. Secondly, the BSON ObjectID type
has some really useful information encoded with it automatically e.g.
a timestamp, machine ID (e.g. hashed Also, the ObjectID values themselves are not actually used for anything internally e.g. some people think it is used for sharding which is not the case. All those values are just a reasonable semantic way to comprise the ObjectID, that is all. Another reason why people prefer using an ObjectID (or some other
random string as value for the Natural OrderNatural order is defined as the database's native ordering of objects in a collection. With capped collections natural order is guaranteed to be insertion order which, in case default ObjectID's are used, is basically the same as sorting documents by their creation time and then inserting them one by one. I said basically because it is not a 100% true that natural/insertion order guarantees to have documents ordered by their creation time. The reason is that ObjectID's are constructed on the client side (e.g. PyMongo) and then send to the server i.e. there is latency in between when the ObjectID is created, the document send over the wire and finally committed to disk on the server. So if we had several clients creating and sending documents, the insertion/natural order might sometimes be off a little bit in that a slightly younger document might have been inserted before an older one. However, in general, if dealing with a capped collection, the natural
order feature is a very efficient way to store and retrieve data in
insertion order (much faster than say, indexing on a timestamp field).
We can use this fact to retrieve results sorted using $natural. Note
however that Example Usage with SortingIn addition to forward natural order, items may be retrieved in
reverse natural order. For example, to return the 50 most recently
inserted items (ordered most recent to less recent) from a capped
collection, we would use Full-text SearchMongoDB itself does not have built-in full-text search but there are other means, like for example using Lucene, so that one can have full text search with MongoDB — Elastic Search, itself based on Lucene, seems quite a promising solution. What is built-in with MongoDB however is quite good and might be enough for most use cases where actually not all features provided by some full-blown full text search engine might be needed. MultikeyMost commonly each key has one value such as A multikey is when a key has not just one value but an array of values
such as There are two main use cases when this is of great use. One is with full-text search and the other when we have a varying number of indexes to build that cannot be determined beforehand i.e. when keys and values, both, might vary in quantity and quality (type). Constraints on Field AttributesFor example, what if we want/need the email attribute of the user document to be unique i.e. no two or more users can have the same email address. Is this possible? The answer is yes, and it is usually done using unique indexes. MetadataGo here and here for more information. Case-insensitiv QueriesHow can I make it so that The solution is with regular expressions. In the particular case,
One way around this would be to literally store both versions (or
normalize right away), Map/ReduceMap/Reduce in MongoDB is intended as an aggregation and/or ETL (Extract Transform Load) mechanism, not as a general purpose query mechanism. What that means is that for simpler queries we would just use a filter those will leverage an index and be more real-time. O(1) query characteristics at all times... no matter if we have 1 GiB of data and a single node or 1 PiB of data spread across several thousand shared-nothing nodes. Map/Reduce for real-time QueriesNo, it is a little more complicated than that. It is true that
map/reduce allows for parallel execution of queries on a given data
set, thus guaranteeing for However, So, the robots brain is said to be real-time when, from the time a signal arrives, to the time a response/command leaves the brain, no more than a certain amount of time passes. Note that this does not include the time spend before receiving a signal (recognising the obstacle via its sensors) and, based on this signal, the robot body reacting to its brains response/command (based on the formerly received input signal) send to its body. Therefore real-time is a time constraint for a (computer) system in which it has to respond to a given input... or bad things may happen. What people really mean when the use the (misleading) term real-time with MongoDB (or any other DBMS for that matter) is that a response to a given query is darn fast (whatever that means?!) ;-] Therefore the actual question should be Is Map/Reduce our first choice for darn-fast Queries?Sorry, No again! Map/Reduce is the right thing to do ETL type operations... facepalm, I know... we are close though, hang in! ;-]
So how the f... can we get our darn-fast queries? Easy! We tell map/reduce to create a permanent collection which we then index and query. This makes for darn-fast queries while we can update and index the collection periodically with map/reduce in the background. PyMongo for Map/ReduceHave a look at the reason is that map/reduce runs on the server ( Continuous/Incremental Map/ReduceIt is not possible to do incremental or continuous map/reduce but it is possible to have ordinary indexes on the collection created by map/reduce. Does Map/Reduce use Indexes?Yes, all normal query semantics apply. However, map/reduce itself does not use indexes. In other words: we can use a query to filter the docs that map/reduce iterates over i.e. reduce the number of documents map/reduce has to iterate over. What are the semantics of sort() and limit() with regards to sharding?
How often does reduce run?When map/reduce is used on a sharded setup, each shard will run reduce
and the whole process is initiated and controlled by one The general MongoDB map/reduce system does iterative reduce so
intermediary stages are keep as small as possible. Right now (January
2012) all final results end up on one shard only (the final output
shard). Since specifying Is it possible to call Map/Reduce on only portions of a Collection?Yes, as mentioned above, we can use the Does the Map/Reduce calculate every time there is a Query?Yes. Map/Reduce recalculates each time there is a map/reduce query. The results are then saved in a new collection. While map/reduce runs, does MongoDB wait for the Map/Reduce Query to finish?Yes and no. We will have to wait until it finished, but we can also run it in the background. Is it necessary for the Map step to emit anything? If so how often/much?Usually we have 1 emit per map step but it is designed to be very flexible i.e. we can have anywhere between 0 and n emits per map. Speedup Map/ReduceAs with writes, map/reduce runs single-threaded on a machine i.e. if we want more speed out of map/reduce, we want more machines — by more machines we mean shards, to really split up the work. Example: a setup with 2 shards running on 2 dual core machines would be faster than a setup with 1 quad core machine without sharding. Can Map/Reduce speed be increased by adding more Machines to a Shard?No. Map/Reduce writes new collections to the server. Because of this, for now (October 2010) it may only be used on the primary i.e. if for example used with a replica set, it only runs/works on the primary of this replica set. This will be enhanced at some point so it will be possible to use up to the maximum number of machines per shard participating in map/reduce — thus being able to deliver somewhat less but close to linear speedup for map/reduce relative to the number of machines per shard. GISBecause location matters and because people like to work with and use GISs (Geographic Information Systems)... WRITEME GridFSMommy! Yes? Let us store the world!... and better even, let us be able to retrieve it once we are done... mommyisglancingaroundnervouslyatthispoint What is GridFS?Please go here. Why use GridFS over ordinary Filesystem Storage?One major benefit is that using GridFS does simplify our entire software stack and maintenance procedures — instead of using a distributed filesystem like for example MogileFS in addition to MongoDB, we can now use GridFS for all the nifty things we would have used a distributed filesystem before, thus get rid of it altogether. Also, with GridFS we get disk locality because of the way MongoDB allocates data files i.e. if we store a big binary (e.g. a video), chances are good it all fits into one pre-allocated data file and is therefore stored in GridFS as one continuous big chunk (data file) on disk. This way we can use any Linux/Unix utilities to do CRUD (Create Read Update Delete) operations e.g. rescue it by copying it to another node, etc. Another big plus is that if we use the ordinary filesystem, we would have to handle backup/replication/scaling ourselves. We would also have to come up with some sort of hashing scheme ourselves plus we would need to take care about cleanup/sorting/moving because filesystems do not love lots of small files. With GridFS, we can use MongoDB's built-in replication/backup/scaling e.g. scale reads by adding more read-only slaves and writes by using sharding. We also get out of the box hashing (read uuid (universally unique identifier)) for stored content plus we do not suffer from filesystem performance degradation because of a myriad of small files. Also, we can easily access information from random sections of large files, another thing traditional tools working with data right off the filesystem are not good at. Last but not least, we can keep information associated with the file (who has edited it, download count, description, etc.) right with the file itself. Web ApplicationsAnother difference worth mentioning is that serving blobs (e.g. images) directly from the database is perfectly fine with GridFS and thus MongoDB as it was designed for that. With MySQL and friends we would rather use the filesystem to do that since storing blobs within a rdbms (relational database management system) is usually performing poorly, both in terms of diskspace and speed. Store non-binary Data that references binary DataA common use case is when we need to store information about some entity and, along with it, some binary data. For example, let us consider a person which might have information like address, hobbies, phone number and so on — all non-binary data i.e. regular textual data like strings, booleans, floats, arrays of strings, etc. in addition to that non-binary data, we would like to store binary data for that person like, for example, his/her profile picture for some facebook-alike website. The best approach is to store the non-binary data in a usual collection and the binary data in a GridFS collection i.e. a separate collection for both types of data, binary and non-binary. We would then need to access the GridFS collection to get the binary data belonging to a particular user e.g. his profile picture. In terms of performance impact, the best thing we could do now is to
have a unique path to the profile picture that includes the We can think about how serving a list of images works in HTML. When we
serve a page it includes The logic needed to create the unique path containing the So, what is the benefit of all of this? Why not just query both collections for each request? The answer is speed and scalability. If we only have to touch both collections once and can even circumvent the logic tier for requests altogether, then that makes for a huge performance boost! Secondly, with a unique path (also known as URI (Uniform Resource Identifier)) to a picture (or any other kind of binary data), we can use so-called CDN (Content Delivery Network) and have content cached close to the user and thus have low latency and minimized global traffic along with a ton of other goodies. Can I change the Contents of a File in GridFS?No, simply because this could lead to data corruption. Instead, the recommended way is to put a new version of the file into GridFS by using the same filename. When done we can use (dramatic pause...) Oh yes, absolutely, GridFS can be thought of as a versioned filestore ... gosh! Delete Files in GridFSIf we wanted to delete all files stored within GridFS then we should
just drop the Delete / Read ConcurrencyIf we are reading from a file stored in GridFS while it is being deleted, then we will likely see an invalid/corrupt file. Therefore care should be taken to avoid concurrent reads to a file while it is being deleted. Note that this is not driver specific but based on the fact that we cannot do multiple, isolated operations, which would be needed to do a delete in GridFS that is concurrency safe. However, one way we could work around this would be by using a singleton in our logic tier (e.g. Django) and proxy delete/read operations through it. DevelopmentThis subsection is about using MongoDB as data tier for our logic tier and the various things that need to be known in order to do so. Is there a scons clean Command for MongoDB?When scons is used, Insert multiple Documents at onceMost drivers have a so-called bulk insert method. PyMongo for example
has How are large resultsets handled by MongoDB?Most people fear (and rightfully so) that if a query is send to some
That is not the case. What happens is that mongodb uses cursors to send results back in batches i.e. it does not assemble all of the results in ram before returning them to the client. Last but not least, the batch size is configurable to provide the most possible flexibility for any possible use case out there. Can i run MongoDB as Master and Slave at the same Time?Yes, there are certain use cases where this might make sense. When using PyMongo, what can be done for improving single-server Durability?Since MongoDB does not wait for a response by default when writing to
a These commands can be invoked automatically with many of the drivers (e.g. PyMongo) when saving or updating in safe mode — behind the scenes however, what really happens is that a special command called getLastError is being invoked. To answer the initial question: with
Above we already mentioned replica sets. Replica pairs and/or sets have the ability to ensure writes to a certain number of nodes (slave for replica pairs respectively secondaries in context of replica sets) before returning. Commands are safeThe mentioned semantics for single-server durability are automatically true for commands i.e. they are all safe as all commands always have a response from the server — no fire and forget semantics... Birds-eye View on Data SetDuring development, when we populate/flush our data set every five minutes or so to try out whether or not the code we have just written does what it is supposed to do, it comes in handy if can get an overview on our data set instantly. Using PyMongo this task is easily accomplished: sa@wks:/tmp$ python Python 2.6.6 (r266:84292, Apr 20 2012, 09:34:38) [GCC 4.5.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import pymongo >>> >>> con = pymongo.Connection() >>> >>> def print_collections(): ... ... for dbname in con.database_names(): ... print '\nDatabase: %s' % (dbname,) ... db = con[dbname] ... ... for name in db.collection_names(): ... print ' Collection: %s' % (name,) ... coll = db[name] ... ... for doc in coll.find(): ... print ' Document: %s' % (doc,) ... ... ... ... ... >>> print_collections() Database: test Collection: test Document: {u'_id': ObjectId('4dda5230b9b052f4b881309b'), u'name': u'katze'} Collection: system.indexes Document: {u'ns': u'test.test', u'name': u'_id_', u'key': {u'_id': 1}, u'v': 0} Database: aa Collection: polls_poll Document: {u'question': u"What's up?", u'_id': ObjectId('4d63beba4ed6db0e36000000')} Collection: system.indexes Document: {u'ns': u'aa.polls_poll', u'name': u'_id_', u'key': {u'_id': 1}, u'v': 0} Database: admin Database: local >>> sa@wks:/tmp$ Of course, that is just a very simplistic go at the problem and we would adapt/alter where necessary based on our use case at hand e.g. to nicely traverse and print nested documents. Scalability, Fault Tolerance, Load BalancingScalability, fault tolerance and load balancing. Big words! Sounds scary, sounds hard, sounds expensive. All true, except when armed with MongoDB. Hidden Replica Set MembersWe can hide a secondary if we use This also means that hidden secondaries will not be used if a driver automatically distributes reads to secondaries. A hidden replica set member must have a priority of 0 which means we cannot have a hidden primary (only hidden secondaries). To add a hidden member to a replica set we would do > rs.add({"_id" : <some_id>, "host" : <a_hostname>, "priority" : 0, "hidden" : true}) Last but not least, let it be known that what we just explained is not in any way affected by the use of replica set priorities — we can use both or either one at the same time. Corrupted Primary affecting SecondaryA secondary queries the primary's oplog the same way any client does i.e. document-by-document. This means that if a document's headers are intact and the document's payload is the expected size, yet corrupt, then that corruption could be transferred, in theory. However, this is a very unlikely scenario, but not 100% impossible. The important thing to understand here is that a problem/corruption on the primary might (very unlikely but theoretically possible; see above) cause a few corrupted documents on the secondary but it will never cause a corrupted secondary — worst case we will have to write off a few documents but never an entire secondary. Replicating/Migrating IndexesData gets migrated (sharded setup) and/or replicated (replica set) but never indexes. They will be build and maintained entirely by any node itself (e.g. a secondary within a replica set) and never transferred in any way between nodes. bounceWhat does bouncing localThe local database contains data that is local to a given server i.e. it will not be replicated anywhere. This is one reason why it holds all of the metainformation replication info like for example the oplog.
Time to FailoverHow long does it take until automatic failover happens? With
replica sets this is configurable (combination of The heartbeat signal is not blocked by running transactions e.g. if we
run an insert/update that takes 5 seconds, it will not block the
default Semantics for writes to a Replica Set which Primary just died?From an application point of view, when trying to write to a replica set which primary just died/vanished, the first write request will fail i.e. the application has to handle a retry. Subsequent requests should go to the right place i.e. the new primary. Can I go from a single Machine/Replica Pair to a Replica Set?Is there Downtime for adding a Secondary to a Replica Set?No, whether it is the first secondary within a replica set or any further secondary after the first one, there is no mandatory downtime involved in adding secondaries. Initial synchronization SourceIn order to synchronize a new secondary in a replica set we can either use its primary or another secondary as source. Using a secondary is recommended because it does not draw resources of the primary. For example, if we wanted to add a new node and force it to synchronize from a secondary, we could do > rs.add({"_id" : <some_id>, "host" : <a_hostname>, "initialSync" : {"state" : 2}}) We can choose a source by its state (primary or secondary), By default, a new member of a replica set will attempt to synchronize from a random secondary and, if it cannot find one, synchronize from the primary. If it chooses a secondary, it will only use the secondary for its initial synchronization. Once the initial synchronization is done and it is ready to synchronize new data, it will switch over and use the primary for normal synchronization. SemanticsDuring the initial synchronization, the entire data set is copied to a
secondary (except for data we put into the local database). The way
this works is sort of snapshotting i.e. it clones the master's oplog
in SlavedelayWith a replica set and or the older master/slave setup it might be
beneficial to have a secondary that is purposefully many hours behind
the primary in order to prevent human error. In MongoDB 1.3.3+, we can
specify this with > rs.add({"_id" : <some_id>, "host" : <a_hostname>, "slaveDelay" : 3600}) Is there a Limit on the Number of physical Machines for a Replica Set?Yes there is. As of now (February 2012) it is 12 physical machines per replica set — a shard usually consists of a replica set so... This number is inclusive the primary i.e. 1 primary and 11 secondaries are the maximum number of physical machines for a replica set as of now. Scale from 1 Shard to >=2 ShardsYes, doing so not even requires downtime. Is there Downtime for adding a Shard to an existing sharded Setup?No, a shard can be added to an existing sharded setup without downtime
for the existing setup. The way this works is rather simple — we set
up a new replica set and then use the In which Order do I upgrade a sharded Setup?Assuming we have sharding set up and each shard is made redundant using a replica set, the recommended order how to upgrade the whole setup is by starting with the config servers, next the shards (replica sets), and finally the mongos used to route client requests to our shards. It is also recommended to use the same version of mongodb for all
instances and, if possible, use pre-build packages as provides by our
os e.g. Do I need an Arbiter?An arbiter is a Therefore, an arbiter is only needed for a replica set that might end
up with 2 secondaries (or any other even number of secondaries) which
carry the same weight in terms of votes ( In the case of an odd number of secondaries remaining after the primary vanished, they can elect a new primary amongst themselves without the need for an arbiter — that is, even if all of them carry the same weight in terms of votes. Usually what we want to start out with is an odd number of This is merely a slight insurance assuming that every instance carries the same weight in terms of votes — there is also the factor of which of the remaining secondaries has the most up-to-date copy of the data set i.e. votes are not the only fact that determines the election of a new primary. There are also replica set priorities which influence the election of a new primary. An arbiter would ideally run on a different physical machine (neither one that participates in a replica set i.e. is primary or secondary). Note however that there is nothing stopping us from having an arbiter
deployed for replica sets which, even after its primary vanishes
(physical machine dies,
Saving DiskspaceSome setups start out with only 2 members (1 primary and 1 secondary). The benefit of doing so is saving diskspace — 2 copies of our data set rather than 3. The downside with this setup is that there are only two copies of the data set at all times which increases the probability for total data loss. With 1 arbiter and 1 secondary the replica set is able to break ties in the voting process and elect the secondary to primary automatically (without manual/human intervention) — without the arbiter the replica set would not be able to recover automatically and remain in read-only mode until manually fixed. Replica Set PrioritiesMongoDB 1.9 allows us to assign priorities in the range of 0 to 100 to each member of the replica set.
WRITEME Arbiters with —oplogsize 1By default the oplog size is roughly 5% of diskspace on a machine
running a Anyway, arbiters do not need to hold any data or have an oplog therefore 5% diskspace would be a huge waste of space. How can I make sure a write Operation takes place?Go here and here for more information. Sharding works on the collection-level but not on the database-level?Exactly. When sharding is enabled we shard collections but not the entire database. This makes sense since, as our data set grows, certain collections will grow much larger than others. Replication works on the database-level but not on the collection-level?Collection-level replication is not supported for replica sets i.e. we can either replicate the entire database or nothing at all. I want MongoDB start sharding with less X MiB?MongoDB's sharding is range based. So all the objects/documents in a collection get put into a chunk. Sharding only starts after there is more than 1 chunk. Right now (February 2012), the chunk size is 64 MiB, so we need at least 64 MiB for sharding to occur. If we do not want to wait until we have 64 MiB, then we can configure
Also: note that chunks with regards to GridFS (usually 256kb in size) are something entirely different than chunks with regards to sharding, both in size and the semantics that go with them, it is just that on both cases the term chunk is used. What ties exist between Map/Reduce and ordinary Queries and indexing?Indexing and standard queries in mongodb are separate from map/reduce i.e. for somebody who might have used couchdb in the past this is a big difference — MongoDB is more like mysql for basic querying and indexing i.e. those who just want to index and query can do so without the need to ever touch the whole map/reduce subject. Does a Shard Key have to be a unique Index?A shard key can either be a single or a compound index. We might also
choose to make it unique. If so, then we should know that a sharded
collection can only have one unique index, which must exist on the
shard key i.e. no other unique indexes can exist on a sharded
collection except for the unique index which exists on the shard key
itself. Note that Either ways, some variation is good since otherwise it will be hard for MongoDB to split up our data set if the shard key lookup always returns either true or false. Choose a Shard Key?First and foremost, note that this very much depends on the use case at hand i.e. a shard key is always specific to a particular setup and therefore has to be chosen wisely — the choice for a shard key will effect overall cluster performance as well as time spend to extend and maintain it. What is generally true to any use case is that we want our data set
evenly distributed across the entire cluster i.e.
Second, we want data to only live on one shard at all times i.e. we do not want the same data to live on two or more shards at the same time. In order to adhere to this two general rules, a shard key should fit our use case so that:
Last but not least, note that what is true for how data gets stored in a sharded setup is true for anything else as well e.g. the same semantics apply for queries to our data set. The reason the above explanation only focused on storing data is for reasons of simplicity only, as most people new to sharding already have a hard time wrapping their minds around it. Time vs Non-time baseAs a rule of thumb, we can decide to go with one of the following use cases:
Compound Shard KeysThis is when the shard key pattern has more than one field —
The way semantics change (for splitting up our data set and/or for queries etc.) is like this: If we shard by a compound shard key patter of Should I use a Compound Shard Key?As usual, depends on the data set at hand and how it is queried (see above). Please also go here for more information. Change Shard KeySo far, (February 2012) there is no way to change the sharding key on a collection. In order to have a different shard key, we would have to recreate the collection with the new sharding key and write a script to move data from the old collection to the new one. Unique IndexThe same that is true for changing the shard key is true for having a unique index on sharded collections — we need to recreate and move data from the old to the new collection which would have the unique index in place. Also, note that a sharded collection can have only one unique index, which must exist on the shard key itself. Migrating ChunksWith a sharded MongoDB setup data (chunks) is moved around on all
post-cleanup or removedDuringChunks are a logic unit — right now (January 2012), The Specifically,
Swap Master And SlaveWe can swap master and slave. Actually this is not much different from cloning a snapshot. What we need to do is:
Can I upgrade the Master without the Slave(s) needing to resync?Yes. Assuming we want to upgrade the master (e.g. give it
faster/bigger hardware) to cope with growth, we certainly do not want
to resync the/all slave(s) which were pointed to this master using
The way this works is by copying the oplog from the old master to the new one and then pointing the/all slave(s) to the new master. The necessary steps are outlined below:
How does distributing Reads amongst Replica Sets/Pairs work?This is not happening automatically (i.e. As per default all I/O (input/output) operations go to the master (in replica pair terminology) or primary (replica sets terminology), the problem here is that there is no guarantee all secondaries within the replica set are totally synced with their primary at any given time — there are cases where this is not a problem but there might be cases when it actually matters. The way distributing reads works is so that an application using mongodb gets to decide when/how to distribute reads by means provided through their native mongodb language driver e.g. PyMongo. This provides maximum flexibility and still makes total sense since replica sets are thought to solve more problems than just distributing reads (load balancing). For example, with replica sets we get:
PyMongo for example might have some more flexible options in the future, but to distribute reads to a replica set we have to maintain multiple connection instances and distribute them in our application code i.e. as of now (February 2012) PyMongo does not do load balancing (e.g. in round-robin fashion) reads across secondaries automatically. As said however, we might see this feature soon... Distributing reads with mongosIf we have a sharded setup where each shard is made up of a replica set, we also have one or more mongos which route queries to and aggregate results from the replica sets. We already know that with Replica Set Configuration is persistentThat is, if i shutdown a whole cluster, do i have to re-initialize it when i bring it back up? No, you do not have to re-initialize because, as for sharding, the configuration is persistent. Each node within the replica set contains the configuration for the entire replica set. Shard Configuration is persistentYes, the config server handles persistence of the shard setup. Overhead involved with ShardingIn auto-sharding setup, the Compared to {My,Postgre}SQL, what is the Performance Penalty with Replication?When we use replication with mysql/<some_other_rdbms>, it requires us to enable binlog and that results in significant performance drop, usually 50% or more. This different with MongoDB as it does not require writing to an additional replication log and therefore does not have anywhere near that 50% performance drop. With MongoDB replication works by writing to an oplog which is generally pretty fast (depending on use case of course). What if a Primary's Oplog is to far ahead for a Secondary to catch up again?Since the oplog is a capped collection, this could very well happen. The question is what would a secondary do that already existed but may have been without a connection to its primary for some time — maybe long enough for the primary's oplog to roll over! Most folks would guess one of three cases:
Well, of course 1 is what happens, whether we had the primary roll over its oplog already or not, the secondary always does a full resync in order to have the exact same state/data as its primary. In case we want to speed things up, we could clone a snapshot from the primary and then use fastsync to create a new and up to date secondary from an existing primary which could then again be integrated within the replica pair/set again. Does using —only mean not the entire Oplog is written on the Master?No, even if we use DNSReplica sets as well as shards can use domain names instead of IPs. MiscellaneousA colorful mix of various non-related bits and pieces... tell me, does this also makes you think of the food you had back then when you were still a student? ;-] C ExtensionsBy default PyMongo gets installed/build using optional C extensions (used for increasing performance). If any extension fails to build then PyMongo will be installed anyway but a warning will be printed. This is how we can check whether or not PyMongo got compiled with the C extensions: >>> import pymongo >>> pymongo.has_c() True >>> ConnectionsFor each TCP connection from a client to a This means that a single connection always has a consistent view of the data set — it reads its own writes e.g. if we do a query after we did an insert/update, then the query will return the most current status (we will see the changes done by insert/update). However, two or more connections do not necessarily have a consistent view of the data set as one connection might do a write/update which might not be seen immediately by another connection. For example, each mongo has its own TCP connection to a getLastErrorThis is useful for all CRUD (Create Read Update Delete) operations
(e.g. What all those CRUD operations have in common is that by default none
of them waits for a return code from For example, if we are writing data to a From within the shell the syntax to query directly on the If we use options such as Connection Pooling
The good news here is that most drivers support connection pooling
i.e. the drivers open multiple TCP connections (a pool) to PyMongo for example does automatic connection pooling in that it uses one socket per thread. As of now (January 2012) the number of pooled sockets is fixed at 10, but that does not include sockets that are checked out by any threads. Number of Clients connectedHave a look at the Number of parallel Client ConnectionsHave a look at the EndiannessMongoDB assumes that we are using a little-endian system e.g. Linux. in a nutshell: little-endian is when the lsb (least significant byte) is at the lowest address in memory. The other bytes follow in increasing order of significance. Even though that MongoDB assumes little-endian, all of the drivers work on big-endian systems, so we can run the database remotely and still do development on our big-endian (old-school) machine. 32 bit limitationThere is a size limit because of how memory mapped files work. This
limit is 2 GiB per database (read per Type CastingIn MySQL we have Forbidden Characters/KeywordsYes, there are some. It is recommended not to use certain characters in key names. Store Logic on the ServerYou, like me, we are both sane technicians i.e. we follow the three tier model when building web applications and thus keep presentation, application logic and data separated. However, with MongoDB, it sometimes makes sense to put query logic onto/into the data tier. In a way that is pretty much the same as stored procedures with RDBMSs. So, the answer is yes. With MongoDB we can use JavaScript to write
functions (read logic) which we can then store and execute directly on
the server side i.e. where We should use this with caution however since there is only one thread of execution for each JavaScript call on the server (e.g. 10 Map/Reduce jobs would run 10 JavaScript threads in serial i.e. one after the other). This makes it a potential bottleneck and might cause problems on a heavily used server (e.g block other operations). Go here for more information. Cron-like SchedulerAs of now (February 2012) there is no built-in scheduler with MongoDB i.e. it is not possible to schedule execution for some server-side stored JavaScript. The simplest solution to do this now would be to put the JavaScript
code into a file, put this file into the OplogThe so-called oplog or operations log also known as transaction log is
in fact a capped collection (all The way this (replication) works is fabulous — we do not move the actual data but instead we move operations (read actions/computations) from the primary to the secondary and then run them again on the secondary, thereby creating the exact same state/data as found on the primary. Since the oplog is a capped collection, it is allocated to a fixed
size (5% of overall diskspace by default; If the secondary can not keep up with this process, then replication will be halted. The solution is to increase the size of the primary's oplog. There are a couple of ways to do this, depending on how big our oplog will be and how much downtime we can stand. But first we need to figure out how big an oplog we need. The Oplog is IdempotentIf working with the oplog (e.g. by replaying it on some database or by using it to synchronize two members of a replica set etc.) it is important to know that the oplog is idempotent i.e. we can apply it to our data set as many times as we want and our data set will always end up the same. How do I determine and create the right Oplog Size?
WRITEME What else can I use the Oplog for?Well, there are many good ideas floating around. Some of them centered around events. For example, by watching the oplog we might build an event framework/loop and do all kinds of fancy things. That is pretty much exactly the same as using CouchDB's CursorWell, generally speaking i.e. for DBMSs (Database Management Systems) in general, a cursor comprises a control structure for the successive traversal and potential processing of records in a result set. By using cursors, the client to the DBMS can get, put, and delete database records. Of course, this is also true for MongoDB i.e. a query for example returns a cursor which can then be iterated over to retrieve results. The exact way to query will vary with language driver. For example,
using MongoDB's interactive shell, database queries are performed with
the Conceptually speaking, a cursor always has a current position. If we delete the item at the current position, the cursor automatically skips its current position forward to the next item. As of now (February 2012), a cursor in MongoDB does not provide true point-in-time snapshot functionality when being accessed — we call this cursors being latent i.e. depending on whether or not an intervening CRUD (Create Read Update Delete) operation to the database happens after we did an initial access/query to the cursor, the exact same access/query to the cursor might not give us the same result after the CRUD operation happened. There is however a Tailable CursorIf we issue Tailable in this context means the cursor is not closed once all data is retrieved from an initial query. Rather, the cursor marks the last known object's position and we can resume using the cursor later, from where that object was located, provided more data is available. Like all latent cursors, the cursor may become invalid at any point e.g. if the last object returned is deleted. Thus, we should be prepared to requery if the cursor is dead. We can determine if a cursor is dead by checking its ID. An ID of zero indicates a dead cursor. In addition, the cursor may be dead upon creation if the initial query returns no matches. In this case a requery is required to create a persistent tailable cursor. Last but not least, tailable cursors are only allowed on capped collections and can only return objects in natural order. Transactions and DurabilityPlease go here for more information. How can I make the ObjectID to be a Timestamp?Firstly, that is what happens per default anyway i.e. the ObjectID
contains a timestamp. If, for whatever reason, we want the ObjectID be
a timestamp and nothing but a timestamp, then we could do
However the downside of doing all this is that the ObjectID will not be guaranteed to be unique anymore which is generally considered a bad thing. One should have a very good reason for going down that road i.e. replacing the default ObjectID with a timestamp. It is probably better to store a timestamp in addition to the default ObjectID or, even better, use the portion of the default ObjectID that contains the timestamp and work based on that. If I insert the same Document twice, it does not raise an Error?Using PyMongo for example, why does inserting the same document (read
same try: doc = { _id: '123123' } db.foo.insert(doc) db.foo.insert(doc) except pymongo.errors.DuplicateKeyError, error: print("Same _id inserted twice:", error) The answer is that Is incrementing and retrieving a new Value in one Step possible?Yes, it is even an atomic operation. The way it works is it returns the old value, not the new one i.e. the document returned will not include the modifications made on the update. What are MongoDB's Components/Building-blocks ?
with sharding we get shards where each shard can be composed of either a single mongodb or a so called replica set; a MongoDB auto-sharding+replication cluster has a master server at a given point in time for each shard. Config ServerConfig servers manage/keep all information with regards to sharding i.e. the config servers store the cluster's metadata, which includes basic information on each shard (which in most cases consists of a replica set) and the chunks contained therein. Chunk information is the main data stored by the config servers. Each config server has a complete copy of all chunk information. A two-phase commit is used to ensure the consistency of the configuration data among the config servers. If any of the config servers is down, then the cluster's metadata becomes read only. However, even in such a failure state, the MongoDB cluster can still be read from and written to. As of now (February 2012) there can be either one (for development) or three (use when in production) config servers — no two or more than three. However, future plans include more than three config servers with more fine grained control of what happens when one or more config servers go down. WRITEME: mongodb, mongos, arbiter Scalability, Fault Tolerance, Load BalancingAll of it we can do using sharding in combination with replica sets. Before we start however, I want the reader to know that there is a FAQs section with regards to scalability, fault tolerance and load balancing. ShardingWRITEME... just notes so far
SizeThere is no hard-coded limit to the size of a sharded collection. We can start a sharded collection and add data for as long as we also add the corresponding number of shards required by the workload we experience. There are two things however we need to be smart at:
Global vs TargetedMongoDB sharding supports two styles of operations — targeted and global. On giant systems, global operations will be of less applicability. ReplicationWRITEME... just notes so far
Replica PairReplica Set
FastsyncWRITEME Ensure Writes to n NodesWRITEME
Load BalancingWRITEME... just notes so far
MiscellaneousThis section provides miscellaneous information within regards to MongoDB. PythonThis section is about using MongoDB as data-tier for a logic-tier written in Python. Best of both worlds, we love it because it is totally
Crazy? Exciting? Fabulous? Hm... not sure ;-]
PyMongo
Python, because JavaScript is slow, C++ is hard and Java feels like ORMMapReduce with PyMongowith Python Python drivers/stack
Django Stack
with mongokit with mongoengine
Utilitiessa@wks:/tmp/nightly/mongodb-linux-x86_64-v1.6-2010-10-21$ ll bin/ total 63M -rwxr-xr-x 1 sa sa 6.7M Oct 21 08:45 bsondump -rwxr-xr-x 1 sa sa 3.1M Oct 21 08:45 mongo -rwxr-xr-x 1 sa sa 6.8M Oct 21 08:45 mongod -rwxr-xr-x 1 sa sa 6.8M Oct 21 08:45 mongodump -rwxr-xr-x 1 sa sa 6.7M Oct 21 08:45 mongoexport -rwxr-xr-x 1 sa sa 6.8M Oct 21 08:45 mongofiles -rwxr-xr-x 1 sa sa 6.8M Oct 21 08:45 mongoimport -rwxr-xr-x 1 sa sa 6.7M Oct 21 08:45 mongorestore -rwxr-xr-x 1 sa sa 4.6M Oct 21 08:45 mongos -rwxr-xr-x 1 sa sa 1.1M Oct 21 08:45 mongosniff -rwxr-xr-x 1 sa sa 6.8M Oct 21 08:45 mongostat sa@wks:/tmp/nightly/mongodb-linux-x86_64-v1.6-2010-10-21$ Some of them ( TopicsWRITEME Tips and TricksBackup, RecoveryBelow we are going to look at different scenarios when we would need to recover (as opposed to how to create/keep a backup and/or run a replica set) e.g. because we have a hardware failure or power outage etc. Hardware Failure
That is the easiest one. We need to go and fix the server before we can move on to either case from below. Power Outage, No Writes
If we did not do any writes during the session that shutdown
uncleanly, our data set is fine. All we need to do is remove the
mongod.lock file after rebooting the machine and start Power Outage, Writes, No Replica Set, Backup
We had a power outage during a session which had writes. We run
without a replica set but luckily we have a backup. All we need to do
is stop Power Outage, Writes, No Replica Set, No Backup
This is the same as above but unfortunately we do not have a backup
that we can drop into If we have a single Since we only have this one copy of our data set, we will have to
repair whatever is left i.e. we remove the What we should not do at all is just remove the
Note that MongoDB keeps a sort of table of contents at the beginning of each collection. If it was updating its table of contents when the unclean shutdown happened, it might not be able to find a lot of healthy data. This is how the I lost all my documents! horror stories happen. Run with replication, people! Power Outage, Physical Damage, Writes, Replica Set
This is the procedure to follow when, within a replica set, the primary vanishes e.g. because of power outage, physical damage, hardware failure, etc. Since we are running a replica set (with or without sharding), we do not need to do anything, the promotion of a primary will happen automatically — assuming we do not need an arbiter. Also we do not need to change anything in our application code (e.g. Django) since failover will happen automatically as well. Instead, really, if we are running a replica set, all we need to do is start the node that failed with the same arguments we used before i.e. all the stuff that lives in the configuration file. Next, remove the entire data directory ( Last but least, two additional but unrelated notes:
DelegationQuerying / Sorting / Indexes
AggregationMap/Reduce
ACID, ConcurrencyThe basic thing to know about MongoDB with regards to ACID (Atomicity, Consistency, Isolation, Durability) is that MongoDB
Basically what that means is that if we have several clients competing to make writes to a single document, we can be sure that this will be atomic — it either happens from start to end or not at all i.e. there will be no two clients interfering with each others writes thus possibly create inconsistency. WRITEME some notes from http://permalink.gmane.org/gmane.comp.db.mongodb.user/2590
Single Server DurabilityNew since 1.8. One discussed solution is to have append only transaction logs (Write ahead log). Security
Server Side Code Execution
db.eval(), $whereWhen we pass JavaScript to a db.eval() in sharded enviromentStored Functions/Procedures
TriggersBSONBSON is a language independent data interchange format, not just exclusive to MongoDB. BSON was designed to have the following three characteristics:
So, what is the point of BSON when it is no smaller than JSON in many cases? Well, BSON is designed to be efficient in space, but in many cases is not much more efficient than JSON. In some cases BSON uses even more space than JSON. The reason for this is another of the BSON design goals: traversability. BSON is also designed to be fast to encode and decode. This uses more space than JSON for small integers, but is much faster to parse. SONOS Tweaks
GIS
GridFS
Serving GridFS data without going through the Logic TierNginxread from secondaries Misc
Import/ExportMigrating from RDBMS XY to MongoDB
StreamingE-CommerceUse CasesFinancial Applications
handling numbers
VoIP
StorageHostingApplications using MongoDB |