Porting From MongoDB to Cloudant: Differences in Design

Share: Share on FacebookTweet about this on TwitterShare on Google+Share on LinkedInShare on RedditEmail this to someonePrint this page

cloudant

Who’s this for?

This is for the MongoDB developers who want to explore another NoSQL offering. Maybe your original decision to use MongoDB was a mistake, maybe you’re just curious about other options, or maybe you’ve taken to the cloud via IBM Bluemix and want to use an IBM owned database offering. Whatever the case, the central question remains: how difficult will it be porting my MogoDB project over to Cloudant? I’m here to help address that. Read on.

Understanding Cloudant

This isn’t meant to be inclusive of all of Cloudant’s functionalities. For a full understanding, see the docs.

Cloudant is a document based NoSQL database just like MongoDB. It’s a wrapper around CouchDB except with more functionality. Like CouchDB, it’s a RESTful service with an web-based GUI that’s capable of all your basic CRUD operations, and it was designed around map/reduce functions.

Documents aren’t organized in collections1 like in Mongo. Instead, every database simply stores all the documents together in bulk so if you want to query only a certain subset, then you need to create a field within each document that will act to distinguish it from other types of documents. You can define mapReduce views (think pre-defined queries) to filter your data by said fields. Cloudant is a distributed CouchDB across multiple machines and also has several features that set it apart: full text search indexes based on Apache Lucene search [Edit: IBM has just open-sourced it], Cloudant Geospatial for dedicated deployments, and (the most topical to us) Cloudant Query.

Cloudant Query is an additional way in which Cloudant allows you to get at your data. It’s Cloudant’s attempt at a declarative query language based on MongoDB’s .find() ability. It’s more intuitive for those from a SQL background (and obviously more so for those from a MongoDB background) and is slightly faster to build indexes with than using the traditional mapReduce views so it’s the recommended path provided it fits your use case. Do note, that although this is unique at the moment, IBM plans to contribute it back to the Apache CouchDB project in the upcoming future.

1MongoDB also has capped collections and “time to live” collections while Cloudant does not. However, you could certainly replicate the behavior of these collections without too much effort in your code if you so desired.

Good Reasons to Use Cloudant

  • You need a master-master replication system
  • You need AP (Cloudant) over CP (MongoDB) in regards to the CAP theorem
  • You need fast map/reduce functions
  • You need to send HMTL or JSON data directly to the client
  • You need a database that scales well horizontally
  • You want great replication support
  • You want a managed, easily scalable solution

In addition, good existing MongoDB projects that would convert easily to Cloudant Query would be ones in which complex data fields are not often updated, multikey indexing is not used extensively, and MongoDB specific types are not heavily depended on.


Cloudant Query vs. MongoDB Declarative Queries

Obviously, MongoDB has a larger query language. That’s what it’s made for. Keep in mind that I’m mainly comparing MongoDB to Cloudant Query. If you think Cloudant Query is lacking in features, remember it is just one part of the Cloudant offering. I think in a fair number of use cases, MongoDB can be replaced by Cloudant Query, and in many use cases MongoDB can be replaced by the entirety of the Cloudant offering.

Basic Syntax Differences

You know what MongoDB’s syntax looks like:

The equivalent in Cloudant is a POST request to the _find endpoint of your database with the following body:

Pretty similar. “selector” is the query parameter and “fields” is the projection parameter analog to the db.collection.find(). If you want all the document fields omit the “fields” value or leave it empty. Cloudant Query doesn’t support exclusion in the projection, and there’s no projection operators.

Indexes

In MongoDB, you don’t need to define an index in order to query against your data. Of course, it’s recommended for performance reasons, but you can query straight out of the box. However in Cloudant, an index must exist for at least one field that you query against. The primary index, the one that indexes the _id field, can be used, but that results in no performance gains unless you use some custom defined _id that allows you to filter certain _ids.

Also unlike MongoDB, Cloudant Query has two types of indexes at your disposal: JSON and text indexes. JSON indexes are the ones you’ll want to use in production as they’re faster than text indexes. They are the same indexes used used for views but with a declarative querying wrapper. You can specify multiple fields to index and just like in MongoDB’s compound indexes, the order is important. Do note, that MongoDB supports index intersection while Cloudant does not.

Text indexes in Cloudant Query are not too similar to MongoDB’s text index as it does not support wildcards, phases, omissions, etc. For that kind of functionality see Cloudant’s full text search (yes, it’s a bit confusing that Cloudant Query text indexes provide a different functionality than full text search indexes). What Cloudant Query’s text index does do is automatically index all possible fields (you can still index a subset of fields if you’d like), which means you can immediately use whatever queries you’d like without having to worry if there’s an appropriate index. Any query that can be run on a JSON index can be run on a text index. Additionally, text indexes have the special $text operator that allows for you to search for any string within a document that matches your $text string (note: MongoDB’s $text operator is different).

One of the major differences between Cloudant Query’s and Mongo’s indexing is the way they index different data types and the resulting queries that can be performed. Only String, Number, and Boolean data types can be indexed for both JSON and text indexes. This means that arrays and objects can’t be indexed; however, nested fields like someObject.someName can be indexed, provided “someName” is also of the correct type. When I say arrays can’t be indexed, it’s a bit of a half-truth. In fact, we need a subsection for this:

Array Indexes

You can create an index for an array field, but this is not the kind of index that you probably want. Say you have documents with the “fooArray” field that’s three elements long. You then POST this body to the _index endpoint:

It’ll create successfully and you may think you have an array index. Not so fast. This will only work if you query for the entire array like so:

That’s not very useful. Instead, another slightly less useful (still not too useful) way to index array elements is by their position:

That way you can now query like:

This isn’t explicitly covered in the docs but it is supported. Unfortunately, this kind of array indexing does not work at all like MongoDB’s multikey indexes, but you can get a similar functionality by using traditional views that loop through arrays and objects to index how you’d like (example).

Operators

Cloudant Query’s operators are usually sufficient for most use cases–obviously there aren’t as many as MongoDB’s operators. Cloudant Query splits them into two categories, combination and condition operators. These should look familiar:

  • Combination: $and, $or, $not, $nor, $all, $elemMatch
  • Condition: $lt, $lte, $eq, $ne, $gte, $gt, $exists, $type, $in, $nin, $size, $mod, $regex

I think for most people, the query projection operators and update operators will be the ones they miss the most. However, since the projection operators only limit the size of the returned results, they can be worked around. As for update operators, they too can be worked around, but Cloudant’s not really meant for data that updates a lot.

Additionally, the only evaluation query operator missing from Cloudant Query is $where, which can’t make use of indexes anyway. And while Cloudant Query technically lacks most of the aggregation pipeline operators, all of the query modifiers, and all of the geospatial operators, Cloudant as a whole can substitute most of those features.

For instance:

  • The MongoDB docs state that the aggregation pipeline is an alternative to map/reduce functions, which Cloudant is designed for
  • The geospatial operators can be substituted with Cloudant Geospatial in dedicated deployments (see the geospatial section for more info.)
  • Cloudant doesn’t have query modifiers in the form of operators, but it does have an _explain endpoint to see what index was used, you can force a specific index, you can limit the number of documents, and you can sort the returned documents in various orders
  • Lastly, certain operators don’t make as much sense in Cloudant. Take the $showDiskLoc operator for example. Cloudant is a hosted solution so why would I want a reference to an on-disc location?

Also be aware of the following quirk in Cloudant (taken straight from the docs):

You cannot use combination or array logical operators such as $regex as the basis of a query when using indexes of typejson. Only equality operators such as $eq, $gt, $gte, $lt, and $lte (but not $ne) can be used as the basis of a query for json indexes.

I’ll explain this with an example. Take the operator $elemMatch for example. Your selector cannot look like this:

Instead it must have something else be a part of the query like in this example:

Above assumes the field “type” has a corresponding index.

In other words, if you want to use an operator, like $elemMatch or $in, that’s not one of the equality operators quoted above and you want to you a JSON index, then another selector field is needed within the query. Remember that’s because Cloudant Query can’t really index arrays so it needs some field that can be indexed in order to process the query.

This can get confusing when you’ve defined a text index on all fields via a POST to the _index enpoint:

It’s confusing because you may mistakenly want to use a JSON index with this query:

And the query will work. However, it only worked because a text index was used (Cloudant Query first tries to use a JSON index before trying a text index).

Lastly, recognize that there are a few specific quirks between the operators in both Cloudant and MongoDB. For instance, MongoDB’s $in operator supports regular expression while Cloudant’s does not. From my testing, $elemMatch in Cloudant can compare String, Number, and Boolean types while MongoDB’s $elemMatch can be used for objects within arrays (again, similar functionality can be found with tradition MapReduce views).

Joins

As you probably know, MongoDB does not support joins (not surprising–NoSQL databases are usually denormalized). However, it can make use or manual references in which you save the _id field of one document in the field of another so that you can make a second query for the manually referenced document. It also supports the DBRefs data type, which is a more formalized approach at joining documents such that your language’s driver can automatically perform the second query for you.

In Cloudant Query, you can also manually reference documents by their _id field and query a second time for that if you wish. There is no data type to do that for you like with Mongo’s DBRefs. However, Cloudant does support joins in a different way. Here’s an article detailing joins using Cloudant’s typical view functionality. What’s interesting here is that Cloudant will complete the joins for you (provided you ask) and thus two queries are not needed unlike MongoDB.

Updating Documents

Cloudant is based on the MVCC architecture. Unlike in MongoDB, there’s no locking. Everyone gets a snapshot of the data upon the time of the request, and everyone can write to a document when they’d like to, though it may result in a conflict that Cloudant deterministically resolves.

Updating a document is the same as replacing it. You can’t update one field at a time like in MongoDB. Instead, you have to get the entire document, update the required field, and resend the document with the updated field (the _id and _rev fields need to be identical to the current fields within the database in order for the update to succeed).

If you don’t want to query for the entire document, you can use an update handler. However, you can’t use an update handler for multiple (bulk) updates.

Note: a common misconception is that MVCC acts as a versioning system. This is false. Replication doesn’t transfer old document versions, and previous versions can only be accessed so long as you don’t compact the database. You should not rely on previous versions existing.

Geospatial

If you don’t want to pay for a dedicated deployment, then you won’t have access to Cloudant Geospatial. If you still need this functionality then you’ll have to stick with MongoDB.

Assuming you do have dedicated Cloudant, both MongoDB and Cloudant support geoJSON and spatial-based queries and both default with the WGS84 datum. MongoDB supports legacy 2d indexes, but both support the more modern geoSpatial indexes–in MongoDB’s case they call it 2dsphere indexes. Both of them also support many different coordinate reference systems (CRS).

The only major downside I’ve been able uncover between the two is that Cloudant doesn’t support nearest neighbor searches like MongoDB’s $near and $nearSphere. You can search for objects within a given radius (or sphere), but I don’t think the results are sorted by proximity. Cloudant does have good relational geospatial support between two distinct objects–the relevant docs are here.

Mongo Type Considerations

One of the major technical differences between MongoDB and Cloudant is the BSON data format MongoDB uses behind the scenes. MongoDB has additional types (Timestamp, DB Reference, OID, etc.) that need to be converted to a valid JSON type in order to be stored in Cloudant. Luckily, MongoDB does this conversion automatically for you, but it’s conversion may not be what you want/expect. This page from the MongoDB docs specifics what the JSON equivalent is for each BSON type.

Each BSON specific data type converts to an object of a couple fields. For instance, the DB Reference converts to:

This means that if you’re sorting on a certain BSON type, that same sorting probably won’t work right out of the box in Cloudant. Additionally, this causes problems for porting existing data with embedded documents and a few other types that are addressed below.


The Actual Porting

If you have gigabytes of data in your current MongoDB system, I wouldn’t recommend porting if you can help it, and you shouldn’t use the shell commands below–you’re on your own there.

Hopefully, you understand a bit more about the changes that may be needed within your code/documents. As for the actual porting give this a read, but it’s a bit out of date unfortunately. Follow the steps up until the curl request (note that in order to see your Cloudant credentials, you need to bind it to a Bluemix app).

The problems you’ll almost certainly face will have to do with the __rev and _id fields. You shouldn’t be uploading any __rev field to Cloudant as Cloudant has it’s own revision system. Thus run the following command to rid yourself of that field:

As for the _id field, unless you were using a custom one, it was converted to an object like:

But Cloudant _id’s need to be strings. What to do next depends on how your use case. If you don’t use manual references or DBRefs, then you can just delete the field like so:

If you are using manual references or DBRefs, you will need the id’s to just be the string value and not the entire object. This can be accomplished with this command:

With the _id field in an acceptable format, you can now push the documents to the _bulk_docs endpoint:

Note that your Cloudant username and password can be seen after binding the Cloudant service to an app. The app doesn’t need to do anything; you just need to see the “Environment Variables” tab on the left once you’ve bound your service to it.

The steps above are for one MongoDB collection. I think you should process collections separately so that you can add a field (using regex or whatever language you want) to all your documents that indicates what collection it came from. Remember Cloudant stores all the documents together so if you add a field, say “collection”, you can query your documents by this collection field to maintain your previous organization.

If You Have Stored Images (Not Using GridFS)

As you can see, documents with images will be converted to this:

Where “<bindata>” is the base64 encoding of the binary data, and “<t>” is what kind of binary (doesn’t really help us). In order to transfer this data to Cloudant, the binary data needs to be within the reserved _attachments field:

Where “content_type” is the MIME type and “data” is the base64 encoded data. Note that while base64 encoded data is sent to Cloudant, the corresponding binary data will be sent upon request (see this article talking about that).

If You Have Stored Images (GridFS)

First you should know that while the max document size in MongoDB is 16 MB, Cloudant has a max document size of 64 MB. As far as I know, it’s not possible to split binary data across multiple documents so if your files are larger than 64 MB, I wouldn’t use Cloudant.

See this page for locally grabbing the images out of MongoDB as you can’t just use mongoexport like before. You can then send the data up via standalone attachments. It’s probably best that you do within whatever language you were using GridFS with in the first place. Just loop through the images and send them up.


When to Use What

Here’s a great graphic that isn’t in the docs (hopefully it will be) that outlines the different components of Cloudant that may help you know when to use what. Note that “Cloudant Query (MapReduce)” refers to Cloudant Query when using a JSON index and “Cloudant Query (Text)” refers to Cloudant Query when using a text index.

Cloudant_Check_Chart

 

Don’t think that just because search index has the most checks that it’s the best tool in general. Remember that performance isn’t displayed here (no one thing could possibly have the best performance in every category and search indexes don’t hold up well with large data sets), and the syntax/intuitiveness is different for each one (Cloudant Query being the most intuitive).

Share: Share on FacebookTweet about this on TwitterShare on Google+Share on LinkedInShare on RedditEmail this to someonePrint this page

4 comments

Leave a Reply

Your email address will not be published. Required fields are marked *