Database and Datascience: septembre 2021

MongoDB High Availability Cluster Part 1

By Donatien MBADI OUM, Database Expert

1. Introduction to MongoDB

MongoDB is a document-oriented NoSQL database used for high volume data storage. Instead of using tables and rows as in the traditional relational databases, MongoDB makes use of collections and documents. Documents consist of key-value pairs which are the basic unit of data in MongoDB. Collections contain sets of documents and function which is the equivalent of relational database tables.

This first part consists of presenting an Overview of MongoDB High Availability. The second will be more practice by implementing MongoDB Replication and Sharding

1.1. MongoDB features

Each MongoDB database contains collections which in turn contains documents. Each document can be different with a varying number of fields. The size and content of each document can be different from each other.

The document structure is more in line with how developers construct their classes and objects in their respective programming languages. Developers will often say that their classes are not rows and columns but have a clear structure with key-value pairs.

The rows (or documents as called in MongoDB) doesn’t need to have a schema defined beforehand. Instead, the fields can be created on the fly.

The data model available within MongoDB allows you to represent hierarchical relationships, to store arrays, and other more complex structures more easily.

Scalability – The MongoDB environments are very scalable. Companies across the world have defined clusters with some of them running 100+ nodes with around millions of documents within the database

1.2. Key Components of MongoDB

Below are a few of the common terms used in MongoDB

_id - This is a field required in every MongoDB document. The _id field represents a unique value in the MongoDB document. The _id field is like the document’s primary key. If you create a new document without an _id field, MongoDB will automatically create the field. So, for example, if we see the example of the above customer table, Mongo DB will add a 24-digit unique identifier to each document in the collection.

_Id	EmployeeID	EmployeeName	DepartmentID
563479cc8a8a4246bd27d784	120	Mbadi Donatien	50
563479cc7a8a4246bd47d784	124	Mandela Georges	60
563479cc9a8a4246bd57d784	70	Azangia Uche	50

Collection - This is a grouping of MongoDB documents. A collection is the equivalent of a table which is created in any other RDMS such as Oracle or MS SQL. A collection exists within a single database. As seen from the introduction collections don’t enforce any sort of structure.
Cursor - This is a pointer to the result set of a query. Clients can iterate through a cursor to retrieve results.
Database - This is a container for collections like in RDMS wherein it is a container for tables. Each database gets its own set of files on the file system. A MongoDB server can store multiple databases.
Document - A record in a MongoDB collection is basically called a document. The document, in turn, will consist of field name and values.
Field - A name-value pair in a document. A document has zero or more fields. Fields are analogous to columns in relational databases.

JSON – This is known as JavaScript Object Notation. This is a human-readable, plain text format for expressing structured data. JSON is currently supported in many programming languages.

Just a quick note on the key difference between the _id field and a normal collection field. The _id field is used to uniquely identify the documents in a collection and is automatically added by MongoDB when the collection is created.

2. Why Use MongoDB?

Below are the few of the reasons as to why one should start using MongoDB

Document-oriented – Since MongoDB is a NoSQL type database, instead of having data in a relational type format, it stores the data in documents. This makes MongoDB very flexible and adaptable to real business world situation and requirements.
Ad hoc queries – MongoDB supports searching by field, range queries, and regular expression searches. Queries can be made to return specific fields within documents.
Indexing – Indexes can be created to improve the performance of searches within MongoDB. Any field in a MongoDB document can be indexed.
Replication – MongoDB can provide high availability with replica sets. A replica set consists of two or more mongo DB instances. Each replica set member may act in the role of the primary or secondary replica at any time. The primary replica is the main server which interacts with the client and performs all the read/write operations. The Secondary replicas maintain a copy of the data of the primary using built-in replication. When a primary replica fails, the replica set automatically switches over to the secondary and then it becomes the primary server.
Load balancing – MongoDB uses the concept of sharding to scale horizontally by splitting data across multiple MongoDB instances. MongoDB can run over multiple servers, balancing the load and/or duplicating data to keep the system up and running in case of hardware failure.

3. Difference between MongoDB and Traditional RDBMS

Below are some of the key term differences between MongoDB and RDBMS

RDBMS	MongoDB	Difference
Table	Collection	In RDBMS, the table contains the columns and rows which are used to store the data whereas, in MongoDB, this same structure is known as a collection. The collection contains documents which in turn contains Fields, which in turn are key-value pairs.
Row	Document	In RDBMS, the row represents a single, implicitly structured data item in a table. In MongoDB, the data is stored in documents.
Column	Field	In RDBMS, the column denotes a set of data values. These in MongoDB are known as Fields.
Joins	Embedded documents	In RDBMS, data is sometimes spread across various tables and in order to show a complete view of all data, a join is sometimes formed across tables to get the data. In MongoDB, the data is normally stored in a single collection, but separated by using Embedded documents. So, there is no concept of joins in MongoDB.

Apart from the terms differences, a few other differences are shown below

Relational databases are known for enforcing data integrity. This is not an explicit requirement in MongoDB.
RDBMS requires that data be normalized first so that it can prevent orphan records and duplicates Normalizing data then has the requirement of more tables, which will then result in more table joins, thus requiring more keys and indexes.

As databases start to grow, performance can start becoming an issue. Again, this is not an explicit requirement in MongoDB. MongoDB is flexible and does not need the data to be normalized first.

4. What is MongoDB Replication

4.1. Overview of MongoDB Replication

Replication exists primarily to offer data redundancy and high availability. We maintain the durability of data by keeping multiple copies or replicas of that data on physically isolated servers. That’s replication: the process of creating redundant data to streamline and safeguard data availability and durability.

Replication allows you to increase data availability by creating multiple copies of your data across servers. This is especially useful if a server crashes or if you experience service interruptions or hardware failure.

In simple terms, MongoDB replication is the process of creating a copy of the same data set in more than one MongoDB server. This can be achieved by using a Replica Set. A replica set is a group of MongoDB instances that maintain the same data set and pertain to any mongod process.

Replication enables database administrators to provide:

- Data redundancy

- High availability of data

Maintaining multiple MongoDB servers with the same data provides distributed access to the data while increasing the fault tolerance of the database by providing backups.

Additionally, replication can also be used as a part of load balancing, where read and write operations can be distributed across all the instances depending on the use case.

If your data only resides in a single database, any of these events would make accessing the data impossible. But thanks to replication, your applications can stay online in case of database server failure, while also providing disaster recovery and backup options.

With MongoDB, replication is achieved through a replica set. Writer operations are sent to the primary server (node), which applies the operations across secondary servers, replicating the data.

Key features of Replication:

Scalability: As the data volume increases the complexity of accessing data and working with data also increases. With replication in place, multiple data copies are available, allowing users to not only increase their data reserves but also recover any previous version in case of any errors or failures.
Performance: When data is available across multiple machines and servers, it not only makes accessing data easier but also makes recovering from unexpected and sudden failures much easier. replication ensures data availability and security at all times.
Availability: With replication in place, there’s no need to worry about data failures. In situations where your primary source of data fails, you can easily access the same up-to-date data from a secondary reserve. This highly promotes data availability.

4.2. How Replication works

The primary node receives all write operations. A replica set can have only one primary capable of confirming writes with { w: "majority" } write concern; although in some circumstances, another mongod instance may transiently believe itself to also be primary. The primary records all changes to its data sets in its operation log, i.e. oplog.

The secondaries replicate the primary's oplog and apply the operations to their data sets such that the secondaries' data sets reflect the primary's data set. If the primary is unavailable, an eligible secondary will hold an election to elect itself the new primary

If the primary server fails (through a crash or system failure), one of the secondary servers takes over and becomes the new primary node via election. If that server comes back online, it becomes a secondary once it fully recovers, aiding the new primary node.

In some circumstances (such as you have a primary and a secondary but cost constraints prohibit adding another secondary), you may choose to add a mongod instance to a replica set as an arbiter to vote in elections.

Arbiters are mongod instances that are part of a replica set but do not hold data (i.e. do not provide data redundancy). They can, however, participate in elections.

4.3. Automatic Failover

When a primary does not communicate with the other members of the set for more than the configured period (10 seconds by default), an eligible secondary call for an election to nominate itself as the new primary. The cluster attempts to complete the election of a new primary and resume normal operations.

Note: Arbiters have minimal resource requirements and do not require dedicated hardware. You can deploy an arbiter on an application server or a monitoring host.

The replica set cannot process write operations until the election completes successfully. The replica set can continue to serve read queries if such queries are configured to run on secondaries while the primary is offline.

4.4. Read Preference

By default, clients read from the primary; however, clients can specify a read preference to send read operations to secondaries.

5. What is MongoDB Sharding

5.1. Overview of MongoDB Sharding

Database systems with large data sets or high throughput applications can challenge the capacity of a single server. For example, high query rates can exhaust the CPU capacity of the server. Working set sizes larger than the system's RAM stress the I/O capacity of disk drives.

There are two methods for addressing system growth: vertical and horizontal scaling.

Vertical scaling

Vertical scaling is the traditional way of increasing the hardware capabilities of a single server. The process involves upgrading the CPU, RAM, and storage capacity. However, upgrading a single server is often challenged by technological limitations and cost constraints.

Horizontal scaling

This method divides the dataset into multiple servers and distributes the database load among each server instance. Distributing the load reduces the strain on the required hardware resources and provides redundancy in case of a failure.

However, horizontal scaling increases the complexity of underlying architecture. MongoDB supports horizontal scaling through sharding—one of its major benefits, as we’ll see below.

MongoDB scales immensely using a technique known as Sharding to handle enormous volumes of data. MongoDB handles the data storage requirements using the concept of Sharding, which includes distributing data and storing it across various machines. Sharding allows MongoDB to scale horizontally and handle the read-write load easily.

Sharding is a method for distributing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations.

5.2. MongoDB Sharding Basics

MongoDB sharding works by creating a cluster of MongoDB instances consisting of at least three servers. That means sharded clusters consist of three main components:

- The shard

- Mongos

- Config servers

Shard

A shard is a single MongoDB instance that holds a subset of the sharded data. Shards can be deployed as replica sets to increase availability and provide redundancy. The combination of multiple shards creates a complete data set. For example, a 2 TB data set can be broken down into four shards, each containing 500 GB of data from the original data set.

Mongos

Mongos act as the query router providing a stable interface between the application and the sharded cluster. This MongoDB instance is responsible for routing the client requests to the correct shard.

Config Servers

Configuration servers store the metadata and the configuration settings for the whole cluster.

5.3. Difference between Sharding and Replication in MongoDB

Replication refers to the practice of copying data from the primary server node to secondary server nodes. It increases data availability and promotes backup; in case your primary server fails. It copies data on every server.

Sharding refers to the process of handling horizontal scaling across various servers using a shared key. You will copy data holistically by sharding copies of pieces of data across various replica sets. All these replica sets work together to utilize all the data.

When sharding and replication work together, they are referred to as a shared cluster. Each shard is replicated to preserve the same data availability.

5.4. Advantages of Sharding

Reads / Writes

MongoDB distributes the read and write workload across the shards in the sharded cluster, allowing each shard to process a subset of cluster operations. Both read and write workloads can be scaled horizontally across the cluster by adding more shards.

For queries that include the shard key or the prefix of a compound shard key, mongos can target the query at a specific shard or set of shards. These targeted operations are generally more efficient than broadcasting to every shard in the cluster.

Storage Capacity

Sharding distributes data across the shards in the cluster, allowing each shard to contain a subset of the total cluster data. As the data set grows, additional shards increase the storage capacity of the cluster.

High Availability

The deployment of config servers and shards as replica sets provide increased availability.

Even if one or more shard replica sets become completely unavailable, the sharded cluster can continue to perform partial reads and writes. That is, while data on the unavailable shard(s) cannot be accessed, reads or writes directed at the available shards can still succeed.

Considerations Before Sharding

Sharded cluster infrastructure requirements and complexity require careful planning, execution, and maintenance.

Once a collection has been sharded, MongoDB provides no method to unshard a sharded collection.

5.5. Sharded and Non-Sharded Collections

A database can have a mixture of sharded and unsharded collections. Sharded collections are partitioned and distributed across the shards in the cluster. Unsharded collections are stored on a primary shard. Each database has its own primary shard.

5.6. Connecting to a Sharded Cluster

You must connect to a mongos router to interact with any collection in the sharded cluster. This includes sharded and unsharded collections. Clients should never connect to a single shard in order to perform read or write operations.

Database and Datascience

dimanche 12 septembre 2021

High Availability Clustering Using MongoDB