MongoDB High Availability Cluster Part 1
By
Donatien MBADI OUM, Database Expert
1. Introduction to MongoDB
MongoDB is a document-oriented NoSQL database used for high volume data
storage. Instead of using tables and rows as in the traditional relational
databases, MongoDB makes use of collections and documents. Documents consist of
key-value pairs which are the basic unit of data in MongoDB. Collections
contain sets of documents and function which is the equivalent of relational
database tables.
This first part consists of presenting an Overview of MongoDB High
Availability. The second will be more practice by implementing MongoDB
Replication and Sharding
1.1.
MongoDB features
Each MongoDB database contains
collections which in turn contains documents. Each document can be different
with a varying number of fields. The size and content of each document can be
different from each other.
The document structure is more in line
with how developers construct their classes and objects in their respective
programming languages. Developers will often say that their classes are not
rows and columns but have a clear structure with key-value pairs.
The rows (or documents as called in
MongoDB) doesn’t need to have a schema defined beforehand. Instead, the fields
can be created on the fly.
The data model available within MongoDB
allows you to represent hierarchical relationships, to store arrays, and other
more complex structures more easily.
Scalability – The MongoDB environments
are very scalable. Companies across the world have defined clusters with some
of them running 100+ nodes with around millions of documents within the
database
1.2.
Key Components of MongoDB
Below are a few of the common terms used
in MongoDB
- _id - This is a field required in every MongoDB document. The _id field
represents a unique value in the MongoDB document. The _id field is like
the document’s primary key. If you create a new document without an _id
field, MongoDB will automatically create the field. So, for example, if we
see the example of the above customer table, Mongo DB will add a 24-digit
unique identifier to each document in the collection.
_Id |
EmployeeID |
EmployeeName |
DepartmentID |
563479cc8a8a4246bd27d784 |
120 |
Mbadi Donatien |
50 |
563479cc7a8a4246bd47d784 |
124 |
Mandela Georges |
60 |
563479cc9a8a4246bd57d784 |
70 |
Azangia Uche |
50 |
- Collection - This is a grouping of MongoDB documents.
A collection is the equivalent of a table which is created in any other
RDMS such as Oracle or MS SQL. A collection exists within a single
database. As seen from the introduction collections don’t enforce any sort
of structure.
- Cursor - This is a pointer to the result set of a
query. Clients can iterate through a cursor to retrieve results.
- Database - This is a container for collections like
in RDMS wherein it is a container for tables. Each database gets its own
set of files on the file system. A MongoDB server can store multiple
databases.
- Document - A record in a MongoDB collection is
basically called a document. The document, in turn, will consist of field
name and values.
- Field - A name-value pair in a document. A
document has zero or more fields. Fields are analogous to columns in
relational databases.
- JSON – This is known as JavaScript Object Notation. This is a human-readable, plain
text format for expressing structured data. JSON is currently supported in
many programming languages.
Just a quick note on the key difference between the _id field and a
normal collection field. The _id field is used to uniquely identify the
documents in a collection and is automatically added by MongoDB when the
collection is created.
2. Why Use MongoDB?
Below are the few of the reasons as to why one should start using
MongoDB
- Document-oriented – Since MongoDB is a NoSQL type database, instead of
having data in a relational type format, it stores the data in documents.
This makes MongoDB very flexible and adaptable to real business world
situation and requirements.
- Ad hoc queries – MongoDB supports searching by field, range queries,
and regular expression searches. Queries can be made to return specific
fields within documents.
- Indexing – Indexes can be created to improve the performance of
searches within MongoDB. Any field in a MongoDB document can be indexed.
- Replication – MongoDB can provide high availability with replica sets.
A replica set consists of two or more mongo DB instances. Each replica set
member may act in the role of the primary or secondary replica at any
time. The primary replica is the main server which interacts with the
client and performs all the read/write operations. The Secondary replicas
maintain a copy of the data of the primary using built-in replication.
When a primary replica fails, the replica set automatically switches over
to the secondary and then it becomes the primary server.
- Load balancing – MongoDB uses the concept of sharding to scale
horizontally by splitting data across multiple MongoDB instances. MongoDB
can run over multiple servers, balancing the load and/or duplicating data
to keep the system up and running in case of hardware failure.
3. Difference between
MongoDB and Traditional RDBMS
Below are some of the key term differences between MongoDB and RDBMS
RDBMS |
MongoDB |
Difference |
Table |
Collection |
In RDBMS, the table contains the
columns and rows which are used to store the data whereas, in MongoDB, this
same structure is known as a collection. The collection contains documents
which in turn contains Fields, which in turn are key-value pairs. |
Row |
Document |
In RDBMS, the row represents a single,
implicitly structured data item in a table. In MongoDB, the data is stored in
documents. |
Column |
Field |
In RDBMS, the column denotes a set of
data values. These in MongoDB are known as Fields. |
Joins |
Embedded documents |
In RDBMS, data is sometimes spread
across various tables and in order to show a complete view of all data, a
join is sometimes formed across tables to get the data. In MongoDB, the data
is normally stored in a single collection, but separated by using Embedded
documents. So, there is no concept of joins in MongoDB. |
Apart from the terms differences, a few other differences are shown
below
- Relational databases are known for enforcing data integrity. This is
not an explicit requirement in MongoDB.
- RDBMS requires that data be normalized first so that it can prevent orphan records and duplicates
Normalizing data then has the requirement of more tables, which will then
result in more table joins, thus requiring more keys and indexes.
As databases start to grow, performance can start becoming an issue. Again,
this is not an explicit requirement in MongoDB. MongoDB is flexible and does
not need the data to be normalized first.
4. What is MongoDB
Replication
4.1.
Overview of MongoDB Replication
Replication exists primarily to offer
data redundancy and high availability. We maintain the durability of data by
keeping multiple copies or replicas of that data on physically isolated
servers. That’s replication: the process of creating redundant data to
streamline and safeguard data availability and durability.
Replication allows you to increase data availability
by creating multiple copies of your data across servers. This is especially
useful if a server crashes or if you experience service interruptions or
hardware failure.
In simple terms, MongoDB replication is
the process of creating a copy of the same data set in more than one MongoDB
server. This can be achieved by using a Replica Set. A replica set is a group
of MongoDB instances that maintain the same data set and pertain to any mongod
process.
Replication enables database
administrators to provide:
-
Data redundancy
-
High availability of data
Maintaining multiple MongoDB servers
with the same data provides distributed access to the data while increasing the
fault tolerance of the database by providing backups.
Additionally, replication can also be used as a part
of load balancing, where read and write operations
can be distributed across all the instances depending on the use case.
If your data only resides in a single
database, any of these events would make accessing the data impossible. But
thanks to replication, your applications can stay online in case of database
server failure, while also providing disaster recovery and backup options.
With MongoDB, replication is achieved through a
replica set. Writer operations are sent to the primary server (node), which
applies the operations across secondary servers, replicating the data.
Key features of Replication:
- Scalability: As the data volume increases the
complexity of accessing data and working with data also increases. With
replication in place, multiple data copies are available, allowing users
to not only increase their data reserves but also recover any previous
version in case of any errors or failures.
- Performance: When data is available across multiple
machines and servers, it not only makes accessing data easier but also
makes recovering from unexpected and sudden failures much easier.
replication ensures data availability and security at all times.
- Availability: With replication in place, there’s no need
to worry about data failures. In situations where your primary source of
data fails, you can easily access the same up-to-date data from a
secondary reserve. This highly promotes data availability.
4.2.
How Replication works
The primary node receives all write
operations. A replica set can have only one primary capable of confirming
writes with { w:
"majority" } write concern; although in some
circumstances, another mongod instance may transiently believe itself to also
be primary. The primary records all changes to its data sets in its operation
log, i.e. oplog.
The secondaries replicate the primary's oplog and apply the
operations to their data sets such that the secondaries' data sets reflect the
primary's data set. If the primary is unavailable, an eligible secondary will
hold an election to elect itself the new primary
If the primary server fails (through a
crash or system failure), one of the secondary servers takes over and becomes
the new primary node via election. If that server comes back online, it becomes
a secondary once it fully recovers, aiding the new primary node.
In some circumstances (such as you have
a primary and a secondary but cost constraints prohibit adding another
secondary), you may choose to add a mongod instance to a replica set as an arbiter to vote in elections.
Arbiters are mongod instances that are part of a replica set but do not hold data (i.e. do not provide data
redundancy). They can, however, participate in elections.
4.3.
Automatic Failover
When a primary does not communicate with
the other members of the set for more than the configured period (10
seconds by default), an eligible secondary call for an election to nominate
itself as the new primary. The cluster attempts to complete the election of a
new primary and resume normal operations.
Note: Arbiters have minimal resource requirements and do
not require dedicated hardware. You can deploy an arbiter on an application
server or a monitoring host.
The replica set cannot process write
operations until the election completes successfully. The replica set can
continue to serve read queries if such queries are configured to run on secondaries while the primary is offline.
4.4.
Read Preference
By default, clients read from the primary; however, clients can specify a read preference to send read operations to secondaries.
5. What is MongoDB Sharding
5.1.
Overview of MongoDB Sharding
Database systems with large data sets or high throughput applications can
challenge the capacity of a single server. For example, high query rates can
exhaust the CPU capacity of the server. Working set sizes larger than the
system's RAM stress the I/O capacity of disk drives.
There are two methods for addressing system growth: vertical and horizontal
scaling.
Vertical scaling
Vertical scaling is the traditional way of increasing the hardware
capabilities of a single server. The process involves upgrading the CPU, RAM,
and storage capacity. However, upgrading a single server is often challenged by
technological limitations and cost constraints.
Horizontal scaling
This method
divides the dataset into multiple servers and distributes the database load
among each server instance. Distributing the load reduces the strain on the
required hardware resources and provides redundancy in case of a failure.
However,
horizontal scaling increases the complexity of underlying architecture. MongoDB
supports horizontal scaling through sharding—one of its major benefits, as
we’ll see below.
MongoDB scales immensely using a technique known as Sharding to handle
enormous volumes of data. MongoDB handles the data storage requirements using
the concept of Sharding, which includes distributing data and storing it across
various machines. Sharding allows MongoDB to scale horizontally and handle the
read-write load easily.
Sharding is a method for distributing data across multiple
machines. MongoDB uses sharding to support deployments with very large data
sets and high throughput operations.
5.2.
MongoDB Sharding Basics
MongoDB sharding works by creating a cluster of MongoDB instances
consisting of at least three servers. That means sharded clusters consist of
three main components:
-
The shard
-
Mongos
-
Config servers
Shard
A shard is a
single MongoDB instance that holds a subset of the sharded data. Shards can be
deployed as replica sets to increase availability and provide
redundancy. The combination of multiple shards creates a complete
data set. For example, a 2 TB data set can be broken down into four shards,
each containing 500 GB of data from the original data set.
Mongos
Mongos act as the
query router providing a stable interface between the application and the
sharded cluster. This MongoDB instance is responsible for routing the client
requests to the correct shard.
Config Servers
Configuration
servers store the metadata and the configuration settings for the whole
cluster.
5.3.
Difference between Sharding and
Replication in MongoDB
Replication refers to the practice of copying data
from the primary server node to secondary server nodes. It increases data
availability and promotes backup; in case your primary server fails. It copies
data on every server.
Sharding refers to the process of handling horizontal scaling across various
servers using a shared key. You will copy data holistically by sharding copies
of pieces of data across various replica sets. All these replica sets work
together to utilize all the data.
When sharding and replication work
together, they are referred to as a shared cluster. Each shard is replicated to
preserve the same data availability.
5.4.
Advantages of Sharding
Reads
/ Writes
MongoDB distributes the read and write workload across the shards in the sharded cluster, allowing each shard to process
a subset of cluster operations. Both read and write workloads can be scaled
horizontally across the cluster by adding more shards.
For queries that include the shard key or the prefix of a compound shard key, mongos can
target the query at a specific shard or set of shards. These targeted operations are generally more
efficient than broadcasting to every shard in the
cluster.
Storage Capacity
Sharding distributes data across the shards in the cluster, allowing each shard
to contain a subset of the total cluster data. As the data set grows,
additional shards increase the storage capacity of the cluster.
High
Availability
The deployment of config servers and shards as replica sets provide
increased availability.
Even if one or more shard replica sets become completely unavailable,
the sharded cluster can continue to perform partial reads and writes. That is,
while data on the unavailable shard(s) cannot be accessed, reads or writes
directed at the available shards can still succeed.
Considerations
Before Sharding
Sharded cluster infrastructure requirements and complexity require
careful planning, execution, and maintenance.
Once a collection has been sharded, MongoDB provides no method to
unshard a sharded collection.
5.5.
Sharded and Non-Sharded Collections
A database can have a mixture of sharded and unsharded
collections. Sharded collections are partitioned and distributed across the shards in the cluster. Unsharded
collections are stored on a primary shard. Each database has its own
primary shard.
5.6.
Connecting to a Sharded Cluster
You must connect to a mongos router to interact with any
collection in the sharded cluster. This includes sharded and unsharded
collections. Clients should never connect to a single shard in
order to perform read or write operations.
Aucun commentaire:
Enregistrer un commentaire