.net simplified |
Posted: 22 Jan 2022 10:46 PM PST Hi Friends, Its been long time since I last blogged. Hence, thought to come up with some design topics. In this case, I thought to present my take on Sql Vs NoSql. Therefore, we will be discussing the below agenda in this blog post.
Background:- In 2004, google released a white paper on Big Table. This was one of the foundation for NoSql database. Then, amazon released white paper on Dynamo Db. And since then, this space has taken wings and different types of NoSql Databases started coming. Sharding:-
Replication:- In simple terms, let’s say one instance is not enough to handle to multiple reads. In that case, there will be multiple copies of instances which can serve for read. Let’s say a scenario, where in a post written by any celebrity on social media. Now, post written once, but its going to be read million times. In that case, data will be replicated to multiple shards, then that post can be read by any of the available shards. BTW, in these situations, caching also comes handy. But, that we will see later. For the context of this post, let’s stick to replication factor. Replication also minimizes the risk of data loss means let’s say you only have one copy of dataset with you and that particular instance itself goes down, then in that case, data lost. With replication, at least you are maintaining multiple copies of the data. This also means that you are avoiding Single Point of Failure. With that being said, let’s understand Replication Factor aka RF now. Consider, we have a cluster (combination of servers). Let’s say we have 1000 servers in one cluster.
Let’s say, we have 1000 servers in one cluster which is holding the user data. Now, with the above statement in place, not every node (server) will know my information. Therefore, if we have 1000 nodes and RF = 5. Then, my information will be replicated to 5 different nodes to avoid single point of failure. This also means, that each row in a node will be replicated to RF no of nodes. Now, consider we have read request for user id 1 like Read(User_Id_1). At this instant, 5 servers are holding this value. Now, this will be returned based on any routing like round robin, least connections, weighted least connections or something else as used by load balancer. Idea is we don’t know which server will respond to this request. The next questions comes is like how request knows, whether the data is lying between these 5 servers. Obviously, one way is via Consistent Hashing, which we will see in another thread. Let’s say, hash is returning server 1 as primary node to store the data, then replication logic say goes like primary node + RF no of nodes means S1+1 => S2, S1+2 => S3, etc. Obviously, algorithm for both consistent hashing and replication will be much cleaner considering the edge cases as well. Now, let’s say we have write request for the same User Id, say Write(User_Id_1) => “data”. Here, write is writing some data for the user id. At this instant as well, one of the servers out of 5 can take request update the node and then parallely send the request to other servers to update the record. Let’s say, at the time other servers S2, S3..S5 are getting updated, at the same time read request comes for the same user_id. Then, it may happen, that value will be returned from S3 => old_data, which is not matching with the updated data what S1 holds. Therefore, this scenario will be known as Inconsistent state. Although, system will eventually becomes consistent as parallely(async) backend job is already running. But, this kind of system won’t be acceptable at many places say Banking systems. Therefore, in order to make system consistent, we need to acquire lock on that particular key say user_id till all the servers get updated with latest value. This also means, let’s say if any read request comes for that user_id, during update time, then that read request will be rejected with message say “Try after sometime or any other message“. This is the case, where in we keep consistency high over availability aka (consistency>availability). This kind of replication happens synchronously. CAP Theorem:- CAP theorem says, either you can achieve consistency or availability when your system is partition tolerant. CAP theorem, gets applied whenever any distributed system comes into picture. We will also delve in this topic in coming posts as this itself is very big topic. Now, let’s look at this formula
Here, let’s say, I have updated the data in one server and while updating the same timestamp is also stamped there, just to give indication, which data is latest. Although, I am updating the data in other nodes parallely. But, at the time of reading, I am reading the data from all the servers and sorting the data based on timestamp. This will always return me the latest updated data, hence will pick that data. Now, this state is also strongly consistent state. Now, if the updated data, which is holding the latest data, if it goes down then it will be data loss scenario. But, this strong consistency we are achieving at the cost of latency. Latency will be higher as we are going to all nodes for reads and then we are comparing the data with timestamp and then returning. This kind of system is well suited for where in read frequency is very less than write say Log aggregator system. For a Log aggregator system, you are writing heavily but reads are less eg. Splunk, NewRelic etc. Therefore, for this kind of system, values will go like W=>1, R= 5, RF = 5. This means, I am writing to one system but I am reading from multiple systems. This is aka as write heavy system. For read heavy case, numbers can look like W=>5, R= 1, RF = 5. This will also be strongly consistent system. Many a time, it happens, when W=1, people tend to think, there are chances of data loss if node goes down. In actual scenario, async update to other nodes will happen in milli seconds. And, if you like to avoid, even that faintest chance of data loss say that 10 ms, then we can have more no of Ws. Therefore, Strong Consistency => R+W > RF say 2+5> 5. R+W > RF aka known as Quorum.
Now, these are the points which we need to take care before choosing any database. Criteria for choosing Database:-
With these details, I would like to wrap this discussion here. Will meet again in some time. Till then stay tuned and Happy Learning. 103 total views, 6 views today |
You are subscribed to email updates from My View. To stop receiving these emails, you may unsubscribe now. | Email delivery powered by Google |
Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, United States |
0 comments :
Post a Comment