Using Apache Ignite to Prevent Duplicate REST Requests


I faced a non-unique problem, my API kept getting duplicate requests in rapid succession from time to time. One of those "It works on my machine" issues that I only saw in a production environment and only due to some misbehaving clients. I know APIs are supposed to be stateless and idempotent and all those buzzwords... I had all that and more, however the duplicates kept coming in, causing the need for cleanup scripts and subtle bugs downstream in a large enterprise system.

Joshua Bloch in his famous book Effective Java, has a great quote: "You must program defensively, with the assumption that clients of your class will do their best to destroy its invariants". This is geared towards making defensive copies when writing methods in Java, but I believe it applies perfectly in my REST API duplicates problem.

I feel like a little background is in order; I can already feel the blood is boiling for some readers. This API has a simple'ish architecture. On the backend there is a database, as a persistence layer, there is a middleware with business rules, validations etc and there is also a distributed network of services consuming the data. The main SQL database is the source of truth, and these other systems maintain their own business rules and persistence. Data flows from consumer apps into the API, its massaged, enhanced, analyzed and then its routed to the appropriate, lets call it "partners" which use this data for some purpose that brings some kind of revenue. A simple data-flow diagram shows the high level data flow of this system.

Data Flow Diagram.png

Lets me add a few constraints I had to work with:

  1. This is a transactional system, so each API write must be done in a transaction, however the partner APIs don't support ACID transactions, no commits, rollbacks. The database is the last step in the transaction and obviously does supports commits, rollbacks. 
  2. Database must remain as the final step in the transaction because it requires some data returned from the partner APIs, for example: confirmation numbers, ID's etc..
  3. The system is over 20 years old, and virtually non-refactorable to a significant degree.
  4. Not event driven.

The API is load-balanced across many instances and each instance is stateless, so no sessions of any kind. Essentially, each instance knows nothing of other instances and processes each request in isolation. The state is kept in the database (not mentioning partner APIs). Maybe, you're starting to see why duplicates are problem in such an architecture. As each request is processed, its flows through to the partner APIs, the distributed transaction is only completed once all the data is available for persistence. If a duplicate request (we'll discuss what makes a duplicate request in a bit) is persisted, it's only evident that its a duplicate once the database uniqueness constraints are violated. However, by this time we have already communicated with the partner APIs. Remember, the constraints, transactions aren't natively supported by the partner APIs, so no rollbacks are possible even if the database transaction is rolled back.

Of course we could implement a pseudo-transaction and try to do a pseudo-rollback to the partner APIs by issuing some kind of DELETE calls. Assume for a minute that this is prone to failure because I don't have control of the partner APIs and some of them can be flaky. A good rule of thumb is to only trust systems you have control over and design a system to gracefully deal with untrusted components.

The architecture could have been designed much better, maybe move the database ahead of the partner APIs, write a record in a pending state, update the record once the partner API call succeeds. The whole thing could be event driven and reactive (I love buzzwords). The point is, we can re-design it to be more modern and we wouldn't have this duplicates problem. However, redesign is not an option due to a multitude of reasons: such as budget, risk, lack of resources... I can keep going. The way the duplicate requests are dealt with prior to the solution below is by implementing a set of cleanup scripts, echo backs, and cancellation queues. Each partner system echos the created resource back to another service which checks for existence of the record in the database and appropriate action is taken to deal with inconsistencies. This however, is not the best solution; it introduces latencies, inconsistent state between systems, and many opportunities for things to go wrong requiring manual intervention.

Solution: we must detect and prevent duplicates before they happen.

One approach comes to mind; keep some kind of short-lived shared state in the API which can be used to detect the duplicate request before they get through to lower layers. This would solve my problems within the stated constraints. However,  the solution should introduce as little overhead as possible to keep latency low. It must be safe, so legitimate requests aren't discarded as duplicates. The solution must not introduce too much complexity as this is already a fragile system. Finally, I don't have a lot of time to solve this issue, it's quite important because it affecting customer experience and taking up precious company resources. 

Did I paint the picture of a real world problem well? IT please fix problem please, do it fast, do it well, and don't spend any money to do this. Aside from the details of this specific problem, software engineers have to deal with such constraints every day, and we do it with a smile. Ok maybe not always with a smile but that's what we're paid for. 

Thoughts?

We can take many approaches to implement the said shared state, but not all fit all the constraints. Thinking out loud: I can use a SQL table stored in memory to keep a cache of recent requests and take advantage of the SQL engine to de-dupe each request.  I can also spin up a Redis cluster, instead of SQL table, with similar effect. Maybe a different in-memory key/store cluster with low latency reads and writes. Possibly take advantage of tools like ZooKeeper, Etcd or Consul. Those are possible options and would definitely work, but let's break them down according to some of the constraints I mentioned above. 

So complexity, what is it?

I define complexity as: I don't want to add any infrastructure. It must be managed, maintained, updated, monitored, logged (I'm already getting a headache). So Redis, ZooKeeper, Etcd and Consul are out of the picture. Similarly all key/value stores that require their own clusters. I simply don't have resources to maintain the extra infrastructure. What about the in-memory SQL table? Well I already have a MSSQL database and it also supports in-memory table (as most database engines do). Great, so let's do that right? But wait, I said it needs to be short-lived. Meaning I want the request data to "expire" fast. Well, then just add a timestamp with an index to each record and only query those records with current time - timestamp less than some threshold. That works, but I need a cleanup script to remove stale record so the table doesn't grow too large. Additionally, this would put a CPU load on the database, which is at the heart of the business, handling hundreds of thousands of queries. DBA's won't like that. Someone with great SQL database expertise will say: a well designed and implemented database should be able to handle the load. True, but should it? I mean, is that really it's job? A SQL database's purpose is to manage OLTP transactions and persist business data, not handle hundreds of thousands of miscellaneous computations (don't get me started on all the business logic our SQL database manages in the form of stored-procs). Fine, just spin up an isolated cluster..err we're back to complexity constraint. OK, what else? Couldn't a load balancer do some magic for me and solve the problem upstream? Maybe, but there is logic required to detect a duplicate request based on a set of fields from the request JSON body. Also, it's a load balancer, it's not meant to maintain application logic. 

There is however, another solution that meets all the constraints and solves the original problem, an in-memory data grid! 

IMDG's to the rescue!

What's an IMDG you ask? It's not a new concept; they've been around for a while and have quite a few use cases. I admit, maybe solving duplicate requests wasn't the original purpose for in-memory data grids. As a matter of fact I can confidently say, solving duplicate requests is NOT the purpose for an IMDG, but that' the beauty of their versatility. Generally speaking an IMDG is a cluster of connected nodes sharing a pool of memory and compute resources. The data is usually only kept in memory and provides mechanisms for very fast atomic writes and near-real-time reads. All the distributed computing concepts apply here, consistency, network partitioning, latency (I told you I love buzzwords). Imagine being able to spin up a cluster of nodes, each with a bit of available memory; being able to write data to any node and allow the cluster to manage replication of data across the cluster so every other node has access to the same data nearly in real time. This way, your stateless API instances can communicate with each other through the cluster infrastructure sharing data and atomically update data to control access. 

Some of the use cases include distributed session management, caching, sharing of key/value pairs, executing code closest to the data aka affinity colocation, etc... There are very widely used and known implementations of IMDGs which I won't list, but will mention some of the more known ones. For example, Memcached, Hazelcast and Apache Ignite are very popular and widely used throughout the industry. Hazelcast is one that I'm very fond of in particular because it works well for Java applications and uses standard Java Collections APIs. I've used Hazelcast as a caching layer for JPA and distributed session management through Spring Integration.

Most of the IMDG implementations come with all kinds of configurations that manage how the data is written and replicated. For example, you may way to write data to just one node and return as fast as possible or you may want to define a quorum which must agree first on the data state. On the other hand, you may want to have full consistency (with slower performance) if your application requires it. If you opt for returning immediately after one node writes the data, there are decisions to make about how many nodes you want to keep a copy of the data and how the replication should be achieved. For example, you may choose to store 3 copies of the data (on 3 different nodes) or you may chose to keep the same copy of the data on all nodes in the cluster. Replication can also be done asynchronously or synchronously. There are many options and you really need to pick the right choices for your specific application requirements. However, the flexibility is nice and allows IMDG's to be a great solution in all kinds of systems. 

As I mentioned, many of the IMDG implementation also provide mechanisms for executing code on the cluster, and also come with configurations for this as well. This is an interesting domain that I have not explored fully, but I can see benefits for data analytics and algorithms such as map/reduce (and the like). For my problem at hand, I'm not interested in this functionality so I'll leave this for future research.

Getting back to duplicate requests. One of the functions provided by IMDG's is atomic writes, as this is important for consistency in the cluster. Atomicity is achieved though some kind of distributed locks, as we all know you need to lock a shared resource. In the case of IMDG's you're locking the specific key of the data being updated so no two nodes of the cluster can update the data simultaneously. The distributed locks can be implemented in many different ways, but the existing framework of remote code execution can definitely be leveraged to achieve this. This is starting to get interesting because if you think about my duplicate request problem, it's really a race condition problem on a shared resource, so being able to lock a resource across the entire application cluster for a short period of time could definitely be useful. 

I give you Apache Ignite

OK I'm not giving it you, it's not mine to give, I'm just talking about..so let's talk about it. To make a long story short I selected Apache Ignite IMDG for my solution. There are many reasons why, and just to name a few:

68747470733a2f2f69676e6974652e6170616368652e6f72672f696d616765732f616476616e6365642d636c7573746572696e672e706e67 (1).png
  1. It's a mature project 
  2. It has a large community
  3. It provides both memory and compute cluster 
  4. It can be embedded into an existing application
  5. It works with both Java and .NET
  6. It exposes the distribute locks as a top level API
  7. It's configurable 
  8. It's easily extendable

I had originally stated that I did not want to maintain new infrastructure and then started talking about IMDG clusters. Most of you probably immediately said, how does that meet my requirement. Apache Ignite (and other IMDG's) allows you to embed the data grid node into your existing application. Since I already have my application cluster which I'm managing; there is no additional cost to maintaining separate infrastructure. The same application cluster becomes your data grid cluster. Of course there are additional monitoring and logging costs but much lower than if there was a standalone cluster. The flexibility of Ignite allows for future expansion into a standalone cluster however. If I decide to use the grid for much more than duplicate request problem and there is a need to expand the available resource beyond the app cluster then the switch is painless. 

One of the main issues with selecting an appropriate IMDG implementation was interoperability with .NET. For example, Hazelcast does provide integration with .NET but as a read only client. Meaning you can connect to the Hazelcast cluster and read and write data but not share memory or compute resources on the .NET host. That's not good enough. There are other native .NET IMDG's, however, they are proprietary, require expensive licenses, closed source and not very well documented. Apache Ignite was the only solution which provided native integration with .NET with full functionality. Now, if you know Apache Ignite then you know it's written in Java and runs in the JVM. So what am I talking about with all this .NET crazy talk?  Well, here's Ignite.NET; it's an implementation of the java Apache Ignite that integrates into .NET ecosystem through JNI. What actually happens under the hood when a .NET app creates a new Ignite node is a JVM instance is started with the full java Ignite node client. A JNI connection is created between the .NET app and the JVM and viola you have a working .NET Apache Ignite client node. All the communication with the grid is done via JNI and a native .NET interface is exposed for your application. The application never even knows it's doing fancy stuff behind the scenes. Of course, as the engineer you need to be aware of this architecture to properly keep it running without issues, monitor JVM memory and CPU resource and all that jazz. I'll just say, Ignite.NET is an impressively well written piece of code so it makes things easy. Skipping ahead, I've been running this setup in a production environment for months without any issues.

The architecture

The architecture of my solution is simple. Each of my application instances hosts/starts a new Apache Ignite client node which in-turn discover the cluster through the existing service discovery mechanism. Any distributed service oriented architecture must employ some sort of service discovery. I'm working with an in-house implementation of such system which is powered by a central service registry based on a polling and heartbeat announcements. It's not as fully featured as some open source alternatives such as Netflix's Eureka or Lyft's Envoy or Kubernetes built-in service bindings, however it works quite well. Maybe one day it will become an open source project of it's own. I'm digressing here, back to Ignite. Each application instance registers with the discovery service and can be queried for by any other application instance. Once an embedded Apache Ignite node starts up it finds at least one running application instance and joins it forming a cluster; alternatively it starts a new cluster. There is no master/slave architecture in Apache Ignite, so any node can join a cluster by connecting to any other node and the lead node (not master) is responsible for updating the cluster topology by propagating the changes to all other nodes. There are different strategies for defining which node is the lead node, the simplest one being the first node to create a cluster. Once a lead node leaves the network, the next oldest node assumes the responsibilities and Ignite itself uses internal consensus algorithms to resolve this for you.

So that's it, now I have a working Ignite cluster. My application instances are both serving my application and hosting Ignite. It's worth noting that Ignite runs side by side with the application nodes in it's own memory space. The JVM implementation is not limited to the JVM heap and can manage memory off heap; many IMDG implementations written in Java do this using a little known features in Java from the sun.misc.Unsafe package. See here for more info on Ignite memory architecture. This allows JVM apps to use more than the allocated -Xmx and -Xms memory; it's tricky to pull off correctly, but enough work has been done in the area to trust that it does the right thing.

So how do I solve duplicate requests problem with this architecture? Once a request comes into any of the application instances, the request parameters are hashed into a unique key. This is where it's very application specific. I pick only the fields of my request which uniquely identify each resource. Using the unique key I create a distributed lock in the Ignite instance. Remember distributed locks I was mentioning earlier... The locks exist to allow synchronized updates to data stored in the Ignite cluster. However, Locks are exposed as a top level API in Ignite (unlike many other IMDGs). Taking advantage of this, I am able to lock the request's unique key for the duration of the request. My application does everything it needs to do, communicates with the database, talks to all partners, takes its time. All while being guaranteed that no other instance can be working on an identical request. The key is to take the lock as early in the controller action as possible. If an identical request comes into the same application instance or another instance, the controller attempts to take the lock and will fail because the unique key is locked on the entire cluster. I opted to use not use blocking locks and instead simply fail the request if the lock is already taken. This of course could differ per application. Lastly, I had to write a mechanism that guarantees that the lock is released regardless of what happens to the request, to ensure no keys are left unlocked which can lead to terrible consequences. This was simple enough to do via web api filters that execute after each controller action. Additionally, Ignite is designed to release locks taken by a node if the node exits the cluster. Therefore, a application+ignite host be rebooted at any time guaranteeing lock release.

Diving a bit deeper into Apache Ignite, lets talk about how the keys and locks are managed in the cluster. Ignite has many different, tunable configurations and modes. One of the most important configurations is selecting the cache mode from one of: partitioned, replicated and local. I'm not going to go into detail about each, this is all very well documented. However, at the 10,000 mile view, replicated cache mode means data is written to one primary node (per key) and replicated to all other nodes. In partitioned mode, the cluster is split into partitions and data (per key) is written to one primary node within the calculated partition and replicated to a backup node. Behind the scenes, when a new key is created, whether for a distributed lock or to store data in the cluster, the key is mapped, via an affinity function, to a single cluster node called the primary node. There is also configuration to select how replication is implemented: async vs sync. The distributed locks take advantage of this architecture naturally. Once a lock is created for a key, only the primary node owns/manages this lock. Any other attempts to lock on the same key will be communicated to the same primary node, via the affinity function mapping. The lock's primary node has no other nodes to synchronize with and manages the locking and unlocking. 

As a side note, the affinity function is just a hash function which is applied to the cache/lock key to determine the primary cluster node for the key. The function hashes any key into an integer index in the range of 0 through N -1 where N is the size of the Ignite cluster. As long as the cluster size remains the same a key will always map into the same index. If a cluster topology changes the affinity function changes and the cluster has to be re-balanced. This topic is out of scope for this post, but very well researched and documented, so I'll just leave that here and get back to the problem at hand.

So does this solve my duplicates problem?

Requirements:

  1. Stop duplicate requests
  2. No new infrastructure
  3.  Low complexity
  4. High performance
  5. Support .NET
  6. Configurable
  7. Extendable

Lets first recall the requirements we defined at the beginning of this post, in order to build an acceptable solution. I define a set of 6 top level requirements. Starting at the top, the new solution must stop duplicate requests. Using a distributed lock across the entire cluster guarantees that no two threads can take a lock on the same key. Hashing of the API request parameters into a unique key (for the lock) by associativity guarantees that no two identical requests can be executed by any two threads simultaneously. The lock is taken for the duration of the controller action, so race conditions on database or partner API's are eliminated.  This satisfies the first requirement.

Moving on, the second requirement states that the proposed solution shall not require any new infrastructure. The Apache Ignite cluster is embedded into the application host so there is no additional architecture to manage. There is a slight overhead of monitoring the Ignite cluster health, metrics and logs to ensure it's working properly. However, this overhead is much lower than if I had to manage a standalone Redis cluster or something similar. 

Next, is complexity. Reiterating my earlier point, the solution to the duplicate request problem is needed for an existing, complex enterprise application. The solution must not be complex and must not takes a long time to develop. Apache Ignite is complex, but does a great job of hiding the complexity behind a well documented and highly optimized user interface. I like to talk about software complexity, in terms of the first law of thermodynamics in physics. The law states that energy is conserved, it can never be created or destroyed only transformed into another state, which (in my opinion) is exactly paralleled in software complexity. You can hide it, you can move it around, you can transform it (into someone else's problem), but you can never eliminate it. In this case, we transform the complexity of the solution into Apache Ignite's domain. We let Ignite's cluster handle all the necessary distributed computing, leaving us with an elegant and simple implementation. 

Looking at the 4th requirement, performance, is bit more involved than the previous requirements. There are the theoretical and measured aspects. Theoretically, Apache Ignite uses efficient low level socket messaging platform to facilitate communication between cluster nodes. According to an article on GridGain, Ignite is quite fast. In the article called: GridGain Confirms: Apache Ignite Performance 2x Faster than Hazelcast the Atomic PUT benchmarks show latency of 0.56 milliseconds. Even though we aren't actually using the PUT operation in our solution, the PUT operation uses locking to achieve atomicity thus the .56 millisecond is actually a ceiling estimate of the latency we should expect. According to the same article we can expect a throughput of 115,000 atomic PUT operations per second which is orders of magnitude larger than the traffic my application receives. Additionally, Apache Ignite is built on top of the Yardstick framework; it allows for built-in benchmarks to be executed on deployed clusters. I would be surprised if the GridGain article didn't use this framework to generate the reported numbers. I have not had the time or the need to benchmark my deployed cluster yet, though I envision that the next blog in this series will be centered around performance and benchmarking; stay tuned for that.

The extendability requirement is met in several ways. First, if at a future time, the need arises for a standalone Ignite cluster, I can spin one up (with the dreaded new infrastructure) and offload the computation and memory footprint from my application instance nodes. All while being completely transparent to the application. I would still need to run the JVM and Ignite instance on each application node, however, these Ignite instances will be light, they will only serve as communication ports into the cluster and will do no computation or host memory. Second, I can use the Ignite cluster for more than just distributed locks in if the need arises. I can use it to cache database queries, API responses, or even schedule compute jobs outside API actions; the possibilities are endless.

Not getting into the specifics of the code, but I did write a set of interfaces for .NET applications which hide the implementation details for working with in-memory-data-grid. No such standard interfaces exist as far as I could determine today but it's detrimental to preventing the dreaded vendor lock-in. Ignite.NET becomes an in-memory-data-grid provider, while the application can swap a new provider at any time. This is just good code practice, not anything revolutionary, but it solidifies the extendibility requirement. According to the benchmarks above Ignite is so far the most performant IMDG but as new implementations come to market or Hazelcast performance improves I'll be able to switch with minimal effort.

For all intents and purposes this solution works and in my opinion works well. Since implementing this solution I have seen about zero duplicate requests, have not had to run any cleanup scripts and was actually able to identify misbehaving clients. By monitoring the logs I was able to see when duplicate requests were being rejected by the new locking mechanism and then requested the clients which created such requests fix their systems. However, I have also seen some interesting consequences of such a robust duplicate check mechanism. Keep reading this ridiculously long post to find out !

Lessons learned

Many lessons were learned in the process of implementing the Apache Ignite based solution. Mainly, I had to learn all about IMDGs, brush up on my distributed computing concepts, and dive deep into Apache Ignite, JNI and more. That's the boring stuff though, lets talk about how I took down my production application cluster for a few minutes (╯°□°)╯︵ ┻━┻).

IMDGs are very powerful; as with anything powerful they can be dangerous if not carefully implemented. Same can be said about any distributed computing platforms, but in my specific experimentation slight oversight resulted in an application outage. In my attempt to be clever and over-engineer this solution I added a bit of code that would attempt to self-heal the Ignite cluster in case of unexpected network partitions. What I was trying to do is shut down the Ignite node and exit the cluster if a network partition was detected. Ignite has built-in mechanisms to detect such network events by providing event listeners. Taking advantage of this framework, I added logic to shutdown the node to prevent a split brain cluster. Shutting down the Ignite node would (in the worst case) result in falling back to the original duplicate check logic; not desired though an acceptable degraded state. A split brain cluster would be much worse because the application would continue working, but would corrupt the cluster's ability to fully prevent duplicate requests. Nodes in the split clusters would function as if they are successfully taking locks on unique keys while the same key could be locked in the unreachable cluster. However, a bug in my partition handling logic was actually sending out a shutdown signal to all nodes, instead of shutting down the specific partitioned node. This resulted in a cascading effect of each Ignite attempting to shutdown at the same time. Coupled with some built-in restart logic, which was meant to restart the node due to an unexpected shutdown, this resulted in a terrible effect of Ignite nodes getting stuck in a strange state locking down the entire cluster. Cascading further into chaos, CPU spiked on the host machines and caused request queues to build up in my IIS application pools. Triggered retry logic in my API clients caused even more requests to queue bringing the whole application cluster to a halt. Was this a problem with Apache Ignite? No, this was a problem with my implementation, but it highlighted a few problems with my design. For one, it highlighted the problem with co-locating the Ignite nodes with application nodes. Had Ignite been running in a standalone cluster the application would remain largely unaffected. Additionally, it highlighted the fact that when introducing a relatively new technology into a critical system, especially one that relies on distributed computing care must be taken to ensure proper failure scenarios are addressed. I had since introduced circuit breakers and a sophisticated logic around handling network partitioning and failure scenario fallbacks that prevent such scenarios from playing out. 

There was another interesting consequence of distributed lock based duplicate prevention mechanism. Since the lock is taken on the duration of the request, if the request takes a long time to complete, interesting things start to occur. Requests can take a long time for any number of reasons i.e. timing out at the database, building up request queues due to spike in load or outages in any of downstream dependencies. I'm sure I'm not the only one to agree with the fact that without proper timeout configurations and circuit breakers API requests can get "stuck". Generally speaking, these types of problems do occur periodically in most systems. However, when requests are locked and if you have clients that retry on failure you start to reject the retries because they are.. well duplicates by the nature of the design. Depending on the underlying cause of the timeouts this effect can act as a type of circuit breaker and can potentially help alleviate the problem..or make the problem worse. Usually whenever a dependency outage causes timeouts, the retrying clients (which do not empoy circuit breakers) make the problem worse by sending large numbers of requests (doomed to failure) as a sort of negative feedback cycle. The duplicate check, which rejects the retries, begins to act as an implicit circuit breaker. No new retried requests are added to the request queue, no extra load is added to the underlying system which is choking and causing the outage in the first place. I'm not going to argue whether this observed effect is desired or not, but I will state that I haven't seen it be a problem yet. I keep mentioning the circuit breakers and timeouts and request queues... Those are all interesting topics in their own respect and deserve a full blog post so I'll just leave it be for now.

Problems with Iginte.NET

Apache Ignite is a great in-memory-data-grid framework/platform. Fully featured, mature, and has a large, thriving community. I was working with Apache Ignite.NET , a JNI based .NET wrapper for Apache Ignite. In several months of testing and production deployment I encountered a small number of problems worth discussing.

I encountered an issue where the a new ignite node would get stuck on startup when attempting to connect to a cluster by establishing a connection with to a host machine where an Ignite node is not running. Basically, when a node tries to connect to a cluster it tries to establish a connection to at least one node in an existing cluster. To figure out which machines to try to connect to, service discovery registry is used to get a list of IPs of application cluster host machines. If the application cluster is up, but no Ignite node has been started on any of the host machines, a new Ignite.NET host gets stuck attempting to connect to each of the IPs in the list. On the surface, this appears to be a bug with Ignite.NET because I tried a similar scenario with pure Ignite and didn't see such problems. A workaround involved coordinating Ignite cluster startup to ensure that a cluster node has at least one Ignite node to connect to. The very first node then must know it's first and does not try to connect to any other node and establishes a new cluster instead.

As far as issues in the production cluster, there was only one, seemingly caused by a network partition. If you remember, about 100 pages back, I had mentioned how I introduced a fatal bug while writing network partition failure handling. The main motivation there was to try to prevent split brain cluster from forming. This turned out to be a real use-case, with exception of the whole split brain thing. After running Apache Ignite cluster in production for some time, I realized that network partitions, though rare, do happen and 99.9% of the time Ignite handles these network issues well. The partitioned node leaves the cluster and re-joins at a later time when the network is fixed. The split-brain cluster has never formed, however, there was one instance when a partitioned node got into a strange state where every attempt to take a distributed lock was treated as if the key has already been locked. Strangely enough Ignite's network partition event did not fire on the affected node. The rest of cluster did detect that the affected node had left the cluster and appropriate measures were automatically taken to re-balance the affinity function. However, the partitioned node itself continued running as if part of the larger cluster. This strangely resulted in all API requests on the partitioned node being rejected. The strange part is that the split brain cluster did not form, the partitioned node used the same affinity function and attempted to send lock requests to the mapped cluster nodes but was unable to communicate with the required nodes. I tried to make sense of what went wrong, looked into all the JVM logs from Ignite, but nothing really stood out. This was also not something I could reproduce nor did I ever see it in my production cluster since. 

I do have to make one disclaimer; my original solution was based on Ignite.NET 2.1.0 and there had since been two dot releases. As of this writing the latest available version is 2.3.0. According to the release notes, most of the updates are bug fixes, performance improvements and added support for .NET Core. Unfortunately I have not tested the newest version to see if the issues I have encountered have been fixed.

Final thoughts

I'll keep this short since this article is already long enough. The Apache Ignite In-Memory-Data-Grid is a powerful, robust, highly optimized distributed computing platform. It can be used to build data processing pipelines, caching solutions, run in-memory data analytics on large data sets, and event to implement atomic REST API actions. This was definitely a fun project, I'll be looking to expand my usage of Apache Ignite and will start taking full advantage of this incredible platform in the future.