In February 2014, a big bank in the Netherlands suffered from an internet banking malfunction which led many costumers to accidentally perform duplicate bank transfers. (In fact, this was just part of the problem as the bank poorly handled the situation and initially laid the responsibility of getting the money back to the costumers.) But what exactly happened?
Imagine you want to pay back a friend who you owe from the last time you went out for dinner. You open your bank’s mobile app and transfer the amount to your friend’s bank account. All seems well until you receive an error message indicating that the transfer failed, and a suggestion to try again. Trustful as you are, you try again a couple of times, only for the error message to appear time and time again. This is not right, you think, and immediately check your account balance, only to find out that the money has in fact left your account multiple times. The problem: the bank’s server received your transactions just fine, but your app never got a confirmation message – and the bank’s server didn’t realize it. In other words, there was no consensus between your app and the bank’s server that a transfer was made. The solution? Well…
Enter the “Two Generals" problem
Imagine there is a city in a valley – namely, Byzantium or Constantinople. The city is surrounded by two armies, each commanded by a general. The generals want to capture the city, but the only chance they have at winning is if they attack together – that is, if they reach consensus. If both generals attack at the same time, they win, but if they attack at different times, they lose. Now this is where the problem lies. The generals can only communicate by sending messengers through the valley, and the valley is dangerously filled with enemies. A messenger might get captured and his/her message undelivered, or a deceiving message could be sent from the valley to the generals. With such an insecure channel of communication, how can the generals decide when to attack? At this point you might be wondering what this has to do with our bank screw-up story from earlier. But notice how the two generals can be interpreted as the bank’s server and your mobile app, respectively, and an attack as performing a transaction.
You would think that this simple problem has a solution, maybe a complicated and impractical one, but a solution, nonetheless. However, even in such a simple scenario, it is impossible to guarantee consensus! In other words, there is no algorithm – or protocol – that the generals can use to be certain that the attacks will be coordinated.
General A and general B need to reach consensus for attacking. They communicate via an unreliable communication channel as messengers may be captured in the valley.
Computer Scientists have proved this impossibility formally, but we can intuitively see why it is true. Suppose there are two generals – A and B – seeking to attack the valley. General A takes the initiative and sends a messenger through the valley with a message that reads: “let’s attack tomorrow at noon”. Now, it may happen that the messenger is captured, and the message not delivered. It then makes sense for general A to expect an acknowledgement of receipt from general B before attacking. However, we now have the same problem from the viewpoint of general B. Say that general B received A’s message and sends the corresponding acknowledgement with a messenger. But, again, this message could also be lost in the valley, so general B would require a confirmation of receipt from A, and so on. As you can see, the back-and-forth messaging can go on forever. The culprit: an unreliable communication channel.
Originally, this problem arose when trying to devise a simple protocol for nodes to communicate with each other in a computer network. Think of the internet, where a connection between a client and a server must be established (i.e., consensus must be reached on how to transmit/interpret exchanged data) before actual data can be sent. In the field of distributed systems – which studies collections of computing agents that communicate by passing messages to one another – this problem is suitably referred to as the “two generals problem”. It is quite famous as well since it was the first computer communication problem to be proved unsolvable. As a result, any generalization of the problem is also unsolvable in the face of arbitrary communication failures. This sets the base of what can be realistically expected for any distributed consensus protocol.
Of course, this doesn’t mean that you should just accept bank screw-ups and move on with your life. There are approaches – like those hopefully implemented by your bank – to this problem that work – or should work – well enough in practice. But the theoretical impossibility result tells us that there is no way that two (or more) decentralized parties can agree with 100% certainty that their messages have been received and acknowledged by both (all) parties. In our story, the banks’ server did not acknowledge that its confirmation messages to the costumers got lost somewhere and it decided to perform the transactions – or “attacks” – anyway.
What can be done, then?
Perhaps the most obvious action the bank could have taken was to detect duplicated transactions happening within a few minutes of each other and, consequently, ask the costumers for confirmation or not process the duplicated transactions at all. This is something we agree is common sense. Another approach is to actually “solve” the two generals problem with something called an “idempotency token”. This is a unique value generated at the source of a message to be sent alongside the first and subsequent identical messages. Think of it as an ID of the transfer you want to make to your friend’s account. The receptor of the message – the bank’s server – upon receiving the message, acknowledges its content – performs the transfer – but also stores the unique message’s ID. If further messages are received bearing the same ID, the server can know two things: (1) its reply didn’t get through, and (2) the content of the message has already been processed. Then, instead of performing a repeated action, the server can simply issue a copy of the first acknowledgement message again. Of course, this approach does not really solve the two generals problem, as it can still fail if no messages ever get through – which is to be expected as the underlying problem is unsolvable. But, in reality, humans would quickly identify (and correct) such failure – be it by engineers monitoring the system of by complaining costumers on social media – which means it should work well-enough in practice.
The bank screw-up story is but one example highlighting the challenges of distributed consensus. But communication is not the only thing that can go wrong in a distributed system. Some components in the system could themselves be unreliable or even malicious. Can the remaining – well behaving –components still reach consensus? Keep on reading and find out!
Enter the “Byzantine Generals” problem
Posed by computer scientists Leslie Lamport, Robert Shostak and Marshall Pease in 1982, the “Byzantine Generals” problem is a generalized version of the two generals problem. Instead of two, consider multiple generals which have surrounded the city of Byzantium and, again, need to decide on a common plan to attack. This time, however, some generals are traitors – they are Byzantium infiltrators – and may lie about whether they support a particular plan and about what other generals have told them. Exchanging only messages, the objective is for all loyal generals to reach consensus. (Here we are assuming that every message that is sent is delivered correctly because, otherwise, we already know that consensus is impossible.) Obviously, if all generals are traitors, then no consensus can be reached, and if all generals are loyal, consensus is a given (since we assume perfect communication). But what about something in the middle? It turns out that consensus is possible! But only if more than two thirds of the generals are loyal; otherwise, the problem is unsolvable. (You can see more details about this result and the corresponding consensus algorithm here and here.) The figure below illustrates the impossibility of consensus with 3 generals where the consensus algorithm is simple majority rule.
A single traitor (red) can prevent the loyal generals (blue) to reach consensus when they have different attack strategies by sending them a different strategy from their own. Here, general A proposes to attack at sunrise while general B proposes to attack at sunset. Given the proposals of traitor general C, general A decides to attack at sunset while general B decides to attack at noon since that is what they see as the majority vote, respectively.
Distributed Consensus in the wild
Aside from the obvious application to military communications, the Byzantine generals problem must be solved by any distributed system that needs to achieve reliable communications. A “traitor” component – or Byzantine fault – in a distributed system could be a hardware malfunction, a software defect, or even a malicious attack, just to name a few examples. The ability of a system to defend against these conditions is referred to as Byzantine fault tolerance. Some applications where such a notion is required include nuclear power plants, airplane engine systems, and other distributed systems that depend on the actions of many sensors – some of which could fail, yet the system should be kept running. So, the next time you board a plane, rest assured that as long as more than two-thirds of the sensors are functional, you should reach your destination just fine.
Perhaps the best example of a Byzantine fault tolerant system is a so-called blockchain. In a few words, a blockchain is an ever-growing digital ledger – a list of data records – that is shared by all members in a decentralized network. There is no trusted leader to manage its operation, but the members must interact with each other to reach consensus on the status of the blockchain. Coming back to money, modern-day cryptocurrencies are examples of blockchain implementations – of which Bitcoin was the first to appear. The idea behind cryptocurrencies is the creation of a trustless economic system, where financial transactions can be executed and verified without the need of an intermediary or institution (a particularly appealing idea for those affected by internet banking mistakes). Some consensus algorithms for blockchains famously include Proof of Work (PoW) and Proof of Stake (PoS). Such algorithms are by no means simple and could be the topic for an entire article. You can see more details about them here and here.
Back to our bank malfunction story
In view of the theoretical insolvability of consensus in some settings, you might still wonder whose fault was it that the transfers didn’t work as they should. Simply put, it was the banks. On the one hand, consensus is difficult in any distributed network – it is understandable that things can go wrong. But on the other hand, the bank should know this, and should have prepared for eventual failures by implementing some well-known solutions. So, the next time an internet service malfunction occurs – be it with a bank, a food delivery app, or online shopping – you end up with duplicated charges, and they try to blame you, ask them what they know about distributed consensus. You’ll most likely receive a full refund, no questions asked.