The first time I encountered distributed systems, around three to four years ago, was when I incorporated Kafka into my projects. But it wasn't until I encountered production issues and needed to delve into Kafka's architecture that I became curious about how it functions across multiple worker machines.
That led me down a rabbit hole of reading, researching, and experimenting with distributed systems through open-source contributions and blogs in a quest to understand this technology.
At their core, distributed systems consist of multiple computing entities (such as computers, servers, or devices) that work together to achieve a common goal. They are designed to handle tasks more efficiently than a single computer could, enabling high levels of performance, reliability, and scalability.
Given the unimaginably vast amounts of data required to run pretty much anything today, distributed systems have become a cornerstone of modern computing, underpinning the functionality and scalability of the internet, cloud computing, and many services we use daily - think Google and Netflix, for instance.
In this blog, I aim to take you through:
- Temporal's role as a prime distributed system in production
- Why Temporal was our preferred choice over other systems
- Challenges faced during Temporal implementation
- Balancing scalability benefits with system complexities
- The importance of a collaborative distributed systems ecosystem
- Staying updated on distributed systems advancements, and
- Final reflections on the future of distributed systems
This is in no way a “What Is…” blog. I think there’s enough of that out there. Rather, I’d like to share my journey and learnings from spending countless hours with distributed systems to guide you in the right direction if you’re someone interested in learning more about them.
Temporal is the Perfect Example of a Distributed System Utilized in Production
My interest in distributed systems waned as other projects took precedence — until I had to work with Temporal, a durable execution framework for workflow orchestration.
Temporal's unique selling point lies in its ability to infinitely scale or run workers, making it exceptionally powerful.
To simplify this, consider a workflow like an order management system. Temporal offers a durable execution framework where, if coded appropriately, it can support an infinite number of workers. This promise intrigued me, and I wanted to figure out how Temporal would fulfill this ambitious claim.
Exploring Temporal's backend architecture, I found that it operates on the concept of workers and a centralized system. Tasks are delegated among these workers via internal queues to ensure smooth operation. If a worker experiences a failure, it promptly notifies the central host, allowing for swift response and task reassignment.
Think of Temporal as a centralized server connected to multiple worker nodes or machines. When a task is assigned to a worker, any failure prompts the worker to inform the Temporal server. Temporal marks the transaction as unfulfilled and initiates a new worker request to address the issue, ensuring continuity.
This approach starkly contrasts with traditional single-server setups, where a failure would cause server-wide downtime and require a complete restart of operations.
Distributed systems like Temporal offer granular fault tolerance, allowing individual tasks to be restarted and processed independently without disrupting the entire system. Each worker node operates independently but remains in constant communication with the central host, similar to Kafka's functioning.
In critical software development, distributed systems like Temporal are indispensable. For our projects, Temporal serves as our workflow orchestrator, acting as the core component of our microservices architecture.
Temporal’s Edge Over Other Distributed Systems
You've got quite a few choices when it comes to distributed systems these days.
Take Kafka, for instance. It's a popular one that uses a centralized server along with multiple nodes (brokers) to communicate. It's known for its scalability and it's been widely adopted, especially by big players like LinkedIn.
Then there are some newer ones popping up, like Resonate, which I've come across myself. Other prevalent frameworks include the Hadoop file system and various clusters utilized for job execution, all leveraging distributed systems architecture.
Temporal stands out for its robust core logic and advanced functionality, especially in handling tasks and defining architecture. Its approach to durable workflow execution makes it a standout in the market, even compared to services like AWS Step Functions and Azure's offerings.
Our focus for our projects is on automating workflows in the healthcare sector. In real-world scenarios, such as within electronic health record (EHR) portals, healthcare professionals like doctors and nurses spend their days doing repetitive tasks — clicking buttons, filling out forms, and sending data to other systems. It's time-consuming and limits how much they can get done.
Clients come to us looking for solutions to automate these tasks, moving away from manual processes to streamlined workflows. And that's where Temporal comes in — it's the key player in orchestrating these workflows efficiently.
We faced several challenges, but Temporal emerged as the ideal solution for each one.
- Firstly, we needed to manage numerous workflows for multiple tenants simultaneously. This called for a centralized orchestration mechanism that could execute workflows, handle errors, and implement retry mechanisms seamlessly.
- Ensuring fault tolerance was another hurdle. We had to develop custom logic to handle step failures, define retry strategies, and report errors. However, doing this independently would be resource-intensive and might not anticipate future challenges.
- Scalability was also a concern. Traditional approaches involved provisioning infrastructure for each workflow, leading to resource overhead.
Temporal offered a more efficient solution. Its server acts as a neutral arbiter, unaware of specific tasks, while workers register with the server before workflow initiation. Tasks are then delegated to a queue and processed by available workers, leading to flexibility and resource optimization through a decentralized approach.
Temporal's server seamlessly handles orchestration, focusing solely on step completion and task assignment, while managing retry attempts and fault tolerance transparently. Developing a comparable in-house orchestrator would require significant effort, as open-source solutions with equivalent capabilities are currently unavailable.
With all that said, Temporal did not come without faults.
Challenges of Implementing Temporal
- Absence of a native multitenancy feature: While Temporal allows for multiple tenants using different namespaces, true tenant isolation and role-based access control (RBAC) are lacking. Although namespaces provide some level of segregation, they don't offer the robust tenant isolation needed for our use case. This limitation made proper tenant management and access control challenging during implementation.
- Potential for errors and downtimes: Despite Temporal's smooth operation in our production environment thus far, the potential for errors was pretty high. While Temporal has been extensively used by Uber for managing millions and billions of transactions and has been battle-tested and open-sourced, unforeseen issues may arise over time.
- Alternatives: Alternate systems like Netflix Conductor offer similar orchestration capabilities and are used by prominent cloud ecosystems like Azure and AWS. These alternatives provide additional options for organizations seeking robust workflow orchestration solutions.
“Most concepts in these systems, like step function execution, for instance, aren’t unique to Temporal. AWS also employs similar approaches through their service called AWS Step Functions, and Azure offers similar functionality. However, while these services provide basic step functionality, they currently lack the depth and maturity found in Temporal.”
Scalability Over Complexity, Always
There’s one thing I’ve seen people struggle with: Do the benefits of scalability outweigh the challenges of managing a complex system?
My answer to that is simple:
Scalability over complexity, always! Scalability is crucial today. Complexity can always be worked around with abstraction to make it easier for both developers and end-users.
Let’s dig deeper into these two areas.
Scalability:
Scalability is vital in today's landscape, given the immense user base and transaction volumes we are expected to handle. Whether it's millions of app users or a multitude of events and transactions, scalability is key.
In distributed systems, scalability typically falls into two categories: vertical scaling and horizontal scaling.
Vertical scaling involves enhancing individual machine capabilities by boosting resources like RAM or CPU. However, this approach has limitations, as there's always a cap on how much a single machine can handle.
In contrast, horizontal scaling is more common. It involves creating a distributed setup with numerous smaller and more affordable machines. While each machine may have limited resources, such as 2GB of RAM and a single CPU core, deploying thousands of them in a cluster creates a scalable and cost-effective solution. This distributed architecture allows programs to run across the cluster, utilizing the combined resources of all machines.
Horizontal scaling is generally the preferred choice for achieving scalability in distributed systems due to its cost-effectiveness and flexibility.
Complexity:
When tackling complexity, we need to focus on two fronts: developer experience and end-user experience.
For developers, it's crucial to minimize cognitive overload by simplifying code and documenting complex algorithms effectively. Ensuring that all team members understand the systems they're working on is also vital.
On the user side, simplicity is key. Systems should be intuitive and user-friendly to minimize complexity.
For instance, let's consider Kubernetes. While it's a powerful tool, its complexity can be overwhelming for end-users initially. However, Google has excelled in abstracting away this complexity, as seen in Kubernetes.
As an end user, navigating Kubernetes is made manageable through its CLI tool. Kubernetes provides a well-structured CLI with predefined options and thorough documentation. It even offers guidance on correct commands in case of errors. This approach simplifies interaction for developers, making it easier to work with the platform.
The key is abstraction. By concealing complexity and exposing only essential functionalities, Kubernetes ensures a smoother user experience.
“At the end of the day, balancing scalability with complexity is crucial, prioritizing scalability while ensuring that complexity remains manageable for both developers and end-users. This balance allows systems to grow seamlessly while remaining accessible to all stakeholders.”
The Distributed Systems Ecosystem is a Collaborative Effort
And I say this confidently because it always has been.
The distributed systems ecosystem thrives on collaboration, and one of the key figures in this collaborative effort is Leslie Lamport, celebrated by all as the father of principled distributed computing.
Lamport's seminal work on distributed systems, introducing concepts like Lamport Clocks and Vector Clocks, has laid the foundation for many modern distributed systems, including Temporal and Kafka. Over two decades, his contributions have been crucial in synchronizing timing across different systems.
The importance of open-sourcing such foundational research cannot be overstated. It allows organizations to build upon these principles, tailoring systems to their specific needs. While companies like Uber have created systems like Temporal based on this research, the underlying principles remain accessible to all, and that in itself is a godsend for architects like me.
This collaborative approach is a catalyst for innovation and allows organizations to make informed decisions about the technologies they adopt.
Access to white papers and open-source research papers, like those from Amazon for services such as Lambda and DynamoDB, is invaluable for architects and organizations. It not only offers insights into these technologies but also enables constructive dialogue and problem-solving.
This transparency builds confidence in decision-making, empowering architects to propose effective solutions and develop robust products based on solid principles. Ultimately, the open-source ethos reflects the spirit of collaborative research, paving the way for advancements in distributed systems and beyond.
Staying in the Loop with the Latest in Distributed Systems
Before diving into distributed systems, it's important to establish a solid foundation. You might encounter systems or software that hint at being distributed. This is where you should let your curiosity prompt further exploration.
Unfortunately, distributed systems aren't typically extensively covered in courses like those on Udemy or similar platforms. While an MIT course on distributed concepts is available, it mainly offers theoretical knowledge rather than practical implementation insights.
To stay updated on the latest developments in distributed systems, here are a few strategies I follow that you can make use of too:
As a learning exercise, I often undertake practical exercises, such as developing toy projects that simulate distributed systems. These projects serve as valuable learning tools by allowing me to experiment with failure scenarios and gain a deeper understanding of distributed system concepts.
In cases where hands-on experimentation isn't feasible, I resort to reading and analyzing relevant blog posts or research papers. I would prioritize white papers as they provide comprehensive insights into the decision-making processes behind various distributed system designs.
White papers authored by researchers will undoubtedly help you gain a thorough understanding of distributed system principles and practices. They detail choices and discuss applicability in different scenarios, making them invaluable resources for anyone venturing into this space.
Durable Execution Frameworks are the Next Goal
Despite research papers on distributed systems dating back to the 1980s and 1990s, we're still in the early stages of fully embracing distributed systems. While tools like Kafka and Temporal leverage distributed systems, many applications don't operate in a fully distributed manner.
There's a growing interest in durable execution frameworks to ensure programs can automatically recover from failures — a crucial requirement given the unpredictable nature of software. Currently, implementing such requirements is manual, verbose, and challenging. However, durable execution frameworks promise to significantly streamline this process.
They also offer the potential for infinitely parallelized execution, leading to more efficient and resilient applications. I anticipate a shift towards greater adoption of durable execution frameworks in the coming years, marking a pivotal development in distributed systems.
Looking ahead, there's a notable trend towards eliminating errors and prioritizing durability in distributed systems. The Resonate open-source library, offering SDKs for languages like Java and Python, serves as a durable execution framework for web applications and has gained traction since its launch last year. As someone involved in its development, I've witnessed the ambitious vision behind Resonate.
In the current landscape, web app developers often rely on traditional error-handling methods like try-catch blocks. However, the future of distributed systems demands a paradigm shift. Systems will automatically retry failed operations and seamlessly switch to alternative resources, optimizing resource utilization and ensuring uninterrupted operation, even in the face of failures.
With that said, we're still in the early stages of fully embracing distributed systems, but I anticipate a significant shift in the next five to ten years. Distributed systems will become integral to all aspects of systems programming, alongside emerging technologies like AI.
Two major trends are currently shaping the tech landscape: advancements in AI technologies, especially Gen AI, and the pivotal role of distributed systems in systems programming.
Whether running machines in clusters or deploying MLOps, reliance on distributed systems is inevitable across all applications. They will form the infrastructure of modern applications, ensuring scalability, resilience, and efficiency across the board.
Embracing this evolution will pave the way for innovative advancements and ensure the robustness and adaptability of our digital infrastructure in the years to come.