Understand Distributed Systems

Objective: Understand the basic concepts of distributed systems and the role of Ice in its implementation in ArmarX.

Previous Tutorials: Create a "Hello World!" Component in ArmarX (C++)

Next Tutorials: Write a Server and Client Communicating via RPC (C++)

All problems in computer science can be solved by another level of indirection, except for the problem of too many layers of indirection. - David Wheeler

A fundamental property of ArmarX is that it realizes a distributed system. This fact has some important implications about its design and implementation. So, we think it is useful to take some time here to explain the basic concepts. But before, it is good if we understand the requirements of a framework for complex robot software.

If you are familiar with ROS (Robot Operating System), you might want to have a look at the comparison of concepts between ArmarX and ROS.

Structuring Robot Software: Understanding the Requirements

A humanoid robot's software has to continuously perform several tasks such as processing sensor data or controlling the robot's joints. Some tasks are completely independent, but some rely on the results of other tasks. Now, imagine you are implementing a cool new algorithm, e.g. for grasping an object, that should be added to the robot. Clearly, there is a need to logically structure the robot's software in a way that makes it easy to add new your algorithm.

One way to do that is to split the program logically into several workers. Each worker either continuously performs a task (e.g. processing camera images), or reacts to occurring events (e.g. a speech command). Of course, there must be a way how threads can communicate, i.e. exchange data and send notifications.

At this point, you might already have a potential solution in mind: Threads. Let us see how viable that solution is.

Local Concurrency: Threads in a Process

You may already be familiar with threads in programming. A thread is a single worker working on your code with its own control flow (e.g. flow through if/else/for/while statements) and its own local variables (e.g. in a function). You can visualize it like a thread on a needle navigating through your code.

A thread resides in a process. A process is a running program (i.e. code in execution) with its own main memory space and has at least one thread, but can contain many more. The threads in a process run concurrently, each in its own context. However, they generally share their memory space: The process has only one virtual memory address space, and it is used by all threads. Do not be confused here: A program can (and should) be written in a way that clearly separates thread-local from shared variables. But in theory, every thread could access all variables, so it is easy to exchange data between them.

In this context, we can visualize the structure of a process like this:

Internal structure of a process.

As can be seen, threads can communicate over the memory of the process, i.e. by reading and writing shared variables in the memory. The thick outline of the process block represents the series of strong protection mechanisms the operating system (OS) implements for processes. An important one is that one process cannot directly access the memory of another process.

Scaling Things Up

Let us summarize what we discussed so far about threads:

  • Each thread is a separate thread of execution through the code.
  • Multiple threads live inside a process.
  • All threads in a process share the memory of the process.
  • Threads can easily exchange data by reading and writing variables (though it also requires adequate synchronization mechanisms).

Sounds good, doesn't it? However, there are also some significant limitations that surface when increasing the system's complexity:

  1. How do you add your new code to the robot's process in the first place? You probably want your own separate project where you can experiment without requiring others to compile and maintain your code. Sure, software engineering has a solution to that: Plugins, which are often used to extend GUI-based applications such as IDEs and the ArmarX GUI. But plugins require dynamically loading (and potentially un- and re-loading) compiled libraries, which can become somewhat tricky during development.
  2. Imagine you are writing your code, compiling it and making a test run in simulation. You quickly find an error, fix your code, re-compile and run again. But you cannot just re-start a thread; instead you have to restart the whole process, including the complete simulation.
  3. Imagine that one computer is not enough to run the robot's complete software stack. So you might want to add more computers to the system. But one process is bound to a single computer, so you cannot distribute your process onto several computers.
  4. Imagine that you are a beginner when it comes to C++ (just for the sake of argument). In other languages (such as Java or Python), you would get an exception if you do something illegal such as accessing a null pointer. In C++, however, you would produce undefined behaviour, which usually results in a segmentation fault ("segfault" in short), which in turn usually lets your process crash spectacularly. And yes, it crashes the whole process. So if all tasks of the robot, including live-preserving ones (in terms of both the robot's state of execution and the health of people nearby), run as threads in the same process, and your thread is buggy and causes a crash, the entire robot crashes. Ouch.

So clearly, while wrapping your complete software stack in a single process has benefits, it also shows some serious limitations. How can we overcome them? One answer is, as is often the case: An additional layer of indirection.

Distributed Systems

In a distributed system, we use multiple processes to build our software stack. Imagine that each process roughly takes the role of a thread in the previous design. For example, there is one process fetching images from the camera and processing them, and another process is controlling the robot's joints. Of course, each process can still run multiple threads, but this is up to the developer of that process now, not something the framework requires.

You can visualize several processes communicating in a distributed system like this:

Processes in a distributed system.

An important aspect is that there is no shared memory space among the different "workers" anymore. As said before, the OS provides strong protection for processes, e.g. processes cannot just directly access each other's memory, so it is more difficult to exchange data between processes than it is between threads of the same process. However, the OS also provides tools that allow processes to communicate: This is called inter-process communication (IPC). In addition, distributed systems often rely on a middleware, which allows a higher-level interface for IPC to application programmers than the OS.

The inconvenient thing is that we now have to think about how workers communicate instead of just writing and reading shared variables. The good thing is that we now have to think about how workers communicate, because it forces the definition of clear interfaces and makes communication explicit (e.g. by sending messages instead of writing a shared variable).

Let us look at the limitations we found before and see whether we improved something:

  1. (Extension) To extend the running system, you can just start your new process. It can use other processes via their interfaces. If your process provides a common interface that is known to the running processes, they can also communicate with your process.
  2. (Restarting) After you re-compiled your code, you can just start your process to test it. The processes building the simulation can keep running.
  3. (More computers) Assuming processes can communicate over the network, you can distribute the different processes on several computers in a network.
  4. (Crashes) If your process crashes, it does not automatically tear down the whole system. Other processes might lose connection and receive errors, but they will not be automatically shut down by the OS.

Sounds good! Of course, adding such an extra layer also has downsides. The overall system complexity will be higher, and testing can become a bit more difficult when you have to start several processes instead of one. However, we think it is useful to understand the requirements that lead to the design decision that a distributed system is adequate for a (humanoid) robot software framework.

So, let us look into the details of what this means.

Communication Paradigms in ArmarX

In ArmarX, there are two main communication paradigms: Remote procedure calls (RPC), and topics.

Remote Procedure Calls (RPC)

This first paradigm is a client-server communication. In a nutshell, a remote procedure call (RPC) is like a regular function call, but remote, i.e. you call a function in a different process via the network.

There are at least two processes: A server and one or multiple clients. The server offers a service that clients can use. More precisely, a client can send a request carrying the input data to the server. The server processes the request, and sends back a response carrying the output data to the client.

Remote procedure calls between a server and clients.

The service offered by the server is described by an interface. For example, imagine the service of generating a random number in an interval. You can imagine such an interface like this:

struct GenerateRandomNumberRequest
{
/// The lower bound of the interval.
int low = 0;
/// The upper bound of the interval.
int high = 0;
};
struct GenerateRandomNumberResponse
{
int random;
};
interface RandomNumberGeneratorInterface
{
/// Generate a random number in the interval [low, high].
GenerateRandomNumberResponse generateRandomNumber(GenerateRandomNumberRequest req);
}
  1. There is an interface called RandomNumberGeneratorInterface.
  2. It has a single function generateRandomNumber().
  3. This function takes a GenerateRandomNumberRequest as input, which contains two integers (low and high) which define the allowed interval.
  4. Finally, the function returns a GenerateRandomNumberResponse containing the random number.

So conceptually, it is not that different from what you might expect from a classical programming language. The difference is that this function generateRandomNumber() can be called via the network from another process: the client.

Let us look at an example remote procedure call:

  1. A client builds a request, setting the input arguments to low = 1 and high = 10.
  2. The client sends the request to the server and waits for the response
  3. The server receives the request.
  4. The server processes the request by generating a random number between 1 and 10 - say it generates the number 3.
  5. The server builds a response, passing it the result 3.
  6. The server sends the response back to the client.
  7. The client receives the response and gets the result 3, and continues.
Remote procedure calls between a server and clients.

Let us note some important properties of the client-server paradigm / remote procedure call:

  • It is one-to-many: There is one server, and any number of clients.
  • It is bidirectional: Data is sent from the client to the server and back.
  • By default, a remote procedure call is synchronous (or "blocking"): The client waits for the response before it continues execution. (However, there are ways for a client to send a request, continue execution right away, and collect the result later.)
  • The server does not know the clients, but the clients know and depend on the server. In other words, the server can run on its own, but the clients require the server to run.

In one of the next tutorials, we will see how remote procedure calls can be implemented in ArmarX.

The graphic above also highlights the different software layers that are involved:

  • The application layer contains your "business code": This is what your application does,

i.e. the logic code that solves your actual, concrete problem. You usually want this part of the code to be independent and simple to use. That is, it should only rely on dependencies (e.g. libraries) that are necessary to solve that sub-problem, and it should use features of the programming language (e.g. overloading) to allow convenient usage.

  • The transport layer fulfils the duty of transporting your data between processes,

e.g. from the client to the server and from the server to the client. It does not know anything about your application logic, except what the data and interfaces look like.

  • Code at the boundary between the application and the transport layer often involves conversions

between types which are defined in the programming language and the types which are defined in the transport layer.

For example, imagine that you are dealing with mathematical 3D vectors of the form (x, y, z).

  • For your business code (application layer), you will want to use Eigen::Vector3f or Eigen::Vector3d in C++ and np.ndarray in Python.

These types are strong, as they use features of their respective programming language to provide efficient and convenient usage (think of mathematical operations, slicing , etc). In addition, they are widely used and well-supported by other libraries. They are what we call business objects (BO).

  • The transport layer only cares about how your data types are structured, but not what they mean:

It has no idea (and does not care) that two 3D vectors can be multiplied to return a scalar number. All it needs to do is transport the data from one point to the other. Also, the transport layer has to support different programming languages. Therefore, these data transfer objects (DTO) are usually very limited in their functionality (they are mere data containers). As a consequence, you do not want to use the DTOs in your business code apart from very simple cases.

When you are designing your distributed application, it is good to think about what part of it is application/business code (usually the majority), and where it touches the transport layer. Ideally, the core of your business code (e.g. the math) can also be used in a non-distributed way (as a library), and the code connecting it to the transport layer is clearly separated from that core.

In principle, you can imagine the layer diagram of such a software architecture like this:

Remote procedure calls between a server and clients.

Now, back to the actual topic!

Topics

The second paradigm is a publisher-subscriber communication. Other terms for the same concept are "producer-consumer", "provider-listener", and "signal-slot". In principle, topics realize an observer pattern: A topic is like a broadcasting channel:

  • Processes can subscribe a topic in order to receive future messages. These processes are called subscribers
  • Other processes can publish messages on the topic, which will be distributed to all current subscribers. These are publishers.
Topics allow publishers to broadcast messages to subscribers.

In contrast to a server in the client-server paradigm, a topic is not an own process. Conceptually, it is just a name and an associated topic interface. The interface defines what messages can be sent over the topic. Topic interfaces differ in one important aspects from a service interface: They never return anything. This is because communication via topics is unidirectional: A publisher sends away a message, but does not get a response directly. Therefore, the corresponding interface functions always have a void return type:

struct RandomNumberTopicMessage
{
int random;
}
interface RandomNumberTopicInterface
{
void reportRandomNumber(RandomNumberTopicMessage msg);
}

Again, let us note some important properties of topic-based communication:

  • It is many-to-many: There can be any number of publishers and clients (including zero).
  • It is unidirectional: Data is sent only from publishers to subscribers, but not back.
  • Topics are asynchronous (or "non-blocking"): A publisher sends a message to the topic and immediately continues execution. The communication system takes care of the message until it has been distributed to all subscribers. This can happen at virtually any point in time; there is no way for a publisher to know whether and when the message arrived.
  • Both the publishers and the subscribers do not know anyone (except the topic). Therefore, topics can in principle foster loose coupling: Both publishers and subscribers can run independently of each other. However, it can also be harder to track who published a message and who is receiving them.
  • Topics are volatile: If a subscriber missed a message because it was started later, that message is lost.

Comparison: Remote Procedure Calls vs Topics

Let us summarize and compare communication via RPCs and topics again:

Remote Procedure Calls (RPC, Server-Client) Topics (Publisher-Subscriber)
1 server to n clients
(1-to-many)
m publishers to n subscribers
(many-to-many)
Bidirectional
(client to server to client)
Unidirectional
(publishers to subscribers)
Synchronous (by default)
(client waits for response)
Asynchronous
Some Coupling: Server is stand-alone, but client depends on server Loose coupling: Publishers and subscribers are independent
Similar to regular function call Similar to observer pattern

The Interface Definition Language

To realize the required interprocess-communication, ArmarX relies on the RPC framework Ice by ZeroC (Ice also supports topics). Ice can exchange data between many programming languages. Therefore, it requires data types and interfaces to be defined in language that is independent of any concrete programming language. In general, this language is called the Interface Definition Language (IDL).

In Ice, this language is called Slice (Specification Language for Ice). The examples above are all written in Slice. Syntactically, it looks similar to C++, but of course they are very different. Slice code is usually stored as .ice files. You should give that page a brief read - it is not long, but it spares us from repeating everything here. However, for the sake of self-containment, we reproduce the most important statement in this context here:

Slice definitions are compiled for a particular implementation language by a compiler. The language-specific Slice compiler translates the language-independent Slice definitions into language-specific type definitions and APIs. These types and APIs are used by the developer to provide application functionality and to interact with Ice. The translation algorithms for various implementation languages are known as language mappings, and Ice provides a number of language mappings (for C++, C#, Java, JavaScript, Python and more).

Note that ArmarX itself mainly makes use of the C++ and Python mappings.

Next Up

In the next tutorials, we will see how both remote procedure calls and topics can be implemented in ArmarX: