Some thoughts on Unions

I’ve never used a Union. It’s been around since at least Cobol and not only have I never used one, I’ve never seen one used except for when people explain what they are. Structs and Unions were the two composite data types in C and each could contain a set of any C type. In addition to the other types, both Unions and Structs could contain Structs and Unions. Structs were the end-to-end layout of the datatypes they wrapped (their size is the sum of the sizes of data types) and Unions were the overlay of the datatypes (their size is the greatest size of the types they compose). Clearly being able to uniquely identify data in a container has its uses since it became the basis for inheritance in C++ while Unions languished in obscurity. To illustrate how composition with identity influenced inherited types in C++, let’s look at two classes.

struct A
{
  int iA;
};

struct B
{
  A a;
  int iB;
};

The memory layout of B is identical to the following C++ implementation of B.

struct A
{
  int iA;
};

struct B : A
{
  int iB;
};

Unions on the other hand don’t expect you to be able to retrieve more than one thing at a time from them. They may have been used quite a bit in the days when memory was constrained, but nowadays people don’t worry about having a single container for multiple types very often either because they can just use multiple types (through templates) or they eat the memory overhead of having a composite type. In the rare case where you have a large memory footprint and can’t take the cost of having wasted space in your composite, and  method templating doesn’t clarify the implementation, there is a better solution to Unions. In the boost library there is a type called the Variant which works in the same way as Unions and uses templates for type safety. I’ve used this type myself on a project and it see some use in the wild.

If Unions did have to support inheritance, it seems like it would be fairly straightforward. The following layout

union A
{
  int iA;
};

union B
{
  A a;
  int iB;
};

would have the same memory layout as

union A
{
  int iA;
};

union B : A
{
  int iB;
};

In these cases the union size would be 4 bytes because the largest datatype is 4 bytes. If dynamic type information needed to be made available through something like C++’s typeid() method then perhaps it would be larger than 4 bytes because a pointer to the Run Time Type Information system would need to be available. Overwriting the pointer would be a problem because all the types are aligned to the first byte of the object. The only way to preserve the pointer would be to store it at the very end of the object where it couldn’t be overwritten. By this I mean that a 4 byte union with a 4 byte type table pointer would need to be 8 bytes long, so that the pointer can live in the last 4 bytes without being overwritten when the first four bytes change. Isn’t this just the same as a struct with its way of laying out data in non-overlapping blocks? Yes, yes it is.

The C++ version of Unions doesn’t allow for inheritance, but it does actually allow for constructors, destructors, member functions, and access modifiers. Of course, all those things are stored end-to-end instead of overlapping, so really C++ unions are just structs with a union in them.

One edge case for C++ objects that would be required for inheritable unions is that an empty union union Something {}; would need to be at least 1 byte large. Stroustrup thoughtfully wrote this into the C++ specification so that no two objects could have the same memory address. Of course different compilers treat byte alignment differently, so this 1 byte is sometimes more.

As long as we’re exploring “C” type unions with inheritance, let’s consider how we might give them methods. If we were forced to come up with a memory scheme that allowed for Unions to be composed of methods it would necessarily have some peculiarities to it. Since all the methods would be aligned to the first byte, there could only be one entry point. If there were multiple entry points then it’s the same as having struct-like end-to-end composition. With only one entry point there would need to be only one exit. Multiple exists could exist if there was room to add conditional exits based on what method was called, but that’s a sort of simulated end-to-end layout. If there were only one entry and exit from all of the methods in the union then there could only be one parameter list as well. This is starting to sound like just one function instead of multiple methods, isn’t it? How could we make this more than one method though? Let’s create a fictional scenario where we can explore making this work. If we allocate 100 bytes for the methods, and 50 for the first method, 75 for the next and 100 for the last, then every method beyond the first must contain other methods. If the only exit is in the last method, then every method call must execute all the methods. If they have different return types the single exit point doesn’t work, so perhaps all variable manipulation should be done through references. This could look like this

unionObject
{
  void IsNull(bool& h, const int& i, int& j, int& k);
  void Halve(int& j, int& k);
  void Double(int& k);
};

The implementation of these methods might look like the following with a scope around each method that insulates it from variables name collisions from methods above it.

void IsNull(bool& h, const int& i, int& j, int& k)
{
  //check i, result in h
  void Halve(int& j, int& k)
  {
    //manipulate j
    void Double(int& k)
    {
       //manipulate k
    }
  }
}

I’ve written “manipulate _” but any method could manipulate any variable that is propagated bellow it. For instance, IsNull() has access to k, the only parameter to Double, and could manipulate it as it’s being passed down to Double(). Alternatively, scope might not exist like I supposed and each method could have complete access to the initial parameter list.

unionObject SomeName
{
  void IsNull(bool& h, const int& i, int& j, int& k);
  void Halve(bool& h, const int& i, int& j, int& k);
  void Double(bool& h, const int& i, int& j, int& k);
};

If this was intentionally a series of operations, and not just an attempt to duplicate the differentiation of single methods like in C++ structs, then perhaps having the methods laid out this way creates a useful hierarchy. Consider these methods to be first class datatypes, if Double() is overwritten during runtime then Halve() and IsNull() are also overwritten, and in this way both IsNull() and Halve() require Double() to be downstream of them. At the same time Double() doesn’t care if Halve() or is IsNull() are replaced with methods that are the same size (or less with nop padding) as long as the parameter list doesn’t change. If the Double() method was private, and there was a size limit on what methods Halve() and IsNull() could hold, then the access protection would ensure that the union object’s function was tied to the Double() method being executed last in a series of methods.

This is a peculiar way of getting a list of functions to execute in an order on a shared set of parameters. A more reasonable implementation might be to have an object that goes through a list of functions and manages parameters, execution, and whatnot. A rough implementation might look like

struct FunctionListManager
{
  typedef int (*func)(bool&, const int&, int&, int&);
  void AddOrReplaceFunction(int executionPosition, const func function)
  {
    if(function != NULL)
    {
      auto searchResult = functionList.find(executionPosition);
      if(searchResult == functionList.end())
        functionList.insert(std::make_pair(executionPosition, function));
      else
        searchResult->second = function;
    }
  }

  void ExecuteFunctions(bool& b, const int& i1, int& i2, int& i3) const
  {
    for(auto itr = functionList.cbegin(); itr != functionList.cend(); itr++)
    {
      itr->second(b, i1, i2, i3);
    }
  }
private:
  std::map<int, func> functionList;
};

There are two exposed methods and an exposed typedef for the function signature. After adding some functions to the list the ExecuteFunctions() method uses the the supplied parameters as inputs. If a function like Double() needed to be run last, then it could be explicitly run after the loop in ExecuteFunctions(). Despite the code overhead, this seems a lot safer than doing runtime checks on function sizes to determine if they can be added to the union without overwriting the private methods. This is also starting to look a bit like delegation.

There is even more complexity to consider if a method in the middle of the union, like Halve(), is declared private. Replacing Double() at runtime doesn’t make any sense anymore. Perhaps the rule is that everything beyond the first restricted access modifier is set to at least that level of protection?

I’ve almost beaten this subject to death, but the final blow is exploring actual method inheritance in a union object system. Consider the following code

union UnionObject1
{
  void IsNull(int& i);
};

union UnionObject2 : UnionObject1
{
  void Double(int& i);
};

If the memory space of IsNull() is overwritten by Double() then there is no discussion, but if the size of UnionObject2 is the sum of IsNull() and Double() then perhaps there is something to talk about. Execution order would likely be from base object up (like constructor order for objects), so IsNull() would execute first and then Double(). In multiple inheritance, either the size of the subclass is the sum of all the base classes or one of the base classes “wins” and overwrites the other. If they are the sum of the base classes then the execution order should be determined in some way by the inheritance order. On the other hand if one base class wins then the diamond problem no longer exists. I’m not sure there’s much more to explore here aside from vtable lookups but the treatment is derivative of what we’ve seen so far, either they’re overwritten (giving a sort of dynamic inheritance) or they are additive and must take the same parameters (already discussed).

So that seems to be the problems with extending an overlapping memory layout container to an object system. It seems fairly obvious why this hasn’t been done since identity is such a critical part of deterministic behavior. Perhaps there’s some place in the future for unions, but it won’t be because there’s an obvious inheritance system anxiously waiting to be extended to them.

Addendum

  • This probably isn’t worth mentioning but since everyone seems to use class in C++ examples I’d like to clarify any confusion. In the examples above, I use struct in the context of C++ member containers but I don’t explicitly mean that all objects of the types I describe must have the implicit access modifiers of a struct. The containers could just as well be classes and the examples would express the same ideas.