HumbleBlogger

Sunday, June 15, 2008

Distributed OSGi — tilting at windmills

I've been an enthusiastic champion regarding the advent of OSGi as used for a Java-based enterprise computing modularity standard. However, I have a very definite compartmentalization in mind as to what role OSGi should play — I draw the line at a single running JVM instance. I don't want to use OSGi as a cornerstone for distributed computing modularity.

In a recent eWeek article, though, we see there is a movement afoot to do precisely that:

Distributed OSGi Effort Progresses

This excerpt makes it quite clear what this initiative is all about:
The Distributed OSGi effort also seeks to enable "a service running in one OSGi framework to invoke a service running in another, potentially remote, OSGi framework (meaning a framework in a JVM)," Newcomer wrote. As the current OSGi standard only defines how services talk to each other only within a single JVM, "extensions are needed to allow services to talk with each other across multiple JVMs — thus the requirements for distributed OSGi on which the design is based," he said.
I see such an effort as yet another attempt to reinvent what has already been tried several times before — distributed object computing — by the likes of such technologies as DCOM, CORBA, and even JEE/EJB (ala Session beans and RMI). What do these technologies all have in common? They have fallen into disuse after befuddling and frustrating many a programer (I personally fell on the DCOM sword back in the mid-90's).

Interface Calling Behavior Semantics

I have a by-line that I sometimes sign my blog postings with:
Let me get right to it by lobbing some grenades: I recognize two arch evils in the software universe – synchronous RPC and remoted interfaces of distributed objects.
The sentiment being expressed here is that it is wrong-headed to try to make synchronous method invocation transparent over the network. Yet that is the grand illusion that all these distributed object solutions strive to accomplish.

The problem is that an object interface that may be perfectly usable in a local JVM context — where call chain invocation can be sub-millisecond — will not have the same behavior semantics when invoked over a network connection. The method invocation may take 15 to 50 milliseconds on a good day; or may fail due to low-level transport errors (which never existed when it was being invoked in a local JVM context); or just time out without completing the call; or even never return/complete at all in any acknowledged fashion.

The consuming software code that used a method in a local JVM context has to now be designed to anticipate a wide range of different situations, as the calling contract of the method is radically different depending on the context in which it is being invoked. The advocates of distributed object computing, however, want us to believe in a grand illusion: that modules that were written and proved out for use in a local JVM context can be shifted to be consumed in a distributed context — and where consuming code, presumably, doesn't have to be any the wiser (or at least not very much wiser).

Of course, some may recall JEE/EJB Session beans where methods invoked via RMI had the potential to raise exceptions related to i/o transport errors. The upshot is that one had to design software from the outset to be a consumer of a "distributed" object interface vs. just a consumer of its local interface. Also, it was not long before EJB developers discovered that an object interface that made sense for a local JVM calling context would become a performance liability in a distributed computing context.

To use the interface by its design gave rise to chatty round trips over the network, where, due to the latency/unreliability, the software becomes visibly sluggish. It is most dismaying to see all the enterprise software systems that have sluggish and problematic user interface behavior due to the application being written on a foundation of synchronous use of distributed object interfaces. That distributed object koolaide that folks drank from proved to be spiked heavily with toxic radiator fluid.

In essence, EJB developers found out that an object-oriented approach to software design could not be transparently shifted to a distributed computing context. The OOP software systems that tried to make that leap devolved into a quagmire of issues that had to be battled.

Interface Version Management

On the basis of this one failing alone distributed object computing has been one of the greatest colossal architectural mistakes of the last 15 years in the IT industry. Yet the failings don't stop there — the other equally perplexing obstacle to this undertaking is object interface version management.

I tend to think that the versioning dilemma is perhaps even more insidious than the synchronous distributed method call semantics problem. One encounters the issues of call semantics fairly early on, however, the interface versioning dilemma arises gradually over time, and then mounts up and becomes one of the greatest headaches that one battles in trying to keep deployed distributed software systems coherent.

One of the popular agile OOP developer practices of recent years is frequent re-factoring of code. Indeed, all of the popular IDEs in use are adept in assisting the developer with re-factoring. Re-factoring may well be a good thing in a context where the development team gets to control all the deployment pieces that are impacted by such change. However, in a distributed computing context, which is usually heterogeneous, re-factoring would just be asking for misery.

Making changes to how distributed software systems interact, and where multiple development teams and/or companies are involved, is a process undertaking akin to wisdom tooth extraction (the difficult kind where the dentist has to work for a few hours to break the tooth apart and bring it out in pieces). The simplest of changes can be tedious to negotiate, often politics of some variety intrudes, and it is often challenging to schedule releases to well synchronize with one other so that deployment can occur.

As such, the notion of versioning of distributed object interfaces has been proffered as the means for coping with this. One team can come out with a new and improved interface to an existing object and deploy it unilaterally. Other parties that devise software that consumes the interface of said object, can catch up to the new interface as they are able. In the meantime the older interface remains in place so that existing deployed software keeps working.

On paper versioning looks like a workable approach — the significant distributed object solutions have all had provision for versioning interfaces. In practice it can even be done a few times. However, for large, complex enterprise software systems, maintaining interface versions gets to be burdensome. One of the reasons is that by their very nature object interfaces are very brittle. The information state that is exchanged tends to be very explicitly coupled to the interface signature. It can be hard to significantly (or meaningfully) evolve the software implementation of an object without having impact to the object's interface. Once that happens, the interface has to be versioned — and a sense of dread then sets in.

OOP Does Not Work Well In Distributed Context

As to distributed object computing, quagmire is the operative word here. Quite simply OOP does not really work in a distributed computing context — there are too many difficult issues that entangle for it to be worth the while.

It is fascinating to see new generations of software engineers getting lured into re-inventing distributed object computing over and over and over again. And a lot of the same computer industry corporate players get right behind these endeavors every time. These kinds of systems become complex and grandiose — thus they seem to be excellent sticky fly traps for luring in developers. Think of the legions of developers (and their companies) that have floundered on DCOM, CORBA, JEE/EJB, WS-* (death star) — and now lets add Distributed OSGi. Distributed object computing is our industry's Don Quixote tilting at windmills endeavor.

Asynchronous Messaging and Loose-Coupling Message Format Techniques

So what is the alternative?

When it comes to distributed computing interactions, try following these guidelines:
  • Design the software from the outset around asynchronous interactions. (Retrofitting synchronous software designs to a distributed context is a doomed undertaking that will yield pathetic/problematic results.)
  • Prefer messaging to RPC or RMI style interface invocation
  • Attempt to use messaging formats that are intrinsically non-brittle. If designed with forethought, messaging formats can later be enhanced without impacting existing deployed software systems. (The entire matter of versioning interfaces can be dodged.)
  • Build in robust handling (and even auto-recovery) of transport related failure situations.
  • Never let the user interface become unresponsive due to transport sluggishness or failure situations. A user interface needs to remain responsive even when over-the-wire operations are coughing and puking. (So distributed interaction I/O always needs to be done asynchronously to the application's GUI thread.)
  • Keep transport error handling orthogonal to the semantics of messaging interactions. (Don't handle transport error and recovery at every place in the application where a messaging interaction is being done. Centralize that transport error handling code to one place and do it very well just one time.)
A key point to appreciate is that transport error handling and recovery is very different from the matter of business domain logic error processing. The two should not be muddied together. A lot of software written around PRC or RMI style interface method invocation does exactly that, though. Messaging-centric solutions usually permit a single transport error handler to be registered such that this need be coded just once. The rest of the application code can then concentrate on business domain logic processing.

AJAX and Flex Remoting

A messaging approach is a sound basis for designing distributed application architecture — especially when one does not control all the end-points. More recently I have been designing architecture for Flex-based web RIA applications. In these apps, the client uses Flex asynchronous remote method invocation to access services on the server. Adobe's BlazeDS is embedded in the server to facilitate: remoting calls, marshaling objects between ActionScript3 and Java, message push to the client, and bridging to a JMS message broker.

You may think that I'm not exactly following my own advice. However, there are special circumstances at play:
  • Flex I/O classes support asynchronous invocation, so the operation does not block the main GUI thread of the app.
  • Flex I/O classes invoke closures to process return results; also, a fault closure can be supplied to handle transport related errors. Consequently a programmer can write one very robust fault handling closure and reuse it in all I/O operations. Thus Flex does an excellent job of segregating business logic processing from transport-related error handling.
  • Flex client .swf files are bundled together with their Java services into the same .war deployment archive. Consequently, the client-tier and the server-tier are always delivered together and thus will not drift out of version compliance.
The last point is worth further remark: The way the two distributed end-point tiers are being bound together into a single deployment unit makes for a situation that is nearly like delivering an old-school monolithic application. Flex supports Automation classes so that tools, such as RIATest, can be used to automate testing client UI. Consequently the client can be scripted to regression test against service call interactions. Thus deployment unit is the subject of due QA attention and even though the two tiers are not subjected to the same compiler, at least they actually get tested together prior to releasing.

If the service call interfaces are refactored, then the client can be refactored at the same time. Typically this is even being done within the same IDE (such as Eclipse with the Flex Builder plugin). The Flex code and the Java code each have refactoring support. Flex unit test could then be used within the development context to verify call interface validity.

Google GWT applications have similar characteristic where asynchronous method invocation is supported for invoking services on the server tier. Client tier java code and services jave code is developed co-jointly and can be packaged into a single deployment unit.

AJAX web applications may be another case where the client tier and the server tier are often deployed together.

Conclusion

So the take aways from this discussion are:
  • If you can't control both end-points stick with messaging and loose-coupled message format design. Be very mindful of the versioning dilemma from the outset and plan with forethought. The best outcome is to be able to evolve message formats without breaking end-points that have not yet been upgraded. Try very hard to dodge the burden of version management of installed end-points.
  • If you can deliver both end-points from a common deployment unit, then method invocation of remote object interfaces can be okay. However, stick with the technologies that support asynchronous I/O. Separation of the concerns of business logic processing on return results from transport fault handling is the ideal.
Related Links

Building Effective Enterprise Distributed Software Systems

XML data "duck typing"

Flex Async I/O vs Java and C# Explicit Threading