header image
OOXML update
September 23rd, 2008 under Digital Rights, OSS, Politics, rengolin, Web. [ Comments: 1 ]

A while ago I’ve posted about how crap Microsoft’s “Open” OOXML is (GPL violations and redundancy among other things).

Now the battle seems to have heated up: IBM threatened to step out ISO (via slashdot) if they don’t roll back the OOXML approval.

Well, they’re big and still a bit powerful. MS is big, but falling apart. Probably other companies would join them, especially those against.

Microsoft is not only failing technically with Vista and their web platform but also financially. They probably spent too much with .NET, Vista and stupid patents. At least the European Patent Office went on strike (I’m really amazed) because they are: “granting as many patents as possible to gain financially”. I wonder is the US patent office ever considered that…

Nevertheless, it’s always good when a big company poses against something bad and restrictive (for the future), although the reasons are seldom for the greater good. Let’s hope for the best.


On Workflows, State Machines and Distributed Services
September 21st, 2008 under Devel, Distributed, rengolin. [ Comments: none ]

I’ve been working with workflow pipelines, directly and indirectly, for quite a while now and one thing that is clearly visible is that, the way most people do it, it doesn’t scale.

Workflows (aka. Pipelines)

If you have a sequence of, say, 10 processes in a row (or graph) of dependencies and you need to run your data through each and every single one in the correct order to get the expected result, the first thing that comes to mind is a workflow. Workflows focus on data flow rather than on decisions, so there is little you can do with the data in case it doesn’t fit in your model. The result of such inconsistency generally falls in two categories: discard until further notice or re-run the current procedure until the data is correct.

Workflows are normally created ad-hoc, from a few steps. But things always grow out of proportions. If you’re lucky, these few steps would be wrapped into scripts as things grow, and you end up with a workflow of workflows, instead of a huge list of steps to take.

The benefit of a workflow approach is that you can run it at your own pace and assure that the data is still intact after each step. But the downfall is that it’s too easy to incorrectly blame the last step for some problem on your data. The status of the data could have been deteriorating over the time and the current state was the one that picked that up. Also, checking your data after each step is a very boring and nasty job and no one can guarantees that you’ll pick up every single bug anyway.

It becomes worse when the data volume increases to a level you can’t just look it up anymore. You’ll have to write syntax checkers for the intermediate results, you’ll end up with thousands of logs and scripts just to keep the workflow running. This is the third stage: meta-workflow for the workflow of workflows.

But this is not all, the worse problem is still to come… Your data is increasing by the minute, you’ll have to split it up one day and rather sooner than later. But, as long as you have to manually feed your workflow (and hope the intermediate checks work fine), you’ll have to split the data manually. If your case is simple and you can just run multiple copies of each step in parallel with each chunk of data, you’re in heaven (or rather, hell for the future). But that’s not always the case…

Now, you’ve reached a point where you have to maintain a meta-workflow for your workflow or workflows and manually manage the parallelism and collisions and checks of your data only to find out at the end that a particular piece of code was ignoring an horrendous bug in your data when it was already public.

If you want to add a new feature or change the order of two processes… well… 100mg of prozac and good luck!

Refine your Workflow

Step 1: Get rid of manual steps. The first rule is that there can be no manual step, ever. If the final result is wrong, you turn on the debugger, read the logs, find the problem, fix it and turn it off again. If you can’t afford to have wrong data live than write better checkers or rather reduce the complexity of your data.

One way to reduce the complexity is to split the workflow into smaller independent workflows, which each one generate only a fraction of your final data. If you have a mission critical environment, you better off with 1/10th of it broken than with the whole. Nevertheless, try to reduce the complexity on your pipeline, data structure and dependencies. When you have no change to do, re-think about the whole workflow, I’m sure you’ll find lots of problems every iteration.

Step 2: Unitize each step. The important rule here is: each step must process one, and only one, piece of data. How does it help? Scalability.

Consider a multiple data workflow (fig. 1 below), where you have to send the whole thing through, every time. If one of the process is much slower than the others, you’ll have to split your data for that particular step and join it again for the others. Splitting your data once at the beginning and running multiple pipelines at the same time is a nightmare as you’ll have to deal with the scrambled error messages yourself, especially if you still have manual checks around.

Multiple data Workflow
Figure 1: Split/Join is required each parallel step

On the other hand, if only one unit passes through each step (fig. 2), there is no need to split or join them and you can run as many parallel processes as you want.

Single data Workflow
Figure 2: No need to split/join

Step 3: Use simple and stupid automatic checks. If possible, don’t code anything at all.

If data must be identical on two sides, run a checksum (CRC sucks, MD5 is good). If the syntax needs to be correct, run a syntax checker, preferably a schema/ontology based automatic check. If your file is too complex or specially crafted so you need a special syntax check, re-write your file to use standards (XML and RDF are good).

Another important point on automatic checking is that you don’t have to check your emails waiting for an error message. When the subject contains the error message is already a pain, but when you have to grep for errors inside the body of the message? Oh god! I’ve lost a few lives already because of that…

Only mail when a problem occurs, only send the message related to the specific problem. It’s ok to send a weekly or monthly report just in case the automatic checks miss something. Go on and check the data for yourself once in a while and don’t worry, if things really screw up your users will let you know!

Automation

But, what’s the benefit of allowing your pipeline to automatically compute individual values and check for consistency if you still have to push the buttons? What you want now is a way of having more time to look at the pipeline flowing and fix architectural problems (and longer tea time breaks), rather than putting down the fire all the time. To calculate how many buttons you’ll press just multiply the number of data blocks you have by the number of steps… It’s a loooong way…

If still, you like pressing buttons, that’s ok. Just skip step 2 above and all will be fine. Otherwise, keep reading…

To automate your workflow you have two choices: either you fire one complete workflow for each data block or do it like a workflow of data through different services.

Complete Workflow: State Machines

If your data is small, seldom or you just like the idea, you can use a State machine to build a complete workflow for each block of data. The concept is rather simple, you receive the data and fire it through the first state. The machine will carry on, sending the changed data through all necessary states and, at the end, you’ll have your final data in-place, checked and correct.

UML is pretty good on defining state machines. For instance, you can use a state diagram to describe how your workflow is organised, class diagrams to show how each process is constructed and sequence diagrams to describe how processes talk to each other (preferably using a single technology). With UML, you can generate code and vice-versa, making it very practical for live changes and documentation purposes.

The State Design Pattern allows you to have a very simple model (each state is of the same type) with only one point of decision where to go next (when changing states): the state itself. This gives you the power to change the connections between the states easily and with very (very) little work. It’ll also save you a lot on prozac.

If you got this far you’re really interested on workflows or state machines, so I assume you also have a workflow of your own. If you do, and it’s a mess, I also believe that you absolutely don’t want to re-code all your programs just to use UML diagrams, queues and state machines. But you don’t need to.

Most programming languages allow a shell to be created and an arbitrary command to be executed. You can then manage the inter-process administration (creating/copying files, fifos, flags, etc), execute the process and, at the end, check the data and choose the next step (based on the current state of your data).

This methodology is simple, powerful and straight-forward, but it comes with a price. When you got too many data blocks flowing through, you end up with lots of copies of the same process being created and destroyed all the time. You can, however let the machine running and only provide data blocks, but still this doesn’t scale as we wanted on step 2 above.

Layered Workflow: Distributed Services

Now comes the holy (but very complex) grail of workflows. If your data is huge, constant flowing, CPU demanding and with awkward steps in between, you need to program thinking on parallelism. The idea is not complex, but the implementation can be monstrous.

On figure 2 above, you have three processes, A, B and C, running in sequence, and process B had two copies running because it took twice as long as A and C. It’s that simple: the more it takes to finish, more copies you run in parallel to make the flow constant. It’s like sewage pipes, rain water can flow in small pipes, but house waste will need much bigger ones. but later on, when you filter the rubbish, you can use small pipes back again.

So, what’s so hard on implementing this scenario? Well, first you have to take into account that those processes will be competing for resources. If they’re on the same machine, CPU will be a problem. If you have a dual core, the four processes above will share CPU, not to mention memory, cache, bus etc. If you use a cluster, they’ll all compete for network bandwidth and space on shared filesystems.

So, the general guidelines for designing robust distributed automatic workflows are:

  • Use layered state architecture. Design your machine in layers, separate the layers into machines or groups of machines and put a queue or a load-balancer in between each layer (state). This will allow you to scale much easier as you can add more hardware to a specific layer without impacting on others. It also allows you to switch off defective machines or do any maintenance on them with zero down-time.
  • One process per core. Don’t spawn more than one process per CPU as this will impact in performance in more ways than you can probably imagine. It’s just not worth it. Reduce the number of processes or steps or just buy more machines.
  • Use generic interfaces. Use the same queue / load-balancer for all state changes and, if possible, make their interfaces (protocols) identical so the previous state doesn’t need to know what’s on the next and you can change from one to another with zero cost. Also, make the states implement the same interface in the case you don’t need queues or load-balancers for a particular state.
  • Include monitors and health checks in your design. With such complex architecture it’s quite easy to ignore machines or processes failing. Separate reports into INFO, WARNING and ERROR and give them priorities or different colours on a web interface and mail or SMS only the errors to you.

As you can see, by providing a layered load-balancing, you’re getting performance and high-availability for free!

Every time data piles up in one layer, just increase the number of processes on it. If a machine breaks, it’ll be off the rotation (automatic if using queues, ping-driven or MAC-driven for load-balancers). Updating the operating system, your software or anything on the machine is just a matter of taking it out, updating, testing and deploying back again. Your service will never be off-line.

Of course, to get this benefit for real you have to remove all single-points-of-failure, which is rarely possible. But to a given degree, you can get high-performance, high-availability, load-balancing and scalability at a reasonable low cost.

The initial cost is big, though. Designing such complex network and its sub-systems, providing all safe-checks and organizing a big cluster is not an easy (nor cheap) task and definitely not done by inexperienced software engineers. But once it’s done, it’s done.

More Info

Wikipedia is a wonderful source of information, especially for the computer science field. Search for workflows, inter-process communication, queues, load-balancers, commodity clusters, process-driven applications, message passing interfaces (MPI, PVM) and some functional programming like Erlang and Scala.

It’s also a good idea to look for UML tools that generates code and vice-versa, RDF ontologies and SPARQL queries. XML and XSD is also very good for validating data formats. If you haven’t yet, take a good look on design patterns, especially the State Pattern.

Bioinformatics and internet companies have a particular affection to workflows (hence my experience) so you may find numerous examples on both fields.


Shortlist for Computer Awards Announced
September 15th, 2008 under rvincoletto, Technology. [ Comments: 2 ]

Just a quick note to say Computer Awards has announced their shortlist for this year… and guess what… they think I deserve to be between the eight finalists…

Who knows… The winners will be announced at a glittering prize-giving ceremony to be held on 5 November.

Fingers crossed!


Intel’s Game Demo Contest announce winners
September 15th, 2008 under Devel, OSS, rengolin, Software. [ Comments: none ]

…and our friend Mauro Persano won in two categories: 2nd on Intel graphics and 5th on best game on the go.

The game, Protozoa, is a retro Petri-dish style frenetic shooting-the-hell-out-of-the bacteria, virii and protozoa stuff that comes in your way. You can play with a PS2 (two-analogue sticks) control, one for the movements and other for the shooting, or just use the keyboard. The traditional timed-power-up and megalomaniac explosions raise even more the sense of nostalgia.

You can download the latest Windows version here but don’t worry, it also runs pretty fine with Wine.

Have fun!


Calliper, chalks and the axe!
September 10th, 2008 under Algorithms, Devel, Physics, rengolin. [ Comments: none ]

Years ago, when I was still doing physics university in São Paulo, a friend biochemist stated one of the biggest truths about physics: Physicist is the one that measures with a calliper, marks with chalk and cuts with an axe!.

I didn’t get it until I got through some courses that teaches how to use the mathematical tools available, extrapolate to the most infamous case, than expand in a series, take the first argument and prove the theorem. If you get the second argument, you’re doing fine physics (but floating point precision will screw up anyway).

Only recently I’ve learnt that some scientists are really doing a lot by following in the opposite direction. While most molecular dynamics simulation are going to the quantum level, taking ages to get to an averagely reasonable result (by quantum standards), some labs are actually beating them in speed and quality of results by focusing on software optimizations rather than going berzerk on the physical model.

It’s not like the infamous Russian pen (which is a hoax, by the way), it’s only the normal over-engineering that we normally see when people are trying to impress the rest of the world. The Russians themselves can do some pretty dumb simplifications like the cucumber picker or over-engineering like the Screw Drive that, in the end, created more problems than solved.

Very clear, in software development, the situation can be as bad as that. The complexity of over-designed interfaces or over-engineered libraries can render a project useless in a matter of months. Working around would increase the big ball of mud and re-writing from scratch would take a long time, not to mention include more bugs than it solves.

Things that I’ve recently seen as over-engineering were:

  • Immutable objects (as arguments or on polymorphic lists): When you build some objects and feed them to polymorphic immutable lists (when creating a bigger object, for instance) and then need to change that afterwards you have to copy, change and then write back.
    This is not only annoying but is utterly ineffective when the list is big (and thousands of objects need to be copied back and forth). The way out of it is to use the bridge pattern and create several (RW) implementations of your objects and lists and whatever you have but that also increases a lot on code complexity and maintenance.
    My view of the matter is: protect your stuff from other people, not from yourself. As in “Library Consistent” or “Package-wise Consistent”.
  • Abuse of “Standard algorithms“: Ok, one of the important concepts in software quality is the use of standards. I’ve written it myself, over and over. But, like water, using no standards will kill your project the same way as abusing of them.
    So, if you create a std::set that gives you the power of log(N) searches, why the heck you’d use std::find_if ( begin(), end(), MyComparator() );, that gives you linear searches? Worse, that find was actually before each and every insert! std::set guarantees at least N.log(N) speed on insertion, but the “standard fail-safe assurance” was giving it N².log(N). For what? To assure no duplicated entries were ever inserted in the set, what was yet another thing guaranteed by the default container in question.
    All in all, the programmer was only trying to follow the same pattern over the entire code. A noble cause, indeed.

Now, I’m still defining what’s worse: over-engineering or under-engineering… Funny, though, both have very similar effects on our lives…


Happy birthday to GNU
September 3rd, 2008 under OSS, rengolin, Unix/Linux. [ Comments: 2 ]

25 years and growing strong, happy birthday!


 


License
Creative Commons License
We Support

WWF

EFF

National Autistic Society

Royal Society for the Prevention of Cruelty to Animals

DefectiveByDesign.org

End Software Patents

See Also
Disclaimer

The information in this weblog is provided “AS IS” with no warranties, and confers no rights.

This weblog does not represent the thoughts, intentions, plans or strategies of our employers. It is solely our opinion.

Feel free to challenge and disagree, and do not take any of it personally. It is not intended to harm or offend.

We will easily back down on our strong opinions by presentation of facts and proofs, not beliefs or myths. Be sensible.

Recent Posts