header image
Open Source and Profit
July 8th, 2013 under Corporate, Devel, Digital Rights, OSS, rengolin, World. [ Comments: 2 ]

I have written extensively about free, open source software as a way of life, and now reading back my own articles of the past 7 years, I realize that I was wrong on some of the ideas, or in the state of the open source culture within business and around companies.

I’ll make a bold statement to start, trying to get you interested in reading past the introduction, and I hope to give you enough arguments to prove I’m right. Feel free to disagree on the comments section.

The future of business and profit, in years to come, can only come if surrounded by free thoughts.

By free thoughts I mean free/open source software, open hardware, open standards, free knowledge (both free as in beer and as in speech), etc.

Past Ideas

I began my quest to understand the open source business model back in 2006, when I wrote that open source was not just software, but also speech. Having open source (free) software is not enough when the reasons why the software is free are not clear. The reason why this is so is that the synergy, that is greater than the sum of the individual parts, can only be achieved if people have the rights (and incentives) to reach out on every possible level, not just the source, or the hardware. I make that clear later on, in 2009, when I expose the problems of writing closed source software: there is no ecosystem in which to rely, so progress is limited and the end result is always less efficient, since the costs to make it as efficient are too great and would drive the prices of the software too high up to be profitable.

In 2008 I saw both sides of the story, pro and against Richard Stallman, on the views of the legitimacy of propriety control, being it via copyright licenses or proprietary software. I may have come a long way, but I was never against his idea of the perfect society, Richard Stallman’s utopia, or as some friends put it: The Star Trek Universe. The main difference between me and Stallman is that he believes we should fight to the last man to protect ourselves from the evil corporations towards software abuse, while I still believe that it’s impossible for them to sustain this empire for too long. His utopia will come, whether they like it or not.

Finally, in 2011 I wrote about how copying (and even stealing) is the only business model that makes sense (Microsoft, Apple, Oracle etc are all thieves, in that sense) and the number of patent disputes and copyright infringement should serve to prove me right. Last year I think I had finally hit the epiphany, when I discussed all these ideas with a friend and came to the conclusion that I don’t want to live in a world where it’s not possible to copy, share, derive or distribute freely. Without the freedom to share, our hands will be tied to defend against oppression, and it might just be a coincidence, but in the last decade we’ve seen the biggest growth of both disproportionate propriety protection and disproportional governmental oppression that the free world has ever seen.

Can it be different?

Stallman’s argument is that we should fiercely protect ourselves against oppression, and I agree, but after being around business and free software for nearly 20 years, I so far failed to see a business model in which starting everything from scratch, in a secret lab, and releasing the product ready for consumption makes any sense. My view is that society does partake in an evolutionary process that is ubiquitous and compulsory, in which it strives to reduce the cost of the whole process, towards stability (even if local), as much as any other biological, chemical or physical system we know.

So, to prove my argument that an open society is not just desirable, but the only final solution, all I need to do is to show that this is the least energy state of the social system. Open source software, open hardware and all systems where sharing is at the core should be, then, the least costly business models, so to force virtually all companies in the world to follow suit, and create the Stallman’s utopia as a result of the natural stability, not a forced state.

This is crucial, because every forced state is non-natural by definition, and every non-natural state has to be maintained by using resources that could be used otherwise, to enhance the quality of the lives of the individuals of the system (being them human or not, let’s not block our point of view this early). To achieve balance on a social system we have to let things go awry for a while, so that the arguments against such a state are perfectly clear to everyone involved, and there remains no argument that the current state is non-optimal. If there isn’t discomfort, there isn’t the need for change. Without death, there is no life.


Of all the bad ideas us humans had on how to build a social system, capitalism is probably one of the worst, but it’s also one of the most stable, and that’s because it’s the closest to the jungle rule, survival of the fittest and all that. Regulations and governments never came to actually protect the people, but as to protect capitalism from itself, and continue increasing the profit of the profitable. Socialism and anarchy rely too much on forced states, in which individuals have to be devoid of selfishness, a state that doesn’t exist on the current form of human beings. So, while they’re the product of amazing analysis of the social structure, they still need heavy genetic changes in the constituents of the system to work properly, on a stable, least-energy state.

Having less angry people on the streets is more profitable for the government (less costs with security, more international trust in the local currency, more investments, etc), so panis et circenses will always be more profitable than any real change. However, with more educated societies, result from the increase in profits of the middle class, more real changes will have to be made by governments, even if wrapped in complete populist crap. One step at a time, the population will get more educated, and you’ll end up with more substance and less wrapping.

So, in the end, it’s all about profit. If not using open source/hardware means things will cost more, the tendency will be to use it. And the more everyone uses it, the less valuable will be the products that are not using it, because the ecosystem in which applications and devices are immersed in, becomes the biggest selling point of any product. Would you buy a Blackberry Application, or an Android Application? Today, the answer is close to 80% on the latter, and that’s only because they don’t use the former at all.

It’s not just more expensive to build Blackberry applications, because the system is less open, the tools less advanced, but also the profit margins are smaller, and the return on investment will never justify. This is why Nokia died with their own App store, Symbian was not free, and there was a better, free and open ecosystem already in place. The battle had already been lost, even before it started.

But none of that was really due to moral standards, or Stallman’s bickering. It was only about profit. Microsoft dominated the desktop for a few years, long enough to make a stand and still be dominant after 15 years of irrelevance, but that was only because there was nothing better when they started, not by a long distance. However, when they tried to flood the server market, Linux was not only already relevant, but it was better, cheaper and freer. The LAMP stack was already good enough, and the ecosystem was so open, that it was impossible for anyone with a closed development cycle to even begin to compete on the same level.

Linux became so powerful that, when Apple re-defined the concept of smartphones with the iPhone (beating Nokia’s earlier attempts by light-years of quality), the Android system was created, evolved and dominated in less than a decade. The power to share made possible for Google, a non-device, non-mobile company, to completely outperform a hardware manufacturer in a matter of years. If Google had invented a new OS, not based on anything existent, or if they had closed the source, like Apple did with FreeBSD, they wouldn’t be able to compete, and Apple would still be dominant.

Do we need profit?

So, the question is: is this really necessary? Do we really depend on Google (specifically) to free us from the hands of tyrant companies? Not really. If it wasn’t Google, it’d be someone else. Apple, for a long time, was the odd guy in the room, and they have created an immense value for society: they gave us something to look for, they have educated the world on what we should strive for mobile devices. But once that’s done, the shareable ecosystem learns, evolves and dominate. That’s not because Google is less evil than Apple, but because Android is more profitable than iOS.

Profit here is not just the return on investment that you plan on having on a specific number of years, but adding to that, the potential that the evolving ecosystem will allow people to do when you’ve long lost the control over it. Shareable systems, including open hardware and software, allow people far down in the planing, manufacturing and distributing process to still have profit, regardless of what were your original intentions. One such case is Maddog’s Project Cauã.

By using inexpensive RaspberryPis, by fostering local development and production and by enabling the local community to use all that as a way of living, Maddog’s project is using the power of the open source initiative by completely unrelated people, to empower the people of a country that much needs empowering. That new class of people, from this and other projects, is what is educating the population of the world, and what is allowing the people to fight for their rights, and is the reason why so many civil uprisings are happening in Brazil, Turkey, Egypt.


All that creates instability, social unrest, whistle-blowing gone wrong (Assange, Snowden), and this is a good thing. We need more of it.

It’s only when people feel uncomfortable with how the governments treat them that they’ll get up their chairs and demand for a change. It’s only when people are educated that they realise that oppression is happening (since there is a force driving us away from the least-energy state, towards enriching the rich), and it’s only when these states are reached that real changes happen.

The more educated society is, the quicker people will rise to arms against oppression, and the closer we’ll be to Stallman’s utopia. So, whether governments and the billionaire minority likes or not, society will go towards stability, and that stability will migrate to local minima. People will rest, and oppression will grow in an oscillatory manner until unrest happens again, and will throw us into yet another minimum state.

Since we don’t want to stay in a local minima, we want to find the best solution not just a solution, having it close to perfect in the first attempt is not optimal, but whether we get it close in the first time or not, the oscillatory nature of social unrest will not change, and nature will always find a way to get us closer to the global minimum.


Is it possible to stay in this unstable state for too long? I don’t think so. But it’s not going to be a quick transition, nor is it going to be easy, nor we’ll get it on the first attempt.

But more importantly, reaching stability is not a matter of forcing us to move towards a better society, it’s a matter of how dynamic systems behave when there are clear energetic state functions. In physical and chemical systems, this is just energy, in biological systems this is the propagation ability, and in social systems, this is profit. As sad as it sounds…

Uno score keeper
March 31st, 2013 under Devel, OSS, rengolin, Software. [ Comments: none ]

With the spring not coming soon, we had to improvise during the Easter break and play Uno every night. It’s a lot of fun, but it can take quite a while to find a piece of clean paper and a pen that works around the house, so I wondered if there was an app for that. It turns out, there wasn’t!

There were several apps to keep card game scores, but every one was specific to the game, and they had ads, and wanted access to the Internet, so I decided it was worth it writing one myself. Plus, that would finally teach me to write Android apps, a thing I was delaying to get started for years.

The App

Adding new players

Card Game Scores

The app is not just a Uno score keeper, it’s actually pretty generic. You just keep adding points until someone passes the threshold, when the poor soul will be declared a winner or a loser, depending on how you set up the game. Since we’re playing every night, even the 30 seconds I spent re-writing our names was adding up, so I made it to save the last game in the Android tuple store, so you can retrieve it via the “Last Game” button.

It’s also surprisingly easy to use (I had no idea), but if you go back and forth inside the app, it cleans the game and start over a new one, with the same players, so you can go on as many rounds as you want. I might add a button to restart (or leave the app) when there’s a winner, though.

I’m also thinking about printing the names in order in the end (from victorious to loser), and some other small changes, but the way it is, is good enough to advertise and see what people think.

If you end up using, please let me know!

Download and Source Code

The app is open source (GPL), so rest assured it has no tricks or money involved. Feel free to download it from here, and get the source code at GitHub.

Distributed Compilation on a Pandaboard Cluster
February 13th, 2013 under Devel, Distributed, OSS, rengolin. [ Comments: 2 ]

This week I was experimenting with the distcc and Ninja on a Pandaboard cluster and it behaves exactly as I expected, which is a good thing, but it might not be what I was looking for, which is not. ;)

Long story short, our LLVM buildbots were running very slow, from 3 to 4.5 hours to compile and test LLVM. If you consider that at peak time (PST hours) there are up to 10 commits in a single hour, the buildbot will end up testing 20-odd patches at the same time. If it breaks in unexpected ways, of if there is more than one patch on a given area, it might be hard to spot the guilty.

We ended up just avoiding the make clean step, which put us around 15 minutes build+tests, with the odd chance of getting 1 or 2 hours tops, which is a great deal. But one of the alternatives I was investigating is to do a distributed build. More so because of the availability of cluster nodes with dozens of ARM cores inside, we could make use of such a cluster to speed up our native testing, even benchmarking on a distributed way. If we do it often enough, the sample might be big enough to account for the differences.

The cluster

So, I got three Pandaboards ES (dual Cortex-A9, 1GB RAM each) and put the stock Ubuntu 12.04 on them and installed the bare minimum (vim, build-essential, python-dev, etc), upgraded to the latest packages and they were all set. Then, I needed to find the right tools to get a distributed build going.

It took a bit of searching, but I ended up with the following tool-set:

  • distcc: The distributed build dispatcher, which knows about the other machines in the cluster and how to send them jobs and get the results back
  • CMake: A Makefile generator which LLVM can use, and it’s much better than autoconf, but can also generate Ninja files!
  • Ninja: The new intelligent builder which not only is faster to resolve dependencies, but also has a very easy way to change the rules to use distcc, and also has a magical new feature called pools, which allow me to scale job types independently (compilers, linkers, etc).

All three tools had to be compiled from source. Distcc’s binary distribution for ARM is too old, CMake’s version on that Ubuntu couldn’t generate Ninja files and Ninja doesn’t have binary distributions, full stop. However, it was very simple to get them interoperating nicely (follow the instructions).

You don’t have to use CMake, there are other tools that generate Ninja files, but since LLVM uses CMake, I didn’t have to do anything. What you don’t want is to generate the Ninja files yourself, it’s just not worth it. Different than Make, Ninja doesn’t try to search for patterns and possibilities (this is why it’s fast), so you have to be very specific on the Ninja file on what you want to accomplish. This is very easy for a program to do (like CMake), but very hard and error prone for a human (like me).


To use distcc is simple:

  1. Replace the compiler command by distcc compiler on your Ninja rules;
  2. Set the environment variable DISTCC_HOSTS to the list of IPs that will be the slaves (including localhost);
  3. Start the distcc daemon on all slaves (not on the master): distccd --daemon --allow <MasterIP>;
  4. Run ninja with the number of CPUs of all machines + 1 for each machine. Ex: ninja -j6 for 2 Pandaboards.

A local build, on a single Pandaboard of just LLVM (no Clang, no check-all) takes about 63 minutes. With distcc and 2 Pandas it took 62 minutes!

That’s better, but not as much as one would hope for, and the reason is a bit obvious, but no less damaging: The Linker! It took 20 minutes to compile all of the code, and 40 minutes to link them into executable. That happened because while we had 3 compilation jobs on each machine, we had 6 linking jobs on a single Panda!

See, distcc can spread the compilation jobs as long as it copies the objects back to the master, but because a linker needs all objects in memory to do the linking, it can’t do that over the network. What distcc could do, with Ninja’s help, is to know which objects will be linked together, and keep copies of them on different machines, so that you can link on separate machines, but that is not a trivial task, and relies on an interoperation level between the tools that they’re not designed to accept.

Ninja Pools

And that’s where Ninja proved to be worth its name: Ninja pools! In Ninja, pools are named resources that bundle together with a specific level of scalability. You can say that compilers scale free, but linkers can’t run more than a handful. You simply need to create a pool called linker_pool (or anything you want), give it a depth of, say, 2, and annotate all linking jobs with that pool. See the manual for more details.

With the pools enabled, a distcc build on 2 Pandaboards took exactly 40 minutes. That’s 33% of gain with double the resources, not bad. But, how does that scale if we add more Pandas?

How does it scale?

To get a third point (and be able to apply a curve fit), I’ve added another Panda and ran again, with 9 jobs and linker pool at 2, and it finished in 30 minutes. That’s less than half the time with three times more resources. As expected, it’s flattening out, but how much more can we add to be profitable?

I don’t have an infinite number of Pandas (nor I want to spend all my time on it), so I just cheated and got a curve fitting program (xcrvfit, in case you’re wondering) and cooked up an exponential that was close enough to the points and use the software ability to do a best fit. It came out with 86.806*exp(-0.58505*x) + 14.229, which according to Lybniz, flattens out after 4 boards (about 20 minutes).

Pump Mode

Distcc has a special mode called pump mode, in which it pushes with the C file, all headers necessary to compile it solely on the node. Normally, distcc will pre-compile on the master node and send the pre-compiled result to the slaves, which convert to object code. According to the manual, this could improve the performance 10-fold! Well, my results were a little less impressive, actually, my 3-Panda cluster finished in just about 34 minutes, 4 minutes more than without push mode, which is puzzling.

I could clearly see that the files were being compiled in the slaves (distccmon-text would tell me that, while there was a lot of “preprocessing” jobs on the master before), but Ninja doesn’t print times on each output line for me to guess what could have slowed it down. I don’t think there was any effect on the linker process, which was still enabled in this mode.


Simply put, both distcc and Ninja pools have shown to be worthy tools. On slow hardware, such as the Pandas, distributed builds can be an option, as long as you have a good balance between compilation and linking. Ninja could be improved to help distcc to link on remote nodes as well, but that’s a wish I would not press on the team.

However, scaling only to 4 boards will reduce a lot of the value for me, since I was expecting to use 16/32 cores. The main problem is again the linker jobs working solely on the master node, and LLVM having lots and lots of libraries and binaries. Ninja’s pools can also work well when compiling LLVM+Clang on debug mode, since the objects are many times bigger, and even on above average machine you can start swapping or even freeze your machine if using other GUI programs (browsers, editors, etc).

In a nutshell, the technology is great and works as advertised, but with LLVM it might not be yet the thing. It’s still more profitable to get faster hardware, like the Chromebooks, that are 3x faster than the Pandas and cost only marginally more.

Would also be good to know why the pump mode has regressed in performance, but I have no more time to spend on this, so I leave as a exercise to the reader. ;)

LLVM Vectorizer
February 12th, 2013 under Algorithms, Devel, rengolin. [ Comments: 2 ]

Now that I’m back working full-time with LLVM, it’s time to get some numbers about performance on ARM.

I’ve been digging the new LLVM loop vectorizer and I have to say, I’m impressed. The code is well structured, extensible and above all, sensible. There are lots of room for improvement, and the code is simple enough so you can do it without destroying the rest or having to re-design everything.

The main idea is that the loop vectorizer is a Loop Pass, which means that if you register this pass (automatically on -O3, or with -loop-vectorize option), the Pass Manager will run its runOnLoop(Loop*) function on every loop it finds.

The three main components are:

  1. The Loop Vectorization Legality: Basically identifies if it’s legal (not just possible) to vectorize. This includes checking if we’re dealing with an inner loop, and if it’s big enough to be worth, and making sure there aren’t any conditions that forbid vectorization, such as overlaps between reads and writes or instructions that don’t have a vector counter-part on a specific architecture. If nothing is found to be wrong, we proceed to the second phase:
  2. The Loop Vectorization Cost Model: This step will evaluate both versions of the code: scalar and vector. Since each architecture has its own vector model, it’s not possible to create a common model for all platforms, and in most cases, it’s the special behaviour that makes vectorization profitable (like 256-bits operations in AVX), so we need a bunch of cost model tables that we consult given an instruction and the types involved. Also, this model doesn’t know how the compiler will lower the scalar or vectorized instructions, so it’s mostly guess-work. If the vector cost (normalized to the vector size) is less than the scalar cost, we do:
  3. The Loop Vectorization: Which is the proper vectorization, ie. walking through the scalar basic blocks, changing the induction range and increment, creating the prologue and epilogue, promote all types to vector types and change all instructions to vector instructions, taking care to leave the interaction with the scalar registers intact. This last part is a dangerous one, since we can end up creating a lot of copies from scalar to vector registers, which is quite expensive and was not accounted for in the cost model (remember, the cost model is guess-work based).

All that happens on a new loop place-holder, and if all is well at the end, we replace the original basic blocks by the new vectorized ones.

So, the question is, how good is this? Well, depending on the problems we’re dealing with, vectorizers can considerably speed up execution. Especially iterative algorithms, with lots of loops, like matrix manipulation, linear algebra, cryptography, compression, etc. In more practical terms, anything to do with encoding and decoding media (watching or recording videos, pictures, audio), Internet telephones (compression and encryption of audio and video), and all kinds of scientific computing.

One important benchmark for that kind of workload is Linpack. Not only Linpack has many examples of loops waiting to be vectorized, but it’s also the benchmark that defines the Top500 list, which classifies the fastest computers in the world.


So, both GCC and Clang now have the vectorizers turned on by default with -O3, so comparing them is as simple as compiling the programs and see them fly. But, since I’m also interested in seeing what is the performance gain with just the LLVM vectorizer, I also disabled it and ran a clang with only  -O3, no vectorizer.

On x86_64 Intel (Core i7-3632QM), I got these results:

Compiler Opt Avg. MFLOPS Diff
Clang -O3 2413 0.0%
GCC -O3 vectorize 2421 0.3%
Clang -O3 vectorize 3346 38.6%

This is some statement! The GCC vectorizer exists for a lot longer than LLVM’s and has been developed by many vectorization gurus and LLVM seems to easily beat GCC in that field. But, a word of warning, Linpack is by no means representative of all use cases and user visible behaviour, and it’s very likely that GCC will beat LLVM on most other cases. Still, a reason to celebrate, I think.

This boost mean that, for many cases, not only the legality if the transformations are legal and correct (or Linpack would have gotten wrong results), but they also manage to generate faster code at no discernible cost. Of course, the theoretical limit is around 4x boost (if you manage to duplicate every single scalar instruction by a vector one and the CPU has the same behaviour about branch prediction and cache, etc), so one could expect a slightly higher number, something on the order of 2x better.

It depends on the computation density we’re talking about. Linpack tests specifically the inner loops of matrix manipulation, so I’d expect a much higher ratio of improvement, something around 3x or even closer to 4x. VoIP calls, watching films and listening to MP3 are also good examples of densely packet computation, but since we’re usually running those application on a multi-task operating system, you’ll rarely see improvements higher than 2x. But general applications rarely spend that much time on inner loops (mostly waiting for user input and then doing a bunch of unrelated operations, hardly vectorizeable).

Another important aspect of vectorization is that it saves a lot of battery juice. MP3 decoding doesn’t really matter if you finish in 10 or 5 seconds, as long as the music doesn’t stop to buffer. But taking 5 seconds instead of 10 means that on the other 5 seconds the CPU can reduce its voltage and save battery. This is especially important in mobile devices.

What about ARM code?

Now that we know the vectorizer works well, and the cost model is reasonably accurate, how does it compare on ARM CPUs?

It seems that the grass is not so green on this side, at least not at the moment. I have reports that on ARM it also reached the 40% boost similar to Intel, but what I saw was a different picture altogether.

On a Samsung Chromebook (Cortex-A15) I got:

Compiler Opt Avg. MFLOPS Diff
Clang -O3 796 0.0%
GCC -O3 vectorize 736 -8.5%
Clang -O3 vectorize 773 -2.9%

The performance regression can be explained by the amount of scalar code intermixed with vector code inside the inner loops as a result of shuffles (movement of data within the vector registers and between scalar and vector registers) not being lowered correctly. This most likely happens because the LLVM back-end relies a lot on pattern-matching for instruction selection (a good thing), but the vectorizers might not be producing the shuffles in the right pattern, as expected by each back-end.

This can be fixed by tweaking the cost model to penalize shuffles, but it’d be good to see if those shuffles aren’t just mismatched against the patterns that the back-end is expecting. We will investigate and report back.


Got results for single precision floating point, which show a greater improvement on both Intel and ARM.

On x86_64 Intel (Core i7-3632QM), I got these results:

Compiler Opt Avg. MFLOPS Diff
Clang -O3 2530 0.0%
GCC -O3 vectorize 3484 37.7%
Clang -O3 vectorize 3996 57.9%

On a Samsung Chromebook (Cortex-A15) I got:

Compiler Opt Avg. MFLOPS Diff
Clang -O3 867 0.0%
GCC -O3 vectorize 788 -9.1%
Clang -O3 vectorize 1324 52.7%

Which goes on to show that the vectorizer is, indeed, working well for ARM, but the costs of using the VFP/NEON pipeline outweigh the benefits. Remember than NEON vectors are only 128-bits wide and VFP only 64-bit wide, and NEON has no double precision floating point operations, so they’ll only do one double precision floating point operations per cycle, so the theoretical maximum depends on the speed of the soft-fp libraries.

So, in the future, what we need to be working is the cost model, to make sure we don’t regress in performance, and try to get better algorithms when lowering vector code (both by making sure we match the patterns that the back-end is expecting, and by just finding better ways of vectorizing the same loops).


Without further benchmarks it’s hard to come to a final conclusion, but it’s looking good, that’s for sure. Since Linpack is part of the standard LLVM test-suite benchmarks, fixing this and running it regularly on ARM will at least avoid any further regressions… Now it’s time to get our hands dirty!


K-means clustering
June 20th, 2012 under Algorithms, Devel, rengolin. [ Comments: none ]

Clustering algorithms can be used with many types of data, as long as you have means to distribute them in a space, where there is the concept of distance. Vectors are obvious choices, but not everything can be represented into N-dimensional points. Another way to plot data, that is much closer to real data, is to allow for a large number of binary axis, like tags. So, you can cluster by the amount of tags the entries share, with the distance being (only relative to others) the proportion of these against the non-sharing tags.

An example of tag clustering can be viewed on Google News, an example of clustering on Euclidean spaces can be viewed on the image above (full code here). The clustering code is very small, and the result is very impressive for such a simple code. But the devil is in the details…

Each red dots group is generated randomly from a given central point (draws N randomly distributed points inside a circle or radius R centred at C). Each centre is randomly placed, and sometimes their groups collide (as you can see on the image), but that’s part of the challenge. To find the groups, and their centres, I throw random points (with no knowledge of the groups’ centres) and iterate until I find all groups.

The iteration is very simple, and consists of two steps:

  1. Assignment Step: For each point, assign it to the nearest mean. This is why you need the concept of distance, and that’s a tricky part. With Cartesian coordinates, it’s simple.
  2. Update Step: Calculate the real mean of all points belonging to each mean point, and update the point to be at it. This is basically moving the supposed (randomly guessed) mean to it’s rightful place.

On the second iteration, the means, that were randomly selected at first, are now closer to a set of points. Not necessarily points in the same cluster, but the cluster that has more points assigned to any given mean will slowly steal it from the others, since it’ll have more weight when updating it on step 2.

If all goes right, the means will slowly move towards the centre of each group and you can stop when the means don’t move too much after the update step.

Many problems will arise in this simplified version, for sure. For instance, if the mean is exactly in between two groups, and both pull it to their centres with equally strong forces, thus never moving the mean, thus the algorithm thinks it has already found its group, when in fact, it found two. Or if the group is so large that it ends up with two or more means which it belongs, splitting it into many groups.

To overcome these deficiencies, some advanced forms of K-means take into account the shape of the group during the update step, sometimes called soft k-means. Other heuristics can be added as further steps to make sure there aren’t two means too close to each other (relative to their groups’ sizes), or if there are big gaps between points of the same group, but that kind of heuristics tend to be exponential in execution, since they examine every point of a group in relation to other points of the same group.

All in all, still an impressive performance for such a simple algorithm. Next in line, I’ll try clustering data distributed among many binary axis and see how k-means behave.

The doctor and the programmer
October 23rd, 2011 under Devel, Life, rengolin, World. [ Comments: none ]

About 15 years ago, when I was working on a dodgy Brazilian firm, I had a conversation with an older programmer that I never forgot. He said something along the lines of:

Medicine is way easier than computer science. Doctors are still using the same books, written decades ago, while we have to buy only the latest ones. Our reality gets rewritten every five years or so, and by the time you leave university, you’re already outdated.

There is a lot of merit in this argument. Even though the common cold’s strain changes every week, its symptoms are exactly the same. Cancer, HIV, malaria, Lupus and other big diseases are treated more or less the same way as they were when first treated, and GPs still give you aspirin/paracetamol/ibuprofen for any kind of pain.

Human anatomy, physiology and psychology doesn’t change at all. Broken legs, runny noses and swollen tonsils are the same on every person and they require the same treatment for everyone.

While doctors kill one patient when they do make a mistake, developers can kill hundreds if they happen to introduce a bug on the Airbus fault-tolerant fail-over system.

But recently, I had to re-think this through, and I have to say I’m not 100% in agreement any more.

When programmers change jobs, they get a few weeks to get used to the new system. They might get months to actually be productive as their peers and will mature within a few years working with the same piece of software. Programmers can run tests, regression tests and usability tests, unit tests, etc, which is something a bit complicated with human beings.

When a doctor gets a new patient, it’s like getting a new programming job. It’s the same language, but the system is completely different. It might take you weeks to start getting the prognosis right, and in a few months you’ll be able to get it right before your patient even tells you the symptoms.

The similarities are remarkable

Consultant programmers get new systems to work on every week. Like ER doctors. They do what they can, with the time they have and the solution is most of the time acceptable. A more stable doctor or programmer might look at the job and cry, but the job was done, the fire was put off and the “patient” is out of the door.

Family doctors, that were there when you were born, know you better than yourself. They know when your symptoms are only psychological and what cause that and when it’s going to go away. They rarely touch the “system”, but normally fix an unrelated bug and you’re good as new.

But not everybody is lucky enough to have such doctors. They are expensive, and there aren’t enough good doctors in the health system of any country to account for every family. Even if the doctor share a hundred families, it’s still very far from enough.

This is the reason that systems fail, and get half-fixed, and why most GPs will send you home with a paracetamol unless you’re dying in front of them.

If doctors and programmers had such a different world, the emergent behaviour wouldn’t be that similar, I believe.

FreeCell puzzles solver API
September 25th, 2011 under Algorithms, Devel, Fun, Games, rengolin. [ Comments: 2 ]

This is a little pet project I did a while ago. It’s a FreeCell puzzle‘s solver API.

The idea is to provide a basic validation engine and board management (pretty much like my old chess validation), so people can write FreeCell solvers on top of it. It has basic board setup (of multiple sizes), movement legalisation, and a basic Solver class, which you must derive to create your own solvers.

There’s even a BruteFroceSolver that can solve a few small boards, and that gives you an idea on how to create your own solvers. However, the API is not clear enough yet that children could start playing with it, and that’s the real goal of this project: to get kids interested in solving complex optimisation problems in an easy way.

Freecell is a perfect game for it. Most boards can be solved (only a handful of them were proven – by exhaustion – not solvable), some movements can be rolled back and you can freely re-use cards that have already been placed into the foundations (their final destination) back in the game again.

It’s out of the scope of this project to produce a full-featured graphic interface for kids, but making the API easy enough so they understand the concepts without dragging themselves into idiosyncrasies of C++ is important.

Compiler optimisations

The reason why I did this was to make some of the optimisations compiler engineers have to do more appealing to non-compiler engineers or children with a taste for complex problems. But what does this have to do with compilers? The analogy is a bit far-fetching and somewhat reverse, but it’s interesting nevertheless and it was worth the shot.

Programming languages are transformed into graphs inside the compiler, which should represent the intentions of the original programmer. This graphs are often optimised multiple times until you end up with a stream of instructions representing the source code in the machine language.

Ignore for now the optimisations on those graphs, and focus on the final part: selecting machine instructions that can represent that final graph in the machine language (assembly). This selection can pick any assembly instruction at will, but it has to put them into a very specific order to represent the same semantics (not just syntactic) of the original program. Since many instructions have side effects, pipeline interactions, special flags set or cleaned up, it’s not trivial to produce correct code if you don’t check and re-check all conditions every time. This is a known complex optimisation problem and can be responsible for changes in speed or code size in orders of magnitude.

What does it have to do with the Freecell puzzle? Well, in Freecell, you have a number of cards and you have to put them in a specific order, just like assembly instructions. But in this case, the analogy is reverse: the “sequence” is trivial, but the “instructions” are hard to get.

There are other similarities. For example, you have four free cells, and they can only hold one value at a time. They are similar to registers, and manipulating them, gives you a good taste of how hard it is to do register scheduling when building the assembly result. But in this case, it’s much harder to spill (move the data back to memory, or in this case, card back to cascades), since there are strict rules on how to move cards around.

Reusing cards from the foundations is similar to expanding single instructions into a sequence of them in order to circumvent pipeline stalls. In real compilers you could expand a multiply+add (very useful for digital signal processing) into two instructions: multiply and add, if that gives you some advantage on special cases on special chips. In Freecell, you can use a 9 on top of a 10, to move an 8 from another cascade and free up a card that you need to clean up your freecells (registers).

I’m sure you can find many more similarities, even if you have to skew the rules a bit (or completely reverse them), but that’s not the point. The point is to interest people into complex optimisation techniques without the hassle of learning a whole new section of computer science, especially if that section puts fear in most people in the first place.

2011 European User Group Meeting
July 19th, 2011 under Devel, rengolin. [ Comments: none ]

For all LLVM users in Europe that didn’t catch the announcement on the main list, be warned that we’re organising the European LLVM User Group Meeting in September, 16th, in London. All information necessary to register and submit your work is on the link.

It’s free of charge and we’ll also provide a happy-hour after the meeting for an extended networking. If you’re just curious about LLVM or work with it day to day or is thinking of starting your PhD with LLVM, please register and show up. You won’t regret. ;)

More information will be available on the link above and through the main mailing list. Stay tuned!

C++ class sizes #2
March 31st, 2011 under Devel, rengolin. [ Comments: 1 ]

In my previous post about class sizes, we visited the empty classes and how their derived classes change in size. Now, we’re going to look what the ABI have to say about tail padding in inherited classes.

Again, for normal (POD) structures, everything is as it looks like, same as in C. But for more complex (non-POD) classes, things change a bit. The Itanium C++ ABI has a rather complex rule for defining the offset and size of class members, but each target-specific ABI (like ARM’s) can have additional rules that make it very complex to define the size of a class member.


So, first, a brief explanation of POD. POD is a concept in C++ that states, for the purpose of layout, a C++ structure (that is more than simple C structures) are just like plain old data. This normally means that the structure behaves like data, ie. it can be created on zero-initialized memory, copied with memcpy without any side effect, etc.

But not being a POD doesn’t always means you have a complex virtual class with multiple inheritance. Sometimes a simple change can transform it into a non-POD, like adding a simple empty constructor to the base class.

For example, in both C and C++, the structure below has 8 bytes:

struct Foo {
  int a;
  char b;

because the structure alignment is the same as the most aligned member, which is int. Assuming int is 32 bits, the second member has size one but aligns to four, yielding a size 8 for the final structure.

In C++, you can derive from that structure, and the derived class is normally implemented by having the base class as sort of a member inside it.

So, the C++ struct Bar below:

struct Foo {
	int a;
	char b;

struct Bar : Foo {
	char c;

is similar (for layout purposes) to a C construct:

struct Bar {
	struct {
		int a;
		char b;
	char c;

So, both in C and in C++, the size of Bar is 12 bytes. Now, doing a slight change in Foo, we’re going to break that:

struct Foo {
	int a;
	char b;
	Foo() {}

struct Bar : Foo {
	char c;

You may be surprised to know that the size of Bar is still 8 bytes.

non-POD structures

Adding the constructor is one way of killing the POD-ness of a structure, but anything that adds some C++ flavour to it is enough. For instance, Bar itself is a non-POD, because it was derived from another structure. Adding simple methods (non constructors) is fine, but adding a virtual method makes it a non-POD (as you have to add the virtual table). Anything that leaves some room for implementation detail will be considered a violation of the POD contract and will transform your structure to a non-POD.

Ok, so what does that mean? If it just means your structures will be smaller, well, I’ll add a constructor to all my structures! But, no, the only change in in the tail padding. If you want to compress your structures, use packed structures, the tail padding rules are too complex to deal with them and still maintain your code clean.

So, the original Foo structure had 8 bytes. Four for the first int, Four for the char. But the char only used one of those four, so you have a 3-byte tail. When inheriting from a non-POD structure, the C++ ABI allows you to use the tail padding of the base class, providing your new type fits in there. In our example, the new type was char that fits in the 3-byte tail of the base class, so the final layout uses two of the four bytes and still have a 2-byte tail.

Further inheritance of the same type (adding a single char member) will still pack at the end of the base structure.


Every time you think you have understood the rules of C++ layout, go back to bit-fields and you’ll always be surprised.

Lets have a look at this example for a moment:

struct Foo {
	char a;
	int b : 8;

struct Bar : Foo {
	int c : 8;

The first member is a char, so one byte. The second is an int, so 4 bytes and the final layout is 5 bytes, aligned to int goes to 8 bytes, right? Well, no. Even without the constructor, the size of that structure is 4 bytes.

Compilers are smart enough (and the ABI allows them to be) to figure that the member b will have only 8 bits, so there is no point in wasting 6 more bytes just to follow the declared type.

You may not be surprised, now, to know that the size of Bar is still 4 bytes, since the member c in Bar gets tail-padded into the base Foo.

But what about this:

struct Bar  {
	struct {
		char a;
		char a1;
		char a2;
		int b : 4;
	int c : 4;

The three chars push the final member to the last byte in the block, of which only 4 bits are really used, so you have 4 bits of tail-padding. The second member c of Bar gets tail-padded into those four bits, since they fit, and the final size of that structure is 4 bytes.

I hope to have convinced you to never try to be smart and convert your structures and classes to arrays or pointers. Even disregarding virtual inheritance (which is probably coming in another post), the C++ ABI for structure layout is too complex to rely on and use it in your own code. Worse, because this is not in the standard, concrete implementations are free to change the details and sometimes, still following the generic ABI. You’ll find a few cases in the ARM ABI where they differ slightly from the ABI without actually being incompatible, sometimes not that lucky.

The final lesson is: be smart, use the language features and rely on the compiler to optimise things for you. If you’re not happy with the optimization, fill a bug rather than hack your code.

Why no MMORPG is good enough?
March 8th, 2011 under Devel, Fun, Games, rengolin, Software, Web. [ Comments: none ]

Massively multiplayer online role-playing game (MMORPG) are not new. The first I remember playing is the Legend Of the Red Dragon (LORD), but before that, of course, I’ve played other (real-life) multiplayer RPG games as well, and they were never quite the same thing.

Of course, at that time the graphic cards couldn’t quite compete with our imagination (not to mention connection speeds), but a lot has improved in both fronts, and lots of very good looking games have arrived, but still, there’s something missing. For years I couldn’t put my finger on it, but recently I think I nailed the issue: user driven content.

The interface

Most of the MMORGP are war games. World of Warcraft, LOTR online, Vendetta, Star Trek Online, Regnum and so many others rely on war to be fun. Of course, all of them have the side issues, some trade or silly missions, but the real fun is going to the battlefield.

If you look from the technical side of things, this is not surprising at all. Aside from good graphics, one of the hardest things to do in a game is a good interface. Very few games are perfect like Tetris. I gave Tetris to both my sons when they were about 2 years old and after about a minute they were already playing. There is no need to instructions, tutorials or any training and still today I find it quite fun. This is why it’s probably the most successful game in history.

But it’s all about the interface. It’s easy to create a seamless interface for Tetris. Try to create an interface for a strategy game that doesn’t require some hours of training, or an interface for first-person games that actually allows you to know where you are, or an interface for adventure games that doesn’t make you click in half-dozen places to get anything. Many have tried, all have failed.

At the dawn of internet games, strategy and quake were dominant. That’s because the internet wasn’t so fast and both were quite good in saving bandwidth. Quake had a special fix to avoid sending one packet for every bullet and only one packet when you press the trigger and another when you release it, the rest was up to the client.

But in war games, the interface is pretty much established. World of Warcraft didn’t invent anything, they just mixed Warcraft with Lara Croft (rubbish interface, by the way). Space ship games got the interface from Wing Commander (Vendetta got it from W.C. Privateer), Lord of the Rings and Regnum mixed Second Life with old-style RPG (even with the same deficiencies) and Star Trek Online copied from everyone above.

Now, the interface for a good strategy or adventure game is somewhat complicated. For a first-person 3D RPG, even worse. It doesn’t have to be mind controlled to be easy, nor you have to use 3D glasses or any immersion technology to be fun. Simplifying the game is one way, but then it’s even harder to make it more interesting.

It’s the user, stupid!

I can see only one way to add value to a game that is simple but still fun: user driven content.

You can enrich the world in which you’re immersed into. For instance, Zynga is quickly gathering an impressive amount of users by adding a lot of content. I don’t play those games, but people around me do and I can see why it’s so addictive. There are so many things to do and the frequency of updates is so impressive that it’s almost impossible not to be driven to it.

You might think that the games are too simple, and the graphics are average, but the interface is extremely simple, the challenges are simple, yet still challenging, and the number of paths you can choose for your character are enormous. In this case, the user experience is driven by his own choices. The content is so rich that each and every place is completely different from every other, solely driven by user choices.

Not many game companies (certainly not the indie ones) have time or money to do that. So, why are indie games generally more interesting than commercial ones? They go back to square one, simplify the game and optimise the interface. EA would never release something like Angry Birds or World of Goo, and yet those are the two best games I played in a long time. But, world of Goo is over and Angry Birds require constant attention (seasonal versions) to keep selling or making money (from ads).

They are missing the user content. It might not be their style, nor objective, but that’s a difference between Deep Purple and a one-hit-band.

Back to MMORGP

So, from all MMORPGs, there are many good looking, some with good challenges and a lot of slaughtering fun, but I tire quite quickly from playing them. The last I played was Vendetta. Quite good graphically, it has some reasonably accurate physics simulation (what drove me to it) but not accurate enough to keep me playing. The content tires too quickly to be fun after a few hours and even though I had 8 hours of free play, I spent less than two and dropped it.

This was always a pain, since Final Fantasy and the like, building up the character, hitting slugs for XP, fight-heal-run until you level up. Though Final Fantasy was better, as it normally would throw you on level 10 or so, so you didn’t need too much of levelling up. But why? Who likes beating 253 slugs to get 1000 experience points, going to level 2 and being able to equip a copper sword that doesn’t even cut a snail’s shell?

One of the best MMORGP experiences I had recently was Regnum. This is a free game made in Argentina and has a lot of content, good interface and a healthy community. They do the normal quest levelling up technique and it works quite well until level 15 or so. After that, it’s hitting bears, golems and phantoms for half a year until you can go outside and beat other users.

I got outside prematurely (couldn’t bother to wait) and the experience was also not great. Huge lag on big crowds, people disappearing in mid-air and appearing a few meters away, etc. But the most annoying of all was the content. It was always the same fort that we had to protect, always the same keep we had to attack, always the same talk on how our race is superior to your race, etc.

I haven’t seen Lord of the Rings (which sucks on Linux) or Star Trek Online (which doesn’t even run), but I bet they can get a bit further. They’re there to compete with World of Warcraft, not Regnum, but the fate will be the same: boring.

So, after quite a big rant, how would I do it?

User content

First, a memory refresh: all free first-person shooter I know of are a re-make of Quake. They use the same engine, the same world builders, the same techniques. On Debian repositories you’ll find at least a dozen, all of them running on some version of Quake. Nexuiz, Tremulous, Open Arena, Urban Terror, etc.

Not only the Quake engine is open source, but it was built to be extensible and that, even before the source was opened by ID. I made some levels for Doom, there were good editors at the time (1994?), probably there are full development suites today.

The user has the power to extend, create, evolve and transform your game in ways that you never thought possible. To think that only the few people you hire are good enough to create game content is to be, to say the least, naive.

Now, all those games are segmented. Nexuiz levels don’t connect to Tremulous levels. That’s because the mods (as they’re called) are independent. To be able to play all those different games you need to download a whole lot of data (objects, music, game logic, physics settings, etc) and each game has it radically different. Sarge with a rocket launcher would be invincible in most of other quake variants.

But that is, in my opinion, the missing link between short spurs of fun and long lasting enjoyment. I want to be able t build my world (like Zynga), but in a world with free movement (like Quake) with quests (like MMORPGs) made by the users themselves (like no FP-game I know) in a connected world. It cannot penalise those that connect seldom, or those that connect through text terminals, Android phones or browser users in any way.

As some games have started to understand, people like different things in games. I like building stuff and optimizing structures, some like carnage, others like to level up or wait 45 minutes for a virtual beef pie to be ready. You cannot have all that in one game if you’re the only content generator.

Also, if the engine is not open, people won’t be able to enhance it for their particular worlds. It doesn’t have to be open source, but it has to have a good API and an efficient plugin system. Tools to create mods and content is also essential, but the real deal is to give the users freedom to create their versions of the game and be able to connect them all.

If every game is also a server, people can create their small worlds inside the bigger world, that is in the central server. A business strategy would be, then, to host those worlds for people that really cared about them. Either host for free in exchange of ads and content generation, or paid hosting for the more serious users. You can also easily sell content, but more importantly, you can create a whole marketplace of content driven by the users! Read Neil Stephenson’s Snow Crash and you know what I mean.

I think Apple and Google have proven over and over that a market with apps generated by the users is very effective indeed! Intel is just following the same path with their new App store, so yes, this is a good business strategy. But, it’s still fun for a wider range of people, from game addicts to casual gamers, from heavy modders to passive Facebook users.

There are many ways of doing that, maybe not all of them will be successful, but at least from my point of view, until that day arrives, no game will be fun.

« Previous entries 

Creative Commons License
We Support



National Autistic Society

Royal Society for the Prevention of Cruelty to Animals


End Software Patents

See Also

The information in this weblog is provided “AS IS” with no warranties, and confers no rights.

This weblog does not represent the thoughts, intentions, plans or strategies of our employers. It is solely our opinion.

Feel free to challenge and disagree, and do not take any of it personally. It is not intended to harm or offend.

We will easily back down on our strong opinions by presentation of facts and proofs, not beliefs or myths. Be sensible.

Recent Posts