This information comes from members of a team that worked with the RTS system. The following is in his words:
RTS
As you stated – RTS was a slick design, it was a system that was to run on 339 servers across the country and over 33k phones and at least 26k users logged in to the production system. It was a shame that issues outside the main RTS software denied it the limelight. The visualization and transmission aspects were not part of the RTS system and thus the RTS system comprised of:
All of which were based on tried and tested open technologies.
The Failure
The truth is that around 8PM Monday is that the /var partition on the provisioning server (running CentOS not Windows) got filled and thus the underlying RDBMS failed. It was a shame because there was so much space on that server but not in the correct (needed place). I can state that there was no hacking (nothing points to it). I can also state that RTS was not creating files and thus the partition was not filled by RTS data but rather by Mysql binary logs that were being generated in situ due to database replication which was switch on. Thus this meant that if the provision server went down - no new logins and requests for candidate data for that polling station could not be serviced. However, those individuals who had logged in at least once before in accordance to the procedure were able to send results to the other servers that were up. This explains the “slow down” experienced after the provisioning server went down.
The following are some figures of the success the system achieved when it was up:
It was a shame that the technical team could not deduce that the issue that took down the server was a simple problem of disk space because the team spent time searching for the problem elsewhere (concurrency, number of threads being spun, max number of connections on mysql). What makes the team sad is that RTS missed it’s the golden window to shine due to non programming errors and the time wasted before we were able to detect that trivial issue for that matter.
Other issues that have been eclipsed due to the nature of the of the issues experiences were procedural and process issues namely –
The late delivery of the visualization software (election day) ensured that no real stakeholder testing happened and thus the spoilt, rejected and disputed vote figures were inflated by a factor of 8 for the presidency (8 candidates) due to a join error that could have been detected if the visualization was delivered on time. I
So while RTS the system failed the software to transmit and receive data was fine and stuff around it messed it up and still believe that it is a great piece of software that run under conditions that were not exactly optimal for its correct performance.