This information comes from members of a team that worked with the RTS system. The following is in his words:
As you stated – RTS was a slick design, it was a system that was to run on 339 servers across the country and over 33k phones and at least 26k users logged in to the production system. It was a shame that issues outside the main RTS software denied it the limelight. The visualization and transmission aspects were not part of the RTS system and thus the RTS system comprised of:
- mobile phone software – a J2ME application
- the web service processing the request – a Servlet running on Glassfish
- Memcache to cache data that was not changing
- the database – running on Mysql
All of which were based on tried and tested open technologies.
The truth is that around 8PM Monday is that the /var partition on the provisioning server (running CentOS not Windows) got filled and thus the underlying RDBMS failed. It was a shame because there was so much space on that server but not in the correct (needed place). I can state that there was no hacking (nothing points to it). I can also state that RTS was not creating files and thus the partition was not filled by RTS data but rather by Mysql binary logs that were being generated in situ due to database replication which was switch on. Thus this meant that if the provision server went down - no new logins and requests for candidate data for that polling station could not be serviced. However, those individuals who had logged in at least once before in accordance to the procedure were able to send results to the other servers that were up. This explains the “slow down” experienced after the provisioning server went down.
The following are some figures of the success the system achieved when it was up:
- 16,617 polling stations had reported the presidential race,
- 8,130 polling stations had reported the governor
- 8,229 polling stations had reported the senator race,
- 11,020 polling stations had reported the national assembly race
- 8,613 polling stations had reported the women rep race
- 9,228 polling stations had reported the county assembly ward reps.
It was a shame that the technical team could not deduce that the issue that took down the server was a simple problem of disk space because the team spent time searching for the problem elsewhere (concurrency, number of threads being spun, max number of connections on mysql). What makes the team sad is that RTS missed it’s the golden window to shine due to non programming errors and the time wasted before we were able to detect that trivial issue for that matter.
Other issues that have been eclipsed due to the nature of the of the issues experiences were procedural and process issues namely –
- The delivery of phones and configuration of the same was not done in time. This had a negative effect on the rest of the process.
- Presiding officers should have logged in, downloaded the candidates and sent a dummy result on the eve of the election to prove it works. This was not done in all polling stations.
- The issuing of usernames and passwords was not as in some places – users were still asking for passwords to be reset 2 days after the election.
The late delivery of the visualization software (election day) ensured that no real stakeholder testing happened and thus the spoilt, rejected and disputed vote figures were inflated by a factor of 8 for the presidency (8 candidates) due to a join error that could have been detected if the visualization was delivered on time. I
So while RTS the system failed the software to transmit and receive data was fine and stuff around it messed it up and still believe that it is a great piece of software that run under conditions that were not exactly optimal for its correct performance.