What went wrong? Reverse-engineering disaster

Last week, we nearly pushed a bad configuration into production, which would have broken some things and made some code changes live that were not ready. Nearly, but not quite: while we were relieved that we’d caught it in time, it was still demoralising to find out how close we had come to trouble, and a few brave souls had to work into the evening to roll back the change and make it right.

Rather than shouting and pointing fingers, the team came together, cracked open the Post-Its and Sharpies and set to engineering. The problem to be solved: what one thing could we change to make this problem less likely, or less damaging?

What happened?

The first step was for the team to build a cohesive view of what happened. We did that by using Post-Its on the wall to construct a timeline: everybody knew what they individually had done and had seen, and now we could put all of that together to describe the sequence of events in context. Importantly, we described the events that occurred not the people or feelings: “the tests passed in staging” not “QA told me there wouldn’t be a problem”.

Yes, the tests passed, but was that before or after code changes were accepted? Did the database migration start after the tests had passed? What happened between a problem being introduced, and being discovered?

Why was that bad?

Now that we know the timeline, we can start to look for correlation and insight. So the tests passed in staging, is that because the system was OK in staging, because the tests missed a case, because the wrong version of the system ran in testing, or because of a false negative in the test run? Is it expected that this code change would have been incorporated into that migration?

The timeline showed us how events met our expectations (“we waited for a green test run before starting the deployment”) or didn’t (“the tests passed despite this component being broken”, “these two components were at incompatible versions”). Where expectations were not met, we had a problem, and used the Five Whys to ask what the most…problemiest…problem was that led to the observed effect.

What do we need to solve?

We came out of this process with nine different things that contributed to our deployment issue. Nine problems are a lot to think about, so which is the most important or urgent to solve? Which one problem, if left unaddressed, is most likely to go wrong again or will do most damage if it does?

More sticky things were deployed as we dot-voted on the issues we’d raised. Each member of the team was given three stickers to distribute to the one-three issues that seemed highest priority to solve: if one’s a stand-out catastrophe, you can put all three dots on that issue.

This focused us a great deal. After the dots were counted, one problem (gaps in our understanding of what changes went into the deployment) stood out above the rest. A couple of other problems had received a few votes, but weren’t as (un)popular: the remaining six issues had zero or one dot each.

I got one less problem without ya

Having identified the one issue we wanted to address, the remaining question was how? What shall we do about it? The team opted to create a light-weight release checklist that could be used in deployment to help build the consistent view we need of what is about to be deployed. We found that we already have the information we need, so bringing it all into one place when we push a change will not slow us down much while increasing our confidence that the deployment will go smoothly.

A++++ omnishambles; would calamity again

The team agreed that going through this process was a useful activity. It uncovered some process problems, and helped us to choose the important one to solve next. More importantly, it led us to focus on what we as a team did to get to that point and what we could do to get out of it, not on what any one person “did wrong” and on finding someone to blame.

Everyone agreed that we should be doing more of these root cause analyses. Which I suppose, weirdly, means that everybody’s looking forward to the next big problem.

Taming the Beast

Security is a journey, we’ve all heard it said but how many of us believe it and who knows where they’re trying to go?  I think we do, and that’s ‘Our next audit’ – we want to breeze through each audit like passing street lamps on a motorway.

At Wealth Wizards we deal with personal data. We provide financial guidance to customers. To do this we need customers’ personal data. their valuable personal data. Not just address and email (which are actually considered freely-available, or business card data) but information on savings, investments, tax details, health conditions, etc. We don’t store credit card or bank details but we do have all the really personal stuff. What this means is that over time we will build up a large dataset of things that con men, attackers, villains and others want to get hold of.  We know this is valuable, as do our customers and we want our customers to trust us. We want to instill confidence that when a customer tells us something, it’s private and remains so. This means we know security is important, and without it, it doesn’t matter how much effort we put into building up our business, because if the data is stolen and exposed then it could be the downfall of our business.

One of the best ways for us to show prospective clients and customers that we’re serious about security is to show our credentials and accreditation. Evidence that we have a rigorous process that stands up to rigorous audits. ISO 27001 is designed to do just this. Which is why we are working towards achieving ISO 27001 this year. However, anyone who knows ISO 27001 will know it’s a beast and not for the faint of heart and so the trick is learning how we can use ISO to advantage instead of against us.  We can use ISO as a framework to build up the policies and process we use as a business.  Instead of trying to fight it, we’re going to make it help us.

We don’t just want to add security on to what we’ve done, we want to build security into what we do.  We deal with a lot of big companies. when we’re selling our products and things are getting close to signing contracts, those big companies (we’re talking tens of thousands of employees) start asking us about our processes, about our data security and more importantly, in their eyes, their data security. In other words, they start auditing us. No one likes an audit, but if you can show an auditor that you do care about things and that you do have processes then they tend to avoid asking the really difficult questions. Even then, when you really do care about things, it doesn’t matter if they do ask the difficult questions because you have an answer for them.

I’m currently going through the ‘Technical Measures’ questions with our team here and it feels endless; How can I prove what we did X, how can I prove why we did Y, how can I show what something looked like on this date compared to that date. Those are difficult questions to answer at the best of times but more so when you’re running in an elastic environment where a server instance may only exist for a day or two. What’s becoming apparent though is that ISO is asking questions that I actually want to be answered myself, regardless of what certification we go for.  I, as a sys admin, want to have a record of what happened, when and why. I also want to know that something happened because we made it happen. If I know this then I can start to answer questions about why something doesn’t work at 3 am.  So already I’m starting to find that while ISO is a beast, it can actually be tamed to be a friendly beast. On our path to ISO, we will build the framework that will define the tasks we need to do to build security into our platform, that will show the auditors what they want to see, as well as the meat behind it to prove it’s not just paperwork.

By doing a true risk assessment of our business and technical environment, we start to build an accurate picture of our weaknesses, both in terms of security as well as our processes. Once we have identified these then we can start to build suitable responses. It looks overwhelming to begin with but before long it starts to become clear that the automation that we’re building to allow hands-off delivery of our applications is also the solution we need to be able to record what was deployed when and why. The automation scripts are the perfect mechanism to build these audit trails rather than having to rely on someone to manually ensure these actions are identified!

How do we ensure there is a separation of concerns? That no one is putting back-door code into production? Why peer review of the code (both application and infrastructure) allows us to enforce this programmatically! Suddenly ISO has become my friend. Sure, it’s still a beast, but it’s not blocking our delivery but helping to define what processes we need and therefore it’s starting to write our automation algorithms. How cool is that!? Ok, perhaps cool is a little strong.

So while we’re still very much en-route, I’m confident we’re on the right path and that the next audit will be us proving we’re secure, not hiding the things we don’t want to be seen. Don’t be afraid of the beast called ISO, embrace it and use it to your advantage.

Be part of our engineering team

If you’re a software developer who wants to swap the urban jungle of London for the rolling hills of Warwickshire, look no further than Wealth Wizards. You’ll join a dynamic team of clever people, be absorbed in an energetic atmosphere, and benefit from an excellent work-life balance. Plus, you’ll be working in a team that’s breaking ground in the world of robo advice and artificial intelligence.

If Wealth Wizards sounds like your sort of company, take a look at our careers page for all the info.

Are you based in The Midlands already? Do you like beer? And DevOps? Come to out DevHops MeetUp.

Using Ansible with WordPress

WordPress is a great tool to use when creating websites as it provides flexibility when managing content.

As you may be aware, one of the operational downsides of managing websites run on WordPress is how frequent new releases are released as part of patching vulnerabilities. This comes with the overhead pain and cost of having to upgrade your WordPress instance every other week.

As we are living in a world that thrives of automation, here at Wealth Wizards we thought it would be a good idea to automate the upgrade process by using configuration management tools like Ansible combined with the power of AWS API’s.

As our WordPress sites are deployed in AWS, we decided to use the AWS API’s to provision instances, manage snapshots alongside configure and apply security groups. We then decided to use various Ansible modules to install packages, update configs, encrypt and decrypt files pushed and retrieved from AWS S3 as well as change permissions on files and directories as part of the upgrade process.

Switching from the traditional method of manually moving files using plugins and bash commands to an automated manor has allowed us to gain more control over our upgrades as well as reduce the time it takes from a day to a 2-hour process, with most of that being dedicated to AWS provisioning. Automating the process using Ansible has also given us the ability to upgrade multiple instances at once over the traditional method of doing one instance at a time.

Microservices make hotfixes easier

Microservices can ease the pain of deploying hotfixes to live due to the small and bounded context of each service.

Setting the scene

For the sake of this post, imagine that your system at work is written and deployed as a monolith. Now, picture the following situation: stakeholder – “I need this fix in before X, Y, and Z”. It’s not an uncommon one.
But let’s say that X, Y, and Z are all already in the mainline branch and deployed to your systems integration environment. This presents a challenge. There are various ways you could go about approaching this – some of them messier than others.

The nitty gritty

One approach would be to individually revert the X, Y, and Z commits in Git, implement the hotfix straight onto the mainline, and deploy the latest build from the there. Then, when ready, (and your hotfix has been deployed to production), you would need to go back and individually revert the reverts. A second deployment would be needed to bring your systems integration environment back to where it was, (now with the hotfix in there too), and life carries on. Maybe there are better ways to do this, but one way or another it’s not difficult to see how much of a headache this can potentially cause.

Microservices to the rescue!

But then you remember that you are actually using microservices and not a monolith after all. After checking, it turns out that X, Y and Z are all changes to microservices not affected by the hotfix. Great!
Simply fix the microservice in question, and deploy this change through your environments ahead of the microservices containing X, Y, and Z, and voila. To your stakeholders, it looks like a hotfix, but to you it just felt like every other release!


Of course, you could still end up in a situation where a change or two needs to be backed out of one or more of your microservice mainlines in order for a hotfix to go out, however I’m betting it will not only be less often, but I’m also betting that it will be less of a headache than with your old monolith.


Mars Attacks!!! Ack, Ack-Ack!

Last Tuesday we saw our first (recognised DDoS attack.  At 12:09 GMT we started to see an increase in XML-RPC GET requests against our marketing site, hosted on WordPress. We don’t serve XMLRPC so we knew this was non-valid traffic for a start.

By 12:11 GMT traffic volumes were well above what the system could handle and the ELBs started to return 503 responses. By 12:20 GMT the request rate was over 250 x higher than usual. At this point, we were trying to establish what was causing the demand. We don’t currently have the highest coverage of monitoring over our marketing sites so this took us a little while. Eventually, by 12:30, using the ELB logs, we had managed to establish we were seeing requests from all over the world, all making GET requests to /xmlrpc.php. We don’t typically see requests from China, Serbia, Thailand and Russia, among others so it was pretty obvious this was a straight forward DDoS attack.

Shortly after 12:30 GMT the request rate drops off just as quickly as it started and by 12:35 GMT it was over and the site recovered. Either the BotNet Attack got bored, they had achieved their purpose (investigation into the consequence of the attack continues with our security partner) or AWS Shield did its free, little-known job and suppressed the attack…

Whatever led to the attack, it passed as quickly as it arrived, and from initial assessment had little purpose. At least we’ve had our first taste of an attack and will be able to better tackle the next one. In the meantime, we continue to analyse logs to determine if there was any more to the attack than a simple DDoS, or if there was something more malicious intended.