3 technical challenges we just cannot get right
Saturday, August 3rd, 2013 | Limited, Programming
Having worked on a lot of different projects at a lot of different companies over the years, there are some issues that recur time after time. These are problems that maybe we just don’t have satisfactory patterns for, or developers have a blind spot for. But whatever the reason, they certainly merit more attention in order to reduce how frequently they arise.
Below, I’ve listed three with some suggestions as to how to help mitigate them.
Caching
Caching is essential to good performance in most applications. Remember your ABC – Always Be Caching. Indeed, if you ever want to sound clever in a discussion about system architecture, say “have you thought about caching that?”. Caching is the solution to almost everything. But causes some problems too.
For the non-technical, caching is storing information in short term memory so it can be delivered faster. It’s like remembering things. You probably have a big pile of books at home, and you can’t remember most of them, so for most facts you have to look them up in a book. But you can remember a few things, so for the questions you get asked most, you memorise the facts so you can say them quicker. That’s basically what caching is.
The problem is that when you update something, the cache can end up serving the old content, rather than the new. This results in data being out of date and often results in web applications not appearing correctly because the HTML has updated, but the CSS and JavaScript happens.
Here are a few tips to help mitigate such problems:
- Have scripts to clear out your various caches. Make sure you know what/where the cache layers are.
- Understand what cache headers you are serving via you sending via your web server.
- Have a mechanism for changing files such as stylesheets and JavaScript when you release a new version.
- Turning caching off on dev environments is fine, but make sure you have it turned on on your staging/pre-production environment so you can spot potential problems.
Timezones
While Britain insisted that time starts and stops in Greenwich, we did allow the rest of the world to make adjustments at hourly (or even half-hourly) intervals based on where they are in the world. This has caused no end of headaches for software developers. Even within one country, one problem that comes up time after time is that systems go wrong when the clocks are changed for daylight savings.
Every six months it all goes wrong every six months and we promise to fix it by the next time the clocks change, but usually never do. Even Apple got it wrong on their iPhone alarm – twice!
Here are a few tips to help mitigate such problems:
- Set your servers to GMT/UTC and leave them there, regardless of where you are in the world and where DST is in effect or not.
- Use prper DateTime objects, and specify the timezone.
- When sending a DateTime string, use an ISO standard and be explicit as to what TimeZone it is in.
- Unit test DST changes, using a mock system time if required.
- Perform manual testing on the DST switch.
Load balancing
As systems scale, they inevitably need to spread across multiple processes, applications and bits of hardware. This brings a whole new layer of complexity, though, having to keep everything in sync. It’s very easy to lose consistency and that almost always ends in one thing – say it with me everyone, “race conditions”!
There are two common problems we find in this situation. The first is that one node stops working but the other one is working fine, or maybe something as simple as one node having old code on it. You get intermittent problems but when you try it it works fine or at least works fine 50% of the time. It could be that it is hitting the working node but at other times is hitting the other.
Here are a few tips to help mitigate such problems:
- Ensure you can (and ideally do automatically) test individual nodes.
- Use a tool like Chef or Puppet to make sure all your nodes are consistent.
- Have the ability to easily take out nodes that are not working.
- Aggregate your logging where possible so you don’t have to tail them on each individual node.
The second is race conditions. This is particularly prevalent when using a master and slave database setup. You will write the update to the master than query for it the results – but because your second query is a read, it will be passed to the slave, which may nor may not have updated yet with the new information.
Here are a few tips to help mitigate such problems:
- Have the ability to manually specify master or slave so time critical reads can be pointed at the master.
- Ensuring your staging/pre-production environment load balances everywhere your production environment does so you can spot potential issues.
- Where possible, rely on one server for the time as different machines can become slightly out of sync.
Having worked on a lot of different projects at a lot of different companies over the years, there are some issues that recur time after time. These are problems that maybe we just don’t have satisfactory patterns for, or developers have a blind spot for. But whatever the reason, they certainly merit more attention in order to reduce how frequently they arise.
Below, I’ve listed three with some suggestions as to how to help mitigate them.
Caching
Caching is essential to good performance in most applications. Remember your ABC – Always Be Caching. Indeed, if you ever want to sound clever in a discussion about system architecture, say “have you thought about caching that?”. Caching is the solution to almost everything. But causes some problems too.
For the non-technical, caching is storing information in short term memory so it can be delivered faster. It’s like remembering things. You probably have a big pile of books at home, and you can’t remember most of them, so for most facts you have to look them up in a book. But you can remember a few things, so for the questions you get asked most, you memorise the facts so you can say them quicker. That’s basically what caching is.
The problem is that when you update something, the cache can end up serving the old content, rather than the new. This results in data being out of date and often results in web applications not appearing correctly because the HTML has updated, but the CSS and JavaScript happens.
Here are a few tips to help mitigate such problems:
- Have scripts to clear out your various caches. Make sure you know what/where the cache layers are.
- Understand what cache headers you are serving via you sending via your web server.
- Have a mechanism for changing files such as stylesheets and JavaScript when you release a new version.
- Turning caching off on dev environments is fine, but make sure you have it turned on on your staging/pre-production environment so you can spot potential problems.
Timezones
While Britain insisted that time starts and stops in Greenwich, we did allow the rest of the world to make adjustments at hourly (or even half-hourly) intervals based on where they are in the world. This has caused no end of headaches for software developers. Even within one country, one problem that comes up time after time is that systems go wrong when the clocks are changed for daylight savings.
Every six months it all goes wrong every six months and we promise to fix it by the next time the clocks change, but usually never do. Even Apple got it wrong on their iPhone alarm – twice!
Here are a few tips to help mitigate such problems:
- Set your servers to GMT/UTC and leave them there, regardless of where you are in the world and where DST is in effect or not.
- Use prper DateTime objects, and specify the timezone.
- When sending a DateTime string, use an ISO standard and be explicit as to what TimeZone it is in.
- Unit test DST changes, using a mock system time if required.
- Perform manual testing on the DST switch.
Load balancing
As systems scale, they inevitably need to spread across multiple processes, applications and bits of hardware. This brings a whole new layer of complexity, though, having to keep everything in sync. It’s very easy to lose consistency and that almost always ends in one thing – say it with me everyone, “race conditions”!
There are two common problems we find in this situation. The first is that one node stops working but the other one is working fine, or maybe something as simple as one node having old code on it. You get intermittent problems but when you try it it works fine or at least works fine 50% of the time. It could be that it is hitting the working node but at other times is hitting the other.
Here are a few tips to help mitigate such problems:
- Ensure you can (and ideally do automatically) test individual nodes.
- Use a tool like Chef or Puppet to make sure all your nodes are consistent.
- Have the ability to easily take out nodes that are not working.
- Aggregate your logging where possible so you don’t have to tail them on each individual node.
The second is race conditions. This is particularly prevalent when using a master and slave database setup. You will write the update to the master than query for it the results – but because your second query is a read, it will be passed to the slave, which may nor may not have updated yet with the new information.
Here are a few tips to help mitigate such problems:
- Have the ability to manually specify master or slave so time critical reads can be pointed at the master.
- Ensuring your staging/pre-production environment load balances everywhere your production environment does so you can spot potential issues.
- Where possible, rely on one server for the time as different machines can become slightly out of sync.