Breaking changes are sad. We’ve all been there; someone else changes their API in a way you weren’t expecting, and now you have a live-ops incident you need to fix urgently to get your software working again. Of course, many of us are on the other side too: we build APIs that other people’s software relies on. There are many strategies to mitigate breaking changes: think SemVer or API versioning, but all rely on the API developer identifying which changes are breaking.
That’s the bit we’ll focus on in this talk: how do you (a) Get really good at identifying what changes might break someone’s integration (b) Help your API consumers to build integrations that are resilient to these kinds of changes (c) Release potentially breaking changes as safely as possible.
5. @paprikati_eng
Adding a mandatory
field to an endpoint
Breaking apart a
database transaction
Introducing a rate
limit
Changing an error
response string
Changing the timing
of batch processing
Reducing the
latency on an API
call
7. @paprikati_eng
How do assumptions develop?
01 Documentation
02 Support Articles & Blog Posts
03 Ad Hoc Communication
04 Industry Standards
05 Observed behaviour
18. @paprikati_eng
Avoiding bad assumptions
01 Documentation
02 Support Articles & Blog Posts
03 Ad Hoc Communication
04 Industry Standards
05 Observed behaviour
@paprikati_eng
19. @paprikati_eng
Avoiding bad assumptions
Given that many integrators just look at the
HTTP examples, naming is critical
Deliberately call out tripwires in your docs to
combat pattern matching
Restrict and document your behaviour as
explicitly as you can
23. @paprikati_eng
Releasing a potentially breaking change
Pull Comms
Updating docs or a
changelog
Push Comms
Newsletter or email to
integrators
Ack’d Comms
Wait for a positive
response from
integrators before
rolling out a change
01 02 03
Likelihood of a change being breaking
24. @paprikati_eng
Can you make the change incremental?
Can you release the change into a test environment?
Can you easily roll back if there are unexpected consequences?
Releasing a potentially breaking change
26. CREDITS: This presentation template was adapted from a
template by Slidesgo, including icons by Flaticon, and
images by Unsplash
@paprikati_eng
Notas do Editor
I’m Lisa Karlin Curtis, born and bred in London.
I’m a software engineer at GoCardless working in our core-banking team.
I’m gonna be talking about how to stop breaking other people’s things
We’re going to start with a sad story.
A developer notices that they have an endpoint that has a really high latency compared to what they’d expect.
They find a performance issue in the code (essentially an exascerbated N+1 problem), and they deploy a fix.
The latency on the endpoint goes down by a half. The developer stares at the beautiful graph with a lovely cliff shape, feels good about themselves, and moves on.
Somewhere else in the world, another developer gets paged - their database CPU usage has spiked and it is struggling to handle the load.
So what happened here?
They start investigating - there’s no obvious cause. No recent changes, request volume is pretty much as expected.
They start scaling down queues to relieve the pressure, which solves the immediate issue. The database seems to have recovered.
Then they notice something strange. They’ve suddenly started processing webhooks much more quickly than they used to.
It turns out that our integrator had a webhook handler would receive a webhook from us and then make a request back to find the status of the resource.
This was the endpoint that we had fixed earlier that day.
By the way, I’m going to use the word integrator a lot - what I mean is people who are integrating against the API that you are maintaining.
Sometimes that will be inside your company, sometimes it will be a customer.
Back to the story.
That webhook handler spent most of its time waiting for our response, before then updating its own database.
So the slow endpoint was essentially rate limiting the webhook handler’s interaction with its own database.
It’s worth noting that our webhooks are often a result of batch processes, so they are really spiky - we send lots of them in a short space of time, a couple of times a day
As the endpoint got faster, during those spikes, the webhook handler started to apply more load to the database than normal, to such an extent that an engineer got paged to resolve a service degradation.
The fix here is fairly simple: scale down the webhook handlers so they process fewer webhooks and the database usage returns to normal.
Or alternatively, beef up your database.
This shows us just how easy it is to accidentally break someone else’s thing - even if you’re trying to do right by your integrators.
When do we break things?
To set the scene, here are some examples of changes that have broken code in the past:
Traditional API changes - adding a mandatory field, removing an endpoint, changing validation logic - I think we’re all comfortable with this stuff
Introducing a rate limit / changing your rate limiting logic - docker did this recently and I think communicated really clearly, but it obviously impacted lots of their integrators
Changing an error string: At GoCardless we found a bug where we weren’t respecting the accept-language header on a few of our endpoints, and we fixed it, and one of our integrators raised a ticket saying that we’d broken their software - it turned out they were relying on us not translating that particular error.
Breaking apart a database transaction
Change timing of your batch processing
We can see from our logs that certain integrators create lots of payments ‘just-in-time’ - i.e. just before our daily payment run, so we know that changing our timings without communicating with them would cause significant issues
Reducing the latency on an API call
END SLIDE at about 5-6 mins
I’m gonna define a breaking change as something where I (the API developer) do a thing and someone’s integration breaks.
And that happens because an assumption made by that integrator is no longer correct.
When this happens, it’s easy to criticise that engineer whose made that assumption
Assumptions are inevitable - as a developer you really can’t get anywhere without them
Even if it is their fault, it’s often your problem. Possibly not if you’re google or AWS (unless it’s slack that you’ve killed),but for most companies if your integrators are feeling pain, then you’ll feel it too.either immediately or in the long term, when you're trying to renew contracts.
There are a few different ways that assumptions develop
Some of these are explicit: a integrator asking a question, getting an answer,
And builds their system based on that answer
The first step when you’re building an integration is often to look at the documentation.
Although it's worth noting that people often skip to the examples and don't actually read any of the text that you have slaved over
so you really need to make sure that your example is a super representative
They might also look at support articles and blog posts - either stuff you’ve published
Or maybe from a third party.
And then you have ad hoc communication
So what I mean by this is random emails or phone calls maybe with a pre sales team or your solution engineers,
it might be a conversation that gets had on a support ticket.
It might be emailing the friend that you have that used to work at the company
and all of that kind of ad hoc communication is still driving the assumptions that integrators make about how your software is going to behave.
Other assumptions are more implicit.
Industry standards are quite interesting: you send me a json response you're going to give me an application/json header.
So I don't need to tell my http client that it's going to be json because it can work it out for itself and i'm going to assume as an integrator that that never changes.
Similarly, I assume that you will keep my secrets safe.
So if you tell me my access token was used to create something, I’ll assume it was me.
Generally this stuff is fine, but in some cases you can find yourself in trouble if these standards change
We had a really bad incident where we upgraded our HA Proxy version which was observing the new industry standard
And downcased all our outgoing HTTP headers.
According to the official textbook, HTTP response headers should not be treated as case sensitive,
but a couple of key integrators had been relying on the previous behaviour and had a significant outage.
And that outage was actually exacerbated by the fact that their requests were being process but they weren't processing our response
and that meant that we had two systems that are out of sync in a really unfortunate way.
Observed behaviour
Skip to next slide!
As a integrator, you want the engineers who run the services that you use to be constantly improving it and adding features,
but in a way you also want them to not touch it so you can be sure that its behaviour won’t change.
As soon as a developer sees something, whether that’s
An undocumented header on an HTTP response
A batch process that happens at the same time every day
A particular API latency
They assume it’s reliable and build their systems accordingly.
Humans also pattern match really aggressively - not just in software but in all walks of life.
We find it very easy to convince ourselves that correlation = causation
And that means particularly if we can come up with an explanation of why A always means B, we are quick to accept and rely on it.
When you think about it, this is a bit bizarre - we are all employed to make changes to our own systems,
We should understand that they are constantly in flux.
We also all encounter interesting edge cases every day where someone has hit some incredibly unlikely scenario that’s caused your code to misbehave.
But we all assume that everyone else’s will stay exactly the same forever.
T-15 mins
None of this stuff is new. A great example of this is MS-DOS.
MS-DOS was released with a number of documented interrupts, calls hooks - all that retro stuff - but early application developers found that they weren’t able to achieve everything they wanted.
This was made worse because microsoft would use undocumented calls in their own software, so it was impossible to compete using only what was in the documentation.
So like all good engineers, they started decompiling the OS, and writing lists of undocumented information like ralf brown’s interrupt list.
This information was shared, and using these undocumented features became so widespread that microsoft couldn’t change anything without breaking all these applications that people used every day.
We can think of the interrupt list being analogous to someone writing a blog on medium called ‘10 things you didn’t know that X API could do’
Some of these assumptions are also unconscious.
Once something is stable for a while, we sort of just assume it will never break.
We also make our resourcing choices based on previous data because napkin math is always quite haphazard
so when i'm choosing how much cpu to allocate to my pod.
I pick a number out of thin air, and then I see what happens, and then I change it until it’s happy.
That works fine as long as what that pod is being asked to do is reasonably consistent over time, but as we've discussed that's not always true.
We can think about this in our first story - the database had plenty of resource until our endpoint got faster
So if we want to stop breaking other people’s things, we need to help our integrators stop making bad assumptions.
Document edge cases
Discoverability is important - think about SEO and also search within your docs site
Don’t ever deliberately not document something. If it’s subject to change, call it out so there’s no ambiguity.
Keep your own religiously up-to-date and searchable
If you’ve got 3rd party blogs that are incorrect, try contacting the author or commenting with the fix needed to make the guide work or point them at an equivalent page
If you get unlucky, that 3rd party content can become the equivalent of ralf brown’s interrupt list.
Consistency is key.
If a developer wants to understand what might break things, they need to know what communication is going out, ideally in a super searchable format.
In my experience many B2B software companies end up emailing random PDFs around or creating shared slack channels, at which point the engineers working on the product don’t really stand a chance of knowing what assumptions might have been made as a result.
Follow them where you can
Flag really loudly if you can’t, or where the industry has not yet settled
There’s a lot to think about with observed behaviour
Naming is really important. Particularly when developers don’t read the docs and just look at the examples
An example is numbers that begin with 0s which often get truncated (company reg. number)
We also have a field in our API called ‘account_number_ending’, but unfortunately in Australia some account numbers have letters in them, which is pretty sad.
You can also try to draw attention to it in the docs - particularly by making the example include the edge case
Use documentation and communication to combat pattern matching
If you know you could change your batch timings, call that out in the docs ‘we currently run it once a day at 11am, but this is likely to change’
Expose information on your API that you might want to change - it’s a good flag.
Restrict your own behaviour both by documenting a limit and then implementing it in the code to ensure you keep to that commitment.
We had an issue at GoCardless where somebody that we integrate with started adding a lot of extra events to each webhook
And our webhook handlers ran out of memory because they were loading so much data.
T - 11 mins
For complex products, it’s very unlikely that all your integrators will have avoided bad assumptions.
So we need to find strategies to mitigate the impact of our changes.
The first thing to remember is that a change isn’t either breaking or not. If a integrator has done something strange enough, almost anything can be breaking.
This binary is historically used to assign blame: if it’s not ‘breaking’ then its the integrators fault.
As we discussed earlier, it may not be technically ‘your fault’ but it’s probably still your problem.
If your biggest customer’s integration breaks, the fact that you didn’t ‘break the rules’ will be little consolation to the engineers up all night trying to resolve it.
So instead of thinking about it as a yes/no question - we should think about it in terms of probabilities.
How likely is it that someone is relying on this behaviour.
Not all breaking changes are equal - yes some changes are 100% breaking (e.g. killing an endpoint).
But many are neither 0% or 100%
Try to empathise with your integrators about what assumptions they might have made.
Use people in your organisation who are less familiar with the specifics than you are to rubber duck.
If possible, try and talk to some of them.
If you can, find ways to dogfood your APIs to find tripwires. This is particularly good as an onboarding exercise - it helps your new joiners immediately put themselves in the shoes of your integrators,
And helps you keep docs and guides up-to-date as well as introducing them to your product.
Sometimes you can even measure it - add observability to help you look for people relying on this undocumented behaviour - for example we can see a spike in Payment Create requests every day just before our payment run.
This can also help you identify which integrators will be impacted
Scale your release approach depending on how many integrators you think have made the bad assumption.
We want to have different strategies to employ at different levels.
If we over communicate, we get into a ‘boy who cried wolf’ situation where no-one reads anything you send them, and their stuff ends up breaking anyway.
Surprisingly, the email in their inbox that they didn’t read doesn’t make them feel better.
Start at pull comms - updating docs or a changelog. This is useful to help integrators recover after they’ve found an issue
You can then upgrade to push comms - perhaps a newsletter or email.
This is where it gets tough - we all ignore emails every day - so try to make sure the content is as relevant as possible.
Don’t tell integrators about changes to features they don’t use, and try to resist the temptation to include marketing content in the developer-focussed comms.
Then if you’re really worried, you can use explicitly acknowledged comms.
This works well if you have a few key integrators you want to check in with before pulling the trigger.
T-5 mins
We can also mitigate the impact of a breaking change by releasing it in different ways.
If at all possible, you want to try and make changes incrementally to help give early warning signs to your integrators.
For example, apply the new behaviour to a % of requests.
That will help integrators avoid performance cliffs and could turn a potential outage into a minor service degradation.
Many integrators will have ‘near miss’ alerting to help them identify problems before they cause significant damage.
If you’ve got a test or sandbox environment, that’s also a great candidate. Making changes there (if integrators are actively using it) can act as the canary in the coal mine.
The final point is about rolling back - if your biggest integrators phones you and tells you that you’ve broken their integration, it’s really nice to have a kill switch in your back pocket to stop the bleeding.
Now that's obviously not always possible, because it totally depends on the nature of the change.
But it's worth knowing what that kill switches and also being really clear internally about when that isn't isn't possible so that as soon as that call comes in, you know what your options are.
The only way to truly avoid breaking other people’s things, is to not change anything at all, and often even that is not possible.
Also, we’d mostly be out of a job.
Instead, we should think in terms of managing risk.
We’ve talked about ways of preventing these issues by helping your integrators make good assumptions in the first place,
And how important it is to build and maintain a capability to communicate when you are making potentially breaking changes to help mitigate the impact
But, you aren’t a mind reader, and integrators are sometimes careless and under pressure, just like you.
So be cautious; assume that your integrators didn’t read the docs perfectly, or at all, and may have cut corners.
They may not have the observability of their systems that you might hope or expect.
You need to find the balance between caution and product delivery that’s right for your organisation.
For all the modern talk of ‘move fast and break things’, it is still painful when stuff breaks and it can take a lot of time and energy to recover.
Building trust with your integrators is critical to the success of a product, but so is delivering features.
We may not be able to completely stop breaking other people’s things, but we can definitely make it much less likely if we put the effort in.
I hope you’ve enjoyed the talk - thank you for listening!
Please find me on twitter at @paprikati_eng if you’d like to chat about anything we’ve covered today
Have a great day!