The path between Software and Site Reliability engineering is shorter than you think.

Pierrick Gicquelais
6 min readApr 17, 2021

Well, is it really ? I think so.

I am a software engineer which means I spend my time coding and designing solutions to business problems. Nevertheless, since a couple years, I tend to think that my work is changing to implement more and more operational situations in a day-to-day basis.

This is actually what could be defined as Site Reliability engineering, or SRE.

A gradient pastel waterboard
Photo by Paweł Czerwiński on Unsplash

What is the “SRE” ?

From the official Wikipedia page, it could be defined as:

[The SRE] is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.
A site reliability engineer (SRE) will spend up to 50% of their time doing “ops” related work such as issues, on-call, and manual intervention […], the other 50% of their time on development tasks.

Basically, the job description is to package, deliver and operate.
The proper way to cross the path between these two environments, is to gather the DevOps principles:

  1. Reduce organizational silos.
  2. Accept failure as normal.
  3. Implement gradual changes.
  4. Leverage tooling and automation.
  5. Measure everything.

Here is the list of low effort edits that could be applied to your day-to-day work as a software engineer to start embrace the path to the site reliability engineering.

Reduce organization silos

This is one of the top key element to be working on: reduce the distance between the development team and the ops one (if it really exists).

From your coworkers, you may have heard there might be an issue about how your application is exposed or some missing requirement environment parameters.
So how could you efficiently provide new solutions to solve this issue, even if you are not an expert on networking issues nor a Linux administrator ?

In my opinion, the easiest way to fix this is to define and develop your application’s package together. You need to start writing your Dockerfile or Helm charts with the ops team. Nowadays, thinking containerization first is a requirement. By doing so, you will allow your team to understand properly how your application is actually working from build stages to deploy stages.

When the package is being developed, try their tools, their CI/CD pipelines and scripts, little tweaks or improvements could be added, and it could came from you.

Accept failure as normal

To fail is to learn. To fail over is to learn again. Over and over again.

Failure is a common domain in the software engineering world, who could say today that one is able to develop a complete application from bottom to top without using debug processes ? This is also true when dealing with site reliability. But how could someone help debugging an application he did not develop ? The solution is pretty clear: monitor and log every important phases of your project.

There are great visualization or gathering tools for logging, such as Kibana, Fluentd, Splunk, but each of them need fuel to work: logs. You already have pre-configured logs specs for web servers, so use them! Also, try to find the correct balance between not enough and too much logs and add lines in your project (before your validation process in a view or right after processing database transactions, for example).

You may also want to handle the panic and the exceptions your application may throw. A tool like Sentry, is very powerful to print an entire stack trace of returned error, and its integration in any kind of platform is a low effort task.

Implement gradual changes

Low risk management is a one of the great soft skills to embrace for an SRE, this means you could leverage how risky a production or an update could be.

There is no secret key for this one, gradual and small changes should simply become a norm in your processes. There is no big migrations or huge core code refactor which are not scattered with pitfalls. The main factor between these two is the inability to iterate fast. If your migration goes wrong, would you be able to rollback ? If your new refactor is prodded, would you be able to ensure the same service quality for all your endpoints or workers ?

By using strategies like blue-green deployments (or rollout strategy for Kubernetes), you have the possibility to use two environments: the deprecated one and the new one, to test and verify if everything is OK for both of them before merging them into one solution. Combined with small changes, this strategy could perfectly fit your refactor process, merging one endpoint per endpoint, for example.

Since the rollback strategy is a key requirement in this process, methodologies such as Semver or Gitflow are powerful tools to help anyone understand where your application is currently situated and how to process upgrades/downgrades.

Leverage tooling and automation

Measure and eliminate the toil.

There is nothing worse than doing the same thing everytime everyday. This labor is what kept you away from doing things you want/need to do.

By creating intelligent scripts or cli for your application, you will allow it to be, at least more user-friendly, but also more accessible. It could seems trivial but having a cli command which could send a request to your app, wait for a response, interpret it and then send it back to an other endpoint, will leverage more automation around your day-to-day work processes.

There a lot of tools in the wildness which would love to help you in your automation process: you can use Makefile targets (or Mage if you do not like make) to help you or you coworkers create scripts around your application, create useful cli with Cobra, or you could check what Argo (or any CI/CD platform) could provide to you for automating some processes.

The article would be too long if we wanted to list them all. Keep track of all the painful tasks that you have to deal everyday, and try to abort them when you can.

Measure everything

And by everything, I mean everything.

There are 3 major keys which are under the SRE scope: service level indicators (SLI), objectives (SLO) and agreements (SLA). Each of them are linked to the others. You cannot hit your objectives and agreements without correct and meaningful indicators.

These indicators data points must be directly observable and measurable by everyone and should reflect the user experience. A great way to start measuring them is to scope them around 3 principles:

  • Log what matters. We already discussed about the necessity of logging meaningful information from your application, this is still true here because it could help building interested indicators of your project usage by customers.
  • Expose business metrics. These defined metrics are mandatory if you want to know how your business is going. They could expose the number of wrong status codes your application returned to your customers or simply expose the number of customers which stop their payment process before entering their credit card. The more metrics you expose the more you will understand the usage, so abuse them. Prometheus and Thanos are powerful tools which let you achieve that.
  • Trace everything. This might seem logic but try to trace each of the requests made onto your application. This will let you know how and where the request went wrong or why it take minutes to respond. With tools like OpenTelemetry and Jaeger, tracing won’t be a problem for you anymore.

Conclusion

By embracing the Devops principles, one by one, day by day, the path of what you are doing today as a software engineer could cross the path of site reliability.

The list above is only some of the elements which compose the life of a real site reliability engineer but no one said that you needed to be perfect and do everything at once.

Keep working on your own pace, but do not stop to be curious.

“I have no special talents. I am only passionately curious.” Albert Einstein

--

--