Reliability Engineering in Startups

Draft

I am currently working in a startup where a part of the value proposition is a resilient product. The product is used in an interactive setting between a seller and a buyer and manages compliance requirements in this phase. As such we can not afford to be down in this phase.

Startups are typically characterized by being resource constraint such that all team memebers need to do diverse tasks. Furthermore, the requirements tend to chance fast

The 6 Shields in Resilient Software

We have, at least, 6 different shields we can use when considering reliability. Each of the shields will defend certain attacks an adversary can have on the product.

The shields are as the following

Types
Tests
Architecture
Requirements
Monitoring
Processes

For each of these shields a team can decide to invest more or less resources into it.

In startups the key is to manage the amount of effort that foes into each shield. The resources allocated should be proportional to the requirements from the business. Not more, not less.

Choose your Tier

For each shield there is the free tiers, the paid tier and the enterprise tier.

The free tier is where it takes the same amount of resources to do it as to not do it - given a software developer that has basic training.

The paid tier is where certain rules are applied in shield as a response to certain issues that has been observed.

The enterprise tier it to dogmatically use state of the art techniques or to go above basic requirements in order to solve large classes of issues that has never been experiences.

For the typing shield: The free tier is to use a types programming language and stick to its typing primitives.

The paid tier is to model business logic in types to the extend that this removes issues that has been seen in production.

The enterprise tier for typing would be to use special languages or DSLs to model parts of the domains in order to ensure that incorrect code can not be deployed.

Types
- Free: Use typed language.
- Paid: Model business requirements in types.
- Enterprise: Use DSLs or delegate code to strongly types languages.
Tests:
- Free: Inline asserts or limited testing framework.
- Paid: Maintain test coverage at a set level.
- Enterprise: Implement multiple types of tests ranging from asserts, unit tests, end to end tests, etc. and maintain coverage targets for each.
Architecture
- Free: Project level consistency in coding patters.
- Paid: Architectural principles that relieves certain types of common errors.
- Enterprise: Follow well defined architectures for the system level.
Requirements:
- Free: A whish list of features.
- Paid: Note down important requirements, for larger projects jut down requirements docs.
- Enterprise: Write complete requirement documentation with reviews before starting development.
Monitoring
- Free: Write to console and check the provider metrics overview.
- Paid: Have explicit tools to manage logs and performance monitoring.
- Enterprise: Full fledges monitoring tools with on-call and complete alerting packages.
Processes:
- Free: Have a todo list.
- Paid: Use some sort of task management tool, use releases as needed, etc.
- Enterprise: Full fledges process management using frameworks like Scrum and friends.

When to Move Tiers?

Moving tiers is a response to two things: Resource allocation and need.

The free tiers are typical for new startups or small projects with small teams.

The paid tiers are typically a response a growing team or a requirement to develop features faster.

The enterprise tier is called so because it is often in enterprises these are needed. They are characterized by being resilient to employee churn. But they are also known to such out the lives of the team participants.

The main thing is to understand the associated costs by moving up a tier. For a lot of owners and projects managers in might seem to be beneficial to employ the enterprise tier on the process when starting new projects. However, the result will often be alienated teams with low performance and a lot of asks on cleaning up code.

The Pseudo Shield

Reliability issues can often be solved using various shields. A common example would be to use typing, but rely on some sort of manual testing to catch the errors occurring from the missing types.

Process shields are typically easier to setup. They don't require changing code and their risk is low. Furthermore, these can typically be done by non- engineering members of the teams.

The down side, however, is that a team can only lift so much process. So process mitigations ought to move into the other shields as time pass.