The problem with talking about code quality is that “quality” is subjective. The term implies something to measure or compare against, and that measure is often missing from the conversation. There are also two types of quality: functional and non-functional. Functional is simple to define: does it work as intended? Meanwhile, non-functional quality is also intangible. This is part of what makes non-functional quality famously difficult to explain and quantify. Defining non-functional quality requires considering what the baseline of that quality is, and that baseline will depend on the level of abstraction.
Non-Functional Code Quality
We can define three levels of non-functional code quality, each varying on level of abstraction, relevant context, and how measurable it is:
- Code-Level: naming conventions (camelCase vs snake_case), formatting, etc
- Fuzzy-Level: complexity, architecture, flexibility, test suites
- Business-Level: matching business domain (ubiquitous language, domain modeling), meeting the company’s values
Code-Level Quality is low-impact, objective, concrete, borrows context from the language and ecosystem, and is easy to objectively measure with black/white answers. Business-Level Quality is high-impact, subjective, very abstract, exists in the context of the market and company situation, and is measured with business results. Fuzzy-Level Quality is what most people think of when they talk about “code quality,” and is borrows aspects from both.
Code-Level and Business-Level Quality
Code-Level Quality context comes from of the conventions of a programming language or ecosystem (e.g. indentation is 2 spaces in Ruby/JS, 4 in Go/Rust). It impacts functional quality only if it’s so inconsistent that it hurts readability, stealing focus from more important topics. Code-Level Quality can almost entirely be checked with a good linter, which will fail code that doesn’t match clearly defined rules. Holy wars aside, this is generally a Solved Problem. I personally install the most restrictive set of lint rules I can find and move on.
Business-Level Quality can impact product quality in intangible ways, with leaky abstractions forcing a suboptimal way of working and subtly impacting product viability. Its context is The Business. Business-Level Quality bridges the gap to Product Management, involving topics like company/product strategy. It covers questions like whether the architecture matches the business domain, or the language users use, and is impossible for computers to measure. The best discussion of these topics is the classic Domain Driven Design.
Fiddly Middly Quality
This leaves Fuzzy-Level Quality. Fuzzy-Level Quality is partly objective, partly subjective. Its impact can sometimes be negligible, and other times profound. Its context comes equally from the ecosystem and from the business context. Some parts of Fuzzy-Level Quality can be measured, others not. It’s so fuzzy that the parts that cannot be measured can sometimes be indirectly measured.
This partly-abstract evaluation of quality includes factors including readability, maintainability, architecture, security, flexibility, stability, bug density, scalability, naming, complexity, test suites, and more. Each is worthy of its own discussion on what quality means in that context, its impact, how to measure it, how to define its baseline for quality. Doing so is missing the forest for the trees. Any discussion of quality on any of these metrics must first uncover which to prioritize.
Values
Even with infinite resources and time, it would be impossible to be good on every measure of quality. Many factors are fundamentally opposed; some are actually opposites. A scalable system will probably be inflexible and slow to change; moving as fast as possible means breaking things. We don’t have infinite resources, so we’ll fall far short of even that. When someone talks about quality code, they are implicitly ranking a set of these factors as what they mean when they say “good quality.”
Good Quality for a startup means prioritizing velocity; for NASA, it means stability and being bug-free. This brings us back to the fuzziness of this level of quality; while many factors of quality can be measured semi-objectively, the rankings of those comes from factors more closely aligned with Business-Level Quality. The tradeoffs and rankings that are explicitly (and implicitly) incentivized in a given context are the values of that context.
How an organization defines “quality” — its small-v values — will result in a different weighting between of each signal. The values should draw directly from the business, its domain, and its strategy. Small-v values are culture: they are the sum of incentives and behaviour, and they exist whether you intend them to or not.
VSCode values approachability, Emacs extensability, Vim stability, Neovim velocityhttps://www.murilopereira.com/the-values-of-emacs-the-neovim-revolution-and-the-vscode-gorilla/
There is tremendous insight about products and code when defining the values of each. VSCode values approachability, Emacs extensability, Vim stability, Neovim velocity. Unlike the meaningless big-V Values that companies often say they have, real values are tradeoffs, choices between the many factors we could invest time and effort into. Engineering values should give guidance for what quality is, but often don’t. The only way to then determine values is to inspect how a company behaves and what it rewards. Often, what an org values is not what it should value.
When Quality Doesn’t Matter
There’s an important modifier to many factors of quality: many are indirect measures of future state: velocity, ease of introducing bugs, and so on. If the code works perfectly, with no bugs or security flaws, and no one looks at it ever again, then it doesn’t matter if the indentation mixes tabs and spaces and the variables are named after the author’s favourite Sesame Street characters.
The importance of non-functional code quality is a function of its churn: how often it is read and/or modified. Every change to code risks introducing another bug. The more complex code is, the easier it is. High-churn, high-complexity code is at high risk to have bugs. Fuzzy-Level Quality is therefore important to monitor and improve, because doing so will decrease bugs.
What is High Quality Code?
The best quality code is the one that is consistently improving in the way that the business values. Where the right amount of effort has been invested in the right place. Once-written, never-read code can be messy. What’s important is that each time it is read, or certainly changed, it gets better. Leave the campsite better than you found it.
Software creation is almost always a discovery process, so a key priority is almost always velocity and flexibility. Simpler code is easier to read and understand, and harder to introduce bugs. With few exceptions, these are what we, as an industry, value.
Beyond this, context matters. Startups will prioritize velocity over stability — some, even over security. Scale-ups will value scalability and approachability. Some large companies will value stability over velocity. Health care and aerospace will prioritize avoiding bugs, the complete opposite of a consumer-facing social media app.
You’ll have to decide for yourself what quality means to you, to your industry and to your company. You may find that your values don’t align with that of your company’s. That’s worth thinking about. Hopefully you can find words for it, now, and perhaps you’ll be able to be more explicit about what you mean the next time you talk about code quality.
PS: But Can We Measure It?
As a final note, it’s worth considering how code quality can be improved over time. The trick is making incentives explicit. Linters are a great example: defining a pass/fail rule connected to a PR status check means that rule will be enforced. That’s generally why Code-Level Quality is pretty much a Solved Problem. We’ve established that Business-Level Quality cannot really be measured. What about Fuzzy-Level Quality?
What Can Be Measured?
To improve something, we must measure it. Complexity, such as cyclomatic/perceived complexity, is measurable. It’s also possible to measure code coverage, with special value for coverage of new (or modified) code. Security can be measured with periodic pentests. These can be multiplied by how often that code changes, its churn. These checks give decent proxies for “is this maintainable”: a long, complex method is probably less maintainable than a shorter one.
[Cyclomatic] complexity is the number of linearly independent paths through a method.https://docs.rubocop.org/rubocop/cops_metrics.html#metricscyclomaticcomplexity
What’s hard, fascinating, and overlooked about Fuzzy-Level Quality is that word “probably.” All these factors are somewhat subjective, and the measures produce a shade of grey rather than a pass/fail. There’s also an opportunity there: it means that it’s possible to nudge code to be better by tracking these subjective measures over time.
Tooling
The available tooling for complexity and code coverage typically sets an arbitrary limit for how complex something can be, but it’s not like a method goes from 100% maintainable to 0% maintainable with one extra conditional. Human judgment is a factor, too: sometimes code measures as complex, but improving on that metric will actually hurt readability or maintainability.
Even the initial value doesn’t matter so much. Who knows if this will be read or changed much in future? Maybe the whole approach will be immediately thrown out. What matters most is that it improves over time.
Our industry is missing effective tools to measure and enforce this subjective part of code quality. Since Code Climate Quality became abandonware, no tool has provided effective ways of subjectively measuring quality and ensuring grey measures of quality improve over time.
Humans Needed
Finally, it’s important to be aware of what we cannot measure. We trivially see if the test suite passes. It’s far more difficult to know if the tests meet the real-life requirements. Abstract topics like architecture are harder or impossible to measure because they relate to offline, human concepts — which may be evolving as we discover. These must be monitored and periodically evaluated by humans.