Quality through SQL Coding Standards

In my professional life I’ve been mostly working with SQL as a language to specify how data should be processed. In this article I will guide you through the process of improving the quality of your work by applying simple standards.

SQL has many practical implementations based upon a solid theoretical foundation. Of course, vendors vary in the way they’ve implemented set theory, as does our own SQL engine. But in general if you have had sufficient training in logic and math, you can easily formulate your queries and requests.

However, SQL can also easily be abused by applying it as a language in scenarios where it is just not fit. Trying to build a Turing machine in SQL? Sure, it is possible, but not recommended. Calculate sinus of a number using solely SQL? Go ahead, feasible, but I would recommend switching to a language more suitable for mathematical formulas.

And in the rest of cases where SQL is a good language of choice, there are multiple ways to achieve a correct result and even yet more ways to achieve a seemingly correct result.

The approach chosen is often related to the vendor-specific functionality, such as with Oracle 6 where you only had the rule-based optimizer and even slightly confusing semantic similarities and differences between ‘’ and null.

But even on a vendor-specific platform, there are a number of rules of thumb that when applied dramatically improve the correctness and maintainability of SQL statements. We require these rules to have been applied on code before we accept cases on our SQL engine (unless it is of course a case where applying the rules works around a problem in our SQL engine).

Use Column Layout

Although screens have become wider and wider, using columns to lay-out your code makes it easier to spot bugs. The lay-out contains always at least two columns as in (LinkedIn is not obviously not meant for layout):

from   TABLE
where  BOOLEAN
and    BOOLEAN

The first column contains the keywords like ‘select’, ‘from’, ‘update’, etc. The width in characters is determined by the length of the most frequent keyword ‘select’ plus one character for the space. Some combined keywords like ‘order by’ are longer and therefore split across two lines.

The comma-separator and other separators are always placed in prefix notation on the left side of the line. Why? Because combined with the line comment ‘–’ this makes it easy to comment out any line of code in the same way without thinking, whether it is an expression early in the list, in the middle or at the end.

The ‘and’ is also considered a separator. Sometimes the and-ed arguments are simple and short, and in that case it is justifiable to put them together on one line. But in general each ‘and’ has some business background and putting them each on an individual line gives you space for explain the business background in a line comment before the ‘and’.

The ‘and’ and ‘or’ are hard to read and understand when precedence is not explicitly given using ‘(’ and ‘)’. Also, or-optimization is a matter in itself. Combined with the relative low frequency of ‘or’ it is better to consider a subordinate of the and-s and format them as given in the example.

In expressions, always type a space after a comma, colon, semi-colon, question mark and exclamation mark. Just as in English. This always signals a good location for line breaks on a space-constrained viewing device. But proper spacing also helps the human eye understand the structure.

Want to learn more about the impact of lay-out on worker productivity and risk management? Although old, my personal favorite is still: Ergonomics at Work by David J. Oborne due to it’s accessibility, well-chosen samples and solid theoretical foundation. Available for a few dollars at Amazon.

Lower case where possible

An essential part of well maintainable and analyzable code is the use of mixed or lower case characters as opposed to upper case character. What? Yes, it is an essential part. Thorough investigation on programmer productivity in maintenance and coding by parties such as IBM and Microsoft have shown this to have a major impact on these aspects and therefore quality and costs.

The human eye does not scan each individual character, but recognizes the shape of a word. These two representations of words have similar shapes: “IBM” and “HAL” when looked at a from a distance. Both occupy a rectangular area. But these two are totally different: “ibm” and “hal”. “ibm” has a separate dot somewhere on hte left and an extending line above the text in the middle. Whereas “hal” lines above the text on the left and right.

Therefore, do not use uppercase only unless you can justify the use.

And even with mixed case, only apply it when you apply it consistently. For the same reasons as mentioned before, the human eye does not recognize easily “myBestMethod” and “mybestmethod” (or “my_best_method”) as the same.

Comment Style

I typically use comment possibilities for two unrelated types of activities:

  • Explain what the code should do in business language and/or describe non-obvious motivations to choose a specific implementation approach.
  • Tool for refactoring and debugging code, for instance when analyzing the data for discrepancies between my expectations and the actual outcome.

The comment for the first type of activity is typically long lasting, whereas the second type is temporary. To be able to comment out large parts of the code during the second type of activity, it is easier to use /*…*/ style. However, /*…*/ can not be nested, so for the first type of activity it is better to use another commenting style.

Luckily enough SQL supports also the line-comment mode using ‘–’. Therefore it provides better results when you use ‘–’ for comment describing what the code should do and /*…*/ solely for debugging, refactoring and analysis.

Inline Views

In the two-column layout described above, you can of course still have an inline view in a from-clause, such as:

from   ( select EXPRESSION
         from   TABLE

In these cases, just format the inline view also in two columns. After that, place the formatted inline view in column 2 after the ‘from’, effectively giving you three columns.

Spaces and Tabs

How tempting they may seem, the use of tabs is not wise unless you work in a very constrained environment. Historically, the tab was something found on a typewriter to rapidly advance the wagon. But the spacing varied depending on the typewriter model and make. Some typewriters were flexible, some had hard settings.

As with the transformation from horse-drawn to automobiles, some historical elements remained for ease of transition and the same holds for the tab key. The tab key on some platforms introduces a horizontal movement equal to two normal spaces, some for four and others jump to specific horizontal locations. Such differences still hold.

Therefore: do not use tabs and indent by column or two spaces instead.

Alias Naming Convention

Although there are more possibilities to improve code quality by applying standards, the last one in this version of the document is alias naming convention. At some day, you will be creating your first join, bringing data together from multiple sets/tables into one list. At that moment, you have to rely on your mind to recognize what columns have originated from what table. The SQL engine nicely tries to find out whether there is a unique hit and use the corresponding table. But that is no guarantee for success; the next maintainer is of course less smart than you are (or less nerd) and some other maintainer might add a column to one of the tables involved in the join with an identical column name as now uniquely contained in only one table.

Therefore, you should always add aliases to columns when you use joins, such as ‘pjt.code’ instead of ‘code’.

But how do you construct meaningful aliases? Aliases like ‘t’, ‘a’ are unclear and using the full table name like in ‘ExactOnlineRest.Projects.Projects.code’ makes your code harder to understand.

A practical way is to have the alias reflect the business contents of your data. Most SQL tables already directly stem from the original entities in the ERD like ‘Projects’ or ‘TransactionLines’. To get a meaningful, concise and short alias use the following approach:

  • Take the singular form of the table name (‘Projects’ becomes ‘Project’ and ‘TransactionLines’ becomes ‘TransactionLine’).
  • Use a space instead of camel case or a underscore to break up individual words (‘Project’ become ‘Project’ and ‘TransactionLine’ becomes ‘Transaction Line’).
  • When the result has two or more words: take the first character from the first word, the first character of the second word and the last character of the last word. So ‘Transaction Line’ becomes ‘tle’.
  • When the result has two or more syllables: take the first character of the first syllable, the first character of the second syllable and the last character of the word. So ‘Project’ becomes ‘pjt’.
  • When the result has three or more characters: take the first two characters and the last character. So ‘care’ become ‘cae’.
  • When all else fails: panic and choose something yourself.

Using such aliases enables you and future maintainers to easily understand your SQL code and more easily detects bug and possible issues.