More about encodings
To read about encodings in your code editor, consult this. You may also want to read this classic, by Joel Spolsky. And if you really want to deep dive into the subject of code, you have to read Code by Charles Petzold.
Interpretation / Representation
Deep down, there are no such things as text or numbers, only “0s & 1s”. Programming languages free us from thinking with “0s & 1s” by providing syntaxes and basic functions for text, numbers and all kind of data structures. But we still need to know about encodings in situations like data transmission or data storage.
In such situations, we have control over the data at its most fundamental level, the level of electrical signals. There are only two states: no current or some current. An encoding is a two-columns table “Interpretation / Representation” just like the Morse code. At that level, the term “code” is synonymous to “representation”.
To control data at its most fundamental level, the level of electrical signals, we use an encoding, that is a two-columns table “Interpretation / Representation” just like the Morse code. At its simplest:
The simplest table we can make is this one:
Interpretation | Representation (or code) | Physical reality |
---|---|---|
false | 0 | Barely any current |
true | 1 | Some noticeable current |
Representing false and true is all it takes to exhaust all the codes given by one binary signal. Such a signal is called a “bit” and is actually, as Wikipedia tells it, a portmanteau of “binary digit”.
Representing false and true exhausts all the codes given by one bit.
Bits = pieces
Personally, I always think about the term “bit” as really saying “piece”. First because the word “signal” seem to imply that a binary signal may convey rich meaning, when in fact there are only two codes. Second because, in practice, two codes is so severe a limitation that we need to piece together several morsels —or bits— to multiply the size of the table.
Each bit we add to the mix multiply the total of codes by 2. For instance, with two bits, we have four codes, making it possible to represent all the suits of play cards for instance:
Personally, I always think about the term “bit” as really saying “piece”. We piece together several morsels —or bits— to multiply the size of the table by 2 for every bit we add. So, with two bits, we have four codes, making it possible to represent all the suits of play cards for instance:
Interpretation | Code |
---|---|
club | 00 |
diamond | 01 |
heart | 10 |
spade | 11 |
It's that easy yet there are many remarks to be done already: about encodings in general, about how we pair interpretation and code, about how an encoding uses the codes and about arbitrary decisions.
General remarks
-
You have to know which encoding to use as decoding always works in essence.
This applies to every situation where you are exposed to encodings. The most common one is: reading or writing to a textual stream (a file, a HTTP connection, etc). But it also applies to your code editor since, after all, it is for reading and writing code files (by the way, you should use UTF8).
In essence, what you are working on is a sequence of bits. Why would decoding fail? It's just about assigning a specific meaning to a specific code. For instance, set the wrong encoding in your code editor. It won't break. It will display something. Whether the initial meaning is preserved is another question.
In practice, decoding can fail. As we will see, some encodings are more sophisticated than a two-columns table. Some codes can be forbidden, preventing the recovery of information if they appear in the binary data. Which will probably happen if you pick the wrong encoding.
What is important here is that decoding cannot fail in a meaningful way and tell you: “Oh, I think you should use another encoding”. Decoding is all about recovering the meaning: if you use the wrong encoding, you're wrong, end of story.
In essence, decoding is about assigning a specific meaning to a specific binary code, which cannot fail per se. In practice, decoding can fail (for instance, on forbidden codes) but never in a meaningful way: if you use the wrong encoding, you're wrong, end of story.
-
You should not necessarily trust what you see to guess the encoding.
This is because two different encodings may have pairs in common. It is not a pedantic corner case: encodings are used since day one but the technical limitations of the old days prevented us from having many bits, and thus the number of pairs were limited.
With technical limitations being pushed back as we progress, common pairs are actually both likely and natural: we just extend an existing encoding, especially for backward-compatibility. That's our situation with the once widely used ASCII encoding extended by the now widely used UTF8.
In the days of ASCII, computers would usually combine about 8 bits together, for a total of 256 codes. You won't encode many alphabets with this limited amount. Today, we easily combine 32 bits for nearly 4.3 billions codes.
In practice, it is safe to decode ASCII data by using UTF8 because all pairs used in ASCII have been intentionally preserved in UTF8. The reverse, decoding with ASCII some binary data that was saved according to the UTF8 pairs, will only work on the common pairs, as the rest of pairs are exclusive to UTF8, i.e. not covered by ASCII.
But what if this “rest of pairs” is never used in the binary data? Then decoding with ASCII will work fine. This sounds contrived but that's the case with this very web page. Apart some exceptional characters, you will recover close to 100% of the content if you decode it with ASCII.
Extending an encoding A with an encoding B ensures backward-compatibility, such that decoding with B will work fine on data encoded with A. Like UTF8 extending ASCII. The reverse is possible when none of the new pairs in B are used: for instance, you can decode this UTF8 web page with ASCII and recover nearly 100% of the content.
-
A programming language uses several encodings but they are mostly out of your way and reach.
One for booleans, a default one for strings and at least one for numbers (usually several though, to have control over the size of numbers). Most languages won't let you mess with these encodings. But they will offer you direct control of the encoding for strings, for situations like writing to a file or receiving text from another program.
-
The number of pairs can grow fast and so does the number of required bits.
A good example is the Latin alphabet: you need 26 pairs. And so 5 bits since 25=32 is enough while 24=16 is not enough. But you want to represent text and thus you also need to represent the ten digits: 0, 1, 2… 8, 9. Since 26+10=36, you are over 32 and so you need a sixth bit, raising the total of codes to 64 (=32×2 =26).
You may think you now have some room but then you realize that each letter is needed twice: in lowercase and in uppercase, bringing the total to 62 pairs. And since you also need whitespace characters, you add the most essential ones: the space and the newline, in order to tell words and paragraphs apart.
You now have reached the total of 64 characters and that's when you notice that you have no punctuation... So you need at least 7 bits. Of course, if you then want to add other alphabets, you would need much more than just 7 bits. See UTF8 for an example of a solution.
Remarks on pairing
-
The pairing is arbitrary so one is free to choose any pairing. But once chosen, the pairing cannot change without migrating all the data.
The goal of an encoding is to move from one column to the other and vice-versa. Encoding data means going from the first column to the second. Decoding is the reverse. The point is to get back the same information as before the encoding. To use a different pairing, we have to migrate the data from the old representation to the new.
-
There cannot be any ambiguity: an interpretation should not have several codes and a code should not point back to several interpretations.
Any ambiguity breaks the encoding/decoding symmetry and prevents us from being confident that we did get back the same information. That said, even with no ambiguity, the interpretation may be redundant with another one.
A good example is
-0
. A way to encode numbers is the sign-magnitude way: the sign (positive or negative) is given by the first bit, the rest of the bits represents the value. So, callingc
the rest of the bits representing the number 0 (which is often all the bits at zero), we have the code0c
of interpretation+0
and the code1c
of interpretation-0
.They are redundant as they make no arithmetical difference. The redundancy is often managed in the language but you may have to deal with it yourself. In Javascript for instance, there is the same-value-zero equality but it's internal so, if you want that equality for yourself, you have to code it.
Any ambiguity breaks the encoding/decoding symmetry and prevents us from being confident that we did get back the same information.
-
As the pairing is arbitrary, one can pick the properties one wants.
Considering again the encoding of suits, we can think of several representations:
Considering again the encoding of suits, code B has the property that the first bit indicates the color (0 for black, 1 for red):
Interpretation Code A Code B club 00 00 diamond 01 10 heart 10 11 spade 11 01 Code A is simply done: sort lexigraphically the suit in column “Interpretation” and assign binary code starting from 00 and counting up. Basically, code A is easy for the designer and no care is given about how it can be used.
Code B is a bit smarter: the codes of all black suits start with a
0
and the codes of all red suits start with a1
. So a program can efficiently check the color, without verifying the full code.
Remarks on using codes
-
Nothing mandates an encoding to use all the codes.
For instance, representing traffic lights requires two bits, as one bit offers only two codes and yet there are 3 colors: green, orange, red. But since two bits offer four codes, one code is left unused.
A real world example is ASCII, where the designers intentionally left out some codes for later needs, instead of forcing themselves to fill these extra codes.
-
An encoding can forbid some codes.
Such is the case of UTF8. It stems from the fact that UTF8 codes have several lengths: 8, 16, 24 or 32 bits. The bits are organized along some patterns depending on the intended length. Codes that do not follow the patterns are not valid UTF8 and are thus forbidden.
-
Some codes may be unused for a purely technical reason, rather than by design.
This is the case of ASCII. At one point, all the hardware components of the majority of computers managed 8 bits or more. So one more bit than needed by ASCII. The first bit was usually set to
0
and the 7 bits left to the ASCII code. ASCII was even extended to take advantage of this 8th bit, with the Extended ASCII encoding.
Remarks on arbitrary decisions
The question "which one is the first?" is an important question with arbitrary answers. There are two parts to this question:
- what is the context?
- are we talking about the first bit or about the first byte?
The two classic contexts are transmission and storage. The way I see it is that the context drives how we can forward the data. For transmission, we read the data part by part and so we often forward each part just as it is read. The order of reading is then preserved:
The question "which one is the first?" is an important question with arbitrary answers. The two classic contexts are transmission, where the order generally is preserved:
Considered | Forwarded | |||||||
A | B | C | D | ⇒ | A | B | C | D |
For storage, the data is often considered as a whole block that is pushed to storage. So, when pushing on the left, the first part to be forwarded is the right-most one, then the second right-most and so on. Thus, the order gets reversed:
And storage, where the order is generally reversed:
Considered | Forwarded | |||||||
A | B | C | D | ⇒ | D | C | B | A |
There are many ways to phrase all this so you should find one that works for you. To read more about it, look for the term “endianness”. It is a reference to Gulliver's Travels in which the author, Jonathan Swift, mocks such situations in politics and religion.
Basically, you face a problem with different answers, each presenting a decisive trait but none having better technical merit. So the issue is not that you lack a solution: all decisive traits have the power to solve the problem. The issue is that they strictly do that. None of them stand above the others by any objective criterion.
While everybody could just settle on one common answer and call it a day, what can and does happen is that everybody settles on the solution he prefers, and overplay its merits. Programming presents many such occasions, the classic one being programming languages themselves.
This is known as “endianness” and you should probably phrase this issue in a way that works for you. Some problems just accepts several solutions for which no objective criterion exist to rank them. Programming presents many such occasions, the classic one being programming languages themselves.
Another example is the byte. It is a special unit used to represent the minimal number of bits that a computer manages… somewhere. Preference can be given to memory or instruction. Historically, the pragmatic concern of representing text was the decisive trait. Nowadays, a byte is 8 bits, by “evolutionary determination”.
I think it is the result of a two-steps process. First, the value took hold by design & technical convergence: at least 7-bit are needed to represent Latin-based text and 8-bit architecture came to be the dominant design. Then, “8-bit” was so much referenced that it was interchangeable with the original definition (representing a character).
As hardwares progressed, the value never got actualized to keep on par with the new hardware abilities. Actualizing it probably did not make any sense because of competing character encodings, and because no architecture was that much dominant to be deemed standard and also because of retro-compatibility issues.
Nowadays, I think the later point is the most important, as UTF8 dominates character encodings, as architectures are overwhelmingly 32/64 bits and as we have piled layer on layer of programs that we cannot safely changed. That is the second step: natural petrification.
Another example is the “byte” unit, now very commonly understood to be 8-bit because of “evolutionary determination”: the value established itself by design & technical convergence, and endured by natural petrification. We probably cannot change it safely anyway.
It happened for endianness too. As one answer on Stack Overflow states it: “endianness is always about bytes order not bits”. This is in direct opposition to what Danny Cohen said when he popularized the term “endianness”. But forty years separates the two statements.
It is important to understand that while your problem may accept a clear and widespread solution, it is just a de-facto standard. The main problem is that documenting it can easily be overlooked. I recently had this issue with using the Buffer from NodeJS: it is never clearly stated that bytes are 8-bit long and that endianness is about bytes.
The lesson for me is two-fold. When using some code, check the documentation and write proper tests. When writing some code which embody such arbitrary considerations, document it clearly and don't make too much of a fuss of some technical merits. In any case, do not gloss over the actual characteristics: them being well-known do not make them any less arbitrary.
You have to recognize when something is a de-facto standard. If you use it, check thoroughly the documentation and write proper tests, just to be sure. If you propose a program which follows it, document the situation clearly and do not gloss over the actual characteristics: them being well-known do not make them any less arbitrary.