Sometimes I look at things that already exist in the world and my eyes fill up with wonder and curiosity. I look at them and I ask myself: how did that thing come to exist exactly the way it is? Why was it done that way and not differently? I feel as if by answering those questions I get a little bit closer to those things, to the technology they represent and embody. And, by getting closer to the technology, in a way, I also get closer to a part of myself.
It saddnes me that in the context of computer science this point of view made of curiosity and wonder is most often negleted. Reflecting upon my personal experience studying computer science at university, especially during the first years, I felt very alienated by the fact that I had no idea regarding how most of the tools and technology I was using worked internally. Even at work, most of the time one is too busy developing new technology, that no time or resources are given to the simple yet foundamental act of rediscovering and reimplementing technologies that already exist.
Yet, I've always found an intrinsic value in the act of reinvent the wheel, so to speak. Because while it is true that the metaphorical wheel already exist, whatever it represents, it is also true that I did not build it, but someone else did. I believe that in building the wheel one gains much experience, not only about the specific technical work, but about life itself. Now, there are many ways of reinvent the wheel, but only one of them is always helpful, regardless of the context: that which comes out of a desire to learn and to play, and not to be too serious about it.
In this blog post I want to show how to build a wheel that has
already been built many times. A simple yet fascinating wheel that
has to do with base64 encoding and decoding. The wheel will be built
with the C
programming language.
#What is base64?
First of all, what do we mean with "base64"?
When we talk about base64 we're essentially talking about an encoding scheme, which, roughly speaking, is a mechanism that allows us to assign to each character of a particular alphabet a specific number in such a way as to allow the translation from numbers to character and vice versa.
Example: The simplest type of assignment between characters and numbers we could imagine is the following one
\[\begin{split}
A &\longleftrightarrow 00 \longleftrightarrow 00000 \\
B &\longleftrightarrow 01 \longleftrightarrow 00001 \\
C &\longleftrightarrow 02 \longleftrightarrow 00010 \\
& \;\;\;\;\;\;\;\;\;\;\;\vdots \\\
\end{split}\]
The takeway here is that with an encoding we can transform characters into numbers, which can then be memorized in a computers memory as a sequence of bits. Viewed under this light, an encoding is the abstraction needed to make our computers deal with characters.
To specify an encoding one needs to specify two things:
-
The alphabet used.
-
The mapping between the characters in the alphabet and the associated numbers.
The base64 encoding, not so surprisingly, has an alphabet made up of \(64\) characters plus an extra padding character. The characters used are shown below.
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789
+/
= (extra padding character)
The specific mapping between characters and numbers can be analyzed in the original standardization of the encoding, which can be found in RFC4648.
In linux we have the base64
binary which allows for encoding and
decoding of arbitrary text.
which base64
/usr/bin/base64echo "Hello World!" | base64
SGVsbG8gV29ybGQhCg==echo "SGVsbG8gV29ybGQhCg==" | base64 -d
Hello World!
Our objective is to implement something similar to this: something that takes in input arbitrary bytes and produces as output the relative base64 encoding of those bytes. We're also interested in the decoding process, which goes from the base64 encoded bytes to the original bytes.
\[\begin{split} \text{Arbitrary Bytes} &\overset{\text{encoding}}{\longrightarrow} \text{Base64 bytes} \\ \\ \text{Arbitrary Bytes} &\overset{\text{decoding}}{\longleftarrow} \text{Base64 bytes} \\ \end{split}\]
#Why Base64?
Having discussed what do we mean exactly by base64, we can now ask the more difficult question: why do we need base64 encoding? What does it bring to the table?
The why questions are always the more important ones, especially before writing code. Why is it that people thought base64 was so needed that it had to be implemented and standardized? What kind of problem does it solve? Understanding the why is about understanding the bigger context in which that particular technology lives.
To understand the importance of base64 it is useful to remember
the concept of control characters, which sometimes are also called
non-printable characters. When we use our keyboard we typically
write out the characters that make up the alphabet of our natural
language, such as the latin characters abcdef...
with their
upper-case counterparts ABCDEFG...
, as well as the arabic numerals
we use to write out numbers 01234...
. These characters are called
printable characters, because typically they are directly printed
by the programs that process them so that we can see them directly
on the screen. Now, along side printable characters there are
other characters, which, as it was mentioned earlier, are called
control charaters. These other kind of characters are special in
the sense that they do not represent any common written symbol,
but rather they are used to change the behavior of the program
that receives them.
The simplest example of a control character is the line feed,
which in most programming languages is represented with the
sequence of characters "\n
". The actual byte value of the newline
is the hexadecimal byte 0x0a
, which in decimal is the number
10
. If we have a textual program, for example a program that
handles the graphics of a terminal, and such program reads the
byte 0x0a
, then it will not print a specific character on the
screen. Rather, it will insert a newline, so that later characters
will be printed on the next line of the prompt.
This is exactly what happens with the echo
command. When we
execute something with echo, the last character that is printed by
the program is a newline character (0x0a
) and this is why that is
printed immediately after is printend in a newline, on its
own. This behavior can be analyzed with hexdump
.
echo "Hello World!"
Hello World!echo "Hello World!" | hexdump -C
00000000 48 65 6c 6c 6f 20 57 6f 72 6c 64 21 0a |Hello World!.| 0000000d
If we instead do not want a newline to be printed at the end, we
can use the flag -n
. If we once again anylze the bytes printed by
the command, we can see that now there is no newline. This means
that the terminal program will print the characters of the new
prompt immediately after the !
character.
echo -n "Hello World!" | hexdump -C
00000000 48 65 6c 6c 6f 20 57 6f 72 6c 64 21 |Hello World!|
0000000c
Now, remember the output alphabet used by base64. It is made up of all characters which, in the basic ascii encoding, are all printable characters. And now we're able to answer the question: why base64?
The idea behind base64 encoding is to take an arbitrary stream of bytes, which could contain control characters and normal non-control characters, and convert such sequence into a sequence made up of only normal non-control characters. This encoding process has to be done without losing any information about the original sequence of bytes, so that we can always go back and decode to obtain the original bytes.
The value behind base64 encoding is that it allows the transfering of arbitrary data without worrying about how the program that is receving the data will process it. In particular, by encoding all characters using printable, non control characters, we make sure that the receving program will not interpret the incoming data as control characters and therefore it will not change its behavior in unpexpected ways. Once the data has been received, the program can then choose what to do about it.
Having introduced the proper what/s and the proper why/s we're now able to describe the two main procedures we need to implement:
-
The base64 encoding process, which goes from arbitrary bytes to base64 bytes.
\[\text{Base64Encode}(\text{RawData}) \longrightarrow \text{Base64Data}\]
-
The base64 decoding process, which goes from base64 bytes to the original bytes.
\[\text{Base64Decode}(\text{Base64Data}) \longrightarrow \text{RawData}\]
These two produces are related in the sense that whathever it is encoded by the encoding procedure, must be correctly decoded by the decoding procedure in order to obtain the original bytes.
\[\text{Base64Decode}(\text{Base64Encode}(\text{RawData})) \longrightarrow \text{RawData}\]
#base64_encode()
Let us now describe the encoding procedure. Before showing the
actual code we will describe how to manually encode the string
"Hello World!
" in base64.
To start off, remember that the ascii encoding tells us that each character of such sequence is a byte, and each byte is a number between \(0\) and \(255\). These numbers can be represented in various notation, such as hexadecimal notation or binary notation.
H -> 0x48 -> 01001000
e -> 0x65 -> 01100101
l -> 0x6C -> 01101100
l -> 0x6C -> 01101100
o -> 0x6F -> 01101111
-> 0x20 -> 00100000
W -> 0x57 -> 01010111
o -> 0x6F -> 01101111
r -> 0x72 -> 01110010
l -> 0x6C -> 01101100
d -> 0x64 -> 01100100
! -> 0x21 -> 00100001
Now, the main idea behind encoding in base64 is to consider the sequence of characters as a stream of bytes, each of which is made up of \(8\) bits. We want to split up this sequence in \(6\) bits blocks instead of the original \(8\) bit blocks. This means, essentially, that every \(3\) blocks of \(8\) bits we will have \(4\) blocks of \(6\) bits, since
\[3 \cdot 8 = 24 = 4 \cdot 6\]
Consider the first three bytes.
01001000 01100101 01101100
Let's split them in four blocks of six bits each.
010010 000110 010101 101100
The split is done simply by going from left to right and taking each time blocks of six bits. Sometimes we have to append bits taken from different bytes, but we always do it from left to right. And now, once we have all blocks of \(6\) bits, we extract from each of these block a character by taking the numerical value (written in base \(10\)) of each block and by using the previously shown base64 encoding table. In particular we have
010010 -> 18 -> S
000110 -> 6 -> G
010101 -> 21 -> V
101100 -> 44 -> s
And this means that the first four characters in the base64
encoding of our string will be SGVs
. We can now decode the rest.
-
First, we consider all the remaining sequence of eight bit blocks
01101100 01101111 00100000 01010111 01101111 01110010 01101100 01100100 00100001
-
Then, for each three block of eight bit each we transform them into four block of six bit each.
01101100 01101111 00100000 -> 011011 000110 111100 100000 01010111 01101111 01110010 -> 010101 110110 111101 110010 01101100 01100100 00100001 -> 011011 000110 010000 100001
-
Then we transform each six bit block into character by using the base64 encoding table
011011 -> 27 -> b 000110 -> 6 -> G 111100 -> 60 -> 8 100000 -> 32 -> g 010101 -> 21 -> V 110110 -> 54 -> 2 111101 -> 61 -> 9 110010 -> 50 -> y 011011 -> 27 -> b 000110 -> 6 -> G 010000 -> 16 -> Q 100001 -> 33 -> h
Putting it all together, we obtain the following SGVsbG8gV29ybGQh
.
\[\text{Hello World!} \longrightarrow \text{SGVsbG8gV29ybGQh}\]
which the same thing we obtain using the official binary base64
echo -n "Hello World!" | base64
SGVsbG8gV29ybGQh
The procedure we described works perfectly everytime we have a number of characters which is an even multiple of \(3\). In this case our string had \(12 = 3 \cdot 4\) characters, which is why in the end we obtained an encoded string made up of \(16\) characters. What happens when this is not the case? What happens when we want to encode a string who's length is not an even multiple of \(3\)?
Notice we have two cases left to handle, depending if the length of our string, when divided by \(3\), has a remainder of \(1\) or \(2\). These cases will be called padding cases.
-
Padding case n.1: This case happens when we have a remainder of \(1\). The simplest case of this example is when we want to encode the string "
a
". What we do in this case is that we use the first byte to extract two blocks of six bit each, which will represent two characters of the encoded base64 output, while the remaining two characters are filled with the padding character=
.a -> 01100001 -> 011000 01 -> 011000 -> 24 -> Y 010000 -> 16 -> Q
Notice the way we use the final 2 bits of the initial byte. We use them to fill a block of six bit each by filling the remaining bits as zero bits. At the end the output we get is
YQ==
as the base64 encoded output.echo -n "a" | base64
YQ== -
Padding case n.2: This case is similar to the previous one, but this time we're able to use more bytes of the original input and we only need to use a single padding character
=
. The simplest case is when we want to encode the string "aa
".aa -> 01100001 01100001 -> 011000 010110 0001 -> 011000 -> 24 -> Y 010110 -> 22 -> W 000100 -> 4 -> E
At the end we get
YWE=
as the base64 encoded output.echo -n "aa" | base64
YWE=
Having done a manual example, which is always important to do the first time around, we're now able to write the code of interest. In particular we will write a function with the following signature
void base64_encode(const char *in, const unsigned long in_len, char *out);
We are given three arguments in input, the in
pointer, which points
to the input string, the arbitrary sequence of bytes we want to
encode, the in_len
which represents the length of the input, and,
finally, the out
pointer, which points to a memory region which we
assume has already been allocated with the enough length in order
to contain the entire base64 of the input.
We will also use the following string, which encodes the base64 conversion table, going from numbers to characters in the base64 charset.
char *BASE64 = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
As we can see, when we do BASE64[x]
we're getting the character
corresponding to the numerical value x
. Of course we have that 0 <= x <= 63
.
The implementation then is pretty straight-forward. The only thing that could take a bit of time to understand is the usage of the bit and shift operators.
void base64_encode(const char *in, const unsigned long in_len, char *out) {
int in_index = 0;
int out_index = 0;
while (in_index < in_len) {
// process group of 24 bit
// first 6-bit
out[out_index++] = BASE64[ (in[in_index] & 0xFC) >> 2 ];
if ((in_index + 1) == in_len) {
// padding case n.1
//
// Remaining bits to process are the right-most 2 bit of on the
// last byte of input. we also need to add two bytes of padding
out[out_index++] = BASE64[ ((in[in_index] & 0x3) << 4) ];
out[out_index++] = '=';
out[out_index++] = '=';
break;
}
// second 6-bit
out[out_index++] = BASE64[ ((in[in_index] & 0x3) << 4) | ((in[in_index+1] & 0xF0) >> 4) ];
if ((in_index + 2) == in_len) {
// padding case n.2
//
// Remaining bits to process are the right most 4 bit on the
// last byte of input. We also need to add a single byte of
// padding.
out[out_index++] = BASE64[ ((in[in_index + 1] & 0xF) << 2) ];
out[out_index++] = '=';
break;
}
// third 6-bit
out[out_index++] = BASE64[ ((in[in_index + 1] & 0xF) << 2) | ((in[in_index + 2] & 0xC0) >> 6) ];
// fourth 6-bit
out[out_index++] = BASE64[ in[in_index + 2] & 0x3F ];
in_index += 3;
}
out[out_index] = '\0';
return;
}
Let's consider for example how we compute the second block of six bits. The relative code is the following
// second 6-bit
out[out_index++] = BASE64[ ((in[in_index] & 0x3) << 4) | ((in[in_index+1] & 0xF0) >> 4) ];
In this line we're encoding the operation of taking the two right
most bit of the first byte and concatenating them with the four
left most bit of the second byte. The masks here are used to
extract these sequence of bits from the original number, and the
shift operations are used to place correctly the bits. Finally, the
OR operation is used to combine the two sequences of bits in order
to obtain the final and correct value. This value is then used as
an index into the BASE64
array, which gives us the corresponding
base64 character.
Consider the following two bytes
in[in_index] -> 01001000
in[in_index + 1] -> 01100101
We can break the previous computation into single steps as follows
(in[in_index] & 0x3) -> 01001000 & 00000011 -> 00000000
(in[in_index] & 0x3) << 4 -> 00000000 << 4 -> 00000000
(in[in_index+1] & 0xF0) -> 01100101 & 11110000 -> 01100000
(in[in_index+1] & 0xF0) >> 4 -> 01100000 >> 4 -> 00000110
Putting it all together we get
((in[in_index] & 0x3) << 4) | ((in[in_index+1] & 0xF0) >> 4)
-> 00000000 | 00000110
-> 00000110
-> 6
And therefore we obtain
BASE64[ 6 ] -> G
#base64_decode()
Having discussed the encoding procedure, we can now discuss the decoding procedure. This time we start from a sequence of bytes which have been obtained by following the base64 encoding procedure, and we want to extract the original sequence of bytes. As before, I will first show how to execute manually the algorithm, and then I will show the relative code.
Consider the string "SGVsbG8gV29ybGQh
" we obtained in the previous
step. We now want to decode it and obtain the original string.
To start off, there is already something we can immediately to in order to check if the string we're trying to decode is a valid base64 string. The idea is that the length of all valid base64 sequences must be a multiple of \(4\). In our case we have
\[\text{Length}(\text{SGVsbG8gV29ybGQh}) = 16\]
First we consider the ascii encoding for all characters of the base64 string. The idea behind base64 decoding is to start from the ascii encoding value of the various base64 characters and reconstruct the relative six bit value that was computed during the encoding process. This can be done by connecting the numbers between the ascii encoding table and the base64 encoding table.
For example, consider the character S
. In ascii encoding S
is
mapped to the hexadecimal number 0x53
, which in decimal is 83
. In
the base64 encoding table however S
is mapped to the number
18
. This creates the following mapping between 0x53
and 18
.
(ascii) (base64)
0x53 <---------> S <---------> 18
We do this for all characters of the original string in order to obtain the relative six bit value that was found during the encoding process.
S -> 0x53 -> 18 -> 010010
G -> 0x47 -> 6 -> 000110
V -> 0x56 -> 21 -> 010101
s -> 0x73 -> 44 -> 101100
b -> 0x62 -> 27 -> 011011
G -> 0x47 -> 6 -> 000110
8 -> 0x38 -> 60 -> 111100
g -> 0x67 -> 32 -> 100000
V -> 0x56 -> 21 -> 010101
2 -> 0x32 -> 54 -> 110110
9 -> 0x39 -> 61 -> 111101
y -> 0x79 -> 50 -> 110010
b -> 0x62 -> 27 -> 011011
G -> 0x47 -> 6 -> 000110
Q -> 0x51 -> 16 -> 010000
h -> 0x68 -> 33 -> 100001
Now we can combine sequences groups of four six bit blocks into sequences of three eight bit blocks.
010010 000110 010101 101100 -> 010010 00|0110 0101|01 101100
-> 01001000 01100101 01101100
And from each of these byte we can use the ascii encoding to get back the original value.
01001000 -> 72 -> H
01100101 -> 101 -> e
01101100 -> 108 -> l
The first three bytes of the original output are Hel
. We can do
the same procedure for all remaining bytes.
011011 000110 111100 100000 -> 011011 00|0110 1111|00 100000
->
-> 01101100 -> l
01101111 -> o
00100000 -> ' '
010101 110110 111101 110010 -> 010101 11|0110 1111|01 110010
->
-> 01010111 -> W
01101111 -> o
01110010 -> r
011011 000110 010000 100001 -> 011011 00|0110 0100|00 100001
->
-> 01101100 -> l
01100100 -> d
00100001 -> !
At the end we obtain the string "Hello World!
", which is the
correct one.
\[\text{SGVsbG8gV29ybGQh} \longrightarrow \text{Hello World!}\]
In terms of code, we define the following signature.
int base64_decode(const char *in, const unsigned long in_len, char *out);
The meaning of the various arguments is essentially the same as the encoding procedure, only that now the input is a base64 encoded string, while the output can be an arbitrary sequence of bytes. Once again, we assume that the output has already been allocated with the proper length.
To actually implement the decoding procedure we need to define a translation table which allows us to obtain the base64 numerical code of the various characters, that is, the numerical value of the various characters defined in the base64 table. The table is defined as follows
// This table implements the mapping from 8-bit ascii value to 6-bit
// base64 value and it is used during the base64 decoding
// process. Since not all 8-bit values are used, some of them are
// mapped to -1, meaning that there is no 6-bit value associated with
// that 8-bit value.
//
int UNBASE64[] = {
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, // 0-11
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, // 12-23
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, // 24-35
-1, -1, -1, -1, -1, -1, -1, 62, -1, -1, -1, 63, // 36-47
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, -1, -2, // 48-59
-1, 0, -1, -1, -1, 0, 1, 2, 3, 4, 5, 6, // 60-71
7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, // 72-83
19, 20, 21, 22, 23, 24, 25, -1, -1, -1, -1, -1, // 84-95
-1, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, // 96-107
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, // 108-119
49, 50, 51, -1, -1, -1, -1, -1, -1, -1, -1, -1, // 120-131
};
We have for example that UNBASE64[84] = 19
because in ascii
encoding the decimal value 84
represent the character T
, which, in
the base64 table, is mapped to the number 19
.
We are now ready too see the entire code for the decoding process.
int base64_decode(const char *in, const unsigned long in_len, char *out) {
int in_index = 0;
int out_index = 0;
char first, second, third, fourth;
assert(!(in_len & 0x03)); // input must be even multiple of 4
while( in_index < in_len ) {
// check if next 4 byte of input is valid base64
for (int i = 0; i < 4; i++) {
if (((int)in[in_index + i] > 131) || UNBASE64[ (int) in[in_index + i] ] == -1) {
fprintf(stderr, "Invalid base64 char, cannot decode: %c\n", in[in_index + i]);
return -1;
}
}
// extract all bits and reconstruct original bytes
first = UNBASE64[ (int) in[in_index] ];
second = UNBASE64[ (int) in[in_index + 1] ];
third = UNBASE64[ (int) in[in_index + 2] ];
fourth = UNBASE64[ (int) in[in_index + 3] ];
// reconstruct first byte
out[out_index++] = (first << 2) | ((second & 0x30) >> 4);
// reconstruct second byte
if (in[in_index + 2] != '=') {
out[out_index++] = ((second & 0xF) << 4) | ((third & 0x3C) >> 2);
}
// reconstruct third byte
if (in[in_index + 3] != '=') {
out[out_index++] = ((third & 0x3) << 6) | fourth;
}
in_index += 4;
}
out[out_index] = '\0';
return 0;
}
The workings of the code should now be clear. First we check if the input is an even multiple of \(4\), and then we iterate over all the blocks. At each iteration we transform a consecutive sequence of four blocks into three bytes of the output. To perform this transformation we first unbase the values to obtain the actual six bit values, and then we combine the six bit values in order to obtain the original eight bit values.
#Conclusion
Not much to say anymore, other than I hope it was an interesting and fun read. Have fun in your projects, and do not be afraid of trying to "reinvent the wheel", because learning something new is always a magical thing, regardless of how many times in the past it has already been done by other humans.