Adventures in Cryptography with Python – Base64

Base64 is a binary to text encoding technique rather than an encryption technique but I thought it made sense to cover it in this series because it is widely used especially for transmitting the data over the wire. The reason being the set of characters selected for this encoding is a subset of most common characters in all encoding and printable characters.

Here is the Base64 index table:

Index Char Index Char Index Char Index Char
0 A 16 Q 32 g 48 w
1 B 17 R 33 h 49 x
2 C 18 S 34 i 50 y
3 D 19 T 35 j 51 z
4 E 20 U 36 k 52 0
5 F 21 V 37 l 53 1
6 G 22 W 38 m 54 2
7 H 23 X 39 n 55 3
8 I 24 Y 40 o 56 4
9 J 25 Z 41 p 57 5
10 K 26 a 42 q 58 6
11 L 27 b 43 r 59 7
12 M 28 c 44 s 60 8
13 N 29 d 45 t 61 9
14 O 30 e 46 u 62 +
15 P 31 f 47 v 63 /

 

The conversion of a string into Base64 happens by taking the 8-bit binary equivalent of the alphabets and then slicing it into 6-bit unit since the maximum value in the Base64 is 2^6 and then using the index table like above binary would be represented. Lets take an example of string Sun and see how it would be represented in Base64

 

Text          |     S    |     u     |     n       |
ACII Code     |    083   |    117    |    110      |
Binary        | 01010011 |  01110101 |  01101110   |
6-bit         | 010100 | 110111 | 010101 | 101110  |
Base64 Index  |   20   |    55  |   21   |   46    |
Base64 encoded|    U   |    3   |   V    |    u    |

We can verify this by converting the string with Python

>>> "Sun".encode("base64")
'U3Vu\n'

 

The newline character that we see at the end of the output is ignored. Whether we decode the string with or without the we would still get the same string back

>>> "U3Vu\n".decode("base64")
'Sun'
>>> "U3Vu".decode("base64")
'Sun'

 

The length of characters in the output has to be a multiple of 4. If it is not the case then the output is appended with either one or two “=” to make it so. For example when we convert Earth to Base64 we this in action

>>> "Earth".encode("base64")
'RWFydGg=\n'

 

Base64 Encoder

Sometimes for various reasons the strings are Base64 encoded multiple times and you might have noticed by now this increases the length of the output. The base64 encoder that I wrote using the one builtin with Python takes the number of times you would like to encode your string. The code is pretty straightforward.

 

input_str = raw_input("Enter the string that you like to be base64 encoded:")
times = int(raw_input("How deep do you want it encoded:"))

output_str = input_str

for i in range(times):
    output_str = output_str.encode("base64")

print "Encoded string: ", output_str

 

And here is a sample run

 

Image showing sample run of Base64 encoder

Image showing sample run of Base64 encoder

 

Base64 Decoder

This a where it gets a little bit trickier since while decoding I assume that I am not aware of the number of times the text was encoded. I created a base sting that contains all the valid characters in Base64 encoded strings and then take the input as base64 encoded string

 

base_64_encoding_characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/="

input_str = raw_input("Enter the base64 encoded string that you would like to decode: ")

With the string to be decoded in hand we go into a while loop and run in it until we have a potential candidate for the original string. The basic logic is to try and decode the string and if fails to decode then append an “=” to its end and try again and also increase the error count in the process. We repeat this twice and keep going until we have a string that cannot be decoded.

 

while error_count < 3:
    input_str, is_end = ValidateAndSplit(input_str.replace('\n',''))

    if is_end == True:
        break;
    try:
        temp = input_str.decode("base64")
        input_str = temp
        output_str = temp
        depth = depth + 1
        error_count = 0
        print input_str
    except binascii.Error as err:
        error_count = error_count + 1
        input_str = input_str + "="

print "Potential decoded string: ", output_str, "\nWith depth: ", depth

The ValidateAndSplit method basically tries to remove unnecessary charters from the string to make sure we don’t down a bad path and also tells us when potentially we have reached the end of our search

 

def ValidateAndSplit(input_str):
    is_end = False
    n = len(input_str)
    if n < 1:
        is_end = True
        return input_str, is_end

    for i in range(n):
        c = input_str[i]
        location = base_64_encoding_characters.find(c)
        if location < 0 and c == " ":
            is_end = True
            break
        elif location < 0:
            data = input_str.split(c, 1)
            input_str = data[0]
            break

    return input_str, is_end

Here’s a sample run of this decoder with the same base64 string that we encoded before 10 times

 

Image showing sample run of Base64 decoder

Image showing sample run of Base64 decoder

 

The problem with the current approach is that if we might over decode the string that are one word only. One fix to that could be reaching out to reach out to an online dictionary and see that we have found a valid word.

 

The entire source code for this post can be found at https://github.com/abhishuk85/cryptography-plays

Any questions, comments or feedback are most welcome.