If you pay a special sort of attention, you may notice that the upper-case and lower-case alphabet characters on the ASCII table are always exactly offset from each other by 32.
For example, upper-case A is 65, while the lowercase a is 97
As such the way the binary place value works out, you can manipulate the case of a character by setting or clearing the sixth bit… (2^5 = 32, we start counting bits and everything at 0, we’re programmers!)
Decimal 32 expressed as hexadecimal is 0x20 and expressed as
binary is 00100000
This binary form is useful for clearing, setting or toggling bits:
| Operation | Operator | Mask bit | Effect on target bit |
|---|---|---|---|
| SET | OR (|) | 1 | Forces bit to 1 |
| CLEAR | AND (&) | 0 | Forces bit to 0 |
| TOGGLE | XOR (^) | 1 | Flips bit (0→1, 1→0) |
| NO-OP | AND (&) | 1 | Leaves bit unchanged |
| NO-OP | OR (|) | 0 | Leaves bit unchanged |
Read up on logic gates, particularly the Truth tables section for this to make more sense.
What we’re building to here is the understanding that it’s incredibly easy and efficient to lower, upper or toggle the case of a letter using bitwise operations.
A little demo in C:
| |
Compile and run the program to see the functions work…
$ clang -O0 -g -o charcase charcase.c
$ ./charcase
a
a
B
B
…and as you can imagine, we’re going to peak at this through a debugger as well.
% clang -O0 -g -o charcase charcase.c
% lldb ./charcase
(lldb) target create "./charcase"
Current executable set to './charcase' (x86_64).
(lldb) b letter_togglecase
Breakpoint 1: where = charcase`letter_togglecase + 10 at charcase.c:5:16, address = 0x000000000020167a
(lldb) run
Process 15598 launched: './charcase' (x86_64)
a
Process 15598 stopped
* thread #1, name = 'charcase', stop reason = breakpoint 1.1
frame #0: 0x000000000020167a charcase`letter_togglecase(a='A') at charcase.c:5:16
2
3 char letter_togglecase (char a)
4 {
-> 5 return a ^ 0x20;
6 }
7
8 char letter_lowercase (char a)
(lldb) disassemble -b
charcase`letter_togglecase:
0x201670 <+0>: 55 push rbp
0x201671 <+1>: 48 89 e5 mov rbp, rsp
0x201674 <+4>: 40 88 f8 mov al, dil
0x201677 <+7>: 88 45 ff mov byte ptr [rbp - 0x1], al
-> 0x20167a <+10>: 0f be 45 ff movsx eax, byte ptr [rbp - 0x1]
0x20167e <+14>: 83 f0 20 xor eax, 0x20
0x201681 <+17>: 5d pop rbp
0x201682 <+18>: c3 ret
(lldb) var -L
0x0000000820324e2f: (char) a = 'A'
(lldb) memory read -format b 0x0000000820324e2f -size 1 -count 1
0x820324e2f: 0b01000001
(lldb) memory read -format d 0x0000000820324e2f -size 1 -count 1
0x820324e2f: 65
(lldb) b 0x201681
Breakpoint 2: where = charcase`letter_togglecase + 17 at charcase.c:5:9, address = 0x0000000000201681
(lldb) continue
Process 15598 resuming
Process 15598 stopped
* thread #1, name = 'charcase', stop reason = breakpoint 2.1
frame #0: 0x0000000000201681 charcase`letter_togglecase(a='A') at charcase.c:5:9
2
3 char letter_togglecase (char a)
4 {
-> 5 return a ^ 0x20;
6 }
7
8 char letter_lowercase (char a)
(lldb) register read -format c al
al = 'a'
(lldb) register read -format d al
al = 97
(lldb) register read -format b al
al = 0b01100001
(lldb)
Note the xor 0x20 instruction in the function’s ASM:
| |
eax is the full 32-bit register where as al is just the first byte,
a subset of the same register.