Converting High Level Languages to Machine Language

Introduction

High level languages (HLLs), like Python and Java, are designed to be written and read by humans. However, computer hardware can’t understand HLLs, so it translates HLL programs into machine language (ML).

An instruction set architecture (ISA), which serve as an interface between hardware and software, define every ML. Common ISAs include ARM, MIPS, and 80×86). MLs are languages consisting of two numbers—1s and 0s—in base-2 or “binary” format. These binary numbers are often referred to as bits or binary digits and represent electronic signals for on and off.

Confusing language

ML and “machine code” are often used as synonyms. However, machine language != machine code. A program’s ML is the language it is written in (e.g., MIPs), while a specific program converted to an ML is machine code. Think of machine code as a book (i.e., a program) written in English (i.e., a machine language).

TLDR: machine code is a specific instance of a program (that might be written in MIPS assembly), while MIPS assembly is machine language.

🌸👋🏻 Join 10,000+ followers! Let’s take this to your inbox. You’ll receive occasional emails about whatever’s on my mind—offensive security, open source, academics, boats, software freedom, you get the idea.

Program translation process

Here is the general overview of how computers translate HLL (C specifically) to ML format:

Created by Warren R. Carithers; styled by Olivia Gallucci.

Now, let’s dig a bit deeper.

Compiler

First, the compiler translates each HLL statement into assembly language statements (in this case, MIPS).

Assembly

Assembly language provides a symbolic representation of machine instructions; instructions are commands that computer hardware understands and obeys (Patterson and Hennessy ch. 2).

Warren R. Carithers’ swap function in MIPS assembly language.

li   $8, 4      # r8 = 4
mul  $8, $8,$4  # r8 = t0 * i 
add  $8, $8,$4  # r8 = r8 + base
lw   $9, 0($8)  # r9 = arr[i]
lw   $10,4($8)  # r10= arr[i + 1]
sw   $10,0($8)  # arr[i] = r10
sw   $9, 4($8)  # arr[i + 1] = r9
jr   $ra        # return

Assembler

Then, an assembler translates the symbolic instructions to binary instructions (ML format) inside an object module (again, ML format).

Carithers’ swap function as machine code (ML format) with the corresponding MIPS assembly language.

00110100000010000000000000000100        li      $8, 4            # r8 = 4
00000001000001010000000000011000        mul  $8, $8,$4     # r8 = t0 * i
00000000000000000100000000010010         
00000001000001000100000000100000        add   $8, $8,$4   # r8 = r8 + base
10001101000010010000000000000000        lw     $9, 0($8)    # r9 = arr[i]
10001101000010100000000000000100        lw     $10,4($8)   # r10= arr[i + 1]
10101101000010100000000000000000        sw     $10,0($8)  # arr[i] = r10
10101101000010010000000000000100        sw     $9, 4($8)   # arr[i + 1] = r9
00000011111000000000000000001000        jr       $ra            # return

One thing to note is that the instruction used (e.g., add, lw, or jr) along with the instruction set architecture determine what bits mean in a given instance. For example, 0001100 might represent 12, yet a computer may interpret 0001100 to represent an instruction to add two values in memory.

Humans do the same thing with characters and strings. By looking at the letters xxyzxx, a person doesn’t know if xxyzxx represents a person’s name, a computer password, or a coded instruction to a co-worker.

Linker

Once the assembler produces one or more object modules, the linker takes these object modules and library modules (e.g., ncurses and glib), and outputs an executable load module (ML format). This executable module is commonly referred to as an “executable” or “executable file,” and it can be run by users on the operating system.

Linker step in C:

gcc foo.o bar.o -o foo -lncurses

Why multiple object and library modules?

Separating objects and libraries help make the compilation process more efficient. N.B., that “module” and “file” are synonymous in this section.

As programs become larger and the number of source files increases, the time required to recompile and link all source files can become very long – often requiring minutes to hours. Instead of compiling an executable using a single step, a modular compilation approach can be used that separates the compiling and linking steps within the compilation process. In this approach, each source file is independently compiled into an object file. An object file contains machine instructions for the compiled code along with placeholders, often referred to as references, for calls to functions or accesses to variables or classes defined in other source files or libraries.
Bic, Lysecky, and Vahid

For example, the Linux kernel consists of thousands of object modules and libraries. Compiling every source file when only a few files were changed would be incredibly slow.

Thus, it makes sense to only compile files and their corresponding dependencies that have been or will need to be changed.

When all the necessary source files have been compiled, the linker creates a final executable by linking together the object and library files. The linker does this by finding placeholders inside the object files, and searching all of the linked object and library modules for referenced functions or variables.

The linker replaces each placeholder with a jump instruction to the referenced function or variable, which successfully links the two modules.

Using a modular compilation approach has the benefit of reducing the time required to recompile and link the program executable. Instead of recompiling all source files, only the source files that have been modified need to be recompiled to create an updated object file. The linker can then use these newly recompiled object files along with the previously compiled object files for any unmodified source files to create the updated executable.
Bic, Lysecky, and Vahid

Loader and memory

The loader is a program in the operating system; it gets the program ready to be ran in memory.

Once the program reaches the loader step, it is an executable (ML format). In other words, the file has to already be an executable in order to go into the loader.

Unfortunately, the loader and memory steps are out of scope for this post; check back for a post on this subject!

Conclusion

If you enjoyed this post, I recommend reading my other assembly posts. This post is the second part of my new series exploring low-level programming and computer architecture.

Citations

Bic, Lubomir. Operating Systems. Zyante, 2020.
Lysecky, Roman, and Frank Vahid. Programming in C. Zyante, 2020.
Patterson, David A., and John L. Hennessy. Computer Organization and Design MIPS Edition: The Hardware/Software Interface. Morgan Kaufmann, 2020.