Assembly Language

8086 Microprocessor Primer


Introduction

Machine code is the only language understood directly by the CPU in our computers. Machine code is simply a series of numbers, which the computer can interpret to do something useful.

Assembly language is a mnemonic representation of machine code. Machine code, being just a series of numbers, can be very challenging to learn; hence the introduction of assembly language. Assembly language is not just a simple mapping of numbers to words. It also contains many high-level-language type constructs to make data definition and program structuring easier.

8b c3
mov ax,bx
f8
clc
d1 e8
shr ax,1
b4 4c
mov ah,4ch
cd 21
int 21h

Just as a compiler is used to convert from a high-level language to machine code, an assembler converts assembly language programs to machine code. The most common assemblers used are Microsoft Assembler (MASM) and Turbo Assembler (TASM). MASM is probably the standard as far as assemblers go since its syntax is accepted by almost all assemblers to some degree or the other.

Although the language is formally called Assembly Language, in practice it is generally known as assembler.

Need and usage of assembler

Hexadecimal Number System

Dec
012 3456 789 10111213 1415
Hex
0 123 4567 89a bcde f

Decimal numbers range from 0 to 9. Hexadecimal numbers range from 0 to 15.

A typical decimal number can be represented as a sum of its components eg.

123 = 100 + 20 + 3
= 1 x 100 + 2 x 10 + 3 x 1
= 1 x 10*10 + 2 * 10 + 3 x 1

Since the number system is based on 10, each digit is successively multiplied by 10. In a hexadecimal system, however, each digit is multiplied by 16. Thus in hexadecimal,

12A= 100 + 20 + A
= 1 x 16 * 16 + 2 * 16 + 10 * 1 (remember A=10)
= 298

To convert a number from hexadecimal to decimal, use the above technique. To convert a number from decimal to hexadecimal, use the technique of repetitive division,

298 / 16 = 18 rem. 10
18 /16= 1 rem 2
1/16= 0 rem 1

Reading the remainders in reverse, we get 1 2 10 = 12A in hexadecimal.

Registers and the CPU

All calculations in an 8086 CPU are done in at most 16-bits. This means that the largest number representable is 65535. This create obvious problems for larger numbers. These problems can be overcome by various means, depending on the situation, as will be shown later.

The CPU has a number of temporary storage locations called registers. These registers can hold values while calculations are in progress; thus making the calculations faster since the registers are in the CPU itself. Registers are also used to keep track of important pieces of information like the position of currently executing instruction in the program, etc.

The 8086 CPU has the following registers :

AH,AL,AX
accumulator
BH,BL,BX
base register
CH,CL,CX
count register
DH,DL,DX
data register

SP
stack pointer
BP
base pointer

SI
source index
DI
destination index
IP
instruction pointer

CS
code segment
DS
data segment
ES
extra segment
SS
stack segment

All the registers are 16-bit registers except the ones ending in H or L. These are not separate registers but are used to access the high part (eg. AH) or the low part (eg. AL) of one of the general purpose 16-bit registers (AX, BX, CX, DX).

Number representation

Since all numbers used in assembler are generally 8-bit or 16-bit it is easier to represent them as hexadecimal rather than decimal. This means that all 8-bit numbers become 2-digit hexadecimal numbers and all 16-bit numbers become 4-digit hexadecimal numbers. This has other advantages, especially since a hexadecimal number can be broken down into binary more easily than a decimal number.

Hexadecimal numbers are represented by a trailing "h". So 10h is the value 16 in decimal. It is understood that the number is hexadecimal if the number has any alphabets in it.

Memory Organisation

High-level languages do not generally require the user to know about memory organisation because this is taken care of by the compiler. However, for assembler, it is important to have a good knowledge of how memory is used and can thus be manipulated by assembler programs.

Memory can be thought of as a vast collection of bytes. These bytes need to be organised in some efficient manner in order to be of any use. A simple scheme would be to order the bytes in a serial fashion and number them from 0 (or 1) to the end of memory. The numbers thus given to the individual positions in memory are called ADDRESSES. The problem with this approach is that towards the end of memory, the addresses become very large. For example, if a computer has 1 Megabyte of RAM, the highest address would be 1048575 (=1024*1024-1). This definitely would not fit in a 16-bit register and therefore addresses need to be stored in two registers. The scheme used in the 8086 is called segmentation. Every address has two parts, a SEGMENT and an OFFSET. The segment indicates the starting of a 64 kilobyte portion of memory, in multiples of 16. The offset indicates the position within the 64k portion.

absolute address = (segment * 16) + offset

Assuming we want to access a byte at absolute position 70000, we cannot use a segment of 0 because the offset would need to be 70000 and offsets cannot be greater than 65535. If we use a segment of 1000h, then the offset needs to be 54000, which is feasible.

Note that two different pairs of segment:offset values can point to the same absolute address. For example, consider the following segment:offset pairs

segment
offset
absolute address
1000
64
16064
1001
48
16064
1002
32
16064
1003
16
16064

In the 8086 CPU, programs are identified by different segment numbers. Thereafter all references to memory within that program is done via the offset. This works remarkably well for small (<64k) programs. For larger programs multiple segments are used. The segment register (CS, DS, ES, SS) are used for the purpose of indicating the relevant segments for a particular program.

CS indicates the start of the code of the program

DS indicates the beginning of the data storage section of memory

SS indicates the stack segment

ES is used for operations where data is transferred from one segment to another

In conjunction with the segment, memory can be addressed by using an offset. The most common form of addressing is by simply giving an offset but a full address can usually be specified by indicating the segment and offset. If only an offset is given, the CPU makes an intelligent guess in choosing the correct segment. If data is being manipulated, the DS register is used and if the program is jumping from one location to another, the CS register is used.

When a typical program is loaded into memory, the CS register is set to the top of the program and the instruction pointer (IP) is set to zero. As the program progresses, the IP is updated to indicate the position of the currently executing instruction in the program.

Software Interrupts and BIOS

Contrary to common belief, DOS is not only about command-line commands like DIR and TYPE. DOS is mainly a collection of procedures that can be accessed by any program. These procedures perform input and output between the computer and the keyboard, screen, disk, printer, etc. The programmer does not have to write lengthy routines to do these tasks in every assembler program. To use these routines, we call a SOFTWARE INTERRUPT. A number of values must be set in the particular registers and then an interrupt (INT) command must be issued. This then executes the relevant procedure.

DOS does not know how to communicate with the hardware directly, since DOS is the same for all computers. Instead, DOS makes use of an even lower-level operating system called the BIOS (Basic Input/Output System). This is another set of routines and variables that are built into the computer's hardware, in many cases in the ROM (read-only memory). Different computers have different BIOSes but they all provide the same functions. DOS then uses these functions to give the programmers a richer set of procedures to use.

In some cases there are many subfunctions that are contained within a single procedure. To access these, a register is used to denote the particular subfunction. For example, before calling interrupt 21h, if the AH register is set to 30h then the DOS version number is returned.

The BIOS Data Area has a collection of variables about the current state of the computer. The BIOS Data Area is located at segment 0040h.

Sample Assembler Program #1

DOSSEG
.MODEL TINY
.STACK

.DATA

.CODE
ProgramStart:
mov ah,4ch ; end program
int 21h
END ProgramStart

This is a simple program which does absolutely nothing.

In assembler we have to explicitly perform many functions which are taken for granted in high-level languages. The most important of these is exitting from a program. Although it is quite obvious that a program ends when the code ends, there are many different techniques to end a program. Assemblers leave the choice of exit code to the user. One of the easiest techniques is to call interrupt 21h, subfunction 4ch.

First the file has to be created with a text editor such as EDIT. Then to assemble the source file into an object file we use an assembler, typically TASM. Finally the object file must be linked into an executable by using a linker, typically TLINK. Linking is an extra step that is included in the process in order to allow the programmer to use multiple object/source files for a single assembler program. This will be used extensively in later programs.

The commands in sequence would be :

EDIT SAMPLE1.ASM

(enter lines and save file)

TASM SAMPLE1

TLINK SAMPLE

When you run the file by calling SAMPLE1, you will just be returned to the command prompt since the program is not meant to do anything more. If the computer hangs or does anything unusual, you ought to reboot and check your program for any typing errors; then re-assemble and relink the file.

DOSSEG is what is called an assembler directive. It is not assembly language, but a code telling the assembler to perform a certain task. In this case, it tells the assembler to use paragraph alignment for all segments. This is the default and should not be changed.

.MODEL instructs the assembler on how memory should be arranged. This is known as the memory model. It is not a feature specific to assemblers since even high-level languages need a memory model to be pre-defined. The memory model specifies how much of memory to use for the storage of code and data. A TINY memory model means that one 64k segment will be used for both code and data.

.STACK is a label to define the position of the stack. This is used for temporary data storage and procedure calls.

.DATA indicates the variable storage section

.CODE indicates the start of the actual assembly language code

ProgramStart: is an arbitrary name selected to indicate the entry point i.e. where the program starts running. It is not always necessary but its better to put it in always than leave it out.

mov ah,4ch is the first line of assembler code. The value 4C in hexadecimal is stored in the register AH.

int 21h is the second line of assembler code. The software interrupt 21h is called. This interrupt, when given the value of 4ch in AH (as is the case here), causes the program to exit immediately.

END denotes the end of the program. Although not necessary, it is advisable to put in the name of the entry point label as a parameter.

Sample Assembler Program #2

; sample02
; illustrates standard output interrupt
DOSSEG
.MODEL SMALL
.STACK

.DATA
aString db 'Hello World',13,10,'$'

.CODE
ProgramStart:
mov ax,SEG _DATA ; set data segment
mov ds,ax

mov ah,09h ; output message
mov dx,OFFSET aString
int 21h

mov ah,4ch ; terminate program
int 21h
END ProgramStart

Any line that starts with a ";" like the first two lines here is considered to be a comment. Comments can appear also at the end of any line, causing everything after the ";" to be ignored.

A string is declared in the .DATA section. The name of the string is set to aString and the type of the data is byte (DB=data byte). Although it is not only a single byte, assemblers are only concerned with the data type of the first item, which is then assumed for all other declarations before a new label. At the end of the string is a 13 and 10 - these are the carriage return and linefeed characters that are used to go to the next line. The last '$' is needed by the output function to signal the end of the string.

Before the program can use the data in the data segment, the DS register must first be set up appropriately. Unlike the CS register which is always set when the program starts, the DS register must be explicitly set to point to the DATA segment. The first "mov" command gets the SEGment of the DATA segment and stores it in the AX register. The second "mov" command sets the DS register value from the AX register. The reason why two commands are necessary is because the 8086 CPU does not have a command to directly set a value into a segment register.

The value of 9 is inserted into the AH register to select sub-function 9 of the interrupt 21h DOS interrupts. This interrupt requires that the DS:DX segment:offset pair point to the string to be output. In this case, DS already points to the segment containing the string. So we just set the DX register to the OFFSET of the string.

Interrupt 21h is called to output the string and the program terminates like before.

Sample Assembler Program #3

; sample03
; illustrates setting of BIOS variable
; clears the NUMLock, CAPSLock, SCROLLLock flags
DOSSEG
.MODEL SMALL
.STACK

.DATA
aString db 'Lock keys reset !',13,10,'$'

.CODE
ProgramStart:
mov ax,SEG _DATA ; set data segment
mov ds,ax

mov ax,0040h ; set ES to point to BIOS Data Area
mov es,ax
mov byte ptr es:[0017h],0 ; reset keyboard flags

mov ah,09h ; output message
mov dx,OFFSET aString
int 21h

mov ah,4ch ; terminate program
int 21h
END ProgramStart

This program has three additional lines to illustrate the simplicity with which useful tasks can be accomplished in assembler.

First the value 0040h is stored in AX. Then this value is copied into the ES (extra segment) register.

The following line stored the value of 0 (zero) in the memory location ES:0017h. This is equivalent to 0040h:0017h since the ES register points to 0040h. This is a very interesting location as it holds, at any time, the current state of the CapsLock, NumLock, ScrollLock and shift keys on the keyboard. By changing this value to 0, the CapsLock and other lights can be switched off automatically. The reason for the "byte ptr" is that the assembler cannot guess whether to use a byte or a word for the data storage. So we explicitly denote that this location points (ptr) to a byte.

This is an example of how assembler programs can do things that cannot be done normally in high-level languages. It also shows that programs can be very small and very fast.