1. ARM
Advanced RISC Machines
ARM Instruction Set
1
2. Stack Processing
• A stack is usually implemented as a linear data structure which
grows up (an ascending stack) or down (a descending stack)
memory
• A stack pointer holds the address of the current top of the stack,
either by pointing to the last valid data item pushed onto the stack
(a full stack), or by pointing to the vacant slot where the next data
item will be placed (an empty stack)
• ARM multiple register transfer instructions support all four forms of
stacks
– Full ascending: grows up; base register points to the highest address
containing a valid item
– empty ascending: grows up; base register points to the first empty
location above the stack
– Full descending: grows down; base register points to the lowest address
containing a valid data
– empty descending: grows down; base register points to the first empty
location below the stack
83
3. • The ARM architecture uses the load-store multiple instructions to carry out
stack operations.
• The pop operation (removing data from a stack) uses a load multiple
instruction; similarly, the push operation (placing data onto the stack) uses a
store multiple instruction.
• When using a stack you have to decide whether the stack will grow up or
down in memory. A stack is either ascending (A) or descending (D).
Ascending stacks grow towards higher memory addresses; in contrast,
descending stacks grow towards lower memory addresses.
• When you use a full stack (F), the stack pointer sp points to an address that is
the last used or full location (i.e., sp points to the last item on the stack). In
contrast, if you use an empty stack (E) the sp points to an address that is the
first unused or empty location (i.e., it points after the last item on the stack).
• There are a number of load-store multiple addressing mode aliases available
to support stack operations (see Table). Next to the pop column is the actual
load multiple instruction equivalent.
84
4. For example, a full ascending stack would have the notation FA appended to the load
multiple instruction—LDMFA. This would be translated into an LDMDA instruction.
85
5. Example 20
The STMFD instruction pushes registers onto the stack, updating the sp. Figure shows a
push onto a full descending stack. You can see that when the stack grows the stack
pointer points to the last full entry in the stack.
PRE r1 = 0x00000002
r4 = 0x00000003
sp = 0x00080014
STMFD sp!, {r1,r4}
POST r1 = 0x00000002
r4 = 0x00000003
sp = 0x0008000c
NOTE : Stack pointer points to the last full entry in the stack. 86
6. Example 21
In contrast, Next figure shows a push operation on an empty stack using the STMED
instruction. The STMED instruction pushes the registers onto the stack but updates
register sp to point to the next empty location.
PRE r1 = 0x00000002
r4 = 0x00000003
sp = 0x00080010
STMED sp!, {r1,r4}
POST r1 = 0x00000002
r4 = 0x00000003
sp = 0x00080008
87
NOTE : SP to point to the next empty location.
8. Stack Examples
STMFD sp!, STMED sp!, STMFA sp!, STMEA sp!,
{r0,r1,r3-r5} {r0,r1,r3-r5} {r0,r1,r3-r5} {r0,r1,r3-r5}
0x418
S r5 S
P r4 P r5
r3 r4
r1 r3
r0 r1
Old SP Old SP r5 Old SP Old SP r0 0x400
r5 r4
r4 r3
r3 r1
r1 r0
S r0 S
P P 0x3e8
9. Load-Store Instructions
• Three basic forms to move data between ARM registers
and memory
– Single register load and store instruction
• A byte, a 16-bit half word, a 32-bit word
– Multiple register load and store instruction
• To save or restore workspace registers for procedure entry and exit
• To copy clocks of data
– Single register swap instruction
• A value in a register to be exchanged with a value in memory
• To implement semaphores to ensure mutual exclusion on accesses
90
10. Single Register Swap Instruction
• The swap instruction is a special case of a load-store
instruction. It swaps the contents of memory with the
contents of a register.
• This instruction is an atomic operation—it reads and writes a
location in the same bus operation, preventing any other
instruction from reading or writing to that location until it
completes.
• Swap cannot be interrupted by any other instruction or any
other bus access. We say the system “holds the bus” until the
transaction is complete.
91
12. Example 21
The swap instruction loads a word from memory into register r0 and overwrites the
memory with register r1.
PRE mem32[0x9000] = 0x12345678
r0 = 0x00000000
r1 = 0x11112222
r2 = 0x00009000
SWP r0, r1, [r2]
POST mem32[0x9000] = 0x11112222
r0 = 0x12345678
r1 = 0x11112222
r2 = 0x00009000
This instruction is particularly useful when implementing semaphores and mutual
exclusion in an operating system. You can see from the syntax that this instruction can
also have a byte size qualifier B, so this instruction allows for both a word and a byte
swap.
93
14. Concept of SEMAPHORE
• In computer science, a semaphore is a variable or abstract data type that
provides a simple but useful abstraction for controlling access by
multiple processes to a common resource in a parallel
programming environment.
• A semaphore, in its most basic form, is a protected integer variable that
can facilitate and restrict access to shared resources in a multi-processing
environment.
• The two most common kinds of semaphores are counting
semaphores and binary semaphores. Counting semaphores represent
multiple resources, while binary semaphores, as the name implies,
represents two possible states (generally 0 or 1; locked or unlocked).
95
15. • A semaphore can only be accessed using the following
operations: wait() and release().
• wait() is called when a process wants access to a resource. This would be equivalent
to the arriving customer trying to get an open table. If there is an open table, or the
semaphore is greater than zero, then he can take that resource and sit at the table.
If there is no open table and the semaphore is zero, that process must wait until it
becomes available. signal() is called when a process is done using a resource, or
when the patron is finished with his meal.
• The following is an implementation of this counting semaphore (where the value
can be greater than 1):
96
16. • In this implementation, a process wanting to enter its critical section it has
to acquire the binary semaphore which will then give it mutual exclusion
until it signals that it is done.
• For example, we have semaphore s, and two processes, P1 and P2 that
want to enter their critical sections at the same time. P1 first calls wait(s).
The value of s is decremented to 0 and P1 enters its critical section. While
P1 is in its critical section, P2 calls wait(s), but because the value of s is
zero, it must wait until P1 finishes its critical section and executes signal(s).
• When P1 calls signal, the value of s is incremented to 1, and P2 can then
proceed to execute in its critical section (after decrementing the
semaphore again). Mutual exclusion is achieved because only one process
can be in its critical section at any time.
97
17. Example 22
This example shows a simple data guard that can be used to protect data from being
written by another task. The SWP instruction “holds the bus” until the transaction is
complete.
loop
MOV r1, =semaphore
MOV r2, #1
SWP r3, r2, [r1] ; hold the bus until complete
CMP r3, #1
BEQ loop
The address pointed to by the semaphore either contains the value 0 or 1. When the
semaphore equals 1, then the service in question is being used by another process. The
routine will continue to loop around until the service is released by the other process—
in other words, when the semaphore address location contains the value 0.
98
18. ARM instructions by instruction class
1. Data Processing Instructions
2. Branch Instructions
3. Load-Store Instructions
4. Software Interrupt Instruction
5. Program Status Register Instructions
99
19. Software Interrupt Instruction
Introduction
• The software interrupt instruction is used for calls to the operating system
and is often called a 'supervisor call'.
• It puts the processor into supervisor mode and begins executing
instructions from address 0x08.
Binary encoding
31 28 27 24 23 0
COND OPCODE 24-BIT (INTERPRETED) IMMEDIATE
100
20. Binary encoding
31 28 27 24 23 0
COND OPCODE 24-BIT (INTERPRETED) IMMEDIATE
Description
To return to the instruction after the SWI the system routine must not only copy r14_svc
back into the PC, but it must also restore the CPSR from SPSR_svc. 101
22. Example 23
Here we have a simple example of an SWI call with SWI number 0x123456, used by ARM
toolkits as a debugging SWI. Typically the SWI instruction is executed in user mode.
PRE cpsr = nzcVqift_USER
pc = 0x00008000
lr = 0x003fffff; lr = r14
r0 = 0x12
0x00008000 SWI 0x123456
POST cpsr = nzcVqIft_SVC
spsr = nzcVqift_USER
pc = 0x00000008
lr = 0x00008004
r0 = 0x12
Since SWI instructions are used to call operating system routines, you need some form of
parameter passing. This is achieved using registers. In this example, register r0 is used to pass
the parameter 0x12. The return values are also passed back via registers.
Code called the SWI handler is required to process the SWI call. The handler obtains the SWI
number using the address of the executed instruction, which is calculated from the link
register lr.
103
23. ARM instructions by instruction class
1. Data Processing Instructions
2. Branch Instructions
3. Load-Store Instructions
4. Software Interrupt Instruction
5. Program Status Register Instructions(MSR, MRS)
(Self Study!!!) Refer Steve Furber 104
24. Byte organizations
• Little-endian mode:
- with the lowest-order byte residing in the low-
order bits of the word
• Big-endian mode:
- the lowest-order byte stored in the highest bits
of the word
26. Thumb Mode
• Thumb is a 16-bit instruction set
– Optimized for code density from C code
– Improved performance form narrow memory
– Subset of the functionality of the ARM instruction set
• Core has two execution states – ARM and Thumb
– Switch between them using BX instruction
• Thumb has characteristic features:
– Most Thumb instruction are executed unconditionally
– Many Thumb data process instruction use a 2-address
format
– Thumb instruction formats are less regular than ARM
instruction formats, as a result of the dense encoding.
107
27. Thumb has higher code density !
• Code density: it is define as the space taken up in memory by an executable
program.
• On average, a Thumb implementation of the same code takes up around 30%
less memory than the equivalent ARM implementation.
• Figure 4.1 shows the same divide code routine implemented in ARM and Thumb
assembly code. Even though the Thumb implementation uses more instructions,
the overall memory footprint is reduced. Code density was the main driving
force for the Thumb instruction set.
108
28. Thumb implementation uses more instructions, the overall memory footprint is
reduced.
Code density was the main driving force for the Thumb instruction set. Because
it was also designed as a compiler target, rather than for hand-written assembly
code, we recommend that you write Thumb-targeted code in a high-level
language like C or C++.
109
29. Thumb Register Usage
• In Thumb state, you do not have direct access to all registers.
• Only the low registers r0 to r7 are fully accessible.
• The higher registers r8 to r12 are only accessible with MOV, ADD, or
CMP instructions.
• CMP and all the data processing instructions that operate on low
registers update the condition flags in the cpsr.
110
33. Thumb Instruction Entry and Exit
T bit, bit 5 of CPSR
If T = 1, the processor interprets the instruction stream as 16-bit Thumb
instruction
If T = 0, the processor interprets if as standard ARM instructions
Thumb Entry
ARM cores startup, after reset, execution ARM instructions
Executing a branch and Exchange instruction (BX)
Set the T bit if the bottom bit of the specified register was set
Switch the PC to the address given in the remainder of the register
Thumb Exit
Executing a thumb BX instruction
114
34. ARM-Thumb Interworking
• ARM-Thumb interworking is the name given to the method of
linking ARM and Thumb code together for both assembly and
C/C++.
• To call a Thumb routine from an ARM routine, the core has to
change state. This state change is shown in the T bit of the
cpsr.
• The BX and BLX branch instructions cause a switch between
ARM and Thumb state while branching to a routine.
• The BX lr instruction returns from a routine, also with a state
switch if necessary.
115
35. • There are two versions of the BX or BLX instructions: an ARM
instruction and a Thumb equivalent.
• The ARM BX instruction enters Thumb state only if bit 0 of the
address in Rn is set to binary 1; otherwise it enters ARM state. The
Thumb BX instruction does the same.
Syntax: BX Rn
BLX Rn | label
116
36. Interworking Instructions
• Interworking is achieved using the Branch Exchange instructions
– In Thumb state
BX Rn
– In ARM state (on Thumb-aware cores only)
BX<condition> Rn
Where Rn can be any registers (R0 to R15)
• The performs a branch to an absolute address in 4GB address space
by copying Rn to the program counter
• Bit 0 of Rn specifies the state to change to
117
38. Example 24
;Start off in ARM state
CODE32
ADR r0,Into_Thumb+1 ;generate branch target
;address & set bit 0
;hence arrive Thumb state
BX r0 ;branch exchange to Thumb
…
CODE16 ;assemble subsequent as Thumb
Into_Thumb …
ADR r5,Back_to_ARM ;generate branch target to
;word-aligned address,
;hence bit 0 is cleared.
BX r5 ;branch exchange to ARM
…
CODE32 ;assemble subsequent as ARM
Back_to_ARM …
119
40. ARM data instructions
ADD Add
ADC Add with carry
SUB Subtract
SBC Subtract with carry
RSB Reverse subtract ,RSB r0,r1,r2, r0=r2 – r1
RSC Reverse subtract with carry
MUL Multiply
MLA Multiply and accumulate
• MLA r0,rl,r2,r3 ,r0=r1 x r2 + r3
41. ARM data instructions
AND Bit-wise and
ORR Bit-wise or
EOR Bit-wise exclusive-or
BIC Bit clear
• BIC r0,r1,r2 sets r0 to r1 and not r2
- uses the second source operand as a mask, a bit in
mask is 1, the corresponding bit in first source
operand is cleared
42. ARM data instructions
LSL Logical shift left (zero fill)
LSR Logical shift right (zero fill)
ASL Arithmetic shift left
ASR Arithmetic shift right, copies the sign bit
ROR Rotate right
RRX Rotate right extended with C, performs a 33-
bit rotate
43. ARM comparison instructions
• only set the values of the NZCV bits
CMP Compare
CMN Negated compare,
uses an addition to set the status bits
TST Bit-wise test, a bit-wise AND
TEQ Bit-wise negated test, an exclusive-or
45. ARM load-store instructions
LDR Load
STR Store
LDRH Load half-word
STRH Store half-word
LDRSH Load half-word signed
LDRB Load byte
STRB Store byte
ADR Set register to address
46. C Assignments in ARM Instructions
• x = (a + b) - c;
• using r0 for a, r1 for b, r2 for c, and r3 for x.
• registers for indirect addressing. Indirect r4
• load values of a, b, and c into registers
• store value of x back to memory
47. C Assignments in ARM Instructions
x = (a + b) - c;
ADR r4,a ; get address for a
LDR r0,[r4] ; get value of a
ADR r4,b ; get address for b, using r4
LDR r1,[r4] ; load value of b
ADD r3,r0,r1 ; set result for x to a + b
ADR r4,c ; get address for c
LDR r2,[r4] ; get value of c
SUB r3,r3,r2 ; complete computation of x
ADR r4,x ; get address for x
STR r3,[r4] ; store x at proper location
48. C Assignments in ARM Instructions
• y = a*(b + c);
• using r0 for both a and b, r1 for c, and r2 for y
• use r4 to store addresses for indirect
addressing
49. C Assignments in ARM Instructions
y = a*(b + c);
ADR r4,b ; get address for b
LDR r0,[r4] ;get value of b
ADR r4,c ; get address for c
LDR r1,[r4] ; get value of c
ADD r2,r0,r1 ; compute partial result of y=b+c
ADR r4,a ; get address for a
LDR r0,[r4] ; get value of a
MUL r2,r2,r0 ; compute final value of y=a*(b+c)
ADR r4,y ; get address for y
STR r2,[r4] ; store value of y at proper location
50. C Assignments in ARM Instructions
• z = (a « 2) | (b & 15);
• using r0 for a and z, r1 for b,
• r4 for addresses
51. C Assignments in ARM Instructions
z = (a « 2) | (b & 15);
ADR r4,a ; get address for a
LDR r0,[r4] ; get value of a
MOV r0,r0,LSL 2 ; perform shift (a « 2)
ADR r4,b ; get address for b
LDR r1,[r4] ; get value of b
AND r1,r1,#15 ; perform logical AND (b & 15)
ORR r1,r0,r1 ; compute final value of z
ADR r4,z ; get address for z
STR r1,[r4] ; store value of z
Editor's Notes
N flag SUB r0, r1, r2 where r1<r2Z flag SUB r0, r1, r2 where r1=r2 (also used for results of logical operations)C flag ADD r0, r1, r2 where r1+r2>0xFFFFFFFFV flag ADD r0, r1, r2 where r1+r2>0x7FFFFFFF (if numbers are signed, ALU sign bit will be corrupted) (0x7FFFFFF+0x00000001=0x80000000) (answer okay for unsigned but wrong for signed)
Each instruction is one word (or 32 bits)Thus each stage in pipeline is one wordIn other words 4 bytes, hence the offsets of 4 and 8 used here.Most instructions execute in a single cycleHelps to keep the pipeline operating efficiently - only stalls if executing instruction takes several cycles.Thus every cycle, processor can be loading one instruction, decoding another, whilst executing a third.Typically the PC can be assumed to be current instruction plus 8Cases when not case includeWhen exceptions taken the address stored in LR varies - see Exception Handling module for more details.When PC used in some data processing operations value is unpredictable
4 bit Condition Field refers to the values of the appropriate bits in the CPSR, as indicated on slide.Most instructions assembled with default condition code "always" (1110, AL). This means the instruction will be executed irrespective of the flags in CPSR. The "never" code (1111, NV) is reserved. This family of conditions will be redefined for future use in other ARM devices. Use MOV r0, r0 as NOP operation.Conditional instructions aids code density