Y Does Increasing Amount Of Data In A Processors Register File Increase Performance Of Processor?
Stack Computers: the new wave © Copyright 1989, Philip Koopman, All Rights Reserved.
Chapter 6. Understanding Stack Machines
6.ii ARCHITECTURAL DIFFERENCES FROM CONVENTIONAL MACHINES
The obvious difference between stack machines and conventional machines is the use of 0-operand stack addressing instead of annals or memory based addressing schemes. This difference, when combined with back up of quick subroutine calls, makes stack machines superior to conventional machines in the areas of program size, processor complication, system complexity, processor performance, and consistency of programme execution.
vi.ii.1 Programme size
A pop proverb is that "memory is cheap." Anyone who has watched the historically rapid growth in memory chip sizes knows the amount of retention available on a processor can be expected to increase dramatically with time.
The trouble is that even as memory chip chapters increases, the size of problems that people are calling on computers to solve is growing at an fifty-fifty faster rate. This means that the size of programs and their data sets is growing fifty-fifty faster than available retentiveness size. Further aggravating the situation is the widespread use of high level languages for all phases of programming. This results in bulkier programs, but of course improves programmer productivity.
Not surprisingly, this explosion in program complexity leads to a seeming contradiction, the saying that "programs aggrandize to fill all available memory, and then some." The corporeality of program retentivity available for an awarding is stock-still past the economics of the bodily cost of the retention chips and printed circuit lath infinite. It is also affected past mechanical limits such as power, cooling, or the number of expansion slots in the arrangement (limits which besides figure in the economical picture). Even with an unlimited budget, electrical loading considerations and the speed-of-light wiring delay limit bring an ultimate limit to the number of fast retentivity fries that may exist used past a processor. Modest program sizes reduce memory costs, component count, and power requirements, and tin meliorate system speed by allowing the cost effective use of smaller, college speed memory chips. Additional benefits include better functioning in a virtual memory environs (Sugariness & Sandman 1982, Moon 1985), and a requirement for less enshroud memory to achieve a given hitting ratio. Some applications, notably embedded microprocessor applications, are very sensitive to the costs of printed excursion board infinite and memory fries, since these resources class a substantial proportion of all arrangement costs (Ditzel et al. 1987b).
The traditional solution for a growing programme size is to use a hierarchy of memory devices with a serial of chapters/cost/access-time tradeoffs. A hierarchy might consist of (from cheapest/biggest/slowest to almost expensive/smallest/fastest): magnetic tape, optical disk, hd, dynamic retentiveness, off-chip cache retentivity, and on-chip instruction buffer retentivity. So a more right version of the proverb that "retentivity is inexpensive" might be that "slow retentiveness is cheap, but fast memory is very love indeed."
The memory trouble comes down to one of supplying a sufficient quantity of memory fast enough to support the processor at a cost that can be afforded. This is achieved by plumbing fixtures the most program possible into the fastest level of the retentivity bureaucracy.
The usual way to manage the fastest level of the retentivity hierarchy is by using cache memories. Cache memories piece of work on the principle that a small section of a program is likely to exist used more than once within a curt menstruum of time. Thus, the showtime time a small grouping of instructions is referenced, information technology is copied from slow retention into the fast cache retentiveness and saved for later use. This decreases the access delay on the second and subsequent accesses to programme fragments. Since cache memory has a limited capacity, whatever instruction fetched into cache is eventually discarded when its slot must exist used to hold a more recently fetched instruction. The problem with enshroud retention is that it must be big enough to concur enough program fragments long enough for the eventual reuse to occur.
A cache memory that is big plenty to hold a certain number of instructions, called the "working set," can significantly improve organization performance. How does the size of a plan affect this functioning increase? If we assume a given number of high level language operations in the working gear up, consider the effect of increasing the compactness of the encoding of instructions. Intuitively, if a sequence of instructions to attain a high level linguistic communication statement are more compact on machine A than car B, then machine A needs a smaller number of bytes of enshroud to hold the instructions generated for the same source code as car B. This means that automobile A needs a smaller enshroud to attain the aforementioned average memory response time operation.
By mode of instance, Davidson and Vaughan (1987) suggest that RISC figurer programs can exist up to two.5 times bigger than CISC versions of the same programs (although other sources, especially RISC vendors, would place this number at peradventure 1.5 times bigger.) They also propose that the RISC computers need a cache size that is twice as large as a CISC cache to achieve the same functioning. Furthermore, a RISC machine with twice the cache of a CISC machine volition still generate twice the number of cache misses (since a abiding miss ratio generates twice as many misses for twice as many cache accesses), resulting in a need for higher speed chief memory devices every bit well for equal performance. This is corroborated by the dominion of thumb that a RISC processor in the x MIPS (Million RISC Instructions Per 2nd) performance range needs 128K bytes of enshroud memory for satisfactory performance, while high end CISC processors typically need no more 64K bytes.
Stack machines accept much smaller programs than either RISC or CISC machines. Stack machine programs can exist 2.5 to viii times smaller than CISC code (Harris 1980, Ohran 1984, Schoellkopf 1980), although there are some limitations to this observation discussed later. This means that a RISC processor'southward cache memory may need to be bigger than a stack processor'southward entire plan retention to achieve comparable retentivity response times! Every bit anecdotal evidence of this effect, consider the post-obit situation: while Unix/C programmers on RISC processors are unhappy with less than 8M to 16M bytes of retention, and want 128K bytes of cache, Along programmers are still engaged in heated debate as to whether more than than 64K bytes of program space is really needed on stack machines.
Small programme size on stack machines non merely decreases organization costs by eliminating memory chips, but can actually ameliorate organisation performance. This happens past increasing the hazard that an instruction will exist resident in high speed retentivity when needed, possibly by using the small program size as a justification for placing an unabridged program in fast memory.
How can it be that stack processors have such minor memory requirements? There are 2 factors that account for the extremely small program sizes possible on stack machines. The more obvious gene, and the one usually cited in the literature, is that stack machines have small education formats. Conventional architectures must specify non only an operation on each instruction, merely too operands and addressing modes. For example, a typical register-based machine didactics to add two numbers together might be:
Add R1,R2
This instruction must not merely specify the Add opcode, but also the fact that the addition is beingness done on two registers, and that the registers are R1 and R2 .
On the other hand, a stack-based instruction set demand only specify an ADD opcode, since the operands take an implicit address of the current top of stack. The only time that an operand is present is when performing a load or store instruction, or pushing an immediate information value onto the stack. The WISC CPU/xvi and Harris RTX 32P employ 8 and 9 fleck opcodes, respectively, yet have many more than opcodes than are actually needed to run programs efficiently. Loosely encoded instructions found on the other processors discussed in this book, exemplified past the Novix NC4016, permit packing ii or more than operations into the same instruction to achieve little sacrifice in code density over a byte-oriented automobile.
A less obvious, merely actually more of import reason for stack machines having more meaty code is that they efficiently support code with many often reused subroutines, oftentimes called threaded code (Bell 1973, Dewar 1975). While such lawmaking is possible on conventional machines, the execution speed penalty is severe. In fact, one of the about unproblematic compiler optimizations for both RISC and CISC machines is to compile procedure calls as in-line macros. This, added to most programmers' experience that too many process calls on a conventional car will destroy program performance, leads to significantly larger programs on conventional machines.
On the other hand, stack oriented machines are built to support process calls efficiently. Since all working parameters are always present on a stack, procedure phone call overhead is minimal, requiring no memory cycles for parameter passing. On most stack processors, procedure calls take 1 clock bicycle, and procedure returns take zero clock cycles in the frequent case where they are combined with other operations.
There are several qualifications associated with the claim that stack machines have more than compact code than other machines, especially since we are non presenting the results of a comprehensive study here. Program size measures depend largely on the language being used, the compiler, and programming mode, likewise as the education set of the processor being used. Also, the studies by Harris, Ohran, and Schoellkopf were by and large for stack machines that used variable length instructions, while machines described in this book employ 16 or 32 flake fixed length instructions. Counterbalancing the fixed education length is the fact that processors running Along can accept smaller programs than other stack machines. The programs are smaller because they use frequent subroutine, assuasive a high degree of code reuse inside a single awarding program. And, equally we shall see in a later section, the fixed pedagogy length for even 32-bit processors such as the RTX 32P does not cost equally much program memory space as i might think.
vi.2.2 Processor and organisation complexity
When speaking of the complexity of a computer, two levels are important: processor complexity, and arrangement complexity. Processor complication is the corporeality of logic (measured in fleck expanse, number of transistors, etc.) in the actual core of the processor that does the computations. System complexity considers the processor embedded in a fully functional system which contains support circuitry, the memory hierarchy, and software.
CISC computers have become substantially more complex over the years. This complication arises from the demand to be very skillful at all their many functions simultaneously. A large degree of their complexity stems from an attempt to tightly encode a wide variety of instructions using a large number of education formats. Added complication comes from their support of multiple programming and data models. Whatever machine that is reasonably efficient at processing COBOL packed decimal data types on a time sliced basis with running double-precision floating bespeak FORTRAN matrix operations and LISP expert systems is bound to be complex!
The complexity of CISC machines is partially the result of encoding instructions to keep programs relatively small-scale. The goal is to reduce of the semantic gap betwixt high level languages and the auto to produce more efficient lawmaking. Unfortunately, this may atomic number 82 a state of affairs where almost all available chip area is used for the control and data paths (for instance the Motorola 680x0 and Intel 80x86 products). Additionally, an argument made past RISC proponents is that CISC designs may exist paying a functioning penalty as well as a size penalty.
The extremes to which some CISC processors take the complexity of the core processor may seem excessive, merely they are driven by a common and well founded goal: establishment of a consistent and simple interface between hardware and software. The success that this approach can have is demonstrated by the IBM Arrangement/370 line of computers. This computer family unit encompasses a vast range of price and performance, from personal estimator plug-in cards to supercomputers, all with the same assembly language teaching set.
The clean and consistent interface between hardware and software at the assembly linguistic communication level ways that compilers need not be excessively circuitous to produce reasonable code, and that they may be reused among many different machines of the same family unit. Another advantage of CISC processors is that, since instructions are very compact, they exercise not require a large cache retentivity for adequate system performance. And so, CISC machines accept traded off increased processor complexity for reduced system complexity.
The concept behind RISC machines is to make the processor faster by reducing its complexity. To this cease, RISC processors accept fewer transistors in the actual processor control circuitry than CISC machines. This is accomplished by having simple instruction formats and instructions with depression semantic content; they don't do much piece of work, but don't accept much time to do information technology. The instruction formats are usually chosen to correspond with requirements for running a particular programming linguistic communication and task, typically integer arithmetic in the C programming language.
This reduced processor complexity is non without a substantial cost. Most RISC processors have a large depository financial institution of registers to let quick reuse of frequently accessed information. These register banks must be dual-ported retention (allowing two simultaneous accesses at dissimilar addresses) to allow fetching both source operands on every cycle. Furthermore, because of the depression semantic content of their instructions, RISC processors demand much higher retention bandwidth to proceed instructions flowing into the CPU. This ways that substantial on-flake and organization-wide resources must be devoted to cache retentiveness to attain acceptable performance. Besides, RISC processors characteristically have an internal education pipeline. This ways that extra hardware or compiler techniques must be provided to manage the pipeline. Special attention and extra hardware resources must be used to ensure that the pipeline state is correctly saved and restored when interrupts are received.
Finally, dissimilar RISC implementation strategies make significant demands on compilers such as: scheduling pipeline usage to avert hazards, filling branch filibuster slots, and managing allocation and spilling of the register banks. While the decreased complexity of the processor makes it easier to get problems-costless hardware, even more than complication shows up in the compiler. This is jump to make compilers complex besides as expensive to develop and debug.
The reduced complication of RISC processors comes, so, with an offsetting (peradventure fifty-fifty more than severe) increase in system complication.
Stack machines strive to achieve a balance betwixt processor complexity and organization complexity. Stack machine designs realize processor simplicity not by restricting the number of instructions, but rather by limiting the data upon which instructions may operate: all operations are on the tiptop stack elements. In this sense, stack machines are "reduced operand set up computers" every bit opposed to "reduced didactics set computers."
Limiting the operand selection instead of how much work the teaching may practice has several advantages. Instructions may be very compact, since they demand specify only the actual performance, not where the sources are to be obtained. The on-fleck stack memory tin can be unmarried ported, since merely a single chemical element needs to be pushed or popped from the stack per clock bike (assuming the top two stack elements are held in registers.) More importantly, since all operands are known in advance to exist the top stack elements, no pipelining is needed to fetch operands. The operands are ever immediately available in the superlative-of-stack registers. Equally an example of this, consider the T and North registers in the NC4016 pattern, and dissimilarity these with the dozens or hundreds of randomly accessible registers institute on a RISC auto.
Having implicit operand choice also simplifies instruction formats. Fifty-fifty RISC machines must accept multiple instruction formats. Consider, though, that stack machines have few instruction formats, even to the extreme of having merely one pedagogy format for the RTX 32P. Limiting the number of instruction formats simplifies instruction decoding logic, and speeds upward system operation.
Stack machines are extraordinarily simple: 16-chip stack machines typically use only 20 to 35 thousand transistors for the processor core. In dissimilarity, the Intel 80386 chip has 275 thousand transistors and the Motorola 68020 has 200 1000 transistors. Even taking into account that the 80386 and 68020 are 32-bit machines, the difference is significant.
Stack machine compilers are also simple, considering instructions are very consequent in format and operand selection. In fact, most compilers for register machines become through a stack-like view of the source program for expression evaluation, then map that information onto a register set. Stack auto compilers take that much less work to do in mapping the stack-like version of the source code into associates language. Along compilers, in particular, are well known to exist exceedingly simple and flexible.
Stack reckoner systems are also simple equally a whole. Because stack programs are and then small-scale, exotic cache command schemes are non required for good functioning. Typically the entire program can fit into enshroud-speed memory chips without the complication of cache control circuitry.
In those cases where the program and/or information is too large to fit in affordable retentivity, a software-managed memory hierarchy tin be used: ofttimes used subroutines and programme segments can be place in high speed memory, while infrequently used programme segments are placed in slow memory. Inexpensive single-wheel calls to the frequent sections in the high speed memory make this technique very effective.
The Information Stack acts as a information cache for most purposes, such every bit in procedure parameter passing, and data elements tin can be moved in and out of high speed retentivity under software command as desired. While a traditional data cache, and to a lesser extent an didactics cache, might give some speed improvements, they are certainly not required, nor even desirable, for most minor- to medium-sized applications.
Stack machines, therefore, achieve reduced processor complexity by limiting the operands available to the instruction. This does not force a reduction of the number of potential instructions available, nor does it cause an explosion in the amount of support hardware and software required to operate the processor. The issue of this reduced complexity is that stack computers have more room left for program retention or other special purpose hardware on-bit. An interesting implication is that, since stack programs are so small, plan memory for many applications can be entirely on-bit. This on-chip retentivity is faster than off-flake enshroud memory would exist, eliminating the need for complex cache command circuitry while sacrificing none of the speed.
vi.two.3 Processor performance
Processor performance is a very tricky area to talk well-nigh. Untold free energy has been spent debating which processor is better than some other, oftentimes based on sketchy show of questionable benchmarks, heated by the flames of cocky involvement and production loyalty (or purchase rationalization).
Some of the reasons that comparisons are so hard stem from the question of awarding expanse. Benchmarks that mensurate performance at integer arithmetic are not adequate for floating point performance, business organisation applications, or symbolic processing. Near the best that i tin hope for when using a benchmark is to claim that processor A is better than processor B when installed in the given hardware (with associated caches, memories, disks, clock speeds, etc.), using the given operating systems, using the given compilers, using the given source programming language, simply only when running the benchmark that was measured. Clearly, measuring the performance of different machines is a difficult thing.
Measuring the performance of radically different architectures is even harder. At the core of this difficulty is quantifying how much piece of work is done by a single didactics. Since the amount of work done past a polynomial evaluation instruction in a VAX is different than a annals-to-register move in a RISC car, the whole concept of "Instructions Per Second" is tenuous at best (even when normalized to a standardized educational activity measure out, using those same benchmarks that we don't really trust). Calculation to the problem is that dissimilar processors are built using different technology (bipolar, ECL, SOS, NMOS, and CMOS, with varying feature sizes) and dissimilar levels of design sophistication (expensive total-custom layout, standard prison cell automated layout, and gate assortment layout). Withal, the very concept of comparing architectures requires deducting the effects of differences in implementation technologies. Furthermore, operation varies profoundly with the characteristics of the software being executed. The problem is that in real life, the effectiveness of a particular reckoner is measured not only by processor speed, but also by the quality and functioning of the arrangement hardware, operating system, programming linguistic communication, and compiler.
All these difficulties should lead the reader to the decision that the problem of finding exact performance measures is not going to be resolved here. Instead, we shall concentrate on a word of some reasons why stack machines tin be made to become faster than other types of machines on an instruction-past-instruction footing, why stack machines have good system speed characteristics, and what kinds of programs stack machines are well suited to.
6.2.3.1 Instruction execution rate
Figure vi.1(a) -- Pedagogy phase overlapping -- raw instruction phases.
The most sophisticated RISC processors avowal that they have the highest possible pedagogy execution rate -- one instruction per processor clock cycle. This is accomplished by pipelining instructions into some sequence of instruction accost generation, pedagogy fetch, didactics decode, data fetch, instruction execute, and data shop cycles equally shown in Effigy 6.1a. This breakdown of instruction execution accelerates overall instruction period, merely introduces a number of issues. The nigh significant of these issues is management of data to avoid hazards caused by data dependencies. This problem comes near when one teaching depends upon the result of the previous instruction. This can create a problem, because the second instruction must wait for the first instruction to store its results before it can fetch its own operands. There are several hardware and software strategies to alleviate the bear upon of information dependencies, simply none of them completely solves information technology.
Stack machines can execute programs as chop-chop as RISC machines, perhaps even faster, without the data dependency problem. It has been said that register machines are more than efficient than stack machines because register machines can be pipelined for speed while stack machines cannot. This trouble is caused past the fact that each teaching depends on the effect of the previous didactics on the stack. The whole betoken is, however, that stack machines exercise not demand to exist pipelined to get the aforementioned speed every bit RISC machines.
Consider how the RISC auto pedagogy pipeline can be modified when information technology is redesigned for a stack auto. Both machines demand to fetch the instruction, and on both machines this tin be done in parallel with processing previous instructions. For convenience, nosotros shall lump this phase in with education decoding. RISC and some stack machines demand to decode the instruction, although stack machines such equally the RTX 32P do not demand to perform conditional operations to extract parameter fields from the instruction or chose which format to use, and are therefore simpler than RISC machines.
In the next stride of the pipeline, the major difference becomes credible. RISC machines must spend a pipeline stage accessing operands for the instruction afterwards (at least some of) the decoding is completed. A RISC didactics specifies 2 or more registers equally inputs to the ALU for the operation. A stack machine does not need to fetch the data; they will be waiting on top of the stack when needed. This ways that as a minimum, the stack motorcar tin dispense with the operand fetch portion of the pipeline. Actually, the stack access can also be made faster than the annals access. This is because a unmarried-ported stack tin can be made smaller, and therefore faster than a dual-ported annals retentivity.
The pedagogy execute portion of both the RISC and stack machine are judged to be about the same since the same sort of ALU tin can exist used by both systems. Simply, even in this area some stack machines tin can gain an advantage over RISC machines by precomputing ALU functions based on the top-of-stack elements before the instruction is even decoded, equally is done on the M17 stack machine.
The operand storage phase takes some other pipeline stage in some RISC designs, since the result must be written back into the register file. This write disharmonize with reads that need to accept identify for new instructions beginning execution, causing delays or the demand for a triple-ported register file. This can require holding the ALU output in a annals, and so using that register in the next clock cycle as a source for the register file write operation. Conversely, the stack machine simply deposits the ALU output result in the top-of-stack register and is done. An boosted problem is that actress information forwarding logic must exist provided in a RISC machine to forbid waiting for the result to be written back into the register file if the ALU output is needed as an input for the next instruction. A stack automobile always has the ALU output available every bit one of the unsaid inputs to the ALU.
Figure 6.1(b) -- Instruction stage overlapping -- typical RISC machine.
Figure vi.1(c) -- Instruction stage overlapping -- typical stack motorcar.
Figure 6.1b shows that RISC machines need at least 3 pipeline stages and perhaps four to maintain the same throughput: didactics fetch, operand fetch, and instruction execute/operand shop. Also, we take noted that at that place are several problems inherent with the RISC arroyo, such as data dependencies and resource contention, that are but non present in the stack auto. Effigy 6.1c shows that stack machines need only a ii-stage pipeline: instruction fetch and instruction execute.
What this all means is that at that place is no reason that stack machines should exist any slower than RISC machines in executing instructions, and there is a adept run a risk that stack machines can be made faster and simpler using the same fabrication technology.
half dozen.2.3.2 System Operation
Organisation performance is even more difficult to measure than raw processor performance. Organization performance includes not but how many instructions can be performed per 2nd on straight-line lawmaking, but also speed in handling interrupts, context switches, and arrangement operation deposition considering of factors such as conditional branches and procedure calls. Approaches such every bit the 3-Dimensional Estimator Performance technique (Rabbat et al. 1988) are ameliorate measures of organisation performance than the raw instruction execution rate.
RISC and CISC machines are commonly constructed to execute direct-line code equally the general instance. Frequent procedure calls can seriously degrade the operation these machines. The cost for procedure calls not only includes the toll of saving the program counter and fetching a different stream of instructions, but also the cost of saving and restoring registers, arranging parameters, and whatsoever pipeline breaking that may occur. The very existence of a construction chosen the Return Accost Stack should imply how much importance stack machines place upon flow-of-control structures such as procedure calls. Since stack machines keep all working variables on a hardware stack, the setup time required for preparing parameters to pass to subroutines is very depression, normally a single DUP or OVER education.
Provisional branches are a hard thing for whatsoever processor to handle. The reason is that instruction prefetching schemes and pipelines depend upon uninterrupted program execution to keep decorated, and provisional branches force a look while the branch outcome is beingness resolved. The simply other option is to forge ahead on one of the possible paths in the hopes that there is nondestructive work to exist done while waiting for the co-operative to accept outcome. RISC machines handle the provisional branch trouble past using a "co-operative delay slot" (McFarling & Hennesy 1986) and placing a nondestructive instruction or no-op, which is always executed, subsequently the branch.
Stack machines handle branches in different manners, all of which effect in a unmarried-cycle branch without the need for a delay slot and the compiler complexity that information technology entails. The NC4016 and RTX 2000 handle the problem by specifying memory faster than the processor bicycle. This means that there is time in the processor cycle to generate an accost based on a conditional co-operative and nonetheless have the next instruction fetched by the cease of the clock cycle. This approach works well, but runs into problem as processor speed increases beyond affordable program memory speed.
The FRISC 3 generates the condition for a co-operative on one instruction, then accomplishes the branch with the next instruction. This is actually a rather clever approach, since a comparison or other operation is needed before most branches on any machine. Instead of just doing the comparison operation (usually a subtraction), the FRISC 3 besides specifies which condition code is of interest for the next co-operative. This moves much of the branching decision into the comparison education, and only requires the testing of a unmarried flake when executing the succeeding conditional branch.
The RTX 32P uses its microcode to combine comparisons and branches into a ii-instruction-cycle combination that takes the equivalent time as a comparison instruction followed by a condition co-operative. For example, the combination = 0BRANCH can be combined into a single 4-microcycle (2 instruction cycle) performance.
Interrupt handling is much simpler on stack machines than on either RISC or CISC machines. On CISC machines, complex instructions that take many cycles may be so long that they need to exist interruptible. This tin forcefulness a great amount of processing overhead and command logic to save and restore the state of the motorcar within the middle of an education. RISC machines are non too much better off, since they have a pipeline that needs to be saved and restored for each interrupt. They likewise have registers that need to exist saved and restored in order to give the interrupt service routine resources with which to piece of work. Information technology is common to spend several microseconds responding to an interrupt on a RISC or CISC machine.
Stack machines, on the other hand, can typically handle interrupts within a few clock cycles. Interrupts are treated as hardware invoked subroutine calls. In that location is no pipeline to flush or save, then the merely thing a stack processor needs to do to procedure an interrupt is to insert the interrupt response accost as a subroutine call into the didactics stream, and button the interrupt mask register onto the stack while masking interrupts (to prevent an infinite recursion of interrupt service calls). In one case the interrupt service routine is entered, no registers demand exist saved, since the new routine can simply button its data onto the top of the stack. As an example of how fast interrupt servicing can be on a stack processor, the RTX 2000 spends only 4 clock cycles (400 ns) between the time an interrupt request is asserted and the time the offset didactics of the interrupt service routine is being executed.
Context switching is perceived as being slower for a stack machine than other machines. However, every bit experimental results presented later will evidence, this is not the case.
A finally reward of stack machines is that their simplicity leaves room for algorithm specific hardware on customized microcontroller implementations. For case, the Harris RTX 2000 has an on-fleck hardware multiplier. Other examples of application specific hardware for semicustom components might be an FFT accost generator, A/D or D/A converters, or communication ports. Features such every bit these tin can significantly reduce the parts count in a finished arrangement and dramatically decrease program execution time.
vi.2.3.3 Which programs are most suitable?
The type of programs which stack machines procedure very efficiently include: subroutine intensive programs, programs with a large number of control flow structures, programs that perform symbolic computation (which oftentimes involves intensive use of stack structures and recursion), programs that are designed to handle frequent interrupts, and programs designed for limited memory infinite.
half dozen.ii.four Programme execution consistency
Avant-garde RISC and CISC machines rely on many special techniques that requite them statistically higher performance over long time periods without guaranteeing high performance during short fourth dimension periods. Organisation pattern techniques that have these characteristics include: instruction prefetch queues, complex pipelines, scoreboarding, cache memories, branch target buffers, and co-operative prediction buffers. The trouble is that these techniques cannot guarantee increased instantaneous performance at whatsoever particular time. An unfortunate sequence of external events or internal data values may crusade bursts of cache misses, queue flushes, and other delays. While high boilerplate performance is acceptable for some programs, predictably high instantaneous performance is important for many existent time applications.
Stack machines use none of these statistical speedup techniques to achieve skilful system performance. As a upshot of the simplicities of stack machine program execution, stack machines have a very consistent functioning at every time scale. Equally we shall see in Chapter 8, this has a significant impact on real time control applications programming.
NEXT Department
Phil Koopman -- koopman@cmu.edu
Y Does Increasing Amount Of Data In A Processors Register File Increase Performance Of Processor?,
Source: https://users.ece.cmu.edu/~koopman/stack_computers/sec6_2.html
Posted by: sargentproutiting1980.blogspot.com
0 Response to "Y Does Increasing Amount Of Data In A Processors Register File Increase Performance Of Processor?"
Post a Comment