by Jan van Lunteren
Traditional general-purpose processors seem finally to have reached their limits. The substantial performance improvements achieved year after year over the past decades are diminishing as processor technology approaches physical limits and has to deal with growing power consumption. The tough requirements of several emerging applications now call for rethinking fundamental aspects of the processor architecture that have existed for almost half a century.
At the IBM Zurich Research Laboratory, we have started a research project that examines whether there are opportunities for novel processor concepts which deviate from the traditional ‘Von Neumann’ processor architecture. The objective of this project is to realize a new type of programmable coprocessor that is optimized for applications that run into performance problems when being executed on a conventional processor. The initial focus is on tasks that operate on streams of data such as XML processing, pattern-matching (eg, for spam filtering), compression and encryption. One of the main focal points is to optimize the handling of conditional branches, which is an important factor affecting the performance of state-of-the-art, deeply-pipelined and superscalar processor architectures, as this directly influences the fill rate of the pipeline and execution units.
In a traditional ‘von Neumann’ processor architecture, instructions are identified and selected for execution based on their memory addresses using a program counter. Because the program counter is incremented by default after the execution of each instruction, the basic “instruction execution flow” is sequential in nature and is usually only changed by branch, procedure-call and return instructions. This is illustrated in Figure 1(a). Conditional branch instructions can change the instruction execution flow based on the evaluation of typically one simple condition (eg greater than, less than, or equal to). If multiple conditions or a more complex condition (eg “is a given character a legal name character in a given programming language”) need to be evaluated, then these have to be translated into a sequence of multiple instructions and conditional branches.
|
Figure 1(b) shows the new concept in its most general form: a processor architecture in which each instruction is associated with a set of multiple conditions. In each execution cycle, the conditions for all instructions in the current instruction group are evaluated, and the instruction for which all conditions match is selected for execution. If the conditions for multiple instructions match, a priority scheme is applied. Special “branch” instructions allow a jump to a different group of instructions. Because the execution of the instructions will typically affect the evaluation of the conditions in subsequent cycles, this will determine the actual instruction execution flow. This is illustrated in Figure 1(c), which shows an example of a specific embodiment of this concept in the form of an enhanced state-transition diagram. This diagram involves numerous potential execution paths, in which the actual path taken during execution of the program is determined by real-time evaluation of the various sets of conditions associated with the state transitions, and the instructions associated with the transitions chosen are selected for execution. (Note that the state-transition diagram shown in Figure 1(c) is small and simple; actual diagrams can be substantially larger and more complex.)
A novel programmable, state-machine technology, called B-FSM (BaRT-based Finite State Machine), has made practical implementations of the concept described feasible for clock frequencies into the gigahertz range. Based on the B-FSM technology, the novel processor concept has already been successfully applied to design prototypes of an XML accelerator coprocessor and a pattern-matching engine for intrusion detection. The latter engine will be capable of matching thousands of patterns simultaneously at processing rates of up to 10 Gb/s for FPGA (Field-Programmable Gate Array) and beyond 20 Gb/s for ASIC (Application-Specific Integrated Circuit) implementations, achieving a storage efficiency that is a factor 10 to 500 better than conventional algorithms. Ongoing research is directed at the design of a common instruction set that enables the coprocessor to be used for a range of applications. Other topics for further investigation are also the programming model and compiler.
Links:
http://www.research.ibm.com/XML/IBM_Zurich_XML_Accelerator_Engine_paper_2004May04.pdf
http://www.zurich.ibm.com/~jvl/
Please contact:
Jan van Lunteren, IBM Research GmbH, Zurich Research Laboratory, Switzerland
E-mail: jvlzurich.ibm.com