The Design Flow Process
Computing devices become increasingly present in different areas of the human world, having as an immediate consequence that a
wide array of functional behaviour has to be supported by such devices. Any designer is furthermore confronted with the need to
make his device not only small but also more powerful as the requirements increase. During the design process, he/she has to
identify what application or part of an existing application can be hardwired in order to obtain maximal performance. There is no
fundamental trade-off between hardware and software but only a choice to determine what functions are implemented at what level.
Objective 1: Code Profiling and Cost Modeling
Let's assume a designer is planning to develop an architecture which has dedicated hardware to perform a number of time consuming tasks. The first challenge for the designer is to determine the function, which could be hardwired (see Figure 1). There are evidently several functions, which could be selected, and each of them should be considered. The goal of this stage is to assess to what extent hardwired functions would increase the overall performance and what the cost is to build them. One way to support this process is to offer a profiler that can collect and analyse execution traces of the program. The profiler will use this information in combination with human directives to propose a number of candidate code segments. A cost model will be used to assess how these code segments translate into hardware designs, taking into account configuration delays, area usage, power consumption, etc.Such a cost model will allow to filter away those functions that will not likely result in the anticipated improvement. This cost model will be multidimensional meaning that the human designer can specify what specific constraints (memory, real time, power,...) will need to be taken into account when evaluating the different candidate functions.
FIGURE 1: Identifying Candidate Functions
Objective 2: Code Transformations and Optimizations
The candidate segments undergo a series of optimizations and transformations that aim converting the sequential algorithm targeted to GPP execution into a parallelized form suitable for hardware implementation. These optimizations and transformations are divided into two groups: graph restructuring and loops parallelization.
- Graph restructuring: Starting from the dataflow graph of the application, different clusters of operations are identified that will be implemented as new instructions. The clusters identified within the graph are usually Multiple Input Multiple Output (MIMO) subgraphs with specific properties that are used as guidelines during graph exploration. One of these properties, the convexity, guarantees a proper and feasible scheduling of these new instructions while respecting the dependencies.
- Loop parallelization: After restructuring the candidate segments and formatting the new instructions, loops containing those instructions are revisited in order to exploit parallelization opportunities. To this purpose, (standard) loop transformations have to be applied taking into account constraints such as available area and latency. Examples of potential loop transformations are loop unrolling, software pipeling, loop tiling, etc.
Objective 3: The Retargetable Compiler
Once the choice has been made and a function f(.) is identified, the code containing the f(.) logic needs to be eliminated from the original source code and replaced by an appropriate FPGA call (see Figure 3). The instruction set needs to be extended with the appropriate instructions for setting up the FPGA and to start its computation. To this purpose an interface needs to be developed between the software program and the FPGA implementation of f(.). This boils down to modifying the compiler as to include the appropriate function calls that sets up the FPGA, transfers the required parameters, and receives the result(s) of the execution. The retargeted compiler can then generate the appropriate machine code of the original program containing the f(.)-function call. The compiler will evidently also have to take into account all the scheduling issues in case the FPGA is integrated in a pipelined or superscalar processing architecture. Additionally, dedicated compiler optimizations and transformations address the specific features of the FPGAs, such as reconfiguration latency, partial and run-time reconfigurations.
FIGURE 3: Generating Executable Code
Objective 4: VHDL Generation
The selected functions for hardware implementation have to be described in a Hardware Description Language (HDL) such as Verilog or VHDL. As shown in Figure 4, we distinguish between three possible paths: manual, IP library, or automated code generation.
- For critical code segments that require very high quality of the implemented hardware, manual VHDL or Verilog code should be written.
- Whenever an external IP library is available for the selected code segments, the hardware models can be directly instantiated from that library.
- The third possibility, automated code generation, is envisioned for fast prototyping and fast performance estimation during design space exploration. Even though the current state of the art in automated HDL generation is not yet comparable to manually crafted HDL, certain optimisations can be applied in order to obtain a high quality hardware model at a fraction of the time needed for a manual implementation.
FIGURE 4: VHDL Generation Process
Objective 5: Integration and Validation
The generated machine code can run on an existing FPGA such as the Virtex Pro II which has up to 4 embedded PowerPC processors on chip. The Virtex Pro II board completely eliminates the need for a simulator and enables us to immediately test and compare. This evaluation provides detailed statistical information on the overall performance of the proposed architecture and has to take into account all factors determining the performance such as the set-up time of the FPGA and make for instance recommendations on where to place the set-up functions.
This process is repeated for each of the potential f(.)-functions after which a justified choice can be made. When multiple changes are introduced, studying the impact of individual components on the overall performance becomes very challenging given the combinatorial and interdependent nature of the problem.
This assumes a high degree of integration between the different components as discussed above. The integrated work flow process which should be covered by the workbench is depicted in Figure 5 and goes from code profiling up to execution on the virtex pro board using the Molen architecture.
FIGURE 5: The overall workflow of the reconfigurable computing workbench
|