This project focuses on the investigation
of architectures that are associated with embedded processors. More specifically,
we investigate the possibilities of extending embedded processors, which
are becoming increasingly more general-purpose like, in order to improve
their performance in a specific application area. Examples of application
areas are multimedia processing (graphics, video, sound, text) and network
processing. In this project, a wide range of Master thesis subjects can
be found including (and not limited by) the following:
1. Porting Linux Operating System
to a XUP Virtex II Pro Board
Context: Embedded systems are becoming more complex over
time in order to provide the needed functionalities and achieve the required
performance levels. In addition, adaption to changing environments (e.g.,
moving a device from location to another or targeting different markets
with the same device) is becoming increasingly more important. With the advent of reconfigurable
platforms, it is possible to compose a hardware system that can perform the necessary
functionalities and, hopefully, with the required performance and more importantly change the
functionalities over time.
The MOLEN project focuses on the design of embedded systems by combining a general-purpose
processor with reconfigurable hardware. The purpose of the chosen platform is to gain
increased flexibility through the execution of non-time-critical operations on the
general-purpose processor and to achieve increased performance by implementing time-critical
operations on the reconfigurable hardware.
To manage the different hardware and functionalities of the MOLEN platform in a more automatic
way extending it beyond a single application, it requires an operating system (OS).
Problem statement: The MOLEN architecture can be implemented on different Xilinx boards: the ML310 board
and the Xilinx University Program (XUP) board. Xilinx provide an embedded OS for their ML310
board but not to the XUP board. The student should investigate the possibilities of porting
the Linux Operating System into the XUP board with the MOLEN architecture inside.
Expected effort: The student is expected to perform some literature research on the details of the
MOLEN architecture, embedded OS andthe XUP board (specifically the differences between this
board and the ML310). The student should then be able to understand the different steps needed
to port the OS into the MOLEN and the required changes to both the Linux kernel and the
hardware base system design. The student should have already knowledge on the Linux OS,
particularly on compiling kernels and kernel programming.
Context: The MOLEN project focusses on the design
of embedded processors by combining a general-purpose processor with reconfigurable
hardware, e.g., field-programmable gate arrays (FPGAs). The purpose of
the chosen platform is to gain increased flexibility through the execution
of non-time-critical operations on the general-purpose processor and to
achieve increased performance by implementing time-critical operations
on the reconfigurable hardware. Without going into much detail, the approach
entails a one-time architectural extension that is able to support any
application-specific data processing to be implemented on the reconfigurable
hardware. And this is performed by utilizing reconfiguration microcode
and execution microcode.
Problem statement: Investigate, in particular for partial
reconfigurable hardware, the possibilities of converting a reconfiguration
file into reconfiguration microcode and develop a tool to automatically
perform this operation.
Expected effort: The student is expected to perform some
literature research on both topics of microcode and synthesis for FPGAs.
The student should then be able to understand how reconfiguration files
are generated for partial reconfigurable hardware structures. Utilizing
this knowledge, a tool must be generated (using ANSI C programming) that
automatically generates reconfiguration microcode. The reconfiguration
microcode structure is not fixed, thus leaving the student the possibility
to define it.
Context: Recent multimedia algorithms require huge
amounts of data to be transferred and processed in real-time. Examples
are the motion estimation algorithms in MPEG, which require whole blocks
of visual data to be accessed randomly and processed as fast as possible.
While various designs of processing units, capable of handling arrays of
visual data, have been proposed, feeding these units with data is still
a substantial performance bottleneck. Traditionally, visual data is stored
in scan-line manner in linearly addressable memories. However, in MPEG
this data is processed per blocks, which are not aligned according the
scan-line scheme. Hence, data reordering is required, which is a time-consuming
(i.e., performance restricting) process in conventional memory organizations.
To meet the MPEG requirements for high data throughput, new data memory
organizations are required.
Problem statement: Propose a hardware mechanism for solving
(avoiding) the data reordering problem for accessing 2D visual. Subsequently,
design the memory organization for the proposed mechanism and implement
it in a cost-effective manner.
Expected effort: The student should design a memory organization
capable of accessing sub-matrices (blocks) of visual data out of 2D addressable
memory buffer. The thesis should include an investigation of different
memory organizations to solve the problem, a discussion of the trade-offs
involved and the criteria for the final solution. A cost-effective, scalable
design solution should be proposed, the choice should be well argued. More
specifically, the student is expected to: 1. Make a thorough exploration
of the problem and related literature and refine the initial design requirements
2. Propose (refer to) different design solutions 3. Establish criteria
for design evaluation and propose a cost-effective design solution 4. Implement
the design in HDL and validate it by simulations (or on an FPGA prototype
chip). Preliminary simulations indicate access times for some candidate
memory organizations implemented on FPGA in the range of 10-20 ns. The
final output of the thesis should be a complete scalable design of memory
organization implemented on platform FPGA (Xilinx VIRTEX II).
Context: The project involves the review of a number of standards
to establish the various functions required to do lossless compression of digital data.
(examples of these standards are gzip and zip.) These functions will be categorized, according
to their requirements in hardware. After this, an architecture for these functions should be
proposed and developed.
Finally, an implementation of the architecture should be proposed.
5. Design of functional
units for multimedia and telecommunication applications
Context: Multimedia and telecommunication applications have particular computational needs
to be satisfied under real-time constraints and with low power consumption. To perform such
tasks, applications may use a large variety of hardware resources spanning from microprocessors
and digital signal processors to specialized functional units. For example, to improve performance
and fulfill the multimedia application requirements, recent general purpose microprocessors for
workstations and personal computers use special built-in hardware.
In this MS project we intend to select some computations that are specific to telecommunication
and multimedia applications, e.g., DCT, Huffman encoding, motion estimation, etc., and design
functional units to perform such tasks. High performance as well as low power consumption are the
main envisaged design constraints. We plan to follow the entire design trajectory from algorithm
to layout. The project will imply, apart of the chip design (from VHDL to layout), research effort
on improved algorithms and organizations for the selected tasks.
Context: The MOLEN project focusses on the
design of embedded processors by combining a general-purpose processor
with reconfigurable hardware, e.g., field-programmable gate arrays
(FPGAs). The purpose of the chosen platform is to gain increased
flexibility through the execution of non-time-critical operations on
the general-purpose processor and to achieve increased performance by
implementing time-critical operations on the reconfigurable hardware.
Without going into much detail, a co-processor-like execution engine
needs to be built next to the general-purpose processor to handle
time-critical operations.
Problem statement: Investigate and specify a
co-processor that runs next to the MOLEN processor in handling
particular time-critical operations (specified during the project).
Expected effort: The student is expected to perform
some literature research on both topics of microcode and synthesis for
FPGAs. The student should also familiarize himself with the synthesis
toolchain in order to implement the co-processor. Existing knowledge
on computer architectures, VHDL, and programming are advantageous to
this project.
Context: The MOLEN project focusses on the
design of embedded processors by combining a general-purpose processor
with reconfigurable hardware, e.g., field-programmable gate arrays
(FPGAs). The purpose of the chosen platform is to gain increased
flexibility through the execution of non-time-critical operations on
the general-purpose processor and to achieve increased performance by
implementing time-critical operations on the reconfigurable hardware.
Without going into too much detail, the reconfigurable hardware
implementation can be custom-build hardware units or a simple
co-processor extending the capabilities of the general-purpose
processor.
Problem statement: Investigate and implement a hybrid
simulator that is able to combine the simulation of code running on
the general-purpose processor with (micro)'code' running on the
co-processor.
Expected effort: The student is expected to perform
some literature research on topics of code simulation and investigate
applicability of existing tools (compilers, simulators, etc.).
Subsequently, a toolchain needs to be built to support the envisioned
hybrid simulator. Existing knowledge on computer architectures and
C programming are advantageous to this project.
Context:
Multimedia applications require substantial computational resources. In
order to be accelerated, recent general purpose microprocessors utilize
special hardware. This type of hardware can be extending a processor's
ISA or can exploit specific DSP accelerators. The Multifunctional
Architecture for Video and Image Processing (MAVIP) is a complete design
that can be used for multimedia applications acceleration. It is
oriented towards architectures consisting of a master processor tightly
coupled with hardware accelerators, making it particularly suitable for
the MOLEN processor.
Problem statement:
Create a co-processor that couples the MAVIP architecture with the MOLEN
processor.
Expected effort:
The student is expected to perform some literature research on MAVIP,
MOLEN, microcode, and synthesis for FPGAs. The student will modify
MOLEN's microcode/microarchitecture in order to be able to support the
co-processor and also familiarize himself/herself with the synthesis
toolchain in order to implement it. Background knowledge on computer
architectures, VHDL and programming are required to realize this project
successfully.
Doing a Master thesis in subjects
that are not mentioned above are possible as long as they still fit inside
the
MOLEN project.
The field of embedded processor design
is large in the sense that many (new) applications can be targeted. This
results in having different embedded processors for different application
areas. We are open to suggestions and the possibility exists that a Master
thesis is done in a totally different application area than the two mentioned
above.
Interested?
The
theme
1. A fault injector
Context: Smaller feature size, greater chip density, and minimal power
consumption all lead to an increased number of faults in computing systems. The Computer Engineering
laboratory is investigating architectural techniques to tolerate such faults. For this research, a software
tool that injects faults (a fault injector) into a processor simulator (e.g. the sim-outorder simulator of
the SimpleScalar simulation toolset) is needed.
Project: The goal of this project is to extend the simulator of a
processor with fault injection capability. First, existing fault injection theory should be investigated.
Thereafter, some appropriate fault models should be selected and implemented. The simulator
should be extended to accept command-line arguments specifying the desired fault injection
properties, such as which resources can fail (e.g., memory bus, functional units, register files,
etc.), the fault frequency, the type of faults (transient or permanent), and others.
It should also be possible to collect fault statistics after a simulation.
Context: The TOP500 Supercomputer Sites webpage
( http://www.top500.org/
)
presents the world best highperformance computers. The LINPACK Benchmark
is used as a yardstick for performance. Companies like IBM, HP, NEC and
Intel (ASCI red) are presented there with their top supercomputers. The
interesting question is: Can a university student team with out of the
shelf inexspensive hardware components beat the industry supercomputers
on Linpack? The DelftLinpack-1 processor uses the power of reconfigurable
hardware in order to attempt an answer of this question. Xilinx state of
the art FPGA (XC2VP50) incorporating reconfigurable logic and four PowerPC
general purpose cores will be used to implement the DelftLinpack-1 machine.
Problem statement: 1. Analysis and Synthesis on the provided
technology (Xilinx) of LINPACK benchmark.
2. Organization design and performance
estimation of Delft Linpack 1.
3. Memory and cache design for Delft Linpack
1.
4. Delft Linpack 1 array of floating point
multipliers design.
Expected effort: We need 4 students to do their master
thesis on this project.
Student 1. The student will analyze the
LINPACK benchmark structure and dataflow. The most processor time consuming
pieces (so-called hot spots) are to be highlighted. A VHDL description
and synthesis of those hot-spots for the Xilinx technology is the project
final result.
Student 2. Based on the available 216,
18-bit x 18-bit dedicated multiplier blocks, the student should produce
an organization for Delft Linpack 1 and produce emulated results to validate
the expected LINPACK performance in MFLOPS. The student will be using state
of the art Xilinx development tools.
Student 3. The student will investigate
the memory and cache organization of Delft Linpack 1 needed to guarantee
minimal fetch and store delays for LINPACK. The results will be validated
on a real hardware board.
Student 4. To outperform the TOP 500 machines
Delft Linpack 1 uses highly parallel hardware. An array of multipliers
(approx. 128 stages) is the hart of the calculation engine. The goal of
the project is to design such a array to produce one result each machine
cycle and optimally deal with the floating point format of the data used
in Linpack benchmark.
3. Realistic Architectural Comparison
for modern processors
Context: Benchmarks, such as SPEC, i-bench and
many others, are used to compare the processor performance. However, they
all are focusing on different application specific types of computing.
Benchmarks that compare the architectural issues like pipeline stage, branch
prediction and caching are not existing. Delft Benchmark project is aimed
to investigate the existing benchmarks, analyzes their operation and will
produce set of new tools focused on architectural analysis.
Problem statement: 1. Investigation and Analysis of existing
benchmarks.
2. Investigation and design of benchmarks
for different Architecture facilities
Expected effort: 1. The student is supposed to investigate
the existing benchmarks on different processors and analyze the produced
results with the architectural differences in mind. The result of this
project is an exhaustive study and analysis of the benchmark results.
2. The student will evaluate different
architectural aspects and investigate how independent different aspects
can be tested, e.g. how to test branching independent of the cache organization.
A several architectural benchmarks for different architectural aspects,
e.g. pipeline, caching, branching etc., will be produced as result of this
project.
4. Analysis and Mapping of HMMER onto the Cell BE Processor
Context: Heterogeneous multi-core architectures aim at improving applications performance and energy efficiency by means of parallelization and some degree of specialization. Bioinformatics applications, known for having a lot of parallelism available, can benefit from these type of architectures by exploiting both DLP and TLP. An example of such an heterogeneous multi-core architecture is the Cell BE processor which contains a PowerPC core connected to 8 pure SIMD cores with local scratch pad memories. It is then very interesting to investigate how much benefit bioinformatics applications can obtain from such architecture. This analysis can also give insights about how we should design future multi-core processors. HMMER is a popular bioinformatics application that uses Hiden Markov Models to perform multiple sequence alignment of proteins/DNA.
Problem statement: Analyze the performance of different parallelization approaches of HMMER on Cell BE.
Expected effort: The student is expected to start by studying about bioinformatics in general and then deeply about HMMER. This application (written in C) should be then ported to Cell BE which also requires studying the architecture and its programming model.
1. Automatic Compiler Support
For Instruction Set Extension
Context: Reconfigurable components such as Field
Programmable Gate Arrays (FPGA’s) are becoming increasingly popular and
pose specific architectural challenges. One of those challenges is to automate
the machine code generation process so that the proposed architecture extended
with FPGA’s can be tested and its performance evaluated by executing benchmark
programs. To this purpose, the compiler needs to be retargeted in order
to match the new architecture. There exist tools that automate a part of
this retargeting process, namely the instruction selection phase. The goal
of this Msc Topic is integrate the tool OLIVE into a compiler.
Problem statement: Integration of the OLIVE code generator
generator into the compiler included in Delft WorkBench Project and writing
a specification for PowerPC Instruction Set with some extensions.
Expected effort: A good working knowledge of C++ is assumed
and the student must be willing to study compiler theory and should understand
the Stanford University Intermediate Format (SUIF) front-end and MachSUIF
back-end. The following tasks need to be carried out:
Task 1 : describe and thoroughly study
OLIVE.
Task 2 : implement a pass for the instruction
selection phase into the back-end using OLIVE.
Task 3 : write a specification for PowerPC
(and maybe for MIPS) as to include reconfigurable units.
1. Agents are synonym for distributed,
asynchronous processes that perform some kind of function. They have
become increasingly popular especially in internet related environments.
An important example is Grid computing. The internet is there
seen as massively parallel machine. The challenge in grid computing is
to develop efficient load balancing algorithms.The goal of this research
is to develop algorithms for balancing the workload on a grid like architecture.
2. Agents can also be used in a
wide variety of other applications, ranging from information search to
price negotiation. The goal of this topic is to propose a universal
architecture of an agent, irrespective of the specific taks or function
the agent is supposed to fulfil.
3. Minimal Agents As Building
Blocks For Distributed Computing Systems
Context: Grid and cluster computing are widely
studied as alternatives to supercomputers. Basic principle is to connect
a large number of low performance machines such as ordinary pc’s and use
it as one large computing system. This evidently poses specific challenges
with respect to load balancing, synchronization, routing etc. This project
aims to look at agent based techniques in order to address some of the
above issues. Agents are pieces of hardware or software that behave autonomously,
perform a certain task and that can interact with other agents or the outside
world. The main idea is to have an agent manage the resources of a local
pc or computing node and to negotiate computing time and tasks with other
agents or some kind of central platform where tasks and resources are dispatched.
Problem statement: Extend an existing simulation environment
based on the notion of minimal agents, so tat architectural changes are
easily introduced and its impact investigated. More specifically routing,
load balancing and resource allocation will be studied. What additional
mechanisms, such as bidding platforms and strategies, communication protocols,
etc., are required ?
Expected effort: The following tasks will need to be performed
:
Task 1 : describe, on the basis of the
literature, the architecture of minimal agents.
Task 2 : extend an existing Java simulation
environment to study one of the following issues : routing, load balancing
and resource allocation.
Task 3 : evaluate different algorithms
for the problem at hand and assess the overall performance of the computing
system.
4. Delay Modeling Framework
for CMOS Threshold Logic circuits
Context: The increasing demand for high-speed digital
arithmetic in hardwired processors have shifted the research efforts towards
highly customized alternative circuit techniques and specific computer
arithmetic algorithms. Among them, the Threshold logic (TL) paradigm has
received increasingly more attention in recent years since the basic TL
gate can perform more complex and wider functions (in terms of number of
input variables) than the usual Boolean CMOS gates. Basic CMOS Boolean
gates have been well studied and there are a wide range of delay models
supported by all commercial timing analyzers (e.g. Synopsys (TM), Candence
(TM), Mentor Graphics (TM) ). In contrast, Threshold Logic gates lack such
delay models diversity. Moreover, there is a need for a software framework
capable of rapid estimation of the critical path delay in a Threshold Logic
circuit, without relying on time-expensive circuit simulations.
Problem statement: Develop a program capable of assisting
the evaluation of the critical path and critical path delay of an arbitrary
Threshold Logic circuit.
Expected effort: The student is expected to perform first
a literature research in order to become familiar with the subject of Threshold
logic and CMOS gates delay modelling. Second, the student should perform
several benchmark circuit simulations in order to extract Threshold Logic
gate delay model parameters. Finally, the student should develop a program
capable of estimating accurately the critical-path delay of an arbitrary
Threshold Logic circuit.
Context: With the advent of mobile computing and
communications platforms, computer graphics seem to become an ubiquitous
application. Consequently, there is much interest from the designers of
generalpurpose microprocessors, media processors, and specialized hardware
to provide cost-effective realtime 3-D computer graphics capabilities.
Our Computer Engineering laboratory is actively involved in the research
and development of a low-power, low-cost hardware graphics accelerator
for an industrial partner in the Molen project framework. In particular,
in the rasterization stage of a 3-D graphics pipeline, a division per fragment
(pixel) is performed at least for the texture coordinates computation.
Division is a very slow, complex procedure when implemented for general
divisors with a huge latency, significant cost, and therefore, power hungry.
Fortunately in computer graphics, due to the limitations of the human visual
system, a reasonable amount of error is allowed in the division computation
process without introducing noticeable artifacts in the final computer-generated
image.
Problem statement: Devise a low-power, low-cost, fixed-point
division-like hardware algorithm by choosing the proper operand representation,
by tweaking bit operand width precision, or by finding alternative approximative
functions that produce roughly the same results to a true division algorithm.
Expected effort: The student is expected to perform some
literature research in accustoming herself/himself to the subject of division
algorithms and implementations, and of several topics of computer graphics.
Then, the candidate hardware algorithm will be modeled in SystemC or VHDL
and will be embedded, tested, and evaluated in a full-fledged experimental
framework (provided) for an OpenGL compliant rasterization engine. Further,
when the possibility presents itself, it might be even possible to measure
on the actual silicon (FPGA) the quality metrics of the synthesized hardware
model.
Context: The area of current ASIC and microprocessor
chips is dominated by on-chip memory. Product yield and reliability therefore
are effectively determined by that on-chip memory. Memory tests are used
to weed out and/or repair defective memories. However, the failure behavior
of memories is much more complicated than that of digital logic. Therefore,
in addition to the stuck-at and bridging fault models, many other fault
models are being used. Industrial memory test results indicate that the
established fault models do not cover all faults. Because of that, research
is being performed in memory fault modeling and test design.
The recent memory tests, however, are
becoming increasingly more complex in order to cover larger classes of
fault models. This has reached the point where manual verification of the
fault coverage is becoming impractical and/or error prone. Because of that,
there is a strong need for a tool, which is able to verify whether a given
set of faults is detected by a given test.
Problem statement: Design and implement a tool, which can
accept as inputs fault models and tests, and produces as output a coverage
matrix for each of the tests.
Expected effort: The student is expected to perform a literature
study of current memory test verifiers, the way fault models are being
specified (using fault primitives), and the way memory tests are being
specified (using march, and possibly other, notation).
A flexible and extendable structure should
be designed such that the simulator can be updated to cope with different
memory designs (e.g., multi-port memories), new fault models and tests.
The end result should be such that the established memory tests can be
verified, using the current state of the art in fault modeling.
Contact persons: Georgi Gaydadjiev / A.J. van de Goor
Phone: 0152-786168 / 0152-786172 or 0182-529798
Email: g.n.gaydadjiev@its.tudelft.nl
a.j.vandegoor@its.tudelft.nl