(1) INTRODUCTION
1.1 Need for configurable ALUs :
The configuration of the Arithmetic Logic Unit has effect on the computation efficiency of the executing programs . It is a fact that a given hardware ALU configuration may be better suited for execution of a given computation structure . Thus maximum computation efficiency can be achieved if the underlying hardware configuration can be adjusted as required by the program structure . Since a program is sequence of algorithmic steps with varying computation structure , the need for such a reconfiguration i.e , changing from one ALU configuration to another , would thus arise at different program points corresponding to these steps .
The problem of identifying the optimal configurations that match the needs of the underlying computation at different steps in program is very complex . But if such configurable computing engines can be designed then power of these architectures can be maximally used . The success of these architectures critically depends on the effectiveness of the compiler to suitably configure the hardware to the computation needs .
1.2 Overview of the reconfiguration framework :
Configurable Arithmetic Logic Units offer opportunities for adapting the underlying hardware to support the varying amount of parallelism in the computation . A configuration is defined as a given hardware implementation of different operators along with their multiplicities . Thus if the computation structure contains the same operators , the configuration is maximally utilized .
The framework for reconfiguration of the ALUs aims at generating optimal configurations that match to the operator requirements of the code to be executed . Operator parallelism can be achieved by generating different configurations .
The automatic compilation framework for configuration analysis exploits operator parallelism within loop nests . The focus of this framework is on performing configuration analysis to minimize costly reconfiguration overheads . In the framework initially some operator and loop transformations are carried out . This helps in reusing the ALU configuration if possible .
The next part of the framework is a two pass algorithm .The first pass attempts to generate either maximal cutsets or maximally parallel configurations by performing analysis of the program dependency graph (PDG) of a loop nest . Here , a cutest is defined as a group of statements that execute under a given configuration . The second pass analyses the trade-offs between the costs and the benefits of reconfigurations across different cutsets and attempts to eliminate the reconfiguration overheads by merging the cutsets .
(2) BASIC CONCEPTS AND DEFINITIONS
2.1 Operator Parallelism :
Operator parallelism within loops is one of the important parallelism that can be exploited on these reconfigurable architectures. In a loop when all the instances of the single operator are executed simultaneously , we say that operator parallelism is used . For example consider the following loop :
for i = 1 to 100
s1 : A[i] = B[i] * C[i]
s2 : D[i] = A[i] + B[i]
end for
This loop contains forward dependence due to second statement requiring A[i] that is computed in first statement .This loop could be split into two for-all loops preserving forward dependence due to A[i] and then operator parallelism within each loop nest could be exploited by suitably configuring the ALU to generate code as follows :
ALU configuration (i) :
for i = 1 to 100
s1 : A[i] = B[i] * C[i]
end for
ALU configuration (ii) :
for i = 1 to 100
s1 : D[i] = A[i] + B[i]
end for
The ALU configuration (i) has 100 multipliers for supporting 100 concurrent multiplications and the ALU configuration (ii) has 100 adders for supporting 100 concurrent additions . Thus all instances of each single operator are executed in parallel . This loop transformation is possible only if the ALU can be reconfigured at the entry point of the given loop nest .The loops then execute faster when splitted to achieve operator parallelism . However , configuring the hardware at the entry point of each loop nest has an overhead associated with it . Thus reconfiguring the ALU for each loop to achieve operator parallelism would be beneficial only if this overhead is minimal . The expensive reconfigurations could be minimized if the generated configurations are able to span a large amount of computation without sacrificing the parallelism . The framework to be discussed focuses on this issue .
2.2 Definitions :
The notations and definitions used in the following configuration framework are discussed in this section .
2.2.1 Configuration :
A set of operators along with their multiplicities that are implemented on the hardware to support computation at a given time is known as a configuration . A configuration is represented as a set of tuples . For example a configuration consisting of 30 adders and 40 dividers can be represented as follows :
{ < + , 30 > , < / , 40 > } .
2.2.2 Cutset :
A group of program statements that execute under a particular configuration is known as a cutset . The cutsets are represented using set notation . For example , if statements S1 , S2 and S3 execute under the configuration { < * , 20 > , < + , 50 > } , they are denoted as a cutset { S1 , S2 , S3 } .
2.2.3 Maximal cutset :
A cutset that maximally uses a given configuration is known as maximal cutset . In other words , one cannot add any more statements satisfying dependency constraints to this cutset without incrementing the corresponding configuration . For example , consider the cutset { S1 , S2 , S3 } and corresponding configuration { < * , 50 > , < + , 100 > } . Also assume that statements S4 and S5 are the only possible statements which could be added to this cutset satisfying the dependency constraints . Assume that S4 and S5 both need 100 concurrent dividers which are not present in the above configuration . Thus neither S4 or S5 can be added to the current cutset without incrementing the configuration . Thus the cutset { S1 , S2 , S3 } is maximal with respect to the above configuration { < * , 50 > , < + , 100> } .
2.2.4 Maximally parallel configuration :
A configuration is said to be maximally parallel with respect to a statement if it maximally supports the operator parallelism present in a given statement . Suppose a statement exhibits operator parallelism of 100 concurrent multiplications and corresponding possible configurations are as follows :
{ < * , 50 > , < + , 50 > } and { < * , 70 > }. The second configuration { < * , 70 > } is maximal since it maximally supports the operator parallelism of 70 multiplications , whereas the first configuration supports only 50 concurrent multiplications and is not maximal .
(3) COMPILATION FRAMEWORK FOR CONFIGURABLE SYSTEMS :
3.1 Introduction :
The compiler for the configurable computing systems need to generate code which is optimized inorder to match the underlying ALU configuration . Initially the loop nests in the code are studied to generate maximal configuration . Depending on the suggested configuration , the code is optimized . The resulting final code will give optimum performance gain .
3.2 Compiler Analysis Phases :
The compiler for the configurable systems consists of four distinct phases . They are as follows :
1. Loop and operator transformations .
2. Configuration Analysis .
3. Architecture Dependent Optimizations .
4. Code Generation .
The compiler analysis phases are depicted in the figure 3.1.
The description of each phase is as follows :
3.2.1 Loop and operator transformations :
This is the first phase , which is a preprocessing phase to the analysis phase. This phase takes a Program Dependency Graph (PDG) as an input and performs oop transformations , such as loop splitting , statement reordering , etc. that could help in generating maximal cutsets and maximally parallel configurations . It also undertales operator transformation, which exposes more opportunities for configuration reuse .
3.2.2 Configuration Analysis :
This phase analyses the trade-offs between the costs of reconfiguration and performance gains due to the reconfiguration done for exploiting operator parallelism within loop nests . Space time cost models of operators are used for compile time analysis by this phase. The phase uses a two pass solution .
In the first pass , maximal cutsets or maximally parallel configurations are generated . In the second pass an attempt is made to merge the cutsets and the corresponding configurations if the reconfiguration costs outweighs the benefits . Finally , this phase generates a PDG annotated with reconfiguration directives .
FIG . 3. 1 COMPILER ANALYSIS PHASES
3.2.3 Architecture Dependent Optimizations :
This phase performs architecture specific optimizations targeted toward the specific reconfigurable system .
3.2.4 Code generation :
This phase generates code for execution on the reconfigurable hardware .
The framework which is described next , covers the first two phases of this compiler analysis .
(4) OPERATOR AND LOOP TRANSFORMATIONS :
4.1 Introduction
This is the first phase of the compiler as stated in the previous chapter . Depending on the computational demands of a loop , a given set of statements in a loop may demand certain hardware configuration for their best execution . Once the statements are grouped in terms of their preferred configurations , it is important to analyse whether or not the two groups should be assigned to two different configurations . The cost of reconfiguration is practically quiet high . So the number of reconfigurations should be kept minimum . Thus program transformations should be done to expose more opportunities for configuration reuse . The total number of reconfigurations that need to be performed during execution of loops can be reduced through certain operator and loop transformations which act as a preprocessing step.
4.2 Operator Transformations :
The motivation for operator transformation is the reuse of the previous configurations generated to reduce reconfigurations . Consider the loop nests given in fig 4.1 , which has to be executed on the hardware .
L1 : for i = 1 to 100
A[i] = B[i] + C[i] + D[i] + E[i] + x + y
end for
L2 : for i = 1 to 100
A[i] = A[i-1] * 6
end for
FIG . 4.1
Loop nest L1 consists of three parallel additions per iteration and thus the preferred configuration for L1 would be as follows { < - , 300 > } .Loop nest L2 contains a multiplication operation and its preferred configuration would be { < * , 100 > } . A reconfiguration would have to be done between the two loop nests . If we transform the multiplication operator into a series of additions , as shown in fig. 4.2 , reconfiguration can be avoided .The preferred configuration demand of L2 is now { < + , 300 >} , same as L1 .
L1 : for i = 1 to 100
A[i] = B[i] + C[i] + D[i] + E[i] + x + y
end for
L2 : for i = 1 to 100
A[i] = A[i-1] + A[i-1] + A[i-1] + A[i-1] + A[i-1] + A[i-1]
end for
FIG . 4.2
Although the conventional cost of implementing a multiplication through a series of additions is higher , since reconfiguration costs are high , the overall performance is better by transforming operators as above . Each complex operation can thus be broken down into simpler constituent operations that can be executed in parallel . Since the hardware can support a higher multiplicity of simple operations and such operations can span a range of complex operations , we can eliminate redundant reconfigurations by operator transformations .
4.3 Loop Transformations :
The motivation for performing loop transformations is to expose parallelism in the statements to optimally utilize the hardware space for parallel execution of operations . At the same time loop transformations should not lead to reconfiguration to occur within the loops , but rather reconfigurations to occur between loop boundaries .
A very useful transformation is a for-all loop transformation , where a given loop is split into a sequence of for-all loops . Each for-all loop , therefore , would exhibit maximum operator parallelism and each boundary between the for-all loops could be a potential reconfiguration point . The loops having loop carried dependencies cannot be split into for-all loops . For such loops , loop transformations as loop peeling , scalar expansion when performed , allow certain dependencies to be removed so that the resulting loop can then be split into a sequence of for-all loops. Reordering statements in a loop can also break loop carried dependencies and produce a candidate loop . For example consider the loop given below . There is loop carried dependence due to B[i] between S1 and S2 .
for i = 1 to N
A[i] = B[i] + C[i] / / S1
B[i+1] = C[i] + D[i] / / S2
end for
Using combination of statement reordering , loop peeling and the
index-set shifting , we reorganize the loop as follows :
A[1] = B[1] + C[1]
for i = 2 to N
B[i-1] = C[i-1] + D[i-1]
A[i] = B[i] + C[i]
end for
A[N+1] = B[N] + C[N]
Thus through this transformation , loop carried dependence has been removed and the resulting loop can be easily split into for-all loops.
4.4 Statement Reordering for Configuration Reuse :
Since each configuration consists of a set of operators along with their multiplicities , one may attempt a transformation where statements with ‘ similar ‘ operators are moved together through statement reordering . The program dependencies must be preserved in doing the reordering .
Consider the following loop nest :
for i = 1 to 100 Cutset Configuration
a[i] = b[i] * c[i] - d[i] / / S1 {S1} { < * ,100 > , < - , 100 > }
e[i] = a[i] - c[i] / / S2 { S2 , S3 } { < - , 100 > , < + , 100> }
f[i] = a[i] + e[i] / /S3
g[i] = a[i] * b[i] / /S4 {S4} { < * , 100 > }
end for
In the above example , in the absence of statement reordering the compiler attempts to generate three configurations as shown above . But as one can see the statements S1 and S4 share the same operator < * > and statements S1 and S2 share the same operator < - > . The dependency graph is as shown below in fig 4.3 . From the PDG , we note that S4 can be moved before S2 to increase the configuration locality , but S3 cannot be moved before S2 . The loop can be legally transformed to
for i = 1 to 100 Cutset Configuration
a[i] = b[i] * c[i] - d[i] / / S1 {S1, S4 , S2} { < * ,100 > , < - , 100 > }
g[i] = a[i] * b[i] / / S4
e[i] = a[i] - c[i] / /S2
f[i] = a[i] + e[i] / /S3 {S3} { < + , 100 > }
end for
This results in a fewer number of configurations utilizing operator locality . Such transformations for improving operator locality increase configuration reuse and reduce reconfigurations .
4.5 Order of Transformations :
The above transformations are applied by the compiler discussed in the following sequence :
Ø First , intraprocedural loop nests are normalized and an attempt is made to convert all the loops into for-all loops .This allows loop nests to be split ( fission ) freely , if necessary by the following phases .
Ø Next , statement reordering is performed to create operator locality within loop nests . This transformation reduces the number of possible reconfigurations found by following phases .
Ø Next , operator transformations are carried out to enhance operator locality as illustrated above . This further eliminates reconfigurations between two loop nests .
Ø Finally , we carry out loop fission on legal for-all boundaries.
(5) CONFIGURATION ANALYSIS - PASS 1:
5.1 Introduction
This is the next phase of the compiler which focuses on analyzing the PDG to determine the cutsets and the corresponding configurations. In this chapter we discuss the first pass of this phase The goal of the first pass is to generate maximal cutsets and maximally parallel configurations.
5.2 Cutset and Configuration Generation
This is the first pass of the configuration analysis. The input to the first pass is a PDG and a compile time cost model of the reconfigurable ALU which includes execution time of operators , RPU ( Reconfigurable Processing Unit ) space occupied by each operator , total reconfigurable space and reconfiguration time. The algorithm generates the cutsets for the loop and , for each cutset the corresponding configuration is generated. The idea behind the algorithm is to pack as many statements as possible in one configuration to minimize reconfiguration overhead constrained by hardware space.
The first pass involving analysis of the program using the PDG constructed at loop nest level. Each node of the PDG corresponds to a statement. The nodes of the PDG are annoted with the data elements referenced and the operations involved on these data elements. The directed edges in the PDG represent the data dependence from one node to another in the direction of the edge. The nodes in the PDG are assigned level numbers that represent the depth of the node in the PDG. A node has level no-1 if it is not data dependent on any of the statements in the loop. In the PDG the nodes with the same level no are not data dependent on each other and , hence , can be executed in any order.
The PDG of the loop is traversed level by level in the order of the flow dependence in the program. The amount of space required by each statement is given by the multiplicity of each of its operator and the space needed to implement it.
e.g.:-
If a statement involves one division and one addition and the loop count is hundred , we will require hundred multipliers and hundred adders to execute all instances of the statements in parallel.
The algorithm to generate maximal cutsets or maximally parallel configurations uses two heuristic approaches:
1. Configuration Match
2. Hardware Space Requirement.
The PDG’s are treversed in topological order and cutsets and configurations are generated incrementally. At every level , the algorithm invokes the selected heuristic to determine the order of the statements at that level for inclusion in the current cutset being incremented. The heuristic return the statements which are ordered as per one of the criteria maximal cutsets or maximally parallel configurations.
The algorithm then attempts inclusion of the statements in the sorted order and the corresponding current configuration is incremented accordingly if needed. When a statement is included in the current cutset , the corresponding configuration may or may not have to be incremented. The former case occurs if the operator parallelism demands of the statement included are already met by the current configuration. If the demands are not met the configuration is incremented. The incrementation can be done only when the space constraint is met. In case the space constraint is not met a new cutset and configuration will be opened by the algorithm.
5.2.1 Complexity Of The First Pass
We analyse the complexity of the algorithm based on the number of statements in the PDG. The algorithm traverses the PDG level by level. Let there be maxlevel levels in the PDG and let mi denote the no. of statements at level i of the PDG. The total no. of statements , N in the PDG is given by :
For each level , statements are sorted based on the chosen heuristic. This sorting takes O(mi log(mi) ). The statements are then included in the cutset in the sorted order . The worst case occurs when all the N statements are present in the same level of the PDG. The worst case complexity of the algorithm is therefore given by : O( N log (N )).
5.2.2 Heuristics Required For The First Pass
The two heuristic approaches used by the algorithm are motivated by the following two factors which can affect the efficiency of the program execution on the hardware :
1. The number of reconfigurations performed :-
Since reconfiguration overhead is significant , it is very important to keep the number of reconfigurations to a minimum .
2. Hardware space utilization:-
The hardware space should be utilized to the maximum
extent to fully exploit available parallelism within each configuration.
The heuristics are used to order the selection of statements in current configuration. They are discussed below:
a. Configuration Match:-
The motivation behind this heuristic is to maximally utilize current configuration to generate maximal cutsets. Thus , this heuristic attempts to group the statements that have ‘similar’ needs for operator parallelism together into a cutset and generate corresponding configuration. The heuristic orders statements that have a better match with current configuration before those statements that have a worst match. In other words it attempts to order statements so that there is a maximal match between the operator support available in the current configuration and the operator support required by the statements.
Assume that the algorithm is in the middle of selecting the next statement for inclusion in the current cutset and it calls this heuristic to get an ordered set of candidate statements . Let the current configuration be comprised of operators Oi and Oj . Let there be three statements S1 , S2 , S3 , which are candidates for inclusion in the current cutset for the above configuration . Let S1 require operators Ok , Oi . Let S2 require operators Oi , Oj , and S3 require operators Om , On , where Oi , Oj ,Ok , Om ,On are different .
First , the heuristic determines the match and unmatch count of each statement as follows :
For S1 , one of the required operators Oi is available in the current configuration ( a match ) and the other is not ( an unmatch) . Thus , its match and unmatch counts are both one. For S2 both the operators are available ( two matches ) and , for S3 , none match ( two unmatches ) . Then ordering is performed on statements using match and unmatch counts . Statement Si > Sj ( Si is placed before Sj ) if unmatch count of Si is lower than that of Sj . If the unmatch count is equal , then Si > Sj if the match count of Si is greater than that of Sj .
Based on the above criteria the given statements will be ordered as S2 > S1 > S3 . Thus S2 is a better candidate for inclusion than both S1 and S3 and will be considered first by the main algorithm to decide whether or not it should be included in the current cutset .
In the above heuristic , apart from generating maximal cutsets , the other motivation is to enforce maximal overlap between the new and the old configuration , to maximally utilize partial reconfiguration when space constraint is not met . For example , in the above illustration if the space constraint is not met for the candidate S1 , one would start a new cutset and a new configuration < Ok , Oi > . One can see that this new configuration has a good amount of overlap with the previous configuration < Oi , Oj > . Thus , reconfiguration would be partial resulting in minimal reconfiguration overheads.
b. Hardware space requirement :
The motivation behind the second heuristic is to generate maximally parallel configurations ( rather than maximal cutsets as in the first case ) by utilizing the hardware space judiciously . This heuristic orders statements such that those which require higher space are ordered before those needing less space. Since the amount of space required is directly proportional to the available operator parallelism within a statement , this approach ensures that the configuration is generated to fully utitlize the available operator parallelism within a given statement . This approach contrasts with the previous one in that we focus on generating locally optimal configurations by maximally fulfilling parallelism demands of the current statement . The cutsets and the corresponding configuration are then generated until the space constraint is met . A new cutset and configuration is generated if the space constraint is not met . This enables maximum utilization of hardware space.
The following figure summarizes the first pass :
(6) CONFIGURATION ANALYSIS - PASS 2:
6.1 Introduction
This is the next phase of the compiler which focuses on analyzing the PDG to determine the cutsets and the corresponding configurations. In this chapter we discuss the second pass of this phase . The goal of the second pass is to analyse trade-offs between gains and costs of reconfigurations and to merge configurations when cost exceeds gains.
6.2 Reconfiguration Minimization :
Once the PDG has been completely traversed and cutsets generated in the first pass , now we traverse over the cutsets. The cutsets are analysed taking two adjacent ones at a time to determine if they should be merged or not .
In the above figure C1 denotes the execution cost of Cutset-1 under Config-1 and C2 of Cutset-2 under Config-2. Let RC12 denote cost of reconfiguration from Config-1 to Config-2.Let the Cmerged denote the cost of the merged cutset ( Cutset-1 , Cutset-2 ) under a common configuration. If Cmerged is less than ( C1 + C2 + RC12 ) , the cutsets are merged and the common configuration would be comprised of all the operators that are present in the configuration of the given two cutsets.
e.g.:-
If the configuration of Cutset-1 is given as :
{ < * , 100> , < + , 100 > }
and that of Cutset-2 is:
{ < / ,100 >, < - ,100 > }
then the common configuration would be comprised of {*, / , +, -} . The multiplicity of each operator is computed based on the total chip space .
6.2.1 Complexity Of Second Pass
In this algorithm we take two cutsets at a time and compare the execution costs of them under their configuration and the execution cost of the merged cutset under a common configuration. The complexity of computation of execution cost of the cutsets is linear on the number of statements in the given PDG. The complexity of computation of execution cost of the merged cutsets is again linear on the number of statements in the given PDG. Therefore the complexity of the algorithm is O ( N ) , where N is the number of statements in the PDG.
Overall complexity of the Configuration Analysis is:
O ( N log( N ) + N )
(7) AN ILLUSTRATION :
This chapter illustrates the working of the algorithm for configuration analysis.
Problem: We consider the following loop nest:
for i = 1 to 100
A[i] = A[i] * a + b; / / S1
B[i] = A[i] + B[i] * b ; / / S2
C[i] = A[i] / c + a ; / /S3
D[i] = B[i] * c / D[i] ;/ /S4
E[i] = B[i] * C[i] / c ; / /S5
G[i] = F[i] + C[i] ; / /S6
end for
Solution:-
I.PDG: The program dependency graph of the given problem is as given below:-
In this loop there are 6 statements which is represented by each node in the PDG. Each node is annotated with operators involved in the statement. Edges in the graph denote flow dependence. The algorithm is applied on this PDG.
II. PASS 1
The PDG is traversed in topological order starting from S1 to generate cutsets and configurations. Some terms used in the algorithm are as follows:
1> Totalcutsetconfig : Denoting the configuration decided for the current cutset.
2> Currentcutset
3> Availablespace : The hardware space available at each level in the PDG
4> Totalspace :Total hardware space.
5> Currentavailconfig : Configuration found at previous level.
Totalcutsetconfig and Currentcutset are initially set to empty. Availablespace and Totalspace are initially say 100 units. We use the heuristic configuration match for the algorithm.
We assume the space and time measures for the operators as given in the table:
Operator Time Space
DIV 3.5 10
ADD 1 1
SUB 1 1
MUL 3 4
We start the algorithm from S1 as follows:
Level-1:
Statement = S1
Currentavailconfig = 0
Currentcutset = 0
Process : A new cutset is created and S1 is added to it.
Space Required: The space taken for the multiplier and the adder in S1 is 5 as per the table.
Parallelism possible: Since available space is 100 we can have 20 adders and multipliers. Thus 20 instances of S1 will execute in parallel and will take 5 steps to complete execution of 100 instances of S1.
Results:
Totalcutsetconfig: { < * , 20 > , < + , 20 > }
Availablespace = 0
Currentcutset = {S1}
Level-2:
Statement1 = S2
Currentavailconfig = Previous level’s Totalcutsetconfig
{ < * , 20 > , < + , 20 > }
Currentcutset = S1
Process : The configuration matching heuristic will order S2 before S3 as its configuration requirement < *, +> matches with Currentavailconfig. So configuration requirements of S2 is subset of Currentavailconfig.
Parallelism possible: S2 requires 100 multipliers and adders each same as S1 it will take 5 steps to complete the execution of S2 with 20 instances at a time.
Results:
Totalcutsetconfig: { < * , 20 > , < + , 20 > }
Availablespace = 0
Currentcutset = {S1, S2}
Statement2 = S3
Currentavailconfig = { < * , 0 > , < + , 0 > }
Since S2 is using all the operators in the current configuration.
Currentcutset = { S1 , S2 }
Process : S3 requires a divider which is not present in Currentavailconfig. No additional space is also present. So a new cutset and new configuration is created.So Availablespace is set to 100.
Parallelism possible: A divider and a adder will take 11 units of space so , 9 dividers and adders can be generated. 9 instances of S3 will execute in parallel and 12 steps will take to complete 100 instances.
Results:
Totalcutsetconfig: { < / , 9 > , < + , 9 > }
Availablespace = 100 - 99 =1
Currentcutset = {S3}
Level-3:
Statement1 = S5
Currentavailconfig = Previous level’s Totalcutsetconfig
{ < / , 9 > , < + , 9 > }
Currentcutset = {S3}
Process : The configuration matching heuristic will order S5 which implies perfect match with Currentavailconfig followed by S6 whose configuration requirement is subset of Currentavailconfig. S4 will be ordered last as it requires a multiplier.
Parallelism possible: S5 requires 100 dividers and adders each .But there is no additional space available .So S5 will take 12 steps same as S3 .
Results:
Totalcutsetconfig: { < / , 9 > , < + , 9 > }
Availablespace = 100 – 99 =1
Currentcutset = {S3, S5}
Statement2 = S6
Currentavailconfig = { < / , 9 > , < + , 9 > }
Currentcutset = { S3 , S5 }
Process : S6 requires 100 adders since its configuration requirement <+> is subset of Currentavailconfig,S6 is added to Currentcutset.
Parallelism possible: 9 instances of S6 will execute in parallel as 9 adders are available and 12 steps will take to complete 100 instances.
Results:
Currentcutset = {S3,S5,S6}
Statement3 = S4
Currentavailconfig = { < / , 9 > , < + , 9 > }
Currentcutset = {S3,S5,S6}
Process : S4 requires a multiplier which is not present in Currentavailconfig. No additional space is also present. So a new cutset and new configuration is created.So Availablespace is set to 100.
Parallelism possible: A divider and a multiplier will take 14 units of space so , 7 dividers and multipliers can be generated. 7 instances of S4 will execute in parallel and 15 steps will take to complete 100 instances.
Results:
Totalcutsetconfig: { < * , 7 > , < / , 9 > }
Currentcutset = {S4}
Results of PASS 1:
Cutsets and configurations generated are as follows:
Cutset Configuration
{S1, S2} { < *,20> , < + ,20> }
{S3, S5, S6} { < /,9> , < + ,9> }
{S4} { < *,7> , < / ,7> }
III. PASS 2
This is reconfiguration minimization step in which the cutsets found in pass-1 are analysed and merged if the cost of merging is less than the cost of reconfiguration.
(8) CONCLUSION :
By this framework the loops are partitioned into cutsets and configurations are generated. The total execution time including reconfiguration overhead can be computed for each loop. Speedups can also be computed. In most of the cases it is seen that the speed-ups obtained are very good except in one case where the reconfigurable processor does worse than the host processor.With this framework it is seen that only a small fraction of hardware space is left unused.
Thus it can be concluded that a lot of operator parallelism exists within loops and can be effectively exploited by performing compiler analysis to utilize the power of reconfigurable ALU’s.
(9) BIBLIOGRAPHY
1.IEEE Transactions On Parallel And Distrbuted Systems, volume 13, no. 1 , Jan 2002
Automatic Compilation Of Loops To Exploit Operator Parallelism On Configurable Arithmetic Logic Units.
-Narasimhan Ramasubramanian,
-Ram Subramanian,
-Santosh Pande
2.www.suif.stanford.edu
Global Optimizations for Parallelism and Locality on Scalable Parallel Machines.
-Jennifer M. Anderson
- Monica S. Lam
Tuesday, January 6, 2009
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment