In previous works, we showed that dynamic scheduling strategies for distributing and-work and or-work in parallel logic programming systems yielded better performance than a fixed assignment of processors to exploit and-parallelism and or-parallelism. We then suggested that a combination of compile-time granularity information with dynamic scheduling strategies could improve the performance of the system even further [44,43,46].
In this work we show that a combination of compile-time granularity information with dynamic scheduling strategies can in fact improve the performance of parallel logic programming systems that use completely dynamic scheduling strategies.
Our target system is Andorra-I [111,82,112,43], a parallel logic programming system that exploits both ORP and ANDP. ANDP in Andorra-I is exploited determinately, i.e., only goals that match at most one clause in the program are allowed to proceed in parallel.
In order to obtain compile-time granularity information, we used the ORCA system [8], that generates simplified granularity information based on Tick's algorithm [98] from goals and clauses of a Prolog-like program.
We modified the Andorra-I system to understand the ORCA outputs. Our results show that dynamic scheduling strategies can actually benefit from compile-time granularity information.
The ORCA system is responsible for generating an annotated Prolog program that is later compiled to an abstract code, the Andorra-I VRAM code [88], and provides granularity information about clauses and goals to the Andorra-I engine.
The ORCA system consists of three main parts. The first part reads the Prolog program and creates a table with names of all procedures and their respective clauses. It also generates a list of calls to clauses. The table is organised in a way that each entry contains complexity of the procedure and of the clauses belonging to that procedure. The second part of the model computes the complexity measures for the clauses and procedures. The third part annotates the source program with complexity information.
The algorithm used is based on Tick's algorithm[99] with some modifications to increase precision. Complexity is measured in terms of number of resolutions. In that case, the complexity of a fact in the database is considered as 1. The complexity of a clause is obtained by the sum of the complexities of each goal in the clause plus one call (considering the head call). The complexity of a procedure is given by the sum of the complexities of the clauses belonging to that procedure. The complexity of a recursive call is computed as the complexity of the corresponding procedure by considering each recursive call as having weight 1.
As an example, figure 2 shows the annotation for the clauses shown in figure 3.
The complexity of d/2 is given by annotation '__c_d/2'. This annotation employs the Prolog syntax. The complexity of samples is given by '__c_samples/2'. The first argument of this annotated code corresponds to the complexity of the procedure, the second represents a list with the complexity of each clause in the procedure, the third argument corresponds to the recursive value of the procedure (if it has a recursive call), the last argument corresponds to a list of complexities for mutually recursive calls. As no one of the clauses shown in figure 3 have mutually recursive calls, this last argument is represented as an empty list.
Obviously, the numbers represented in each argument are computed based on the analysis of the whole program that we do not show here.
ORCA gives information on clauses and on goal invocations. This information finds a simple match in the process of Andorra-I compilation:
The engine was adapted to access granularity information. This mainly consisted of adapting the loader and supporting the new instruction fields.
As this work concentrates on the reconfigurer, no modification was made in the and-scheduler and or-scheduler related to the work distribution algorithms. Modifications were done only to provide information to the reconfigurer.
The reconfigurer was implemented using two different strategies [44]. One, the efficiency-guided strategy, makes decisions based on past information and on efficiency of workers in a team [46]. The other, work-guided strategy, makes decisions based on instant information about the sizes of and-work and or-work available [45].
In this work we concentrated only on the work-guided strategy. The information used by the work-guided strategy assumes some estimates to work sizes. For example, in order to collect amount of and-work, the reconfigurer takes from the and-scheduler the width of the run queue of goals. This means that we do not keep any information about the actual predicted size of the work below that subtree, which may lead to a decision that allocates workers to a team that is working on a short branch of the execution tree. This may cause load imbalance, since other sources of work may need more workers.
Our experiments were done on a Sun SPARCstation-20 with 4 processors. We used only three processors and left one processor for OS duties. We ran 2 versions of Andorra-I. An original version that does not use any kind of compile-time information and a version using compile-time information generated by ORCA.
Our results show that the utilisation of granularity information can help in improving performance. Our improvement ranges from 1% (for the queens application with a 10x10 boardsize, bqu10) to 168% (for the cyphertext program, cypher). The program fibonacci does not have improvement in performance, because it is a deterministic application that contains only and-parallelism. As we do not use the granularity information in the and-scheduler, we observed only the effect of the overhead of supporting ORCA information.