In any parallel system, the scheduler is a very important part because it is responsible for the efficient utilisation of processors by the system. By efficient utilisation we mean to use only the resources required by the application. A reasonable scheduling strategy will also complete a parallel application in a time close to the earliest possible time. In parallel logic programming systems, the execution of logic programs generally produces irregular computations that can be mapped into trees. Some applications produces very ``bushy'' trees, while others produce ``thiner'' trees. The trees have nodes that are distributed throughout the computing resources to be executed.
In or-parallel systems the nodes of the execution tree are or-nodes (also called choicepoints) that contain alternatives left to be executed. A collection of or-nodes connected in some hierarchical order form an or-tree. Finding or-work, i.e., finding new alternatives to execute, is the task performed by an or-scheduler. The or-scheduler deals with problems as which alternative to select next to continue working or which branches need to be pruned.
And-parallel systems generate the same kind of trees as or-parallel systems, but with and-nodes. A collection of and-nodes connected in some hierarchical order form an and-tree. In and-parallel systems, the {\it and-scheduler} is responsible to select the next goal to be executed.
In systems that exploit both and-parallelism and or-parallelism there is a new issue to be tackled which is what kind of work to be chosen next: and- or or-? There is not much research on this scheduling issue, because the combination of and- and or-parallelism is a relatively new issue and there are not many implemented systems that exploit both and- and or-parallelism. Yet another issue to be tackled, and this is applicable also to and- and or-parallel systems, is the efficient utilisation of resources, i.e., utilise only the resources required by the application. This is a very important issue in multi-user parallel environments.
Systems that exploit both and- and or-parallelism usually generate {\it and-or trees}, although several variations of exploitation of and-parallelism may lead to different conceptual trees.
A task to be selected by a scheduler to allocate to a processor is any work done by the system engine between two invocations of the scheduler. In that case, depending on the implementation system, the task sizes may vary. A task switch is performed when the scheduler is invoked to find work, and the worker actually switches to another task.
Next we discuss some or-only and and-only scheduling techniques used for some well known parallel logic programming systems that exploit either and- or or-parallelism, and and/or scheduling techniques proposed for systems that exploit both and- and or-parallelism.
In a parallel logic programming system an or-scheduler has the function of choosing the best choicepoint from which to take an alternative to be explored, in order to utilise resources in a better way. For both all-solutions and single-solution applications the scheduler should, in an abstract level, keep the height of the execution or-tree to the minimum possible (in order to achieve speedups related to the sequential execution). From a practical and concrete level, it should not incur too high costs in task switching, and it should cope with a limited number of processors in the system. Another very important issue is to reduce the amount of interaction between the workers. Some schedulers also deal with the problem of ``speculativeness''. This means that whereas some branches of the search tree are to be executed compulsorily, others (speculative or-branches) might be pruned because of a cut or commit operator. A branch of the or-tree can also be speculative if we can find ways of knowing that it does not lead to a valid solution.
Several schedulers have been proposed to distribute or-work in or-parallel logic programming systems. In shared memory systems implementations such as Aurora~\cite{aurora:ngc} and Muse~\cite{Ali90}, or-schedulers try to minimize frequency of task switching, and amount of interaction between the workers. In the implementation of other models~\cite{warren:models,disz-lusk-overbeek:anlwam}, methods are used that keep the cost of task switching constant by maintaining the cost of task creation or variable access as non constant. Gupta~\cite{Gupta90} gives a comprehensive taxonomy of or-parallel execution models for logic programs and discusses the issues of variable access time, task switching costs and task creation time. Task switching costs are directly related to scheduling. It is known from~\cite{Ali93} that it is more efficient to keep the cost of task switching non constant while keeping other costs -- variable access and task creation -- constant. This section concentrates on describing or-schedulers that work for parallel systems whose cost of scheduling is non constant. So, reducing costs of task switching is one of the main subjects we talk about in this section.
Contrary to distributed systems, schedulers for shared memory systems do not need to tackle the processor communication problem. Instead, the greatest problem to be tackled by an or-parallel scheduler is to control the frequency of task switching in systems that do not have constant time for this operation. Indirectly, two problems of equal importance arises when we do not control the frequency of task switching: cache corruption and scheduling overheads. These problems can also occur in distributed memory systems where the penalty for not maintaining locality of references is more serious, incurring in memory accesses that can be 10,000 times slower than a shared memory access.
There are several different schedulers for or-parallel systems, but the main basic strategy to follow, in general, is: to give preference to mandatory work. If there is no mandatory work select the least speculative work and repeatedly seek for better work. This is the algorithm that seems to fit all or-parallel systems nowadays and proved to be the best so far. Around this main algorithm, several different strategies implement some or all of this.
There are in the literature several strategies describing different schedulers for different models, and even for the same model. For example, the Aurora system has at least 5 scheduler strategies that compete with each other for parallel efficiency. The strategies vary mainly in the way that the work is chosen in the execution search tree. The schedulers also use different data structures that affect in different ways the performance of the or-parallel system. Some of them, the most recent ones, deal with speculative work. All these strategies are supposed to be dynamic, i.e., decisions are made based on information collected at runtime. There is at least one known or-scheduler design that is aided by information generated at compile-time~\cite{wai-keong92}. This was implemented as a simulator and results were compared with the Aurora or-parallel system.
Dynamic or-schedulers can be classified in demand- or work-driven. Demand-driven schedulers are usually more efficient because they do not require that a busy worker interrupt its own work to publish newly created work. In work-driven systems workers publish their newly created work and this can affect the worker's own performance as it has to spend some time publishing its work. In demand-driven schedulers it is the responsibility of an {\bf idle} processor to look for work while idle. It just interrupts busy processors once to take the work.
Scheduling strategies can also be classified as preemptive or non-preemptive. Non-preemptive strategies will work on a task until its completion or failure. Preemptive strategies will voluntarily suspend any task at anytime. Preemption is a very important feature when dealing with speculative work, because processors can try to move to a less speculative task.
There have been many attempts to speed up the parallel execution of or-parallel logic programming systems through various or-scheduler strategies. If the parallel system assumes the same operational semantics of the Prolog language, the scheduler needs to guarantee that the parallel execution of any Prolog program will produce the same output as its sequential counterpart. This is not a trivial task, specially if we observe that many Prolog programs are written to run efficiently on a single processor, making the parallelisation task very difficult. The main problems involved in keeping correct execution of Prolog programs in a parallel environment come from the utilisation of pruning operators (cut and commit) and from side-effect predicates (I/O, for example).
The first attempts at scheduling in parallel Prolog systems did not consider the problems caused by the pruning operators and side-effect predicates. The first or-schedulers tried to address other issues such as the kind of dispatch that should be used by the scheduler, bottom-most, i.e., take alternatives from the deepest created choicepoint (this implements a depth-first search); top-most, i.e., take alternatives from the nodes closer to the root of the or-tree (this implements a breadth-first search); or a combination of those. Some or-parallel engines (e.g., Aurora and Muse) divide the tree into two parts: private and public. This way each processor can happily work on its private nodes until the search is exhausted and there are no more live nodes in the private region. In this case the processor will try a public backtrack. If there is no more public work left, new or-work is made public either on demand of idle workers or according to the choice of a busy worker.
Dispatching on top-most nodes was believed to reduce the public part of the tree and therefore reducing cost of moving around. It was also believed that nodes closer to the root hold more work. The disadvantage of this approach is that it may cause too frequent task switching if the task size is small. On the other hand, dispatching on bottom-most opens large public sections, so that work can be found by backtracking (minor task switching) rather than by moving to another branch (major task switching).
Next we describe some of the schedulers proposed and used in or-parallel Prolog systems. Four of them are used in the Aurora system, one of them is used in the Muse system, and the others are used for other or-parallel systems.
The Argonne scheduler was one of the first attempts to overcome the problem of efficiently executing a Prolog program in or-parallel. It also dealt with data locality by distributing to each process the job of finding its own task to execute next. There is a reduced utilisation of global structures. This is a very important approach that is followed by almost all its successor or-schedulers. The scheduler was inspired on ideas of Warren~\cite{warren:sri}, but did not try to address the problems generated by the execution of cuts or side-effect predicates.
The main characteristics of this scheduler are:
\begin{itemize} \item there is no global dispatcher and the use of global structures is very limited. \item the nodes of the tree are divided into ``public'' and ``private''. A process exploring a node at the private part of the tree can run at the same speed as a normal sequential Prolog engine; \item the scheduler's job is restricted to coordinate the movement of workers in the public part of the tree, leaving the private part to be managed by the main engine; \item its strategy is to dispatch on topmost nodes in order to reduce costs of moving around the execution tree, and minimize the size of public regions in the tree. It also considered that top-most dispatching would distribute (heuristically) larger pieces of or-work (high granularity work). \end{itemize}
This approach aims at minimising synchronisation between workers and to make the system scalable by not using many global structures. It failed to achieve very good efficiency because the system had a race problem in that all processors would migrate to the same piece of or-work. When the worker arrived to get the work, it may have disappeared, because another worker would have taken it before. This strategy favours a breadth-first strategy search which is good for all solutions problems, but which is not so good for programs that contain speculative work.
The Manchester scheduler was developed in parallel with the Argonne scheduler and within the same project. In the same way as the Argonne scheduler, the Manchester scheduler uses dispatching on top-most nodes and divides the tree into public and private parts. It differs from the Argonne scheduler in the way it selects nodes from the public part of the tree. Its main concern is with the number of bindings needed to be installed and de-installed in the binding array as Aurora uses the SRI model~\cite{warren:sri} to implement or-parallelism. It maintains a public candidate node that is made public whenever the public work available is exhausted. The scheduler matches idle workers that are closer to the public candidate node. In that case, the Manchester scheduler avoids the race problem that existed in the Argonne scheduler.
This scheduler is one step ahead the Argonne scheduler in that it deals with pruning operators and side-effect predicates.
The Muse scheduler was the first one to implement the idea of bottom-most dispatching. It tries to deal with several kinds of applications, including the ones that have speculative computation. Workers only search for leftmost work when they perform voluntary suspension. If the workers are searching for work because they are idle, the search strategy is to look for the richest worker, i.e., the one that contains most live nodes on its branch. Voluntary suspension is done at time intervals given by the user. The left-to-right ordering between workers is maintained by an array indexed per worker that contains the number of workers to its left. Every worker is suspended at the time of the suspension and a number of rightmost workers try to search for work on its left by updating the array entries. The left-to-right order of the tree is maintained on demand, and not all workers perform voluntary suspension, just the n-rightmost workers will do so. No voluntary suspension is applied when there idle workers. One of the problems of the Muse scheduler is that voluntary suspension is also done on mandatory work.
The Muse scheduler seems to be very successful when dealing with speculative applications.
The Bristol scheduler was designed primarily in order to investigate the Muse bottom-most dispatching strategy when applied to the Aurora system. So far, only top-most dispatching was being used in the scope of the Aurora system, because it was believed that to keep the public part of the tree to a minimum would incur less overhead. As the Muse system was obtaining very good speedup results with its bottom-most dispatching strategy, the Aurora implementors started to investigate strategies for doing bottom-most dispatching. Another motivation for the design of the Bristol scheduler was to overcome some of the disadvantages and reduce the overheads existing in the Manchester scheduler. A performance study of the Aurora system~\cite{PeterSz89} using the Manchester scheduler was reported and pointed some solutions for improving the performance of Aurora.
The Bristol scheduler followed a long path before it would deal with all the problems in scheduling or-parallel work in Aurora. It first implemented different strategies that could be adjusted manually and also voluntary suspension. This was a step to try to deal with speculative work. If workers can suspend they can give up some task to start another one less speculative.
Several strategies have been used by the Bristol or-scheduler (e.g., leftmost and richest worker selection~\cite{Beaumont90}), but results reported that the one that gives best speedups is the hybrid strategy~\cite{Beau91} where {\bf busy} workers every so often voluntarily suspend seeking for less speculative work in the leftmost part of the tree. {\bf Idle} workers seek for work from the richest workers. Recently the Bristol scheduler introduced a new issue to the scheduler by using the pruning operators counters and scope~\cite{bogdan90} to figure out which work is speculative and which is not~\cite{Beau93}. The main algorithm of the Bristol scheduler follows the basic algorithm mentioned before, where workers always give preference to mandatory work, and are always seeking for better work if they are working in any speculative work.
Whilst producing good speedups for speculative programs, this scheduler also produces good speedups for programs without any speculative work.
The Bristol scheduler also tackles a problem that was not approached before by any of its predecessor schedulers. It incorporates a scheme to control resource utilisation by releasing resources whenever they are not needed, and requesting resources when they are necessary to the application. Results of observing the machine workload showed that efficiency can be much improved by using a policy of temporarily releasing workers back to the operating system if there was not enough work available.
The Dharma scheduler's main design goal is the efficient implementation of a strategy for speculative work, and it borrows its main ideas from the Wavefront scheduler~\cite{brand:wavefront}. The main goal is therefore to solve the problem of scheduling speculative work. The Dharma scheduler considers the whole tree as being speculative and tries to concentrate workers on the deepest leftmost part of the tree. It achieves efficiency by implementing a branch level dispatching algorithm instead of a node level dispatching algorithm. The branch level dispatching algorithm uses a data structure linking the tip of each branch in the public part of the tree, called the active chain. This structure avoids scanning nodes of the execution tree, and consequently avoids locking operations. It maintains the left-to-right order of the branches on the tree. Each element of the active chain, a tip, that represents a branch in the execution tree, contains information about its right and leftmost sibling, tip type, and information about the branch itself.
It allows voluntary suspension of workers at a given suspension interval in order that they can seek for better work to their left. The number of workers to be suspended depends on a parameter given by the user. A new implementation is on its way to try to set the suspension interval automatically by using information about pruning operators counters as used by the recent version of the Bristol scheduler.
The most recent results presented by Sindaha~\cite{raed:thesis} suggest that the strategy and implementation technique utilised have several advantages over the other current schedulers that deal with speculative applications. It also behaves quite well for non-speculative applications. The advantages are that by using the active chain, less contention is achieved when searching for work in the leftmost order. The active chain also reduces the communication ratio between workers, because workers do not need to interrupt each other to look for work in the nodes. It still needs synchronisation though to make work public. An extra overhead is introduced because the tips need to be updated for each operation of removing or adding a new branch on the search tree or for pruning operations.
This strategy and implementation seems to be the most successful so far in dealing with speculative applications as well as non-speculative applications.
Other or-parallel scheduling techniques have been proposed, but either have not been implemented or have only been simulated. Wai-Keong~\cite{wai-keong92} reports an or-scheduler strategy that uses a heuristic task distribution by assigning ``weights'' to the alternative clauses. Its algorithm to assign the weights is similar to Tick's granularity size algorithm~\cite{Tick90b}. In this algorithm each call to a goal is counted as having weight 1 and each recursive call has also weight 1. These weights are assigned to the clauses at compile-time by annotating the Prolog program. The scheduler then uses this information in order to select tasks to spawn. This technique was originally used by Tick~\cite{Tick90b} to control the scheduling of and-parallel goals in FGHC, but it was used by Wai-Keong to control the spawning of or-parallel alternatives.
Another strategy technique is reported in~\cite{Lin92} where a method to remove structural imbalance of the programs by global analysis (basically unfolding/flattening recursive predicates) is proposed. By removing structure imbalance the author assumes that the or-work is evenly distributed during the execution. The following rules are applied to distribute work:
\begin{itemize} \item eager-splitting strategy: at each choicepoint where m processors are present, assume there are n valid choices. m tasks are created and assigned evenly to m processors. If \(n \geq m \), each task contains \(\frac{n}{m}\) choices, and the remaining choices are randomly included in some of the tasks. If \( n < m \), each processor randomly picks one choice. \item lazy-splitting strategy: at each choicepoint, two tasks are created and assigned to each of the half of the processors. In case of choices being not evenly divisible, remaining choices are treated in a way similar to that in the eager-splitting rule. \end{itemize}
There are a number of systems that exploit and-parallelism in logic programs. Some of them exploit parallelism explicitly~\cite{Cram90,clmcg82}. Others, for philosophical matters or in order to keep portability of Prolog programs, exploit parallelism implicitly~\cite{iclp91e,bevemyr93,Herm86a}.
In systems that implement IAP the main problem in and-scheduling is to correctly implement backtracking inside a CGE. If any goal is selected to execute in and-parallel, and the implementation is WAM-based with contiguous stacks, on backtracking, a worker may not be able to backtrack and continue a previous goal, because there may be another goal being executed after the goal that needs to start again. This is shown in picture~\ref{trapped}. From goals {\tt a \& b \& c}, worker 1 (w1) selects goals {\tt a} and {\tt c}, while worker 2 (w2) selects goal {\tt b}. Assume that worker 1 finishes goal {\tt a} before worker 2 finishes goal {\tt b}. In that case, worker 1 starts goal {\tt c}. Assume that worker 2 finishes execution of goal {\tt b}. It needs to backtrack and find another solution to {\tt a}, but {\tt a} cannot continue, because goal {\tt c} did not finish yet. This is illustrated as the trapped goal problem.
\begin{figure}[htb] \goodrule \centerline{\psfig{figure=trapped_goal.ps}} \caption{\label{trapped}\sc The trapped goal problem} \goodrule \end{figure}
The strategy currently used by the \&-Prolog system is random selection of parallel goals from any agent's stack. As shown by the explanation before, this strategy is dangerous if we allow goals inside a ECGE to backtrack. In this case it can cause the problem of ``trapped'' goals where the topmost goal in an agent's stack backtracks on top of an older goal that did not finish execution yet. In order to solve this problem, there are three approaches. The first one~\cite{Herm87} was designed for the \&-Prolog system itself, and consists of only selecting a goal for parallel execution if it is the oldest pushed on the goal list. Therefore, WAM agents only execute goals in a certain order that is the sequential Prolog's order of goals. The second approach is to use a more flexible scheduling strategy that instead of restricting selection of goals by age, can select any goal. This is achieved by maintaining a separator marker between goals that can be executed in parallel. This scheduling strategy is described by Shen~\cite{Shen93} and used in the DDAS system. A third approach that will probably be used by the PBA implementation is to employ a memory management that allocates memory segments dynamically instead of using contiguous chunks of memory.
Parlog, as was mentioned before is a concurrent logic language and has declarative and operational semantics different from Prolog. Users need to program using mode declarations for every clause in the program, and need to use guards to prevent don't know non-determinism.
Crammond~\cite{Cram90b} describes 3 different algorithms to schedule and-goals in the parallel Parlog system and analyses their effect on performance. The algorithms vary basically in the way the pool of goals is implemented. The three strategies are shown below:
\begin{itemize} \item maintain a local run queue per worker; \item maintain local run queues with private and shared parts; \item maintain local run queues with a single global run queue. \end{itemize}
Crammond's experiments were performed on a Sequent Symmetry machine. The paper reports that local run queues with private regions produce the best results in terms of locality and locking overhead. The amount of work taken from the local run queues is better proportioned. Also this algorithm produced the most consistent execution times.
A similar scheme was implemented in Andorra-I, but it seemed that for some programs goals that produced lots of determinate work were always being put in the private region of the local run queue, and work was not being distributed proportionally among the workers. And-scheduling in Andorra-I is discussed later in the next chapter.
FGHC (Flat Guarded Horn Clauses) is a language based on GHC~\cite{ueda-cp-book}. As Parlog, FGHC is a concurrent logic language, that supports committed-choice non-determinism and stream communication. It differs from Parlog in that Parlog requires the program to be annotated with mode declarations. Also Parlog requires compile-time analysis in order to guarantee that the guards are safe, i.e., will have at least a solution.
KL1, a language based on FGHC and whose parallel implementation is the kernel language for the PIM machine, required static scheduling. Programs written in KL1 are annotated with KL1 primitives that inform to the system which parts of the program code can be parallelised, i.e, the program is partitioned by the programmer. Therefore, scheduling in this system is made in a static fashion given that the user correctly partitions the program in order to be able to exploit parallelism.
Tick~\cite{Tick90b} studies several scheduling strategies to select goals in FGHC with the aid of a very simple compile-time granularity analysis algorithm. The idea of the algorithm is to assign weights to the clauses in the program. The weight of each goal call in the body of a clause is one. Also recursive predicates are assigned weight one. The scheduler of the FGHC parallel emulator running on a Sequent Symmetry machine was adapted to use this information. Demand-distribution methods are discussed, including {\it oldest-first} and heaviest-first distributions. His results did not produce good speedups due to various factors such as high system overheads, low cost of spawning a goal on a shared memory multiprocessor, and the increase in synchronisation caused by the method. His conclusion was that it does not matter what kind of selection is used to choose which goals to execute next. Newer results have not been reported on the literature.
There are few reports of systems in the literature that exploit both and- and or-parallelism~\cite{ecrc:pepsys,Kale87}. Most of them are not very mature systems and therefore are not yet tackling scheduling problems. Hence very few researchers have been working in the area of distributing and- and or-work. The new issue here is what kind of work to distribute, and- or or-work. So far, only the problem of scheduling for and-parallel only systems and the problem of scheduling for or-parallel only systems have been tackled. With the most recent systems that aim at exploiting both and- and or-parallelism, the new issue is what kind of work to distribute. Some solutions were found that use some kind of global analysis of the program to predict the amount of work to be done in each branch~\cite{DebLinHer,Tick90b,Caslog}. These algorithms work at compile time by calculating inter and intra size arguments of goals and clauses and generating recurrence equations that would be utilised by the scheduler at runtime. However these solutions have not been applied to any known parallel logic programming system.
Other solution is to consider that all Prolog applications when executed in parallel generate a certain amount of work that can be mapped onto a statistical distribution~\cite{Lin92}. Assuming that all tasks are to be executed with the same probability distribution, an algorithm is used that obtains the probability to the branches in order to distribute tasks evenly among the processors. This technique although applied to a model that aims to exploit and- and or-parallelism, only solves the or-parallel work distribution.
Another solution is used by Gao et al~\cite{Gao91} and involves generating feedback information to the system by profiling previous interactive system sessions. The solution proposed tries to solve almost all problems by using a combination of static, dynamic, and hybrid approaches to get granularity sizes of Prolog programs, and feed these results into the system together with most-frequent used modes for calls to predicates. The scheduling strategy for or-parallel tasks is to choose the nodes nearest to the root. The strategy to choose and-goals is based on goal granularity (given at compile-time). Nothing is reported about how to decide between and- and or-work, and no results are reported at all.
The latest systems that aims at exploiting and- and or-parallelism and are being implemented are ACE~\cite{Gupta94}, ParAKL~\cite{ParAKLb}, PBA~\cite{PBA92} and Penny~\cite{Penny94}. Some results of a study of and-scheduling strategies for the ParAKL system are reported by Serr~\cite{Serr93}. Results for ACE have not been published yet.
To the best of my knowledge, Andorra-I is the first and/or parallel logic programming system that has reported results running real applications. Andorra-I is the first reasonably mature implementation of a system that exploits both and- and or-parallelism. Andorra-I exploits or-parallelism as in Aurora by using the binding arrays technique of the SRI model to solve the problem of different conditional bindings per branch of the search tree. It exploits a simple form of and-parallelism by allowing only determinate goals to be executed in parallel. If goals try to bind a common variable with different values the whole branch fails.
A processing element in Andorra-I is called a worker. Workers are forked in the beginning of the execution, and each worker is associated with a processor in the parallel system. This is different from what is done in other logic programming systems, as the system does not allow to fork more processes than there are processors available. This is to avoid overheads of context switching, and interruption of busy workers. Workers in Andorra-I work much the same as in the Aurora system, but instead of having a single worker to explore a branch, several workers cooperate with each other in a team in order to explore and-parallelism if this exists in the branch. The first version of Andorra-I used a fixed configuration of workers into teams, and once the user started the execution with a given configuration, this would never change. If the application had a tree that did not suit the chosen configuration, Andorra-I would lose parallelism.
The novel contribution of this thesis is to add to Andorra-I a reconfigurer whose objective is to rearrange workers into teams according to the parallelism existing in the application. This strategy allows workers to be redeployed from one team to another to look for work. It also decides {\bf what} kind of work to choose -- and- or or- -- in order to achieve reasonable load balancing and overall better performance. The other new issue in parallel logic programming systems, either and-parallel systems or or-parallel systems, that is tackled by the reconfigurer is the efficient utilisation of resources by the parallel system.
The initial decision algorithm was work-guided, i.e., idle workers would decide to take and-work or or-work depending on the amount of work currently available in the system. This strategy proved to be reasonably good as the results demonstrate. However the amount of work available in the system is not always a qualitative representation of the {\bf actual} amount of work to be done. A scheduler based on such a measure would be good if we had some information on the granularity size of the program, possibly generated through compile-time analysis. Therefore a more reasonable strategy is to allow workers to move freely whenever they are becoming inefficient in a team.
We dedicate the next chapter to a thorough description of the Andorra Model and the Andorra-I system, since this is the target system to which we made our main contribution.
Systems that exploit or-parallelism are quite mature and their scheduling techniques are improving with each new algorithm and implementation. Researchers have vastly studied or-scheduler systems and it is becoming clear what is required of an or-parallel scheduler to make an or-parallel system efficient. General conclusions are that low communication between workers is important, in order to be able to perform well with both speculative applications and non-speculative applications. We also need to have an efficient way of keeping the left-to-right order of the branches in the tree and have an efficient algorithm for searching for deepest leftmost work. It is also clear that an or-parallel system can benefit from schedulers that are capable of disposing some processors if the application does not need them. The Bristol-scheduler, the Dharma-scheduler and the Muse-scheduler have been successful in doing their jobs.
Other or-scheduling methods such as the method proposed by Lin~\cite{Lin92} are valid, although some assumptions are made that are not easy to accept in logic programming. E.g., structural imbalance is an intrinsic characteristic of logic programs, therefore to remove structural imbalance is not an easy task. To do that at runtime can introduce too many overheads. This strategy is hence more suitable for a static scheduling model than for a dynamic scheduling model. It is arguable the efficiency of such systems, because the scheduler needs to be centralised in order to keep track of number of idle processors and number of tasks not yet taken at a given time of the execution, and allocate all idle processors to the work at once.
The and-schedulers are not so studied as the or-schedulers, but some conclusions can be drawn. It is not a very good idea to maintain a global pool of goals to schedule and-goals. Moreover, in systems that exploit IAP it is also important to prevent the trapped goal problem.
The method used by Tick~\cite{Tick90b}, that introduced a compile-time technique to help decisions at runtime, although very simple, proved to be not very efficient for scheduling and-parallelism. It might be useful for scheduling or-parallelism though, as Wai-Keong~\cite{wai-keong92} tried to show. To collect granularity size of programs can be a good idea if one can spot points in the program that presents a regular behaviour. This would help the schedulers, both and- and or-schedulers, in some cases. In the cases that we cannot conclude anything about the shape of the execution tree, some strategy that does not rely on compile-time analysis data, needs to be used anyway. To assume that the amount of work in the applications follows a given probabilistic distribution seems unreal, given that the shape of Prolog application search trees varies widely. The idea of profiling is interesting and can be a source for future research, although we can profile and draw conclusions for some classes of applications, but not generalise for all applications.
With respect to scheduling both and- and or-parallelism, as we observed from the previous section, Andorra-I~\cite{Yang93} is the only and/or parallel logic programming system that tackles this problem. So far the results have been quite encouraging. Andorra-I is capable of running a wide range of applications that contain both and- and or-parallelism or either kind alone and achieve good speedups without the need for manually adjusting the number of workers to exploit and- and or-parallelism.
In this chapter we presented the concept of scheduling in parallel logic programming systems. We discussed about schedulers for or-work and schedulers for and-work. We concentrated first on the most well known schedulers for or-work: the Muse scheduler, the Bristol scheduler, and the Dharma scheduler. Second we gave an overview of schedulers for and-work by describing how to select goals in committed-choice systems, in Andorra-I, and in \&-Prolog. Currently, there is no system that employs scheduling strategies for distributing both and-work and or-work, except the ones implemented to Andorra-I. There are one or two proposals of schedulers to distribute both and-work and or-work, but they were not implemented and have some drawbacks. The PEPSys system that exploits and-parallelism and or-parallelism has independent schedulers for and-work and or-work, but does not deal with the problem of exploiting both kinds of parallelism.
Lastly, we gave a discussion about the scheduling models and scheduling systems presented, giving emphasis to scheduling in Andorra-I, which is the target system in this thesis.