Tuesday, January 6, 2009

Applications of Clockless Chips

Applications of Clockless Chips

1. High performance.
2. Low power dissipation.
3. Low noise and low electro-magnetic emission.
4. A good match with heterogeneous system timing.


Figure 5.1 synchronous circuit
5.1 Asynchronous for High Performance :

In an asynchronous circuit the next computation step can start immediately after the previous step has completed: there is no need to wait for a transition of the clock signal. This leads, potentially, to a fundamental performance advantage for asynchronous circuits, an advantage that increases with the variability in delays associated with these computation steps. However, part of this advantage is canceled by the overhead required to detect the completion of a step. Furthermore, it may be difficult to translate local timing variability into a global system performance advantage.

Data-dependent delays :
The delay of the combinational logic circuit show in Figure-1 depends on the current state and the value of the primary inputs. The worst-case delay, plus some margin for flip-flop delays and clock skew, is then a lower bound for the clock period of a synchronous circuit. Thus, the actual delay is always less (and sometimes much less) than the clock period.



Figure 5.2 N- bit ripple carry adder (a) and a self-timed version (b)

A simple example is an N-bit ripple-carry adder (Figure 2). The worst-case delay occurs when 1 is added to 2N - 1. Then the carry ripples from FA1 to FAN. In the best case there is no carry ripple at all, as, for example, when adding 1 to 0. Assuming random inputs, the average length of the longest carry-propagation chain is bounded by log2N. For a 32-bit wide ripple-carry adder the average length is therefore 5, but the clock period must be 6 times longer! On the other hand, the average length determines the average case delay of an asynchronous ripple-carry adder, which we consider next. In an asynchronous circuit this variation in delays can be exploited by detecting the actual completion of the addition. Most practical solutions use dual-rail encoding of the carry signal (Figure 2(b)); the addition has completed when all internal carry-signals have been computed. That is, when each pair (cfi; cti) has made a monotonous transition from (0; 0) to (0; 1) (carry = false) or to (1; 0) (carry = true). Dual-rail encoding of the carry signal has also been applied to a carry bypass adder. When inputs and outputs are dual-rail encoded as well, the completion can be observed from the outputs of the adder.

Elastic pipelines:

In general it is not easy to translate a local asynchronous advantage in average- case performance into a system-level performance advantage. Today's synchronous circuits are heavily pipelined and retimed. Critical paths are nicely balanced and little room is left to obtain an asynchronous benefit. Moreover, an asynchronous benefit of this kind must be balanced against a possible overhead in completion signaling and asynchronous control.

The controller communicates exclusively with the controllers of the immediately preceding and succeeding stages by means of handshake signaling, and controls the state of the data latches (transparent or opaque). Between the request and the next acknowledge phase the corresponding data wires must be kept stable.


5.2 Asynchronous for Low Power :

Dissipating when and where active the classic example of a low-power asynchronous circuit is a frequency divider. A D-flip-flop with its inverted output fed back to its input divides an incoming (clock) frequency by two (Figure 4(a)). A cascade of N such divide-by-two elements (Figure 4(b)) divide the incoming frequency by 2N.

Figure 5.3 Divide-by-2-element (a) and divide-2N circuit (b)
The second element runs at only half the rate of the first one and hence dissipates only half the power; the third one dissipates only a quarter, and so on. Hence, the entire asynchronous cascade consumes, over a given period of time, slightly less than twice the power of its head element, independent of N. That is, fixed power dissipation is obtained.

In contrast, a similar synchronous divider would dissipate in proportion to N. A cascade of 15 such divide-by-two elements is used in watches to convert a 32 kHz crystal clock down to a 1 Hz clock. The potential of asynchronous for low power depends on the application.
For example, in a digital filter where the clock rate equals the data rate, all flip-flops and all combinational circuits are active during each clock cycle. Then little or nothing can be gained by implementing the filter as an asynchronous circuit. However, in many digital-signal processing functions the clock rate exceeds the data (signal) rate by a large factor, sometimes by several orders of magnitude 2. In such circuits, only a small fraction of registers change state during a clock cycle. Furthermore, this fraction may be highly data dependent. The clock frequency is chosen that high to accommodate sequential algorithms that share resources over subsequent computation steps. One is vastly improved electrical efficiency, which leads directly to prolonged battery life.

One application for which asynchronous circuits can save power is Reed-Solomon error correctors operating at audio rates, as demonstrated at Philips Research Laboratories. Two different asynchronous realizations of this decoder (single-rail and dual-rail) are compared with a synchronous (product) version. The single rail was clearly superior and consumed five times less power than the synchronous version.

A second example is the infrared communications receiver IC designed at Hewlett-Packard/Stanford. The receiver IC draws only leakage current while waiting for incoming data, but can start up as soon as a signal arrives so that it loses no data. Also, most modules operate well below the maximum frequency of operation.

The filter bank for a digital hearing aid was the subject of another successful demonstration, this time by the Technical University of Denmark in cooperation with Oticon Inc. They re-implemented an existing filter bank as a fully asynchronous circuit. The result is a factor five less power consumption.

A fourth application is a pager in which several power-hungry sub circuits were redesigned as asynchronous circuits, as shown later in this issue.

5.3 Asynchronous for Low Noise and Low Emission :

Sub circuits of a system may interact in unintended and often subtle ways. For example, a digital sub circuit generates voltage noise on the power-supply lines or induces currents in the silicon substrate. This noise may affect the performance of an analog-to-digital converter connected so as to draw power from the same source or that is integrated on the same substrate. Another example is that of a digital sub circuit that emits electromagnetic radiation at its clock frequency (and the higher harmonic frequencies), and a radio receiver sub-circuit that mistakes this radiation for a radio signal.
Due to the absence of a clock, asynchronous circuits may have better noise and EMC (Electro-Magnetic Compatibility) properties than synchronous circuits. This advantage can be appreciated by analyzing the supply current of a clocked circuit in both the time and frequency domains.
Circuit activity of a clocked circuit is usually maximal shortly after the productive clock edge. It gradually fades away and the circuit must become totally quiescent before the next productive clock edge. Viewed differently, the clock signal modulates the supply current as depicted schematically in Figure 5(a). Due to parasitic resistance and inductance in the on-chip and off-chip supply wiring this causes noise on the on-chip power and ground lines.



5.4 Heterogeneous Timing :

There are two on-going trends that affect the timing of a system-on-a-chip: the relative increase of interconnects delays versus gate delays and the rapid growth of design reuse. Their combined effect results in an increasingly heterogeneous organization of system-on-a-chip timing. According to Figure 7, gate delays rapidly decrease with each technology generation. By contrast, the delay of a piece of interconnect of fixed modest length increases, soon leading to a dominance of interconnect delay over gate delay. The introduction of additional interconnects layers and new materials (copper and low dielectric constant insulators) may slow down this trend somewhat. Nevertheless, new circuits and architectures are required to circumvent these parasitic limitations. For

Figure 5.4
Example, across-chip communication may no longer fit within a single clock period of a processor core.

Heterogeneous system timing will offer considerable design challenge for system-level interconnect, including buses, FIFOs, switch matrices, routers, and multi-port memories. Asynchrony makes it easier to deal with interconnecting a variety of different clock frequencies, without worrying about synchronization problems, differences in clock phases and frequencies, and clock skew. Hence, new opportunities will arise for asynchronous interconnect structures and protocols. Once asynchronous on-chip interconnect structures are accepted, the threshold to introduce asynchronous clients to these interconnects is lowered as well. Also, mixed synchronous-asynchronous circuits hold promise.
Asynchronous logic circuits (Stop the clocks) :
As its name suggests, it does away with the cardinal rule of chip design: that everything marches to the beat of an oscillating crystal “clock”. For a 1GHz chip, this clock ticks one billion times a second, and all of the chip’s processing units co-ordinate their actions with these ticks to ensure that they remain in step. Asynchronous, or “clockless”, designs, in contrast, allow different bits of a chip to work at different speeds, sending data to and from each other as and when appropriate.
Clockless processors, also called asynchronous or self-timed, don’t use the oscillating crystal that serves as the regularly “ticking” clock that paces the work done by traditional synchronous processors. Rather than waiting for a clock tick, clockless-chip elements hand off the results of their work as soon as they are finished.


Figure 2.1


2.1 How clockless chips work :

There are no purely asynchronous chips yet. Instead, today’s clockless processors are actually clocked processors with asynchronous elements. Clockless elements use perfect clock gating, in which circuits operate only when they have work to do, not whenever a clock ticks. Instead of clock-based synchronization, local handshaking controls the passing of data between logic modules. The asynchronous processor places the location of the stored data it wants to read onto the address bus and issues a request for the information. The memory reads the address off the bus, finds the information, and places it on the data bus. The memory then acknowledges that it has read the data. Finally, the processor grabs the information from the data bus.

According to Jorgenson, “Data arrives at any rate and leaves at any rate. When the arrival rate exceeds the departure rate, the circuit stalls the input until the output catches up.”

The many handshakes themselves require more power than a clock’s operations. However, clockless systems more than offset this because, unlike synchronous chips, each circuit uses power only when it performs work.

2.2 Clockless advantages :

In synchronous designs, the data moves on every clock edge, causing voltage spikes. In clockless chips, data doesn’t all move at the same time, which spreads out current flow, thereby minimizing the strength and frequency of spikes and emitting less EMI. Less EMI reduces both noise-related errors within circuits and interference with nearby devices.

2.2.1 Power efficiency, responsiveness, and robustness :

Because asynchronous chips have no clock and each circuit powers up only when used, asynchronous processors use less energy than synchronous chips by providing only the voltage necessary for a particular operation.

According to Jorgenson, clockless chips are particularly energy-efficient for running video, audio, and other streaming applications — data-intensive programs that frequently cause synchronous processors to use considerable power. Streaming data applications have frequent periods of dead time — such as when there is no sound or when video frames change very little from their immediate predecessors — and little need for running error-correction logic. During this inactive time, asynchronous processors don’t use much power. Clockless processors activate only the circuits needed to handle data, thus they leave unused circuits ready to respond quickly to other demands. Asynchronous chips run cooler and have fewer and lower voltage spikes. Therefore, they are less likely to experience temperature-related problems and are more robust. Because they use handshaking, clockless chips give data time to arrive and stabilize before circuits pass it on. This contributes to reliability because it avoids the rushed data handling that central clocks sometimes necessitate, according to University of Manchester Professor Steve Furber, who runs the Amulet project.



2.2.2 Simple, efficient design :

Logic modules could be developed without regard to compatibility with a central clock frequency, which makes the design process easier. Also, because asynchronous processors don’t need specially designed modules that all work at the same clock frequency, they can use standard components. This enables simpler, faster design and assembly.

However, the recent use of both domino logic and the delay-insensitive mode in asynchronous processors has created a fast approach known as integrated pipelines mode.
Domino logic improves performance because a system can evaluate several lines of data at a time in one cycle, as opposed to the typical approach of handing one line in each cycle. Domino logic is also efficient because it acts only on data that has changed during processing, rather than acting on all data throughout the process. The delay-insensitive mode allows an arbitrary time delay for logic blocks. “Registers communicate at their fastest common speed. If one block is slow, the blocks that it communicates with slow down,” said Jorgenson. This gives a system time to handle and validate data before passing it along, thereby reducing errors.

No comments:

Post a Comment