M-7 -- 6.3 Proposal of two Executable Datatypes REAL

Go backward to 6.2 Proposal of an Abstract Datatype REAL
Go up to 6 Towards a Datatype REAL

6.3 Proposal of two Executable Datatypes REAL

Proposal Based on [KM81].

We suggest to specify a datatype resp. a family of datatypes ExecutableRealKulischMiranker depending on roundings r, which we present later. These datatypes consist of

a sort S[b,l,e1,e2] which corresponds to the set S^*:=S(b,l,e1,e2)^*,
the arithmetic operations +, -, *, /, where these operations respect the law RG for the rounding r, and
a rounded sum operator, which computes the value r(SUM_i=1ⁿ x_i), where n > 1 and x_i e S^* for 1 < i < n.

At the first sight it seems to be doubtful that formula (RG) can be met by an executable specification. In order to determine the approximation r(x ° y), the exact but unknown result x ° y seems to be required. But [KM81] shows by means of algorithms for ° e {+,*,-,/} that whenever x ° y is not representable in the finite floating point system, it is sufficient to replace it by an appropriate and representable value x ^õ y. The latter will have the property r(x ° y) = r(x ^õ y). Thus the algorithms compute first the value x ^õ y in a sligthly larger floating point system and then apply a rounding r on this value to obtain x ° _r y = r( x ° y) = r(x ^õ y). The algorithms presented in [KM81] work for the roundings

monotone downwardly directed rounding \¯/:
forall x e R^*: \¯/(x):= max{ y e S^* | y < x }.
monotone upwardly directed rounding /\:
forall x e R^*: /\(x):= min{ y e S^* | y > x }.
monotone rounding toward zero []_b:

forall x e R^*, x > 0: []_b(x):= \¯/(x) and

forall x e R^*, x < 0: []_b(x):= - []_b(-x).
monotone rounding away from zero []₀:

forall x e R^*, x > 0: []₀(x):= /\(x) and

forall x e R^*, x < 0: []₀(x):= - []₀(-x).

roundings []_µ, for µ e {1,2, ..., b-1}:

forall x e R^*, x e [0, b^{e1 -1}): []_µ(x):= 0,

forall x e R^*, x > b^{e1 -1}: []_µ(x):= {
\¯/(x) , x e [\¯/(x), s_µ(x))

/\(x) , x e [s_µ(x),\¯/(x)],

forall x e R^*, x < 0: []_µ(x):= - []_µ(-x),

where

forall x e R^*: s_µ:= \¯/(x) + (/\(x) - \¯/(x) ) ·µ/b.

If b is an even number, []_b/2 denotes the rounding to the nearest floating point number.

The monotone directed roundings \¯/ and /\ are needed for interval computations - see e.g. [Moo88] for a discussion of this topic.

Further, [KM81] present algorithms to compute the result of the rounded sum operator for all of the above mentioned roundings.

Thus we suggest to specify ExecutableRealKulischMiranker for all theses roundings. This leads - depending on the algorithms presented in [KM81] - to one specification with a parameter for the rounding or to a family of specifications, where each specification "implements" a special rounding.

Proposal Based on the IEEE Standards.

For a comprehensive discussion of the IEEE standards we refer to [Gol91]. He summarizes the standards as follows:

There are two different IEEE standards for floating-point computation. IEEE 754 is a binary standard that requires b=2, l=24 for single precision and l=53 for double precision [IEE87]. It also specifies the precise layout of bits in a single and double precision. IEEE 854 allows either b=2 or b=10 and unlike 754, does not specify how floating-point numbers are encoded into bits [Cod84]. It does not require a particular value for l, but instead it specifies constraints on the allowable values of l for single and double precision.⁸

The standard IEEE 754 e.g. defines the following format parameters:

parameter single single extended double double extended

l 24 > 32 53 > 64

e2 +127 > +1023 +1023 > 16383

e1 -126 < +1022 -1022 < 16382

Both IEEE standards cover the basic operations +,-,*,/, and also specify square root, remainder and the two conversions between integer and floating point numbers and between the internal format and decimal output. They require all these operations to be exactly rounded. That is, the result must be computed exactly and then rounded⁹. There are several rounding modes to be provided: the default is rounding to the nearest floating point number. But also rounding toward 0, rounding toward + infty and rounding toward - infty are covered.

We suggest to specify both standards as ExecutableReal754 and ExecutableReal854 with suitable parameters like single, double, base, etc.

Comparison of the Proposed Executable Datatypes.

The datatype ExecutableRealKulischMiranker and both specifications based on the IEEE standards, ExecutableReal754 and ExecutableReal854, are on the one hand independent of the specifications BasicReal and AbstractExecutableReal as they rely on floating-point systems, which can be realized in CASL without any connection to the sort Real. On the other hand they are closely related with these specifications by the roundings.

The main differences between the two proposals are:

underlying floating point systems:

Both proposals are based on finite floating point systems. This is the reason for the phrase "executable" in their names.

The first proposal includes the values - infty and + infty, while in the second the minimal and maximal elements are real numbers.
In the floating point system of [KM81] the parameters base b, length l, minimal exponent e1, and maximal exponent e2 can be freely chosen, while the IEEE standards choose the base b e {2, 10}, and give at least constraints on the values of l, e1, and e2.

roundings:

The computer arithmetic of [KM81] is based on the theory of screens and roundings. It has a larger variety of roundings than the IEEE standards (even if base b = 2).

operations:

A significant difference between both proposals is the rounded sum operator of [KM81]. This operator is important in numerical computations that deal with vectors: to compute the scalar product of two vectors with n components, with the rounded sum operator only n+1 roundings are necessary (n roundings for the multiplications and one rounding for the sum), while the IEEE standards need 2n-1 roundings.

Compared with the proposal based on [KM81] the IEEE standards include as additional operations square root, remainder and the two conversions.

transcendental functions:

The table maker's dilemma is the reason for the different treatment of the transcendental functions in both approaches: While [KM81] does not treat them at all, the IEEE standards do not require them to be exactly rounded.

CoFI Note: M-7 -- Version: 0.2 -- 13 April 1999.
Comments to till@informatik.uni-bremen.de

forall x e R^*, x > 0:	`[]`_b(x):= \¯/(x) and
forall x e R^*, x < 0:	`[]`_b(x):= - `[]`_b(-x).

forall x e R^*, x > 0:	`[]`₀(x):= /\(x) and
forall x e R^*, x < 0:	`[]`₀(x):= - `[]`₀(-x).

parameter	single	single extended	double	double extended
l	24	> 32	53	> 64
e2	+127	> +1023	+1023	> 16383
e1	-126	< +1022	-1022	< 16382