Scalar IEEE 754 Float API#

group scalar_f32_api

Functions

void f32_unpack(int32_t *mantissa, exponent_t *exp, const float input)#

Unpack an IEEE 754 single-precision float into a 32-bit mantissa and exponent.

Example

// Unpack 1.52345246 * 10^(-5)
float val = 1.52345246e-5;
int32_t mant;
exponent_t exp;
f32_unpack(&mant, &exp, val);

printf("%ld * 2^(%d) <-- %e\n", mant, exp, val);

Parameters:

mantissa – [out] Unpacked output mantissa
exp – [out] Unpacked output exponent
input – [in] Float value to be unpacked

void f32_unpack_s16(int16_t *mantissa, exponent_t *exp, const float input)#

Unpack an IEEE 754 single-precision float into a 16-bit mantissa and exponent.

Example

// Unpack 1.52345246 * 10^(-5)
float val = 1.52345246e-5;
int16_t mant;
exponent_t exp;
f32_unpack_s16(&mant, &exp, val);

printf("%ld * 2^(%d) <-- %e\n", mant, exp, val);

Note

This operation may result in a loss of precision.

Parameters:

mantissa – [out] Unpacked output mantissa
exp – [out] Unpacked output exponent
input – [in] Float value to be unpacked

float_s32_t f32_to_float_s32(const float x)#

Convert an IEEE754 float to a float_s32_t.

Parameters:

x – [in] Input value

Throws ET_ARITHMETIC:

Raised if x is infinite or NaN

Returns:

float_s32_t representation of x

float_s32_t f64_to_float_s32(const double x)#

Convert an IEEE754 double to a float_s32_t.

Note

This operation may result in precision loss.

Parameters:

x – [in] Input value

Throws ET_ARITHMETIC:

Raised if x is infinite or NaN

Returns:

float_s32_t representation of x

float f32_sin(const float theta)#

Get the sine of a specified angle.

Computes \(sin(\theta)\) using the power series expansion of \(sin()\) truncated to 8 terms.

This implementation is meant to make optimal use of the XS3 floating-point unit.

Parameters:

theta – [in] Angle \(\theta\) to compute the sine of (in radians)

Throws ET_ARITHMETIC:

Raised if \(\theta\) is infinite or NaN

Returns:

Sine of the angle \(\theta\)

float f32_cos(const float theta)#

Get the cosine of a specified angle.

Computes \(cos(\theta) = sin(\theta+\frac{\pi}{2}\) using the power series expansion of \(sin()\) truncated to 8 terms.

This implementation is meant to make optimal use of the XS3 floating-point unit.

Parameters:

theta – [in] Angle \(\theta\) to compute the cosine of (in radians)

Throws ET_ARITHMETIC:

Raised if \(\theta\) is infinite or NaN

Returns:

Cosine of the angle \(\theta\)

float f32_log2(const float x)#

Get the base-2 logarithm of the specified value.

This function computes \(log_2(x)\) using the power series expansion of \(log_2()\) truncated to 11 terms.

Parameters:

x – [in] Input value \(x\) to get the logarithm of.

Throws ET_ARITHMETIC:

Raised if \(x\) is infinite or NaN

Returns:

\(log_2(x)\)

float f32_power_series(const float x, const float b[], const unsigned N)#

Compute power series summation using specified coefficients.

This function is used to compute the sum of terms in a power series, truncated to \(N\) terms, starting with the \(x^0\) term.

b is an \(N\)-element vector of coefficients \(\bar b\) which are multiplied by the corresponding powers of \(x\).

\(N\) is the length of \(\bar b\) and number of terms to sum together.

Operation Performed:

\[\begin{flalign*} & a \leftarrow \sum_{k=0}^{N-1}\left( x^k,b_k \right) && \end{flalign*}\]

Parameters:

x – [in] Input value \(x\).
b – [in] Vector of coefficients \(\bar b\).
N – [in] Number of power series terms to sum.

Throws ET_ARITHMETIC:

Raised if \(x\) or any element of \(\bar b\) is infinite or NaN.

Returns:

\(a\), the sum of the first \(N\) power series terms.

float f32_normA(exponent_t *p, const float x)#

Get a representation of the input \(x\) in normalized form A.

This function is used internally to transform a float value into a representation required for certain purposes.

In particular, this function behaves much like frexpf(), where it is guaranteed that the returned value \(a\) is either \(0\) or that \(0.5 \le \left| a \right| < 1.0\), and the output exponent \(p\) is such that \(x = a \cdot 2^{p}\).

In anticipation that future work may require alternative “normalized” representations, this form is being defined here as form A.

Parameters:

p – [in] Output exponent \(p\)
x – [in] Input value \(x\)

Throws ET_ARITHMETIC:

Raised if \(x\) or any element of \(\bar b\) is infinite or NaN.

Returns:

\(a\) in normalized form A.