32-bit Scalar API#

group scalar_s32_api

Defines

S32_SQRT_MAX_DEPTH#: Maximum bit-depth to calculate with s32_sqrt().

Functions

float s32_to_f32(const int32_t mantissa, const exponent_t exp)#

Pack a floating point value into an IEEE 754 single-precision float.

The value returned is the nearest representable approximation to \( m \cdot 2^{p} \) where \(m\) is mantissa and \(p\) is exp.

Example

// Pack -12345678 * 2^{-13} into a float
int32_t mant = -12345678;
exponent_t exp = -13;
float val = s32_to_f32(mant, exp);

printf("%e <-- %ld * 2^(%d)\n", val, mant, exp);

Note

This operation may result in a loss of precision.

Parameters:

mantissa – [in] Mantissa of value to be packed
exp – [in] Exponent of value to be packed

Returns:

float representation of input value

int16_t s32_to_s16(exponent_t *a_exp, const int32_t b, const exponent_t b_exp)#

Convert a 32-bit floating-point scalar to a 16-bit floating-point scalar.

Converts a 32-bit floating-point scalar, represented by the 32-bit mantissa b and exponent b_exp, into a 16-bit floating-point scalar, represented by the 16-bit returned mantissa and output exponent a_exp.

Parameters:

a_exp – [out] Output exponent
b – [in] 32-bit input mantissa
b_exp – [in] Input exponent

Returns:

16-bit output mantissa

int32_t s32_sqrt(exponent_t *a_exp, const int32_t b, const exponent_t b_exp, const unsigned depth)#

Compute the square root of a 32-bit floating-point scalar.

b and b_exp together represent the input \(b \cdot 2^{b\_exp}\). Likewise, a and a_exp together represent the result \(a \cdot 2^{a\_exp}\).

depth indicates the number of MSb’s which will be calculated. Smaller values here will execute more quickly at the cost of reduced precision. The maximum valid value for depth is S32_SQRT_MAX_DEPTH.

Operation Performed:

\[\begin{flalign*} a \cdot 2^{a\_exp} \leftarrow \sqrt{\left( b \cdot 2^{b\_exp} \right)} && \end{flalign*}\]

Parameters:

a_exp – [out] Output exponent \(a\_exp\)
b – [in] Input mantissa \(b\)
b_exp – [in] Input exponent \(b\_exp\)
depth – [in] Number of most significant bits to calculate

Returns:

Output mantissa \(a\)

int32_t s32_inverse(exponent_t *a_exp, const int32_t b)#

Compute the inverse of a 32-bit integer.

b represents the integer \(b\). a and a_exp together represent the result \(a \cdot 2^{a\_exp}\).

Operation Performed:

\[\begin{flalign*} a \cdot 2^{a\_exp} \leftarrow \frac{1}{b} && \end{flalign*}\]

If \(b\) is the mantissa of a fixed- or floating-point value with an implicit or explicit exponent \(b\_exp\), then

Fixed- or Floating-point

\( \begin{aligned} \frac{1}{b \cdot 2^{b\_exp}} &= \frac{1}{b} \cdot 2^{-b\_exp} \\ &= a \cdot 2^{a\_exp} \cdot 2^{-b\_exp} \\ &= a \cdot 2^{a\_exp - b\_exp} \end{aligned} \)

and so \(b\_exp\) should be subtracted from the output exponent \(a\_exp\).

Parameters:

a_exp – [out] Output exponent \(a\_exp\)
b – [in] Input integer \(b\)

Returns:

Output mantissa \(a\)

int32_t s32_mul(exponent_t *a_exp, const int32_t b, const int32_t c, const exponent_t b_exp, const exponent_t c_exp)#

Compute the product of two 32-bit floating-point scalars.

a and a_exp together represent the result \(a \cdot 2^{a\_exp}\).

b and b_exp together represent the result \(b \cdot 2^{b\_exp}\).

c and c_exp together represent the result \(c \cdot 2^{c\_exp}\).

Operation Performed:

\[\begin{flalign*} a \cdot 2^{a\_exp} \leftarrow \left( b\cdot 2^{b\_exp} \right) \cdot \left( c\cdot 2^{c\_exp} \right) && \end{flalign*}\]

Parameters:

a_exp – [out] Output exponent \(a\_exp\)
b – [in] First input mantissa \(b\)
c – [in] Second input mantissa \(c\)
b_exp – [in] First input exponent \(b\_exp\)
c_exp – [in] Second input exponent \(c\_exp\)

Returns:

Output mantissa \(a\)

sbrad_t radians_to_sbrads(const radian_q24_t theta)#

Convert angle from radians to a modified binary representation.

Some trig functions, such as sbrad_sin(), rather than taking an angle specified in radians (e.g. radian_q24_t), require their argument to be a modified representation of the angle, as an sbrad_t. The modified binary representation takes into account various properies of the \(sin(\theta)\) function to simplify certain operations.

For any angle \(\theta\) there is a unique angle \(\alpha\) where \(-1\le\alpha\le1\) and \(sin(\frac{\pi}{2}\alpha) = sin(\theta)\). This function essentially just maps the input angle \(\theta\) onto the corresponding angle \(\alpha\) in that region and returns the result in a Q1.31 format.

In this library, the unit of the resulting angle \(\alpha\) is referred to as an ‘sbrad’. ‘brad’ because \(\alpha\) is a kind of binary angular measurement, and ‘s’ because the symmetries of \(sin(\theta)\) are what’s being accounted for.

Parameters:

theta – [in] Input angle \(\theta\), in radians (Q8.24)

Returns:

Output angle \(\alpha\), in sbrads

q2_30 sbrad_sin(const sbrad_t theta)#

Compute the sine of the specified angle.

This function computes \(sin(\frac{\pi}{2}\theta)\), returning the result in Q2.30 format.

The input angle \(\theta\) must be expressed in sbrads (sbrad_t), and must represent a value between \(\pm 0.5\) (inclusive) (as a Q1.31).

Operation Performed:

\[\begin{flalign*} & sin(\frac{\pi}{2}\theta) && \end{flalign*}\]

Parameters:

theta – [in] Input angle \(\theta\), in sbrads (see radians_to_sbrads)

Returns:

Sine of the specified angle in Q2.30 format.

q2_30 sbrad_tan(const sbrad_t theta)#

Compute the tangent of the specified angle.

This function computes \(tan(\frac{\pi}{2}\theta)\), returning the result in Q2.30 format.

The input angle \(\theta\) must be expressed in sbrads (sbrad_t), and must represent a value between \(\pm 0.25\) (inclusive) (as a Q1.31).

Operation Performed:

\[\begin{flalign*} & tan(\frac{\pi}{2}\theta) && \end{flalign*}\]

Parameters:

theta – [in] Input angle \(\theta\), in sbrads (see radians_to_sbrads)

Returns:

Tangent of the specified angle in Q2.30 format.

q2_30 q24_sin(const radian_q24_t theta)#

Compute the sine of the specified angle.

This function computes \(sin(\theta)\), returning the result in Q2.30 format.

Operation Performed:

\[\begin{flalign*} & sin(\theta) && \end{flalign*}\]

Parameters:

theta – [in] Input angle \(\theta\), in radians (Q8.24)

Returns:

\(sin(\theta)\) as a Q2.30

q2_30 q24_cos(const radian_q24_t theta)#

Compute the cosine of the specified angle.

This function computes \(cos(\theta)\), returning the result in Q2.30 format.

Operation Performed:

\[\begin{flalign*} & cos(\theta) && \end{flalign*}\]

Parameters:

theta – [in] Input angle \(\theta\), in radians (Q8.24)

Returns:

\(cos(\theta)\) as a Q2.30

float_s32_t q24_tan(const radian_q24_t theta)#

Compute the tangent of the specified angle.

This function computes \(tan(\theta)\). The result is returned as a float_s32_t containing a mantissa and exponent.

The value of \(tan(\theta)\) is considered undefined where \(theta=\frac{\pi}{2}+k\pi\) for any integer \(k\). An exception will be raised if \(\theta\) meets this condition.

Operation Performed:

\[\begin{flalign*} & tan(\theta) && \end{flalign*}\]

Parameters:

theta – [in] Input angle \(\theta\), in radians (Q8.24)

Throws ET_ARITHMETIC:

Raised if \(tan(\theta)\) is undefined.

Returns:

\(tan(\theta)\) as a float_s32_t

q2_30 q30_exp_small(const q2_30 x)#

Compute \(e^x\) for Q2.30 value near \(0\).

This function computes \(e^x\) where \(x\) is a fixed-point value with 30 fractional bits.

This function implements \(e^x\) using a truncated power series, and is only intended to be used for inputs in the range \(-0.5 \le x \le 0.5\).

The output is also in the Q2.30 format.

For the range \(-0.5 \le x \le 0.5\), the maximum observed error (compared to exp(double) from math.h) was 2 (which corresponds to \(2^{-29}\)).

For the range \(-1.0 \le x \le 1.0\), the corresponding maximum observed error was 324, or approximately \(2^{-21}\).

To compute \(e^x\) for \(x\) outside of \(\left[-0.5, 0.5\right]\), use float_s32_exp().

Operation Performed:

\[\begin{flalign*} & y \leftarrow e^x && \end{flalign*}\]

Parameters:

x – [in] Input value \(x\)

Returns:

\(y\)

q8_24 q24_logistic(const q8_24 x)#

Evaluate the logistic function at the specified point.

This function computes the value of the logistic function \(y =\frac{1}{1+e^{-x}}\). This is a sigmoidal curve bounded below by \(y = 0\) and above by \(y = 1\).

The input \(x\) and output \(y\) are both Q8.24 fixed-point values.

If speed is greatly preferred to precision, q24_logistic_fast() can be used instead.

Operation Performed:

\[\begin{flalign*} & y \leftarrow \frac{1}{1+e^{-x}} && \end{flalign*}\]

Parameters:

x – [in] Input value \(x\)

Returns:

\(y\)

q8_24 q24_logistic_fast(const q8_24 x)#

Evaluate the logistic function at the specified point.

This function computes the value of the logistic function \(y =\frac{1}{1+e^{-x}}\). This is a sigmoidal curve bounded below by \(y = 0\) and above by \(y = 1\).

The input \(x\) and output \(y\) are both Q8.24 fixed-point values.

This implementation trades off precision for speed, approximating results in a piece-wise linear manner. If a precise result is desired, q24_logistic() should be used instead.

Operation Performed:

\[\begin{flalign*} & y \leftarrow \frac{1}{1+e^{-x}} && \end{flalign*}\]

Parameters:

x – [in] Input value \(x\)

Returns:

\(y\)

void s32_to_chunk_s32(int32_t a[VPU_INT32_EPV], int32_t b)#

Broadcast an integer to a vector chunk.

This function broadcasts the input \(b\) to the 8 elements of \(\bar a\).

Operation Performed:

\[\begin{flalign*} & a_k \leftarrow b && \end{flalign*}\]

Parameters:

a – [out] Output chunk \(\bar a\)
b – [in] Input value \(b\)

Throws ET_LOAD_STORE:

Raised if a is not double word-aligned (See Note: Vector Alignment)

void q30_powers(q2_30 a[], const q2_30 b, const unsigned N)#

Get the first \(N\) powers of \(b\).

This function computes the first \(N\) powers (starting with \(0\)) of the Q2.30 input \(b\). The results are output as \(\bar a\), also in Q2.30 format.

Operation Performed:

\[\begin{split}\begin{flalign*} & a_0 \leftarrow 2^{30} = \mathtt{Q30(1.0)} \\ & a_k \leftarrow round\left(\frac{a_{k-1}\cdot b}{2^{30}}\right) \\ & \qquad\text{for }k \in {0..N-1} && \end{flalign*}\end{split}\]

Parameters:

a – [out] Output \(\bar a\)
b – [in] Input \(b\)
N – [in] Number of elements of \(\bar a\) to compute

void s32_odd_powers(int32_t a[], const int32_t b, const unsigned count, const right_shift_t shr)#

Fill vector with odd powers of \(b\).

This function populates the elements of output vector \(\bar a\) with the odd powers of input \(b\). The first count odd powers of \(b\) are output. The highest power output will be \(2\cdot\mathtt{count}-1\).

The 64-bit product of each multiplication is right-shifted by shr bits and truncated to the 32 least significant bits. If \(b\) is a fixed-point value with shr fractional bits, then each \(a_k\) will have the same Q-format as input \(b\). shr must be non-negative.

This function neither rounds nor saturates results. It is up to the user to ensure overflows are avoided.

Typical use-case is computing a power series of a function with odd symmetry.

Operation Performed:

\[\begin{split}\begin{flalign*} & b_{sqr} = \frac{b^2}{2^{\mathtt{shr}}} \\ & a_0 \leftarrow b \\ & a_k \leftarrow \frac{a_{k-1},b_{sqr}}{\mathtt{shr}} \\ & \qquad\text{for } k \in {1, 2, 3, ..., \mathtt{count} - 1} && \end{flalign*}\end{split}\]

Parameters:

a – [out] Output vector \(\bar a\)
b – [in] Input \(b\)
count – [in] Number of elements to output.
shr – [in] Number of bits to right-shift 64-bit products.