16-Bit Block Floating-Point API#

group bfp_s16_api

Functions

void bfp_s16_init(bfp_s16_t *a, int16_t *data, const exponent_t exp, const unsigned length, const unsigned calc_hr)#

Initialize a 16-bit BFP vector.

This function initializes each of the fields of BFP vector a.

data points to the memory buffer used to store elements of the vector, so it must be at least length * 2 bytes long, and must begin at a word-aligned address.

exp is the exponent assigned to the BFP vector. The logical value associated with the kth element of the vector after initialization is \( data_k \cdot 2^{exp} \).

If calc_hr is false, a->hr is initialized to 0. Otherwise, the headroom of the the BFP vector is calculated and used to initialize a->hr.

Parameters:

a – [out] BFP vector to initialize
data – [in] int16_t buffer used to back a
exp – [in] Exponent of BFP vector
length – [in] Number of elements in the BFP vector
calc_hr – [in] Boolean indicating whether the HR of the BFP vector should be calculated

bfp_s16_t bfp_s16_alloc(const unsigned length)#

Dynamically allocate a 16-bit BFP vector from the heap.

If allocation was unsuccessful, the data field of the returned vector will be NULL, and the length field will be zero. Otherwise, data will point to the allocated memory and the length field will be the user-specified length. The length argument must not be zero.

Neither the BFP exponent, headroom, nor the elements of the allocated mantissa vector are set by this function. To set the BFP vector elements to a known value, use bfp_s16_set() on the retuned BFP vector.

BFP vectors allocated using this function must be deallocated using bfp_s16_dealloc() to avoid a memory leak.

To initialize a BFP vector using static memory allocation, use bfp_s16_init() instead.

See also

bfp_s16_alloc

Parameters:

vector – [in] BFP vector to be deallocated.

void bfp_s16_set(bfp_s16_t *a, const int16_t b, const exponent_t exp)#

Set all elements of a 16-bit BFP vector to a specified value.

The exponent of a is set to exp, and each element’s mantissa is set to b.

After performing this operation, all elements will represent the same value \(b \cdot 2^{exp}\).

a must have been initialized (see bfp_s16_init()).

Parameters:

a – [out] BFP vector to update
b – [in] New value each mantissa is set to
exp – [in] New exponent for the BFP vector

headroom_t bfp_s16_headroom(bfp_s16_t *b)#

Get the headroom of a 16-bit BFP vector.

The headroom of a vector is the number of bits its elements can be left-shifted without losing any information. It conveys information about the range of values that vector may contain, which is useful for determining how best to preserve precision in potentially lossy block floating-point operations.

In a BFP context, headroom applies to mantissas only, not exponents.

In particular, if the 16-bit mantissa vector \(\bar x\) has \(N\) bits of headroom, then for any element \(x_k\) of \(\bar x\)

\(-2^{15-N} \le x_k < 2^{15-N}\)

And for any element \(X_k = x_k \cdot 2^{x\_exp}\) of a complex BFP vector \(\bar X\)

\(-2^{15 + x\_exp - N} \le X_k < 2^{15 + x\_exp - N} \)

This function determines the headroom of b, updates b->hr with that value, and then returns b->hr.

Parameters:

b – BFP vector to get the headroom of

Returns:

Headroom of BFP vector b

void bfp_s16_use_exponent(bfp_s16_t *a, const exponent_t exp)#

Modify a 16-bit BFP vector to use a specified exponent.

This function forces BFP vector \(\bar A\) to use a specified exponent. The mantissa vector \(\bar a\) will be bit-shifted left or right to compensate for the changed exponent.

This function can be used, for example, before calling a fixed-point arithmetic function to ensure the underlying mantissa vector has the needed Q-format. As another example, this may be useful when communicating with peripheral devices (e.g. via I2S) that require sample data to be in a specified format.

Note that this sets the current encoding, and does not fix the exponent permanently (i.e. subsequent operations may change the exponent as usual).

If the required fixed-point Q-format is QX.Y, where Y is the number of fractional bits in the resulting mantissas, then the associated exponent (and value for parameter exp) is -Y.

a points to input BFP vector \(\bar A\), with mantissa vector \(\bar a\) and exponent \(a\_exp\). a is updated in place to produce resulting BFP vector \(\bar{\tilde{A}}\) with mantissa vector \(\bar{\tilde{a}}\) and exponent \(\tilde{a}\_exp\).

exp is \(\tilde{a}\_exp\), the required exponent. \(\Delta{}p = \tilde{a}\_exp - a\_exp\) is the required change in exponent.

If \(\Delta{}p = 0\), the BFP vector is left unmodified.

If \(\Delta{}p > 0\), the required exponent is larger than the current exponent and an arithmetic right-shift of \(\Delta{}p\) bits is applied to the mantissas \(\bar a\). When applying a right-shift, precision may be lost by discarding the \(\Delta{}p\) least significant bits.

If \(\Delta{}p < 0\), the required exponent is smaller than the current exponent and a left-shift of \(\Delta{}p\) bits is applied to the mantissas \(\bar a\). When left-shifting, saturation logic will be applied such that any element that can’t be represented exactly with the new exponent will saturate to the 16-bit saturation bounds.

The exponent and headroom of a are updated by this function.

Operation Performed:

\[\begin{split}\begin{flalign*} & \Delta{}p = \tilde{a}\_exp - a\_exp \\ & \tilde{a_k} \leftarrow sat_{16}( a_k \cdot 2^{-\Delta{}p} ) \\ & \qquad\text{for } k \in 0\ ...\ (N-1) \\ & \qquad\text{where } N \text{ is the length of } \bar{A} \text{ (in elements) } && \end{flalign*}\end{split}\]

Parameters:

a – [inout] Input BFP vector \(\bar A\) / Output BFP vector \(\bar{\tilde{A}}\)
exp – [in] The required exponent, \(\tilde{a}\_exp\)

void bfp_s16_shl(bfp_s16_t *a, const bfp_s16_t *b, const left_shift_t b_shl)#

Apply a left-shift to the mantissas of a 16-bit BFP vector.

Each mantissa of input BFP vector \(\bar B\) is left-shifted b_shl bits and stored in the corresponding element of output BFP vector \(\bar A\).

This operation can be used to add or remove headroom from a BFP vector.

b_shl is the number of bits that each mantissa will be left-shifted. This shift is signed and arithmetic, so negative values for b_shl will right-shift the mantissas.

a and b must have been initialized (see bfp_s16_init()), and must be the same length.

This operation can be performed safely in-place on b.

Note that this operation bypasses the logic protecting the caller from saturation or underflows. Output values saturate to the symmetric 16-bit range (the open interval \((-2^{15}, 2^{15})\)). To avoid saturation, b_shl should be no greater than the headroom of b (b->hr).

Operation Performed:

\[\begin{split}\begin{flalign*} & a_k \leftarrow sat_{16}( \lfloor b_k \cdot 2^{b\_shl} \rfloor ) \\ & \qquad\text{for } k \in 0\ ...\ (N-1) \\ & \qquad\text{where } N \text{ is the length of } \bar{B} \\ & \qquad\text{ and } b_k \text{ and } a_k \text{ are the } k\text{th mantissas from } \bar{B}\text{ and } \bar{A}\text{ respectively} && \end{flalign*}\end{split}\]

Parameters:

a – [out] Output BFP vector \(\bar A\)
b – [in] Input BFP vector \(\bar B\)
b_shl – [in] Signed arithmetic left-shift to be applied to mantissas of \(\bar B\).

void bfp_s16_add(bfp_s16_t *a, const bfp_s16_t *b, const bfp_s16_t *c)#

Add two 16-bit BFP vectors together.

Add together two input BFP vectors \(\bar B\) and \(\bar C\) and store the result in BFP vector \(\bar A\).

a, b and c must have been initialized (see bfp_s16_init()), and must be the same length.

This operation can be performed safely in-place on b or c.

Operation Performed:

\[\begin{flalign*} \bar{A} \leftarrow \bar{B} + \bar{C} && \end{flalign*}\]

Parameters:

a – [out] Output BFP vector \(\bar A\)
b – [in] Input BFP vector \(\bar B\)
c – [in] Input BFP vector \(\bar C\)

void bfp_s16_add_scalar(bfp_s16_t *a, const bfp_s16_t *b, const float c)#

Add a scalar to a 16-bit BFP vector.

Add a real scalar \(c\) to input BFP vector \(\bar B\) and store the result in BFP vector \(\bar A\).

a, and b must have been initialized (see bfp_s16_init()), and must be the same length.

This operation can be performed safely in-place on b.

Operation Performed:

\[\begin{split}\begin{flalign*} & \bar{A} \leftarrow \bar{B} + c \\ && \end{flalign*}\end{split}\]

Parameters:

a – [out] Output BFP vector \(\bar A\)
b – [in] Input BFP vector \(\bar B\)
c – [in] Input scalar \(c\)

void bfp_s16_sub(bfp_s16_t *a, const bfp_s16_t *b, const bfp_s16_t *c)#

Subtract one 16-bit BFP vector from another.

Subtract input BFP vector \(\bar C\) from input BFP vector \(\bar C\) and store the result in BFP vector \(\bar A\).

a, b and c must have been initialized (see bfp_s16_init()), and must be the same length.

This operation can be performed safely in-place on b or c.

Operation Performed:

\[\begin{flalign*} \bar{A} \leftarrow \bar{B} - \bar{C} && \end{flalign*}\]

Parameters:

a – [out] Output BFP vector \(\bar A\)
b – [in] Input BFP vector \(\bar B\)
c – [in] Input BFP vector \(\bar C\)

void bfp_s16_mul(bfp_s16_t *a, const bfp_s16_t *b, const bfp_s16_t *c)#

Multiply one 16-bit BFP vector by another element-wise.

Multiply each element of input BFP vector \(\bar B\) by the corresponding element of input BFP vector \(\bar C\) and store the results in output BFP vector \(\bar A\).

a, b and c must have been initialized (see bfp_s16_init()), and must be the same length.

This operation can be performed safely in-place on b or c.

Operation Performed:

\[\begin{split}\begin{flalign*} & A_k \leftarrow B_k \cdot C_k \\ & \qquad\text{for } k \in 0\ ...\ (N-1) \\ & \qquad\text{where } N \text{ is the length of } \bar{B}\text{ and }\bar{C} && \end{flalign*}\end{split}\]

Parameters:

a – Output BFP vector \(\bar A\)
b – Input BFP vector \(\bar B\)
c – Input BFP vector \(\bar C\)

void bfp_s16_macc(bfp_s16_t *acc, const bfp_s16_t *b, const bfp_s16_t *c)#

Multiply one 16-bit BFP vector by another element-wise and add the result to a third vector.

Operation Performed:

\[\begin{split}\begin{flalign*} & A_k \leftarrow A_k + B_k \cdot C_k \\ & \qquad\text{for } k \in 0\ ...\ (N-1) \\ & \qquad\text{where } N \text{ is the length of } \bar{B}\text{ and }\bar{C} && \end{flalign*}\end{split}\]

Parameters:

acc – [inout] Input/Output accumulator BFP vector \(\bar A\)
b – [in] Input BFP vector \(\bar B\)
c – [in] Input BFP vector \(\bar C\)

void bfp_s16_nmacc(bfp_s16_t *acc, const bfp_s16_t *b, const bfp_s16_t *c)#

Multiply one 16-bit BFP vector by another element-wise and subtract the result from a third vector.

Operation Performed:

\[\begin{split}\begin{flalign*} & A_k \leftarrow A_k - B_k \cdot C_k \\ & \qquad\text{for } k \in 0\ ...\ (N-1) \\ & \qquad\text{where } N \text{ is the length of } \bar{B}\text{ and }\bar{C} && \end{flalign*}\end{split}\]

Parameters:

acc – [inout] Input/Output accumulator BFP vector \(\bar A\)
b – [in] Input BFP vector \(\bar B\)
c – [in] Input BFP vector \(\bar C\)

void bfp_s16_scale(bfp_s16_t *a, const bfp_s16_t *b, const float alpha)#

Multiply a 16-bit BFP vector by a scalar.

Multiply input BFP vector \(\bar B\) by scalar \(\alpha \cdot 2^{\alpha\_exp}\) and store the result in output BFP vector \(\bar A\).

a and b must have been initialized (see bfp_s16_init()), and must be the same length.

alpha represents the scalar \(\alpha \cdot 2^{\alpha\_exp}\), where \(\alpha\) is alpha.mant and \(\alpha\_exp\) is alpha.exp.

This operation can be performed safely in-place on b.

Operation Performed:

\[\begin{flalign*} \bar{A} \leftarrow \bar{B} \cdot \left(\alpha \cdot 2^{\alpha\_exp}\right) && \end{flalign*}\]

Parameters:

a – [out] Output BFP vector \(\bar A\)
b – [in] Input BFP vector \(\bar B\)
alpha – [in] Scalar by which \(\bar B\) is multiplied

void bfp_s16_abs(bfp_s16_t *a, const bfp_s16_t *b)#

Get the absolute values of elements of a 16-bit BFP vector.

Compute the absolute value of each element \(B_k\) of input BFP vector \(\bar B\) and store the results in output BFP vector \(\bar A\).

a and b must have been initialized (see bfp_s16_init()), and must be the same length.

This operation can be performed safely in-place on b.

Operation Performed:

\[\begin{split}\begin{flalign*} A_k \leftarrow \left| B_k \right| \\ \qquad\text{for } k \in 0\ ...\ (N-1) \\ \qquad\text{where } N \text{ is the length of } \bar{B} && \end{flalign*}\end{split}\]

Parameters:

a – [out] Output BFP vector \(\bar A\)
b – [in] Input BFP vector \(\bar B\)

float_s32_t bfp_s16_sum(const bfp_s16_t *b)#

Sum the elements of a 16-bit BFP vector.

Sum the elements of input BFP vector \(\bar B\) to get a result \(A = a \cdot 2^{a\_exp}\), which is returned. The returned value has a 32-bit mantissa.

b must have been initialized (see bfp_s16_init()).

Operation Performed:

\[\begin{split}\begin{flalign*} & A \leftarrow \sum_{k=0}^{N-1} \left( B_k \right) \\ & \qquad\text{where } N \text{ is the length of } \bar{B} && \end{flalign*}\end{split}\]

Parameters:

b – [in] Input BFP vector \(\bar B\)

Returns:

\(A\), the sum of elements of \(\bar B\)

float_s64_t bfp_s16_dot(const bfp_s16_t *b, const bfp_s16_t *c)#

Compute the inner product of two 16-bit BFP vectors.

Adds together the element-wise products of input BFP vectors \(\bar B\) and \(\bar C\) for a result \(A = a \cdot 2^{a\_exp}\), where \(a\) is the 64-bit mantissa of the result and \(a\_exp\) is its associated exponent. \(A\) is returned.

b and c must have been initialized (see bfp_s16_init()), and must be the same length.

Operation Performed:

\[\begin{split}\begin{flalign*} & a \cdot 2^{a\_exp} \leftarrow \sum_{k=0}^{N-1} \left( B_k \cdot C_k \right) \\ & \qquad\text{where } N \text{ is the length of } \bar{B}\text{ and }\bar{C} && \end{flalign*}\end{split}\]

Parameters:

b – [in] Input BFP vector \(\bar B\)
c – [in] Input BFP vector \(\bar C\)

Returns:

\(A\), the inner product of vectors \(\bar B\) and \(\bar C\)

void bfp_s16_clip(bfp_s16_t *a, const bfp_s16_t *b, const int16_t lower_bound, const int16_t upper_bound, const int bound_exp)#

Clamp the elements of a 16-bit BFP vector to a specified range.

Each element \(A_k\) of output BFP vector \(\bar A\) is set to the corresponding element \(B_k\) of input BFP vector \(\bar B\) if it is in the range \( [ L \cdot 2^{bound\_exp}, U \cdot 2^{bound\_exp} ] \), otherwise it is set to the nearest value inside that range.

a and b must have been initialized (see bfp_s16_init()), and must be the same length.

This operation can be performed safely in-place on b.

Operation Performed:

\[\begin{split}\begin{flalign*} & A_k \leftarrow \begin{cases} L \cdot 2^{bound\_exp} & B_k < L \cdot 2^{bound\_exp} \\ U \cdot 2^{bound\_exp} & B_k > U \cdot 2^{bound\_exp} \\ B_k & otherwise \end{cases} \\ & \qquad\text{for } k \in 0\ ...\ (N-1) \\ & \qquad\text{where } N \text{ is the length of } \bar{B} && \end{flalign*}\end{split}\]

Parameters:

a – [out] Output BFP vector \(\bar A\)
b – [in] Input BFP vector \(\bar B\)
lower_bound – [in] Mantissa of the lower clipping bound, \(L\)
upper_bound – [in] Mantissa of the upper clipping bound, \(U\)
bound_exp – [in] Shared exponent of the clipping bounds

void bfp_s16_rect(bfp_s16_t *a, const bfp_s16_t *b)#

Rectify a 16-bit BFP vector.

Each element \(A_k\) of output BFP vector \(\bar A\) is set to the corresponding element \(B_k\) of input BFP vector \(\bar B\) if it is non-negative, otherwise it is set to \(0\).

a and b must have been initialized (see bfp_s16_init()), and must be the same length.

This operation can be performed safely in-place on b.

Operation Performed:

\[\begin{split}\begin{flalign*} & A_k \leftarrow \begin{cases} 0 & B_k < 0 \\ B_k & otherwise \end{cases} \\ & \qquad\text{for } k \in 0\ ...\ (N-1) \\ & \qquad\text{where } N \text{ is the length of } \bar{B} && \end{flalign*}\end{split}\]

Parameters:

a – [out] Output BFP vector \(\bar A\)
b – [in] Input BFP vector \(\bar B\)

void bfp_s16_to_bfp_s32(bfp_s32_t *a, const bfp_s16_t *b)#

Convert a 16-bit BFP vector into a 32-bit BFP vector.

Increases the bit-depth of each 16-bit element \(B_k\) of input BFP vector \(\bar B\) to 32 bits, and stores the 32-bit result in the corresponding element \(A_k\) of output BFP vector \(\bar A\).

a and b must have been initialized (see bfp_s16_init() and bfp_s32_init()), and must be the same length.

Operation Performed:

\[\begin{split}\begin{flalign*} & A_k \overset{32-bit}{\longleftarrow} B_k \\ & \qquad\text{for } k \in 0\ ...\ (N-1) \\ & \qquad\text{where } N \text{ is the length of } \bar{B} && \end{flalign*}\end{split}\]

Parameters:

a – [out] Output BFP vector \(\bar A\)
b – [in] Input BFP vector \(\bar B\)

void bfp_s16_sqrt(bfp_s16_t *a, const bfp_s16_t *b)#

Get the square roots of elements of a 16-bit BFP vector.

Computes the square root of each element \(B_k\) of input BFP vector \(\bar B\) and stores the results in output BFP vector \(\bar A\).

a and b must have been initialized (see bfp_s16_init()), and must be the same length.

This operation can be performed safely in-place on b.

Operation Performed:

\[\begin{split}\begin{flalign*} & A_k \leftarrow \sqrt{B_k} \\ & \qquad\text{for } k \in 0\ ...\ (N-1) \\ & \qquad\text{where } N \text{ is the length of } \bar{B} && \end{flalign*}\end{split}\]

Notes

Only the XMATH_BFP_SQRT_DEPTH_S16 (see xmath_conf.h) most significant bits of each result are computed.
This function only computes real roots. For any \(B_k < 0\), the corresponding output \(A_k\) is set to \(0\).

Parameters:

a – [out] Output BFP vector \(\bar A\)
b – [in] Input BFP vector \(\bar B\)

void bfp_s16_inverse(bfp_s16_t *a, const bfp_s16_t *b)#

Get the inverses of elements of a 16-bit BFP vector.

Computes the inverse of each element \(B_k\) of input BFP vector \(\bar B\) and stores the results in output BFP vector \(\bar A\).

a and b must have been initialized (see bfp_s16_init()), and must be the same length.

This operation can be performed safely in-place on b.

Operation Performed:

\[\begin{split}\begin{flalign*} & A_k \leftarrow B_k^{-1} \\ & \qquad\text{for } k \in 0\ ...\ (N-1) \\ & \qquad\text{where } N \text{ is the length of } \bar{B} && \end{flalign*}\end{split}\]

Parameters:

a – [out] Output BFP vector \(\bar A\)
b – [in] Input BFP vector \(\bar B\)

float_s32_t bfp_s16_abs_sum(const bfp_s16_t *b)#

Sum the absolute values of elements of a 16-bit BFP vector.

Sum the absolute values of elements of input BFP vector \(\bar B\) for a result \(A = a \cdot 2^{a\_exp}\), where \(a\) is a 32-bit mantissa and \(a\_exp\) is its associated exponent. \(A\) is returned.

b must have been initialized (see bfp_s16_init()).

Operation Performed:

\[\begin{split}\begin{flalign*} & A \leftarrow \sum_{k=0}^{N-1} \left| A_k \right| \\ & \qquad\text{where } N \text{ is the length of } \bar{B} && \end{flalign*}\end{split}\]

Parameters:

b – [in] Input BFP vector \(\bar B\)

Returns:

\(A\), the sum of absolute values of elements of \(\bar B\)

float bfp_s16_mean(const bfp_s16_t *b)#

Get the mean value of a 16-bit BFP vector.

Computes \(A = a \cdot 2^{a\_exp}\), the mean value of elements of input BFP vector \(\bar B\), where \(a\) is the 16-bit mantissa of the result, and \(a\_exp\) is its associated exponent. \(A\) is returned.

b must have been initialized (see bfp_s16_init()).

Operation Performed:

\[\begin{split}\begin{flalign*} & A \leftarrow \frac{1}{N} \sum_{k=0}^{N-1} \left( B_k \right) \\ & \qquad\text{where } N \text{ is the length of } \bar{B} && \end{flalign*}\end{split}\]

Parameters:

b – [in] Input BFP vector \(\bar B\)

Returns:

\(A\), the mean value of \(\bar B\)’s elements

float_s64_t bfp_s16_energy(const bfp_s16_t *b)#

Get the energy (sum of squared of elements) of a 16-bit BFP vector.

Computes \(A = a \cdot 2^{a\_exp}\), the sum of squares of elements of input BFP vector \(\bar B\), where \(a\) is the 64-bit mantissa of the result, and \(a\_exp\) is its associated exponent. \(A\) is returned.

b must have been initialized (see bfp_s16_init()).

Operation Performed:

\[\begin{split}\begin{flalign*} & A \leftarrow \sum_{k=0}^{N-1} \left( B_k^2 \right) \\ & \qquad\text{where } N \text{ is the length of } \bar{B} && \end{flalign*}\end{split}\]

Parameters:

b – [in] Input BFP vector \(\bar B\)

Returns:

\(A\), \(\bar B\)’s energy

float_s32_t bfp_s16_rms(const bfp_s16_t *b)#

Get the RMS value of elements of a 16-bit BFP vector.

Computes \(A = a \cdot 2^{a\_exp}\), the RMS value of elements of input BFP vector \(\bar B\), where \(a\) is the 32-bit mantissa of the result, and \(a\_exp\) is its associated exponent. \(A\) is returned.

The RMS (root-mean-square) value of a vector is the square root of the sum of the squares of the vector’s elements.

b must have been initialized (see bfp_s16_init()).

Operation Performed:

\[\begin{split}\begin{flalign*} & A \leftarrow \sqrt{\frac{1}{N}\sum_{k=0}^{N-1} \left( B_k^2 \right) } \\ & \qquad\text{where } N \text{ is the length of } \bar{B} && \end{flalign*}\end{split}\]

Parameters:

b – [in] Input BFP vector \(\bar B\)

Returns:

\(A\), the RMS value of \(\bar B\)’s elements

float bfp_s16_max(const bfp_s16_t *b)#

Get the maximum value of a 16-bit BFP vector.

Finds \(A\), the maximum value among elements of input BFP vector \(\bar B\). \(A\) is returned by this function.

b must have been initialized (see bfp_s16_init()).

Operation Performed:

\[\begin{split}\begin{flalign*} & A \leftarrow max\left(B_0, B_1, ..., B_{N-1} \right) \\ & \qquad\text{where } N \text{ is the length of } \bar{B} && \end{flalign*}\end{split}\]

Parameters:

b – [in] Input vector

Returns:

\(A\), the value of \(\bar B\)’s maximum element

void bfp_s16_max_elementwise(bfp_s16_t *a, const bfp_s16_t *b, const bfp_s16_t *c)#

Get the element-wise maximum of two 16-bit BFP vectors.

Each element of output vector \(\bar A\) is set to the maximum of the corresponding elements in the input vectors \(\bar B\) and \(\bar C\).

a, b and c must have been initialized (see bfp_s16_init()), and must be the same length.

This operation can be performed safely in-place on b, but not on c.

Operation Performed:

\[\begin{split}\begin{flalign*} & A_k \leftarrow max(B_k, C_k) \\ & \qquad\text{for } k \in 0\ ...\ (N-1) \\ & \qquad\text{where } N \text{ is the length of } \bar{B}\text{ and }\bar{C} && \end{flalign*}\end{split}\]

Parameters:

a – Output BFP vector \(\bar A\)
b – Input BFP vector \(\bar B\)
c – Input BFP vector \(\bar C\)

float bfp_s16_min(const bfp_s16_t *b)#

Get the minimum value of a 16-bit BFP vector.

Finds \(A\), the minimum value among elements of input BFP vector \(\bar B\). \(A\) is returned by this function.

b must have been initialized (see bfp_s16_init()).

Operation Performed:

\[\begin{split}\begin{flalign*} & A \leftarrow min\left(B_0, B_1, ..., B_{N-1} \right) \\ & \qquad\text{where } N \text{ is the length of } \bar{B} && \end{flalign*}\end{split}\]

Parameters:

b – [in] Input vector

Returns:

\(A\), the value of \(\bar B\)’s minimum element

void bfp_s16_min_elementwise(bfp_s16_t *a, const bfp_s16_t *b, const bfp_s16_t *c)#

Get the element-wise minimum of two 16-bit BFP vectors.

Each element of output vector \(\bar A\) is set to the minimum of the corresponding elements in the input vectors \(\bar B\) and \(\bar C\).

a, b and c must have been initialized (see bfp_s16_init()), and must be the same length.

This operation can be performed safely in-place on b, but not on c.

Operation Performed:

\[\begin{split}\begin{flalign*} & A_k \leftarrow min(B_k, C_k) \\ & \qquad\text{for } k \in 0\ ...\ (N-1) \\ & \qquad\text{where } N \text{ is the length of } \bar{B}\text{ and }\bar{C} && \end{flalign*}\end{split}\]

Parameters:

a – Output BFP vector \(\bar A\)
b – Input BFP vector \(\bar B\)
c – Input BFP vector \(\bar C\)

unsigned bfp_s16_argmax(const bfp_s16_t *b)#

Get the index of the maximum value of a 16-bit BFP vector.

Finds \(a\), the index of the maximum value among the elements of input BFP vector \(\bar B\). \(a\) is returned by this function.

If i is the value returned, then the maximum value in \(\bar B\) is ldexp(b->data[i],b->exp).

Operation Performed:

\[\begin{split}\begin{flalign*} & a \leftarrow argmax_k\left(b_k\right) \\ & \qquad\text{for } k \in 0\ ...\ (N-1) \\ & \qquad\text{where } N \text{ is the length of } \bar{B} && \end{flalign*}\end{split}\]

Notes

If there is a tie for maximum value, the lowest tying index is returned.

Parameters:

b – [in] Input vector

Returns:

\(a\), the index of the maximum value from \(\bar B\)

unsigned bfp_s16_argmin(const bfp_s16_t *b)#

Get the index of the minimum value of a 16-bit BFP vector.

Finds \(a\), the index of the minimum value among the elements of input BFP vector \(\bar B\). \(a\) is returned by this function.

If i is the value returned then the minimum value in \(\bar B\) is ldexp(b->data[i], b->exp).

Operation Performed:

\[\begin{split}\begin{flalign*} & a \leftarrow argmin_k\left(b_k\right) \\ & \qquad\text{for } k \in 0\ ...\ (N-1) \\ & \qquad\text{where } N \text{ is the length of } \bar{B} && \end{flalign*}\end{split}\]

Notes

If there is a tie for minimum value, the lowest tying index is returned.

Parameters:

b – [in] Input vector

Returns:

\(a\), the index of the minimum value from \(\bar B\)

headroom_t bfp_s16_accumulate(split_acc_s32_t a[], const exponent_t a_exp, const bfp_s16_t *b)#

Accumulate a 16-bit BFP vector into a 32-bit accumulator vector.

This function is used for efficiently accumulating a series of 16-bit BFP vectors into a 32-bit vector. Each call to this function adds a BFP vector \(\bar B\) into the persistent 32-bit accumulator vector \(\bar A\).

Eventually the value of \(\bar A\) will be needed for something other than simple accumulation, which requires converting from the XS3-native split accumulator representation given by the split_acc_s32_t struct, into a standard vector of int32_t. This can be accomplished using vect_s32_merge_accs(). From there, the int32_t vector can be dropped to a 16-bit vector with vect_s32_to_vect_s16() if needed.

Note, in order for this operation to work, \(\mathtt{b\_exp} - \mathtt{a\_exp}\) must be no greater than \(14\).

Operation Performed:

\[\begin{flalign*} \bar{A} \leftarrow \bar{A} + \bar{B} && \end{flalign*}\]

Proper use of this function requires some book-keeping on the part of the caller. In particular, the caller is responsible for tracking the exponent and monitoring the headroom of the accumulator vector \(\bar A\).

Usage

To begin a sequence of accumulation, start by clearing the contents of \(\bar A\) to all zeros. Then, an appropriate exponent for \(\bar A\) must be chosen. The only hard constraint is that the accumulator exponent, \(\mathtt{a\_exp}\) must be within \(14\) of \(\bar B\)’s exponent, \(\mathtt{b\_exp}\). If \(\mathtt{b\_exp}\) is unknown, the caller may choose to wait until the first \(\bar B\) is available before initializing \(\mathtt{a\_exp}\).

As vectors are accumulated into \(\bar A\) with multiple calls to this function, it becomes possible for \(\bar A\) to saturate for some element. Each call to this function returns the headroom of \(\bar A\) (note: no more than 15 bits of headroom will be reported). If \(\bar A\) has at least 1 bit of headroom, then a call to this function is guarranteed not to saturate.

The larger \(\mathtt{a\_exp}\) is compared to each \(\mathtt{b\_exp}\), the more 16-bit vectors can be accumulated before saturation becomes possible (and by virtue of that, the more efficiently accumulation can take place.). On the other hand, as long as \(\mathtt{a\_exp} \le \mathtt{b\_exp}\), there is no precision loss during accumulation. It is the responsibility of the caller to manage this trade-off.

If and when this function reports that \(\bar A\) has 0 headroom, if further accumulation is needed, the caller can handle this by increasing \(\mathtt{a\_exp}\). Increasing \(\mathtt{a\_exp}\) will require that the contents of the mantissa vector \(\bar a\) be right-shifted to avoid corrupting the value of \(\bar A\), making room for further accumulation in the process. Shifting the split accumulators can be accomplished with a call to vect_split_acc_s32_shr().

Finally, when accumulation is complete or the accumulator values must be used elsewhere, the split accumulator vector can be converted to simple int32_t vector with a call to vect_s32_merge_accs().

Parameters:

a – [inout] Mantissas of accumulator vector \(\bar A\)
a_exp – [in] Exponent of accumulator vector \(\bar A\)
b – [in] Input vector \(\bar B\)

Returns:

Headroom of \(\bar A\) (up to 15 bits)