Advancing the capabilities of xcore

In 2008, XMOS shipped its first multi-core chip: the XS1. This processor reinvented real-time I/O, enabling low-latency interactions between software and I/O interfaces. Working in software, designers were able to act on hardware signals with latencies lower than a microsecond. Delivering on the application demands of both USB Audio and Robotics, the first generation XS1 offered a valuable solution to the challenges of a diverse marketplace.

The second generation processor followed in 2015: xcore®-200 delivered improved performance and quadrupled the capacity for full-precision signal processing. With up to eight times the added memory of XS1, xcore-200 was able to implement high demand signal processing tasks, such as microphone beamforming.

Jump forward to 2020 and the announcement of our third generation architecture: xcore®.ai. This cost-effective crossover processor enables developers to undertake day-to-day signal processing, communications, and control, and implement a neural network classifier; all on a single device. The same hardware that performs the signal processing can be repurposed to implement deep convolutional neural networks that have been trained with TensorFlow or any other machine learning framework.

XCORE.AI OVERVIEW

Architecturally, xcore.ai introduces a vector unit and a floating-point unit. This enables fast execution of DSP, block-floating-point, and inference engines and fast execution of algorithms that rely on the dynamic range of floating-point numbers. Whether you need to track values using a Kalman Filter, or run a large FIR over your data, xcore.ai can do both: up to 1,400 MFLOP/s and 9,000-36,000 MMAC/s depending on accuracy.

Physically, xcore.ai has a much lower standby power than xcore-200 – it adds an on-chip application PLL and other features to reduce the bill of materials (BOM) for a full solution. The minimum BOM around an xcore.ai device comprises power supplies, a crystal, and a flash. The device can power down to a state low enough to comply with the USB suspend mode, and almost instantaneously power back-up.

In terms of PHYs, xcore.ai boasts a high-speed USB-PHY, a MIPI-PHY, and an external memory interface to make the memory model more flexible. The latter is useful for those applications where large data buffers are needed; for example, networking packet buffers, or audio buffers, The MIPI PHY is capable of receiving data at a rate of up to 1.5 Gbit/s on one or two data-lanes. The USB-PHY is capable of high-speed traffic, up to 480 Mbit/s.

In terms of GPIO, xcore.ai offers between 29 and 128 GPIO pins that can be used at 1.8 or 3.3V depending on the package chosen. As is common with previous generations, it is up to the system designer to decide what to use the GPIO pins for. Serial data, parallel data, complex protocols can all be implemented in software with a large set of IO libraries available to implement the most common protocols.

THE ARCHITECTURE IN ACTION

In order to implement a system that can interpret audio signals (e.g. spot a keyword) and forward any voice command to the cloud, there are four distinct tasks:

  1. Read the microphone data out
  2. Run Fourier transforms over windows of input data, calculating signal strength in frequency bands over time
  3. Take the resulting image, and run it through a neural network
  4. Communicate any voice commands out over WIFI

The xcore.ai chip can perform all four tasks simultaneously, with the first three run in hard real-time.

For the first task, we use one of the standard libraries to input audio samples. This could be the I2S or TDM library if the audio signals come in over a serial line, PDM if it is a PDM modulated signal, or one of the more exotic interfaces such as ADAT or S/PDIF. Since all these libraries are implemented in software, xcore.ai has full flexibility to support any variant of any of those protocols. When the signal has been read at its native frequency (whether that be 48 kHz I2S or 3,072 kHz PDM) a sample rate converter can downsample this to a frequency suitable for voice recognition, such as 16 kHz.

Stage two involves pre-processing the data. The vector unit that forms the backbone of the xcore.ai architecture has vectorised complex multiplications. This instruction can, in every clock cycle, calculate two complex products. This instruction, with support for efficient headroom management, enables real-to-complex or complex-to-complex Fast Fourier Transforms (FFTs) to be computed quickly. Assuming the whole chip is just calculating FFTs, it can perform just under 1 million complex FFTs per seconds (256 taps each), and more or less double that for real-to-complex FFTs. Typically, we then calculate the magnitude of the frequency spectrum, and approximate a log for each magnitude.

In stage three, the resulting magnitudes are stored in 8-bit values, and the same vector unit that executes parallel complex multiplications (four times 32 bit real and 32 bit complex), can execute 32 bytewise multiplications in a single clock cycle. In addition, it can load one of the operands from memory, and accumulate all the results into an accumulator. Because of this combination of steps in a single instruction, the chip is able to perform 30 billion multiply-accumulates per second. It is specifically optimised to execute convolutional layers, where the same data is multiplied with a multitude of convolutions. A typical network, such as MobileNet V1 can be executed in 2.2 ms.

Finally, the WIFI task. Unlike all the other tasks this is not a hard real-time task, so we execute it under the control of FreeRTOS. We communicate with a WIFI chip over SPI, and run the TCP and other tasks inside FreeRTOS. FreeRTOS virtualises the logical cores of the xcore.ai processor and provides the standard programming environment that is required for the client side.

All these capabilities come in a 60 pin QFN package (7×7 mm), larger packages with more GPIO are available that can connect to an external LPDDR memory for those applications that are more memory demanding. Two or more xcore.ai chips can be placed side by side on a board for applications that need more compute. 

To find out more or register your interest in our xcore.ai Alpha programme, please visit the xcore.ai page on our website. Here you’ll find technical and commercial whitepapers alongside primary reports that further explore the AIoT market.

Scroll to Top