Peeling the onion on “power”
The battle with battery life (or lack of) is a familiar one. Typically, when people ask about power, I find they’re really thinking about energy consumption. This may seem like a trivial point, but it’s not: power is an instantaneous measure and energy is a measure of power consumed over time – which determines how long our batteries last, and how quickly we consume our natural resources.
We’re much more focused on the source of our energy today. Innovation in renewable technology and energy efficiency is both exciting and critical. Clean energy and carbon emission targets are more than a nice to have – they’re becoming a matter of legislation. That legislation is increasingly leading to limits on the energy consumption characteristics of the products that we can buy on our high streets.
Many contemporary “low-power” edge AI processing devices (like our own xcore.ai) have significant processing resources capable of multiple operations per clock cycle. These can consume 10s of watts in the instant the processing unit is activated (typically by a clock pulse). If this level of consumption was continuous, intelligence at the edge couldn’t practicably happen. But consumption isn’t continuous, power is consumed in the picoseconds around the clock edge, after which consumption drops by several orders of magnitude. Overall, the energy consumed (the integral of power over time) remains low.
Recognition that it’s about energy, not power, gets us through the skin to the onion beneath. So far so good.
System-level power management
The next layer of consideration is about modes. Let’s use an analogy of a daily commute (distant though that seems to us today). Here’s the most efficient way I can imagine making my trip to the office from my front door:
- My car knows I’m nearby, starts its engine, unlocks when I’m immediately next to it and is ready to go at the click of my seatbelt
- The minor roads are all clear, the traffic lights are green, and I have my own private lane on the motorway, so my car can maintain the most efficient speed for the whole journey.
At the end of my utopian commute, my car has already detected the nearest empty parking space is waiting for me. My car can now return to a sleep mode, ready for the next time it senses my presence.
Let’s analyse this a little. When my car sensed that it was needed, it began to ready itself for its “mission mode” and started its engine. One implication is that the car had sufficient intelligence to know that it was required. This could have been anything from recognising the proximity of the keys to recognising me as a human being and identifying my intent. Since this process involves both the sensors themselves and the inference of my identity and intent, providing this facility must be considered at a system level.
Another implication is that the resources required for my car to start its engine, and the time taken to do so does not compromise the perfect commute. If my car engine took 20 minutes to start, or consumed enormous amounts of energy to do so, I would have to leave my car running. In electronics system terms, the energy and time taken to go from a powered-off state to a mission-mode state defines its usefulness. Start-up, or boot time, is a critical parameter in low-energy embedded systems.
Many Linux-based application processors can take multiple seconds to reach their mission mode. If these processors are responsible for timing critical functions, like voice interfaces, then the only option is for them to be permanently switched on. The knock-on impact on the system-level “standby” mode (synonymous with low power) is that devices such as TVs continue to consume multiple watts, if not 10s of watts.
Typically, when considering systems that consume low-energy we look at power consumption in the various power states, without considering the intelligence and agility with which we can move between them. In many cases, system requirements preclude the use of low-power modes simply due to poor system design choices. Appropriate attention needs to be paid to power management at a system level, by choosing sensor and processing solutions that deliver an appropriate amount of intelligence.
Are your eyes watering yet? Let’s move on…
Abstraction — the enemy of efficiency
There is no point in getting to market with a perfect product if it is too late, too expensive, or simply doesn’t do what it should (which is often a corollary of being late). When designing embedded systems, we consistently prioritise ease of use and fast time to market over low energy.
A classic example is the operating system — system software that provides many generic system services to the application (for example scheduling, multitasking, resource management and communications). Even in embedded systems, which typically deploy lightweight real-time operating systems (RTOS), the RTOS can consume 25% to 50% of the total energy consumed by the processor.
The list of ‘inefficient’ abstractions goes on:
- High-level programming languages and their associated compilers can produce code that is less efficient than that which is programmed in the processor’s native machine code
- The use of an instruction-set architecture (a processor) delivers substantial flexibility, but at the expense of the efficiency that could be gained with a purpose-built hardware solution
- The use of “standard cells” and automated place-and-route tools in the design of the silicon chip compromises the efficiency that would be gained with transistor-level design and a custom layout
- The use of synchronous circuits compromises the efficiencies that could be gained from asynchronous design.
This list could be much longer, and each one of these abstractions could further be broken down into further compromises that have been made to manage complexity.
Should this change? Probably not. Our customers are becoming ever more demanding of a diverse range of features, delivered to market quickly. The flexibility of programmability in a high-level language is key to satisfying these needs.
Consciously or not, we continue to prioritise product fit, time to market and cost efficiency over power. We need to pause and put on the swimming goggles (I’m serious, it works), and then keep peeling.
The processor architecture
To return to our driving analogy, we are interested in efficiency, not raw speed. Typically, we need to perform useful work in an acceptable time, with minimal energy consumption. If the acceptable time is our car at maximum speed, then our options are limited — we simply have to put our foot down, sacrificing all of the optimisations that may have made our car more efficient at other speeds.
The interesting area to look at is the typical case, where the processor is not flat out and opportunities exist to reduce energy consumption. Before we consider these options, we have to consider the flexibility of the types of workloads that are typical in embedded systems — specifically around their schedule. Software engineers use the terms hard and soft real time to describe the importance of getting a task done in a certain timeframe. Most software engineers would agree that the removal of their hand from under a steam hammer is a hard real-time deadline, whereas they might regard their coffee break as a soft real-time deadline.
In a typical system, where the many tasks that comprise an application must share one or two processors, the hard and soft real-time deadlines can interfere with one another and create uncertainty, sometimes called jitter. An empirical way of addressing this is to run the processing resources faster than they need to go — a bit like driving faster than required to mitigate the risk of unforeseen traffic. The problem is that a) this is seldom guaranteed to work in all circumstances and b) more energy is being used and energy reduction strategies are compromised.
In a highly timing-deterministic architecture like xcore, things are different. Tasks are allocated across more processing resources that have absolutely predictable timing behaviour. Since we know the exact operating parameters of the xcore and firmware during the design phase, we can optimise execution to the exact needs of the application and therefore the optimum energy. As in our analogy, you get your own lane on the motorway, so you can be much more certain of driving conditions and optimise your speed to maximise efficiency and arrive just in time.
By knowing the exact requirements of the processing workload under all circumstances, we can make very accurate adjustments to the operating conditions of the device, but we are getting ahead of ourselves — more on that later.
Energy optimisation should be possible without compromising ease of use, otherwise the clever features that we introduce at hardware level simply won’t get used in our rush to get to market. This is never truer than at the implementation layer, where an understanding of the underlying architecture may be necessary to extract the best performance.
Nonetheless, there are some aspects of implementation that are well known to embedded system engineers. One of them is the amount of memory that is on chip vs off chip. As a rule, the energy consumed by an application running from internal memory is significantly lower than the same application running from external memory. The amount of energy consumed by an application running on a shared memory architecture will also typically be lower than a distributed memory system. Put simply, moving data around consumes energy – particularly moving it on and off chip.
The second is instruction-set architecture, With careful design, instructions can minimise memory bandwidth and enable complex operations with minimal execution cycles, enabling the operating frequency to be reduced. For example, on xcore.ai the rotating accumulator in our vector unit enables a single-cycle load, multiply, and accumulate to perform matrix computations extremely efficiently so that we can keep the operating frequency down.
On the logic side, we can use power and clock gating to ensure that parts of the circuit are not activated when they do not need to be. This can be for coarse static reasons — “I don’t need a USB interface in this application” — or dynamically — “if I administer a clock pulse to this logic element, it will consume energy but the output won’t change, so I won’t clock it”.
On xcore.ai, the clocking can be very finely controlled. As instructions flow through the pipeline, a single clock pulse can follow them, ensuring that all spurious energy usage is minimised. The beauty of clock gating at this level is that it is completely invisible to the programmer — it happens automatically on a clock-tick-by-clock-tick basis. In fact, the abstraction is so clean that the xcore could be implemented asynchronously (without any clocks at all) without affecting the programming model. I digress, asynchronous logic implementation certainly warrants its own onion.
Finally there is the physical layer — exploiting the effects of voltage and temperature on the physics of the silicon process. This typically involves a frequent assessment of the operating parameters of the silicon with respect to the application workload, and varying the voltage and frequency to deliver precisely the required performance.
Of course, these optimisations presume that the system designer knows the precise required performance. As mentioned previously, most traditional architectures encourage over-provisioning in an attempt to ensure that requirements are met in all circumstances. The best power management strategies cannot compensate for these approximations.
On xcore.ai we combine the means to measure the runtime parameters of the silicon with the ability to precisely understand the needs of the application to build the most efficient execution environment, and save the most energy.
Next time you are assessing the lowest-power solution, I hope that you will look beyond the basic parameters in the datasheet and take a system view — to think about all of the different considerations that impact overall power consumption. It might save you a few tears in the end.