Cortex A57 - Performance and Power - ARM A53/A57/T760 investigated

We once again turn to SPECint2000 for an estimated performance score of the architecture - again, keeping in mind that this a deprecated and retired benchmark on the PC space. When we look at the results of the SPEC scores, the gains are indeed smaller than what the A53 got over the A7. We get

Cortex A57 - Synthetic Performance

We once again turn to SPECint2000 for an estimated performance score of the architecture - again, keeping in mind that this a deprecated and retired benchmark on the PC space.

SPECint2000 - Estimated Scores
 Apple A8Tegra K1-64
(Denver)
Exynos 5430
(A15)
Exynos 5433
(A57)
5433 > 5430
%Advantage
164.gzip842126970381315%
175.vpr12281312782112044%
176.gcc181018841222154928%
177.mesa 21871308166628%
179.art 51462113257424%
181.mcf14201746983119223%
186.crafty20211470854114934%
197.parser1129119272884116%
252.eon193323421585209632%
253.perlbmk16661818994125827%
254.gap182118441279146615%
255.vortex171625671283165232%
256.bzip212341468898102715%
300.twolf163317851102126016%

When we look at the results of the SPEC scores, the gains are indeed smaller than what the A53 got over the A7. We get a relatively even 25% average boost throughout the various sub-tests of SPECint2000. Considering that the Exynos 5430 ran at 1.8GHz versus 1.9GHz for the 5433, this slight advantage goes down to 18% average when normalizing for clock speeds. The 177.mesa and 179.art tests are normally not part of the SPECint2000 suite but part of SPECfp2000 (being floating points tests), and usually don't get taken into account in the integer score of SPEC.

GeekBench 3 - Integer Performance
 A15 (ARMv7)A57 (AArch32)% Advantage
AES ST68.4 MB/s1330 MB/s1844%
AES MT267.9 MB/s4260 MB/s1490%
Twofish ST64.4 MB/s81.9 MB/s27%
Twofish MT249.6 MB/s440.5 MB/s76%
SHA1 ST187.6 MB/s464.2 MB/s147%
SHA1 MT712.2 MB/s2020 MB/s183%
SHA2 ST80.5 MB/s121.9 MB/s51%
SHA2 MT316.0 MB/528.3 MB/s67%
BZip2Comp ST3.81 MB/s4.88 MB/s28%
BZip2Comp MT13.5 MB/s19.3 MB/s43%
Bzip2Decomp ST5.93 MB/7.41 MB/s25%
Bzip2Decomp MT22 MB/s29.7 MB/s35%
JPG Comp ST14.7 MP/s19.3 MPs31%
JPG Comp MT60.8 MP/s88.8 MP/s46%
JPG Decomp ST35.5 MP/s43.5 MP/s22%
JPG Decomp MT120.3 MP/s149.6 MP/s24%
PNG Comp ST990.7 MP/s1110 MP/s12%
PNG Comp MT3.65 MP/s4.57 MP/s25%
PNG Decomp ST15.3 MP/s19.1 MP/s25%
PNG Decomp MT53.6 MPs78.8 MP/s47%
Sobel ST49.7 MP/s58.6 MP/s17%
Sobel MT162.5 MP/s221.3 MP/s36%
Lua ST0.975 MB/s1.24 MB/s27%
Lua MT2.59 MB/s2.48 MB/s-4.25%
Dijkstra ST4.05 Mpairs/s5.23 Mpairs/s29%
Dijkstra MT11.5 Mpairs/s17.1 Mpairs/s48%

GeekBench's integer benchmarks paint a similar picture - if we disregard the huge boost to the cryptography scores we see an average advantage of 31% for the Exynos 5433's A57 cores, or 29% when we normalize for clock speeds.

There's not much to say here - the IPC improvements on the A57 seem to bring an average of 20-30% improvement on a per-clock basis. The pure integer benchmarks shouldn't change too much with AArch64 or A57, as most advantages of the chip are in FP workloads with the wider FP units.

GeekBench 3 - Floating Point Performance
 A15 (ARMv7)A57 (AArch32)% Advantage
BlackScholes ST3.78 Mnodes/s4.37 Mnodes/s15%
BlackScholes MT14.8 Mnodes/s20.4 Mnodes/s37%
Mandelbrot ST1.02 GFLOPS1.14 GFLOPS11%
Mandelbrot MT4.01 GFLOPS5.09 GFLOPS27%
Sharpen Filter ST810.7 MFLOPS1030 MFLOPS27%
Sharpen Filter MT2.99 GFLOPS4.31 GFLOPS44%
Blur Filter ST0.93 GFLOPS1.27 GFLOPS36%
Blur Filter MT3.61 GFLOPS5.03 GFLOPS39%
SGEMM ST1.88 GFLOPS1.81 GFLOPS-3.72%
SGEMM MT6.41 GFLOPS6.1 GFLOPS-4.84%
DGEMM ST0.626 GFLOPS0.573 GFLOPS-8.41%
DGEMM MT2.2 GFLOPS2.29 GFLOPS4%
SFFT ST0.948 GFLOPS1.1 GFLOPS16%
SFFT MT3.65 GFLOPS4.56 GFLOPS25%
DFFT ST678 MFLOPS1.02 MFLOPS50%
DFFT MT2.18 GFLOPS3.46 GFLOPS58%
N-Body ST322.3 Kpairs/s370.4 Kpairs/s15%
N-Body MT1.24 Mpairs/s1.44 Mpairs/s16%
Ray Trace ST1.51 MP/s1.7 MP/s12%
Ray Trace MT5.79 MP/s6.65 MP/s15%

GeekBench's FP improvements are on the conservative side, where we see a rough 21% overall improvement.
It's here that things might improve in the future as AArch64 compiled tasks should allow for more effective register usage.

Power Consumption

Similar to the little core power consumption measurements, we repeat the exercise of trying to isolate power to a single cluster. The problem of threads spilling over to the other cluster is still prevalent: instead of blocking the GTS parameters to force threads to remain on the little cores, I force the mechanism to avoid migrating down processes and make them stay on the big cores. The issue here is that although most processes stay on the big cluster, there's still inevitable activity on the little cores. This is due to core0 of the system being treated as a special case and many kernel-related tasks being forcefully scheduled on that specific core. I expect this overhead to be in the 10-100mW range.

As opposed to the little cluster which ran at the same frequency in both our comparison SoCs, the A57 cluster on the 5433 runs 100MHz faster than the 5430's A15 cores. As such, I did a power measurement on both the stock 1.9GHz that the phone ships with, and at a limited 1.8GHz to be able to have a comparison figure at the same frequency as the A15 cores in the 5430.

The difference in power is quite staggering. What jumps out immediately as out of the norm is the relatively low power consumption the 5430 is able to achieve. In the past we've seen A15 cores consume well north of 1.5W per core, something I've verified in the Exynos 5410 and Kirin 920. The combination of r3 A15 silicon IP and 20nm in the 5430 seems to have dramatically lowered the power consumption of the A15 to levels comparable to Qualcomm's Krait cores. It seems Samsung has gained a lot of experience with the A15 over the years and fed this back into the 5430, resulting in basically twice the power efficiency over past SoCs such as the 5420. But while the A15 numbers are interesting, we're here to check out the A57.

Alas, it doesn't look as good for the new cores. On a pure per-core basis it seems the A57 is about twice as power hungry. This is a significant figure that is quite worrisome. Investigating the voltages of the big cores on both SoCs reveals that the 5433 increases the voltage on the A57 cores by a rough 75mV across most used frequencies. We're averaging 1175mV at 1800MHz across the various speed bins, and reaching up to 1262mV on the worst speed bins at 1.9GHz. This results in some very high figures for the stock speeds of the 5433, being able to momentarily reach over 7W load on the big cluster. The thermal management doesn't allow this state to be sustained for more than 10 seconds when loading four threads, with the frequencies quickly dropping to the first thermal throttling levels.

I repeated my analysis over all frequency states and core count combinations, providing me with a detailed power curve for the A57 CPUs in the Exynos 5433:

I was able to catch the effect of temperature over power consumption in these tests: over the period of a minute power would continuously increase as the silicon heated up. After only 10 seconds of load the consumption would increase by 5%. Frequencies above 1600MHz especially suffer from this effect, as static power leakage seems to be increasing a whole lot on these states, so I wasn't able to account for the measured power to dynamic leakage alone. This might also be an effect of overhead elsewhere on the system; I imagine that the PMIC is working very hard at these kind of loads.

In any case, this points to the importance of having responsive and smart power management mechanisms that take into consideration not only frequency and voltage in their power modeling but also temperature. Luckily the Exynos 5433's power management is one of the best out there, having a full power arbitration system in place that does not allow the higher power figures for longer than short little bursts. Most work done should find itself in the lower frequencies, enabling much higher efficiency levels. The battery savings mode of the phone also caps the frequency of the A57 cores at 1400MHz, allowing for a very reasonable maximum 3.3W cap on the big cores while still providing excellent performance.

Again, we're able to get some rough estimates on the per-core power consumption figures for both SoCs running at equal frequency:

Estimated Big Per-Core Power Consumption @ 1.8GHz
 A15 (Exynos 5430)A57 (Exynos 5433)Increase
Big per-core load~750mW~1480mW97%

The A57 cores being an ARMv8 evolution of the A15 with only minor architectural additions beyond that shouldn't be that much bigger than the A15. Indeed we can see this in the block sizes of the SoCs:

A15 vs A57 Block Sizes
 A15 (Exynos 5430)A57 (Exynos 5433)Scaling Factor
Core1.67mm²2.05mm²1.22
Cluster14.50mm²15.10mm²1.04

We're seeing a much more conservative 1.22x scaling, less than the 1.75x found in the A53 cores. The total cluster size increases only by 4%, pointing out that Samsung made some improvements elsewhere on the cluster.

I repeat the XML test of BaseMark OS II to be able to give an impression of real-world performance/power.

BaseMark OS II - XML Parsing Energy Efficiency
 PerformanceEnergyPerformance/Energy
factor
A15 (Exynos 5430)99.69MB/s19.75mWh~5.04
A57 (Exynos 5433)155.29MB/s27.72mWh~5.60

As opposed to the A53 which lost perf/W over the A7, the A57 actually gains efficiency over the A15. This shouldn't make sense given that the active power of the A57 seems to be roughly double that of the A15 on these specific SoCs. We go back to why I choose this particular test: the nature of the the XML test allows for cores to exercise their idle power states in a realistic manner.

ARM claims the A53 and A57 bring better and more fine-grained pipeline power- and clock-gating as part of the architectures, and my suspicion is that this is what we are seeing here in practice. I don't have other fitting devices at my disposal to be able to make an apples-to-apples comparison, but it seems that in daily real-world usages the A57 is able to outperform the A15 quite a bit in terms of power efficiency.

While we investigated both the A53 and A57 separately as individual clusters in the SoC, I was also very interested to see how the XML test would fare in the default big.LITTLE mode. We repeat the power measurement but leave the GTS parameters as configured by Samsung.

BaseMark OS II - XML Parsing Energy Efficiency
 PerformanceEnergyPerformance/Energy
factor
A7 (Exynos 5430)77.93MB/s10.56mWh~7.38
A15 (Exynos 5430)99.69MB/s19.75mWh~5.04
A7+A15 (Exynos 5430)76.53MB/s12.85mWh~5.95
 
A53 (Exynos 5433)109.36MB/s17.11mWh~6.39
A57 (Exynos 5433)155.29MB/s27.72mWh~5.60
A53+A57 (Exynos 5433)96.85MB/s18.77mWh~5.15

What we're seeing here in our particular XML power efficiency test is that the whole mechanism falls flat on its face. For both the Exynos 5430 and Exynos 5433, the practical performance drops below what even the little cores would have been able to achieve on their own. While on the Exynos 5430 we lose efficiency over the little cores, it's still an improvement over letting the A15 cores handle the load alone. However when looking at the numbers of the 5433, things look bad. Running this benchmark on the default A53+A57 GTS mechanism not only is worse than simply running it on the little A53 cores, but it's also worse than letting the benchmark stay on the A57 cores. I could see the load jumping around between big and little cluster in a frequent manner.

As to why this performance degradation happens is not certain. I've been told that this may have been related to the way the benchmark was programmed and possible data-dependencies coming into play that cause an unusal overhead on the CCI. This is an interesting thought as this would be one particular case where ARM's newly announced CCI-500 interconnect would help out with the introduction of a snoop filter on the interconnect. I certainly did not aim for such a result in my search for a test-case micro-benchmark, but it raises the question of how many other real-world situations run into such bottlenecks.

It's also important to note that these numbers cannot be used as an argument against the efficiency of HMP itself. Running the big cores only is not a realistic use-case as there are vast power efficiency disadvantages in idle scenarios. A proper comparison would be if I had run the SoCs in a cluster-migration scheme and used that as a comparison point against the HMP operation. Sadly this is not a technically viable scenario for me to reproduce due to software limitations in the firmware. As such, these numbers are more of a representation of the A57 cores' efficiency rather than that of big.LITTLE and HMP as a whole.

I've had some concerns raised to me for the fact that I don't have a fixed-load benchmark test in regards to power efficiency, so I went ahead and ran a power efficiency test on some of our web-browser tests. As a reminder, these tests may represent a real-world workload in terms of computations that are being done, but they don't represent a proper scaling load as we would find in real-world usage. For that, I have included some benchmarks in the battery section of the review which we'll investigate later in the article.

Stock Browser - Web Tests Power Efficiency
 ScoreMean PowerScore/Power
factor
Kraken (E5430)4973ms2.505W~80.27
SunSpider (E5430)449ms2.997W~743.13
Octane (E5430)57622.156W~2.67
 
Kraken (E5433)4160ms3.144W~76.45
SunSpider (E5433)395ms3.458W~732.11
Octane (E5433)74662.924W~2.55

Before discussing the results I'd like to mention that I saw a huge discrepancy between Chrome and the stock browser. In Chrome, which I checked was indeed running the exactly same build version for both the Note 4 and the Alpha, I could see the Alpha consistently outperforming the Note 4 in all the tests. Due to this I deemed Chrome to be extremely unreliable as an apples-to-apples comparison of efficiency and reverted to the stock browser. Here I could see proper performance scaling that we would actually expect from the new core architectures and the clock advantage.

As for the browser efficiency scores themselves, what I see is that the Exynos 5430 again beats the 5433 in power efficiency, although this margin is just 4%. I have to reiterate again that we're dealing with Samsung's sixth generation of A15 SoC, and that's if we don't count consumer TV SoCs which also had A15 cores deployed over 2 years ago. The Exynos 5433 is only Samsung's second (after their unreleased GH7 server SoC) A57 implementation that we're seeing and it is also running at higher voltages on an immature process. I fully expect this gap to reverse in the future as we see vendors gain more experience with the IP and optimize their layouts and implementation.

I've talked in more detail on how currently HMP is implemented by vendors in the Huawei Honor 6 review which can be read here. For those who are still unfamiliar with the subject a quick explanation is that the current mechanism in the kernel that is used for controlling HMP operation is "GTS", or Global Task Scheduling. This mechanism is developed by Linaro, a consortium of ARM vendors which collaborate on the open-source software development side of SoCs. GTS is a Linux kernel modification that enables the scheduler to differentiate between the little and big cores of a system, and migrates tasks depending on their load.

The current mechanism as of late 2014 and found in all commercially available devices determines the decision of migrating a task on a target cluster with help of a geometric load value of a given task, a value that is defined by the kernel scheduler. The window in which the load is averaged depends on the vendor, but we've seen mostly 16ms in the case of Huawei and 32ms in Samsung's implementations. If a scheduler entity (a process) surpasses a certain upper threshold, the system migrates that entity to a fast cluster. Similarly, if it falls below a lower threshold, it is migrated to the slow cluster.

The problem with this approach is that the scheduler is totally unaware of the power consumption of the hardware when doing a migration. As a reminder, this process is done transparently in hardware via help of a cache-coherent bus interconnect such as ARM's CCI-400.

Not only does this cause migration overhead on the CCI due to cache flushes, but it messes with the idle mechanisms of the CPU. When a lower number of high load threads get migrated away from the big cluster and back to it, the cluster is allowed to enter its cluster power collapse state that also shuts down the L2 cache. This turning off and on of the cluster in repeated and high-frequency fashion consumes more power than had the load simply stayed on the big cluster and dealt with the active power penalty of the A57.

Samsung's implementation of GTS worsens this behavior as they employ a boosting mechanism that is tied with the interactive CPU frequency governor and user-space triggers such as switching applications or touchscreen events via the touch-booster mechanism. The boost here temporarily lowers the migrations thresholds to very low levels (15% down, 40% up) to be able to improve device fluidity.

In my opinion this is overkill as trivial tasks such as scrolling through an application like Reddit Sync (with already loaded content) causes the processes to reside on the big cluster. It's possible to disable this behavior via the power savings mode of the device. In my example of the Reddit app there is no discernable loss of fluidity when just scrolling through posts, and the big cores remain idle, making for much lower power consumption. The disadvantage of the power savings mode is that it caps the CPU to 1.4GHz and the GPU to 350MHz. I wish Samsung would provide a two-stage power savings mode where we could control both the boost mechanic and the clock frequency limits separately.

It's been over two years since the first big.LITTLE SoCs were introduced, and even if we ignore the first generations such as the Exynos 5410 and 5420 that were limited to cluster-migration schemes, ARM has had plenty of time to produce a more effective software stack that takes advantage of big.LITTLE. The fact that the Note 4 is using a 14-month old implementation of GTS that is lacking some key improvements in the scheduler in terms of power management demonstrates a grave problem of the big.LITTLE approach, as the software represents the Achilles' Heel of the technology and severely limits its potential.

ARM and Linaro are working on a solution for upstream Linux that has been discussed for well over a year. To this day the power-aware scheduler still hasn't seen a definitive implementation, and when we brought up the issue with ARM they admitted that progress has been slow due to the fundamental changes the solution requires and the many stakeholders it involves. The process has been complicated by the fact that ARM wants to get it right the first time and work in a generic solution that can be carried over to future ARM SoCs, a process that takes longer than implementing a one-off non-upstream solution for a specific product.

Meanwhile ARM is also calling attention to their Intelligent Power Allocation project for upstream Linux. Like the new power-aware scheduling solution, it too has yet to make it upstream. IPA has already seen limited adoption in shipping products in at least Samsung's devices, and ARM believes it can mitigate some of the issues related to the scheduler. While IPA is an excellent thermal management implementation, it cannot serve as a replacement for a proper mechanism that tries to optimize power efficiency in HMP operation, as it has no control over where a task is placed on the CPUs. Also, IPA is only enabled when the SoC passes a certain threshold temperature and any arbitration below that is still left to the kernel scheduler.

In any case, I'd wager we won't be seeing ARM's power aware scheduling solution in devices for well over another 18 months at minimum. Meanwhile, Qualcomm is entering the big.LITTLE ring with its own Snapdragon 810 with a fully working, custom-developed, and power-aware scheduler implementation that alleviates many of the issues I've identified here in this piece.

In the future, I hope Samsung SLSI invests more time and R&D in its software implementations, as being first to market with the hardware and relying on Linaro's basic software stack simply isn't enough if they wish to continue using big.LITTLE as the corner-stone of their SoCs.

ncG1vNJzZmivp6x7orrAp5utnZOde6S7zGiqoaenZIV4fZdoq6GdXaiurr%2FUp55mn5GhrrnFjKemrZ1daXqmxNinpqxloprDqrHWaG0%3D

 Share!