How to make an ‘uncheatable’ benchmark

Is there such thing as an ‘uncheatable’ benchmark? Cheating isn’t new in benchmarking, as seen with SPEC (for PCs) or Dhyrstone (for embedded processors) from a decade ago. More recently, benchmark wars have resurfaced, given the news around how certain smartphone manufacturers like Samsung and HTC have been rigging the results on particular top end models. It appears that they detect when a benchmark that is running (such as GFX, Basemark, or AnTuTu), and then increase the chipset frequency and temperature constraints to give higher results.

As retaliation, we are now starting to see Benchmark designers take action to protect their methodologies by making public statements, and in some cases, even delisting some of their devices (Futuremark is a recent example). To better understand the need for manufacturers to actually risk doing this, we need to take a step back and look at the dynamics over the recent years that have shaped the chipsets running today’s mobile devices.

Why we need good device benchmarks

Chip making is a hard and expensive undertaking, typically costing millions of dollars per revision. To be worth this investment, the chip should meet both the performance and battery requirements for real-world applications.

How firms meet the market demand today

Firms usually follow a competitor’s positioning and listen to what the marketing teams say. This is usually followed by an evaluation of the gap in hardware capabilities (performance and power), budgeting for development and testing, and finally, planning for market entry. Once all this falls in place, towards the end of the design cycle, performance testing begins, using popular benchmarks such as (but not limited to) SPEC (PC), EEMBC (embedded processors), GFX, Basemark and Browsermark (mobile). The biggest pitfall here is that millions of dollars are spent on hardware design largely dictated by market requirements for performance (which is good) but calibrated based on synthetic benchmarks (which is bad). There is too much focus on a competitor’s benchmark scores instead of market requirements.

The semiconductor market is rapidly evolving fuelled by the growth in the mobile ecosystem. It is time we paused and developed a benchmark that helps firms deliver products aligned to what customers want.

There is a need for a new benchmark

Cheating Color

 

The Growth of Android.

Android is growing faster than iOS or any other mobile OS. This growth has had the combined support of several chip makers, OEMs, and IP vendors, allowing Android to proliferate throughout many diverse markets. Every revision of the OS expects more performance from the underlying chipset. This was true for the PC era as well but it is now happening at a faster pace. With an open model, Android has enabled hardware vendors to participate in the ecosystem democratically, which has led to further innovation, higher performance, and more efficient Android chipsets coming to market more frequently

The rise of Asian chip makers.

Companies, such as MediaTek, Mstar and Spreadtrum, have dramatically changed the mobile chipset landscape. They have effectively reduced the cost and time for chip design and manufacturing, created new business models and markets for low-end Android phones. Besides having a monopoly of the low-end Android device market in China, they are now competing with the big players, like Qualcomm, globally. Their new chipsets are packed with more cores and better performance; they’re ready to take on highly demanding real-world applications such as 3D games, HD video playback, computational photography and AR. The result is that high performance chipsets, with capabilities to meet the demands of the next generation applications, are entering the semiconductor market faster than ever before.

The Mobile Application Ecosystem.

App stores from both Apple and Google, cross platform SDKs and even individuals with programming skills are turning into developers. And the name of the hottest application is “Gaming”. Over 32% of time on a mobile is dedicated to gaming, and it now accounts for more than 70% of app revenues. The effect of this trend is that multiple new games enter the marketplace every day. However, no two games are the same. There is huge difference in performance requirements between a simple racing games and highly demanding AAA titles, which are now coming to mobiles. Moreover, top games in the market change on a regular basis, exposing the fact that there isn’t a dominant game that consistently retains the top spot. Therefore, there cannot be a ‘fixed’ and ‘representative’ benchmark, as the performance requirements of these games are constantly changing.

With this market changing so dramatically over the last few years, it is no wonder that the likes of Samsung and HTC have resorted to fixing their benchmark score to remain on top with all the new boys trying to depose them. To this end, the term ‘Uncheatable’ has now crept into the benchmarking lexicon.

Current methodology for benchmarks is broken.

GFX Test ContentGB Test Content

Semiconductor firms continue to follow a single synthetic application or a suite of applications as representative of a real-world application. However, there is an inherent discrepancy with this statement. As explained above, a synthetic benchmark can never be truly representative of a real-world application, especially in this rapidly evolving mobile market. In addition, benchmark code is available to the whole ecosystem (from OEMs to chip makers, and beyond) making manipulation and cheating (such as overclocking, modifying the underlying architecture, and designing hardware for improved benchmark performance) inevitable. With new Android OS versions coming out every 6-9 months, new games built on different technologies and game engines entering the app stores every day and new chipsets from multiple vendors with varied IP (CPU/GPU) released every 6-9 months, using a synthetic benchmarks to represent such a rich and complex mobile ecosystem is flawed.

A true hardware benchmark should track the market, follow the applications, chipsets and OS and highlight bottlenecks and gaps in the performance and capabilities of chipsets. A benchmark should expose chipsets that have the best out-of-box performance for applications such as Games, Browsers and OS. This helps the user and developer communities. Some of the best-designed chips show this and will always rank high in truly representative tests. These chipsets will benefit the entire developer and user ecosystem. Vendors who lure developers to develop only for their chipsets will end up fragmenting the Android market further. This doesn’t happen in the iOS domain and that is one of its greatest strengths, and one the reasons for its success to date – something the Android ecosystem should aspire for.

What is different about the GameBench methodology for benchmarking?

We follow the current market for the latest games and devices. The most representative benchmark(s) is a suite of real applications (games) that drive the mobile market – in terms of volume, revenue and user interest. Why rely on a synthetic application plagued with pitfalls when the market can provide real answers?, This is how we do it ):

  • We pick the top devices from the market and run tests on it out-of-the-box. We do not root the device, do not overclock it or change any settings a regular user wouldn’t. We intend to preserve the state of the device as calibrated by OEMs. Our device listchanges on a regular basis. We will publish ranks on a quarterly basis based on the best devices available at that point.

  • We pick the top (and most popular) games from the Google Play store. We pick multiple game genres such as first-person-shooter (FPS), Racing and Running. These genres stress the performance capabilities and battery life of most devices. The games will change on a regular basis and we plan to publish ranking of top performing games in the near future

  • We download the games on to the devices and launch them through our GameBench app, which runs non-invasively in the background. The devices are tested in our HQ (Bristol, UK) with several game testers (amateurs, intermediate and power gamers). Testers are also changed on a regular basis.

  • We collect several metrics on the device under controlled and repeatable conditions and compute the scores within our App.

The GameBench methodology is ‘unbiased’ and ‘representative’

Gamebench scores - Gamebench

There is no room for manipulation of a game, or a device or how the game is played on the device. Our objective is to follow the market closely and regularly publish the list of the best devices and games available to everyone in the market.

How does this benefit the mobile ecosystem? Our methodology highlights the true winners and exposes the cheaters. In addition, we will provide detailed reports and feedback on why a score is low or high and showcase a few case studies.

Our primary objective is to provide the mobile gaming ecosystem (developers and chip makers) with the Tools (mobile apps),Insights (detailed reports) and Knowledge (Gamebench rankings, scores and comparisons) to maximize game performance and battery life on Android devices. We are happy to share other collateral as we go along.

What others are saying about us: