Complexity of Artificial Intelligence (AI) Benchmarking


IQ Testing for AI?

How do you benchmark your Intelligence Quotient (IQ)? Well, you use a variety of metric types and try averaging the scores like smearing butter over bread. Why so many metrics, well an individual is complex and intelligence can mean many things. AI Benchmarks and be analogous however can potentially be more complex. Why? Because an AI developed an environment may be deployed over diverse platforms with hardware, firmware and dedicated AI dedicated CPUs and physical architectures influencing the benchmarking; even layout and cooling of hardware will impact the benchmarking of an AI system. Furthermore, this is just assuming an isolated AI system, when Cloud or remote AI computing is utilised in Software as a Service (SaaS) type of deployments where lightweight client side platforms are desirable, connectivity and lag become an influencing factor. Now what about the designed neural network design, the number of layers of decision making and the time spent training the AI; these compound the complexity of benchmarking and make a concise definition a problem.

Breaking it down

Before considering all the other factors let’s take a look at the neural network. Well, the neural network can be dozens of different structures and careful selection is needed as one shape will iteratively process decision making more effectively than others depending upon usage. The structure is fundamentally a set of nodes where weighted values are optimised and connections between nodes in different layers. Deep learning becomes deeper with the number of node layers added to the neural network. One way of thinking about this is relating layers to the number of moves a chess player thinks ahead. For instance, if a chess player thinks ‘n’ moves ahead, yet the opposition thinks n+1 or more ahead then the latter can effectively can out think the former player. However there is an extreme computational cost due to the non-linear assessment of each variation going from one side of the neural net to the other. Normally this is why AI machine learning (ML) is conducted server side.

So, assuming you pick an optimised neural network shape and correct variables for computation and the number of layers you need to filter variables then there are a few other things to consider.

The Learning Process

Let’s say you want to identify if an image inputted into a program was a shoe or a sock, one layer may be interested in different profiles of the objects and another the colour of the object. Great but how does the neural network learn to filter this data correctly? Well, we feed it with images that we have labelled using a database of content. The program filters the data first by feeding the image as an array of a filtered grayscale image and as the neural network gets a picture wrong it readjusts node weightings (and the same for any other layer filters). The same data is passed again and again through the neural network until when the error calculated for the latest iteration is below an accepted value and the machine learning is stopped.

The first question everyone has is why stop? Well just as with any simulation there is a theoretical simulation limit due to approximation which in tern is related to the lack of more variables to assess. Interestingly this means that there is a probability that AI will get at some point the wrong answer similar to a person going through a red light when they thought it was green. Taking this further then, if a ‘self-driving’ car is permitted on the road and it kills someone due to this error (which may be lower than calculated human error) in terms of tort law it is the driver of the car that is liable, not the software or product manufacturer; raising questions of will we ever see true self driving cars and a topic for another day.

Other Factors

AI benchmarking can be difficult depending on the platforms the AI operates on. The other challenge is connectivity for online applications as lag and connectivity issues will affect AI benchmarking. Offline applications however do not have these challenges however do not allow (effectively) for real-time training. In most applications this is not too vexing however for self-learning robots this is problematic if they are designed to be emergency or rescue robots. For example, many rescue robots currently do not use WIFI for connectivity and instead use an equally problematic hardwire between operator and robot. If an AI based self-learning robot could function without the wire nuclear reactor emergency work would allow quicker movement to all locations within the structure.

Power Consumption

One AI benchmarking process is assessing power consumption of the neural network when hosted on a defined system against others. While this does not take into account error of the decision-making process or connectivity it is one way of assessing performance in relation of reaching a solution to a given problem. Lower energy consumption would assume a more elegant neural network that has better optimised weighting values. While everything could be poked and prodded similar to conjecture of benchmarking computer hardware performance and the associated attributes it is a useful AI benchmark particularly for client-side platforms that utilise lightweight devices.