Technical

Reasons why Dealing with Data is Difficult

Written by Slingshot Simulations

Published on August 5, 2020

Why is data so difficult to work with? We turn to our developer team at Slingshot for the answers…

Dealing with data on a daily basis, our developer team start us off with some thoughts relating to Big Data in the context of simulation and Digital Twins and some of the challenges and values data carries today.

To Start…

Joe – Tech Strategist

Data is said to be one of the most valuable resources available today, for both companies and individuals alike. However, this value is subjective and relies on the data being accurate. Data that might be worth astronomical amounts of money for one company can be useless to another company, even if it was free.

The process of generating data continues to become easier every year, however producing data that is useful for its intended purpose can be much more difficult. Producing terabytes of unverified data can be less useful than a couple of kilobytes of data that is known to be accurate, as it increases the risk of propagating erroneous data through the system.

Dr Richard Kavanagh – Head of Cloud and Data Dev

Data is all around us and in many different forms. It is being created in ever increasing volumes and it must be processed in a timely fashion whilst being kept accurate and up-to-date. It can hold tremendous value and can be used to drive decision making in various ways, including the creation of Digital Twins.

However, data is rarely useful in its raw form. To make data useful it needs to go through a process of mining the data from its source (Extraction), prepping the data so its usable such as cleaning the data (transformation), and then loading it into a system to start using it (loading). Unsurprisingly, the traditional process for this is called Extract, Transform, Load.

But this kind of process, whilst familiar to Data Scientists and Technologists alike, is somewhat of a mystery to many across multiple industries, and raises many questions such as:

How was the data collected?
What was it collected for?
How was it originally input into the computer?
What validation has the data undergone?
How big is it? Where is it located? and is there an API?
Is the data complete or partial?
When was it collected? and Is the data up to date?

Before doing deep dives into specific areas of data, it is worth knowing the variety of data sources that are used in general and the qualities each data sources carries. The ones we will briefly cover in this blog are: The ones we will cover in this blog are:

Map Data – from Open Street Maps, Google Maps, Microsoft Azure Maps
Terrain data – from NASA SRTM, LIDAR, Sentinel
Traffic data – from the UK’s Department for Transport (DFT)
Pedestrian counts
Pollution data
Employment data – from the UK’s Office of National Statistics NOMIS
Bus Routes Data
Postcodes and Addresses

The Place and Perception of Data

Even domain experts may struggle to analyse and verify data if it is not provided in a specific format and/or visualised in a given way.

If you are tracking the movement of a fleet of ships over time, a spreadsheet of geospatial coordinates may not be meaningful until the data is overlaid onto a map. On the other hand, visualising passengers moving around a public transport system may not be useful if the end goal is to verify the number of passengers passing through a ticket gate.

The optimal approach is often the amalgamation of several different visualisation techniques, each adding its own layer over relevance to the data.

Map data includes aspects such as Open Street Maps, which is a collaborative effort to build up an accurate and up-to-date map of the world. The contributions are from a global effort from many different individuals. This draws its own challenges such as inconsistencies in how well everyone applies the rules on how data should be gathered. How the rules might change over time. How these rules and the data structure itself has to be generic enough to cope with the various different countries in which the data will be collected and their different rules of the road.

Terrain data shows different aspects of data, in regards to how it was collected, i.e. was it from a Shuttle mission like SRTM or from a plane flying over the rooftops.

What happens also when there are reflections or missing spots of data and how do you cope with areas where nothing is known? This issue of missing data can also be seen in other datasets such as address data and postcodes. So what happens when there is a new road, or new house? Is there an alternative data source that can be used instead that is more up to date? How do you merge different data sources as well? The SRTM data for example has a resolution of around 25m, whereas LIDAR data may be much more accurate e.g. 0.5m. Address data can be formatted slightly differently so how do you know which addresses are the same and which are just very similar.

Alex Trout – Application Engineer

Everything happens somewhere. Therefore, all data has the potential to be linked to a physical location, to be given a spatial aspect. Utilising this spatial aspect of data is often the key to using it in a significant way, making strong links between events and places, or spotting patterns that can be explained only by geographic features

Knowing where something is more likely to happen gives a better opportunity to manage it effectively, targeting funding or resources for example, and this simple concept can be applied to everything from natural disasters to social issues. Considering geographic information when studying data was introduced in the mid-1800s to help solve the cholera outbreak in London and has since been applied to almost all areas of science, planning, and business.

Visual representations of geographic data sets can convey information much more succinctly than other formats and can facilitate quick, well-informed, decision making. This presentation of geographic data can then be extended to include a temporal aspect either in a series of data visualisations or in animated models to display the change in data over both space and time.

Future predictions from modelling outcomes are often included for the purposes of studying the effects of planning decisions or when considering the prevention or mitigation of future events. The potential of using geographic information effectively only grows stronger as the availability of data increases, and as the tech surrounding simulation and modelling progresses to generate more accurate future predictions, the visualisation of geographic data sets will likely be the cornerstone of good decision making and effective planning

Context and Complexity of Data

The way the data is collected is also very important.

For example, traffic data relies quite often upon traffic surveys of a particular area, which might be as simple as someone sitting at the side of the road and counting cars. Assuming they did this accurately, numerous questions arise about when the survey was performed…

Was it at a weekend, or was some other event on that means the data collected is not representative of the normal traffic flows? Is there any seasonal traffic such as pedestrian counting during Christmas, summer holidays? This might include something more subtle like is it term time, especially when the count is done near a school or university.

Imogen Hetherington – Head of Product

In previous blog posts, data has been described as “food” for a digital twin. The quality and quantity of the data that a digital twin consumes will determine how well it replicates it’s real world counterpart, and how accurate its predictions will be. The amount of data needed increases exponentially with the complexity of a model.

Take, for example, a digital twin of a single car driving down along a route in a road system, we can do this fairly accurately with minimal data – the model just needs to know what the route is and how fast the car is going. The car’s behaviour can be modelled on these two pieces of information.

If we increased the number of cars and routes traversing the system, we would have to account for how the cars interact with each other. This would require more data about the road system.

At the level of complexity to an entire city centre or road network, there are far more questions to answer. Where are people going? What transport method are they using? What are the most popular routes? Where are the traffic lights and how often do they change? Where are the pedestrian crossings? What cars are people driving and how much pollution do they produce?

Some of the questions we can definitely answer with a single data set, but others are more complex. Human behaviour, for example, is often unpredictable, so a digital twin of a pedestrian system might take masses of data passed through an algorithm to obtain a true to life model.

If we consider the aspect of pollution and emissions, these vehicles will (or should) have undergone testing and to understand their emissions bands with respect to the likes of NOx. These test results are often represented in Simulink models of powertrain systems which can in turn be integrated with full vehicle simulations. These simulations, or digital twins, can therefore be used to predict the pollutions from vehicles during the life. This can be coupled with sensors around roads, or satellite data to give a measure of how much pollution is in the environment and in particular near roads.

There are of course other data sources that have not been intended for consumption by digital twins but would augment what’s already been modelled.

Regional employment data might be an example of this, the original aim is to show performance statistics for councils and local authority regions. It can however be utilised to infer things within the digital twin, such as journeys made to and from work and the prosperity of the region.

Buses and their routing data are examples of data being repurposed for a digital twin. The routes can be placed on a map and data regarding passenger numbers can be obtained. This data usually comes from bus companies, who’s main interest is counting numbers getting on the bus, yet an important phase for modelling a journey is getting off as well. The ticket sales or count of a bus pass used is a good indicator of numbers getting on but there is no such mechanism in regards to passengers leaving the bus and adding to regional footfall. This idea of turning numbers into routes extends to traffic counts and pedestrian surveys as well. A traffic count shows how many cars went past a particular point and is very useful for that point, but it does not show where they were going to or from. This is an aspect of missing data or more accurately the re-purposing of data, that may offer only partial insights into what the digital twin is trying to solve.

Once data has been obtained, there needs to be measures in place to verify the data, ensuring it is both accurate and suitable for its intended purpose. The process of data verification can be extremely difficult and time consuming; especially in black box systems and will often require a domain expert alongside ground truth data to even begin to analyse the results

The issue of Scale

Dr David McKee – CEO/CTO

Summary

Underpinning all these concepts and challenges is the issue of scale. How many gigabytes of data are we expecting to collect, or could it possibly be in the region of terabytes or even petabytes per second. At Slingshot in a recent project where we’ve been modelling an area of 100sq.km we have been generating in the region 250GB in the space of 10 minutes, and we’re working on several projects that are significantly larger than that. For example the at the larger end we have the likes of CERN who according to their most recent reports can generate in the region of 5 Terabytes/second.

At the Digital Twin Consortium we are working closely with some of the world’s leading tech and simulation companies, including Microsoft, DELL, and ANSYS to explore the truly Big Data platforms that need to be architected to support these digital twins from perspectives of data storage, compute, communication and transmission, as well as security and trustworthiness.