How to simulate big datasets in Python using parallel computing – Part I : Find rate of a time-dependent parameter

Posted by on May 19, 2020 in Science & Technology

When you start a new role as a data scientist in an SME, one of the challenges faced is the existence of a good quality dataset to work with. On several occasions, it could be a while before such data becomes available. This was the reason that made me take an existing, low-quantity dataset and simulate millions of time-series observations using Python in a meaningful way (more about that later). I have broken down the simulation steps into 3 posts –

  1. Dataset and finding the rate of change (this post)
  2. Generating samples of rates using parallel computing
  3. Simulating new observations using parallel computing

Code

You can find the full code and associated dataset file on my Github here.

Dataset

Your company has a limited set of data collected continuously from, say, sensors deployed in a process monitoring environment. For example, say paramX is measured every 30 seconds and the values are recorded as follow –

TimeStamp, paramX
06/08/2015 12:33:00, 199.84
06/08/2015 12:33:30, 199.14
06/08/2015 12:34:00, 199.96
06/08/2015 12:34:30, 200.14
06/08/2015 12:35:00, 199.94
06/08/2015 12:35:30, 199.66
...

Your main objective is to simulate this time-dependent dataset, enough to build a machine learning or forecasting model. There are a few ways of doing this –

  • Approach #1: Simulate each variable in the data univariately
  • Approach #2: Simulate each variable in the data multivariately

Approach #1 will mainly consist of finding the rate of change of each variable in the original dataset and simulating each variable independently. On the other hand, approach #2 will be similar to approach #1 but will include maintaining correlation between variables while simulating them. This would be essential to preserve the scientific relationship between variables, as they will be monitoring a process and a change in one variable might have an effect on other variable(s).

Whichever approach you want to follow, finding the rate of change of each variable is common between them, and I will cover that in this part of the series.

Finding the rate of change

In simple terms, rate is change in quantity A with respect to another quantity B. Common examples are speed (change in distance with respect to change in time) and accelaration (change in speed with respect to time). Usually, rate is calculated with respect to a time quantity. In our dataset, we are interested in finding the rate of parameter X (paramX) with respect to time. We notice that the time format is not a numeric type and that would be critical in our calculation. In Python, we can convert time format to seconds using the following piece of code –

index = pd.to_datetime(data.iloc[:, 0]).values.astype(np.int64)//10**9
index = pd.to_datetime(index, unit="s")

Here, data is a pandas dataframe containing our original dataset.

Next we add index to data

data['index_col'] = index

and calculate the rate as below –

# get parameter values
values = pd.Series(data.iloc[:, 1].values, index=index)

# Find rate as difference of two consecutive values divided by
# their date difference (in seconds)
rate = values.diff()*(1/data['index_col'].diff().values)

We can save this rate as an additional column, do some cleaning and arrive at the final dataset as below –

TimeStamp, paramX, rate
06/08/2015 12:33:00, 199.84, 0
06/08/2015 12:33:30, 199.14, -0.023333333
06/08/2015 12:34:00, 199.96, 0.027333333
06/08/2015 12:34:30, 200.14, 0.006
06/08/2015 12:35:00, 199.94, -0.006666667
06/08/2015 12:35:30, 199.66, -0.009333333
...

Conclusion

That’s it, we have successfully calculated rate of change of a time-dependent dataset. We will use this rate of change in generating a large sample of rates using parallel computing (my next post).

Thank you for reading.