In November 2017, kx acquired the rights to PyQ from Enlightenment Research and made it freely available on the Apache 2.0 licence. PyQ enables the seamless integration of a Python interpreter and a kdb+ server via jupyter notebooks, allowing users to benefit from the best of both technologies simultaneously.

We have combined PyQ with our TorQ-FX package to illustrate the power of these two technologies by analysing large quantities of foreign exchange, commodities and index data. This post is a jupyter notebook, so you can see exactly what we've done, and try for yourself.

NumPy and Pandas

In python everything is an object and typing is dynamic. This increases ease of use, but comes at the cost of speed as variables are stored with overhead to include all the python object information. A single integer will contain information specifying its reference count, type, size and digit.

For large arrays or tables of data this type of structure is very inefficient, as all variables in a table column tend to be of the same type. Enter numpy. numpy arrays are densely packed arrays of fixed-type, so you get the benefits of locality of reference. Additionally many numpy operations are implemented in C, avoiding the general cost of loops in python, pointer indirection and per-element dynamic type checking.

The other key piece of the data analysis toolkit in modern python is pandas. pandas combines fast fixed-type arrays together into high level relational data structures (tables!), and provides the syntax and tools for doing practical, real world data analysis on them.

kdb+ is designed from the ground up to cope with large amounts of time-series data. It makes very similar design choices to numpy; data is stored in fixed-type arrays by default with very little overhead, and it uses fast vectorized operations. With kdb+ we also get much more database functionality including the ability to query data on disk and in memory, and use SQL like syntax.

To quickly demonstrate the significance of these different data structures, compare the speed of a simple summation of 20,000,000 random numbers:

In [1]:
import random

# create a python array of 20m numbers and sum
large_array = [random.random() for x in range(20000000)]
%timeit sum(large_array)
1 loop, best of 3: 529 ms per loop
In [2]:
import numpy as np

# create a numpy array of 20m numbers
large_array = np.random.rand(20000000)                    
%timeit np.sum(large_array)
10 loops, best of 3: 22.2 ms per loop
In [3]:
from pyq import q
# create array of 20m numbers in q and sum
%q large_array:20000000?1.0
%q \t sum(large_array)
Out[3]:
24

The speed advantages of a vectorized array approach used by both numpy and kdb+ are obvious as this simple summation takes approximately 20ms, much faster than a standard Python approach.

The Best of Both Worlds with PyQ

pandas and numpy are fantastic data analysis tools, and the ecosystem of tools in python for machine learning, statistics and data visualisation is second to none, however these tools are limited to in-memory processing. Once a dataset is too big to fit in memory you're out of luck! This is where pyq comes in: it allows us to combine kdb+ and python seamlessly in a single process, taking full advantage of the strengths of both. pyq queries on our data are conducted by kdb, and the results of the queries can be stored as shared memory objects that either python or kdb+ can use; this avoids an expensive serialization step to transfer data between processes.

For this post we've prepared an large historical FX database using our TorQ-FX package, if you want to try it out yourself you'll need to download a database first. The full history for this dataset runs to 100s of GB, so in-memory processing is out of the question unless you have some very expensive hardware.

First we load the database into our session (the path will be changed later):

In [4]:
%%q 
system"l /home/username/Gain-FX/fxhdb/";

With a well structured database mapped into memory it's very easy and fast to run simple queries and only pull the necessary information into memory. As a very simple example we can pull out the closing prices of every USD pair in 2017:

In [5]:
%%q -o fxdata
0!select 
        ClosingPrice:last RateBid 
    by 
        date,CurrencyPair 
    from 
        gainfx
    where 
        date within (2017.01.01;2018.01.01),
        CurrencyPair like "*USD"

The result is now saved as fxdata, and can be manipulated by python as well as kdb+:

In [6]:
import datetime

# index the date column and add 1 day (in python!!)
fxdata['date'] + datetime.timedelta(days=1)
Out[6]:
2017.01.03D00:00:00.000000000 2017.01.03D00:00:00.000000000 2017.01.03D00:00:..

We can easily visualize the data using any of the many fantastic libraries available (here we are using plotly):

In [7]:
import pandas as pd
import cufflinks as cf 
import plotly.offline as py
cf.go_offline()

# pivot to the necessary data shape
data = pd.DataFrame(dict(fxdata.flip)).pivot(index='date', 
                                             columns='CurrencyPair', 
                                             values='ClosingPrice')
 # and plot a simple line chart
data.iplot(kind='scatter',
           title='2017 USD Rates',
           yrange = [0.5,2],
          showlegend = False)

Statistical Analysis in Python

Python also gives us access to tools that make it easy to carry out statistical analysis such as a Pearson correlation. Pearson correlations are a quick way to explore complex relationships between a large number of variables. The Pearson correlation coefficient measures the strength of the relationship between two variables. The coefficient value varies between -1 and 1, representing a perfect negative correlation and a perfect postive correlation respectively:

The fundamental issue with time series modeling is that the data will typically be non-stationary i.e. the mean, variance or covariance of the data will be a function of time. Before any statistical test is carried out the data must be stationarized.
To check whether the data is stationary or not, a Augmented Dickey-Fuller test can be used. This is a type of unit root test and deterines how strongly a time series is defined by a trend. This implements a autoregressive model and optimizes an information criterion across multiple difference lag values. The results of this test are determined using a calculated p-value. If the p-value is greater than 0.05, the data is non-stationary. If it is less than 0.05 the data is stationary.

We import the adfuller function from statsmodels to make this easy. Running the test on one pair:

In [8]:
from statsmodels.tsa.stattools import adfuller

"p-value from Dickey-Fuller test: {}".format(adfuller(data.iloc[0:,4].dropna(), autolag='AIC')[1])
Out[8]:
'p-value from Dickey-Fuller test: 0.618905959025'

we can see the p-value is much larger than the 0.05 limit for stationary data, therefore we cannot reject the null hypothesis of the Dickey-Fuller test that the daily time series is stationary. Before any further statistical analysis is carried out we must stationize our data by removing the trend and seasonality. Firstly a non-linear log transformation is applied; this will remove any large variations due to significantly different orders of magnitude for certain rates. Then we use first-order differencing. Let's see if it works:

In [9]:
stat_data = np.log(data)
stat_data = stat_data - stat_data.shift()
stat_data.dropna(inplace=True)

"p-value from Dickey-Fuller test: {}".format(adfuller(stat_data.iloc[0:,4], autolag='AIC')[1])
Out[9]:
'p-value from Dickey-Fuller test: 7.38822603744e-27'

The p-value is significantly decreased under the 0.05 limit, and so our data can now be assumed to be stationary and ready for analysis. Let's plot a heat map of the correlation between different pairs:

In [10]:
import plotly.graph_objs as go

# map USD pair codes to full names
column_map = { 'BCO/USD': 'Brent Crude Oil', 'COR/USD': 'Corn', 'COT/USD': 'Cotton', 'CUU/USD': 'Copper', 'HGO/USD': 'Heating Oil', 'NSX/USD': 'NASDAQ100', 'NTG/USD': 'Natural Gas', 'PDU/USD': 'Palladium', 'PTU/USD': 'Platinum', 'SOY/USD': 'Soybean','SPX/USD': 'S&P 500', 'SUG/USD': 'Sugar', 'UDX/USD': 'DJIA', 'WHT/USD': 'Wheat', 'WTI/USD': 'Crude Oil', 'XAG/USD': 'Silver', 'XAU/USD': 'Gold', 'AUD/USD': 'AUD', 'EUR/USD': 'EUR', 'GBP/USD': 'GBP', 'NZD/USD': 'NZD'}
stat_data = stat_data.rename(columns=column_map)

heatmap = go.Heatmap(z=stat_data.corr(method='pearson').as_matrix(),
                     x=stat_data.columns,
                     y=stat_data.columns,
                     zmin=-1,
                     zmax=1,
                     colorbar=dict(title='Pearson Coeff.'))
fig = go.Figure(data=[heatmap], 
                layout=go.Layout(autosize=True,
                                 width = 540,
                                 height = 500))
py.iplot(fig)

The results of this heat map are fascinating. The strongest correlations are highlighted in red, while weak, or slight negative correlations are in blue.

There are many interesting patterns in here, but a few jump out:

  • The precious metals gold and silver are very closely correlated, as they're both used as a store of value
  • The Austrlian dollar has the highest correlation of any currency with the precious metals, as they are a major producer
  • The DJIA and gold have a slight negative correlation, likely because in down markets investors move to gold

We have only scratched the surface of what you can do by combining kdb and python with pyq. Python is home to most of the most powerful machine learning tools such as tensorflow, and with pyq you can easily bring your data to these tools too!