This post is an extension of the earlier post PyQ – Put your Python where your Data Is which focuses on the technical advantages of using PyQ, its features and how to use them.

This mini-project utilises embedPy, released at the end of 2017 as the mirror package to PyQ, which allows Python functions to be called in q as well as the jupyterq kernel for executing code using Jupyter Notebook. We will uncover the stories behind a real data set, and discover the advantages of combining Python for visuals with a kdb+ backend for powerful number crunching. The focus is not on the underlying code but it will be available to look at along the way.

The Dataset

In 2015, the NYC Taxi & Limousine Commission publicly released a dataset containing 1 billion taxi journeys spanning from 2009-2017. It contains information such as: the time, pick-up and drop-off locations (longitude and latitude are not available from 2015 due to privacy concerns), the trip distance and the fare breakdown.

The data is available as monthly csv files, which were saved down into a kdb+ database using a modified version of the loader from the Kx cookbook. The total size of the database is around 80gb with no compression, way too large to process and save down in one go on my laptop. Luckily, kdb+'s simple partition structure and smart loading meant I was able to download, process and save down in batches; making sure to delete processed csv files to make room for new data.

The resulting kdb+ database is partitioned by date, with a sorted attribute on the pick-up time as there is no natural column to part on (a database also containing Uber or green taxi journeys would potentially warrant a parted attribute on the cab type).

Data Cleaning

With a billion rows, it was inevitable that there would be some erroneous data due to sensor or data entry errors. Methods to correct the data included:

  • Removing journeys with:
    • base & total fares < \$2.50 (the base fare was increased from \$2 in 2004).
    • fare distance < 0
    • base fare + tolls + surcharge + tips ≠ total fare
    • unrealistic average trip speeds (over 100 mph)
    • unrealistic tip percentages (over 10000%)
  • Collating identical categorical values with different entered names (e.g. combining CASH, CSH, CAS, Cas into CASH)
In [1]:
/ Utility functions

/ https://github.com/KxSystems/mlnotebooks/tree/master/utils
\l funcs.q
\l graphics.q
\l hdb
\c 300 300

/ Import Python libraries
mdates:.p.import`matplotlib.dates;
sns:.p.import`seaborn;
display:.p.import`IPython.display;

/ Change the default matplotlib styling
plt[`:style][`:use]"ggplot";

/ Function to covert between python and q datetimes
todatetime:{x . {(`year`mm`dd`hh`uu`ss$x),(x mod"j"$1e9)div 1000}`timestamp$y}[.p.import[`datetime;`:datetime;>]];

/ Functions to convert between dataframes and tables
/ https://github.com/KxSystems/embedPy/blob/master/tests/pandas.t
tab2df:{
  r:.p.import[`pandas;`:DataFrame.from_dict;flip 0!x][@;cols x];
  $[count k:keys x;r[`:set_index]k;r]
 }

df2tab:{n:$[.p.isinstance[x`:index;.p.import[`pandas]`:RangeIndex]`;0;x[`:index.nlevels]`];n!flip $[n;x[`:reset_index][];x][`:to_dict;`list]`}

/ Return dictionary of discrete values and occurances
freqwithcond:{[t;f;v;p;w]
  p:$[s:system "s";(s;0Ni)#p;p];
  d:{[t;f;v;w;x;p] x+0^(!/) value flip 0!?[t;enlist[(=;f;p)],$[1=count w;enlist w;1=count first w;enlist w;w];enlist[`v]!enlist v;enlist[`n]!enlist (count;`i)]}[t;f;v;w];
  #[;r] asc key r:(+/) d/[()!();] peach p
 };

/ Normalise distribution
freqwithcondnorm:{[t;f;v;p;w]
  d % sum d:freqwithcond[t;f;v;p;w]
 }

Cash is King Cash used to be King

Paying for things with card is becoming ever more convenient, faster and popular: taxi payments are no exception. The proportion of payments completed using card has risen from 20% in 2009 to around 60% in 2015, equivalent to a year on year growth rate of 20%. Interestingly card payments dip every December as customers turn to cash during the festive period.

In [2]:
t:update n:n%sum[n] by month from
    select n:count i
    by date.month,
        paymenttype 
    from yellow where date.year<2015,paymenttype in `CARD`CASH;

d:0!exec distinct[paymenttype]#paymenttype!n by month:month from t;

(`l1;`l2) set' {plt[`:plot_date][;;`linestyle pykw `solid;`markersize pykw 3] . exec (mdates[`:date2num] todatetime each month;n) from t where paymenttype=x} each `CARD`CASH;
plt[`:gcf][][`:set_size_inches] (16;5);
plt[`:gca][][`:set_xlabel] "Year";
plt[`:gca][][`:set_ylabel] "Proportion of Journeys";
.p.set[`l1;.p.unwrap l1];
.p.set[`l2;.p.unwrap l2];
plt[`:legend][.p.eval"l1+l2";("Card";"Cash");`loc pykw 0];
plt[`:show][];

Distribution of key variables

To quickly get a preliminary understanding of how the key variables (total fare, distance, tip amount) are represented in the dataset we can plot a histogram of each variable.

In [3]:
subplots:plt[`:subplots][3;1;`figsize pykw (14;10)];
fig:subplots[@;0];
ax1:subplots[@;1][`$":__getitem__"].p.eval","sv string enlist 0;
ax2:subplots[@;1][`$":__getitem__"].p.eval","sv string enlist 1;
ax3:subplots[@;1][`$":__getitem__"].p.eval","sv string enlist 2;

t1:freqwithcondnorm[`yellow;`date;`farecost;date;(<;`farecost;80)];
ax1[`:hist][key t1;160;`weights pykw value t1];
ax1[`:set_title]"Fare";
ax1[`:set_xlabel]"Total Fare ($)";
ax1[`:set_ylabel]"Fraction of Journeys";

t2:freqwithcondnorm[`yellow;`date;`distance;date where date.year=2010;(<;`distance;25)];
ax2[`:hist][key t2;250;`weights pykw value t2];
ax2[`:set_title]"Distance";
ax2[`:set_xlabel]"Distance (miles)";
ax2[`:set_ylabel]"Fraction of Journeys";

t3:freqwithcondnorm[`yellow;`date;`tip;date where date.year=2010;(<;`tip;10)];
ax3[`:hist][key t3;50;`weights pykw value t3];
ax3[`:set_title]"Tip";
ax3[`:set_xlabel]"Tip ($)";
ax3[`:set_ylabel]"Fraction of Journeys";

fig[`:tight_layout][];
plt[`:show][];

The distribution of total fare mostly follows a skewed normal distribution with a minimum fare of \$2.50, median at \$8.48 and trailing off at around \$45 with two unexplained peaks at \$45 and \$52 standing out.

Distance also follows a skewed normal distribution with a median of 1.71 miles. The high number of 0-0.1 mile journeys, dominated by 0 mile journeys, is unusual; they are likely to be outliers but could also correspond to people changing their minds or forgetting something. As we would expect 0 mile trips to cost the base rate, the number of 0 mile journeys should be lower or equal to the number of journeys costing the base fare. Querying for the number of 0 mile and \$2.50 journeys respectively reveals that it is in fact larger, so must include some outliers.

It is harder to see a pattern for the distribution of tips since it is so heavily dominated by the quantity of no tips. This is surprising given the reputation of the American tipping culture. Removing these and replotting below gives us a distribution peaking at multiples of \$0.50 as customers seemingly prefer to tip in rounded numbers.

In [4]:
display[`:display] tab2df t:(`$"0 Mile Journeys";`$"$2.5 Journeys") xcol select sum distance=0,sum farecost=2.50e
    from yellow where date.year<2015,(distance=0)|farecost=2.50e;
0 Mile Journeys $2.5 Journeys
6948968 3699243
In [5]:
subplots:plt[`:subplots][`figsize pykw (14;3)];
fig:subplots[@;0];
ax1:subplots[@;1]

t:freqwithcondnorm[`yellow;`date;`tip;date where date.year=2010;((>;`tip;0);(<;`tip;10))];
ax1[`:hist][key t;50;`weights pykw value t];
ax1[`:set_title]"Non-zero Tips";
ax1[`:set_xlabel]"Tip ($)";
ax1[`:set_ylabel]"Frequency";
fig[`:tight_layout][];
plt[`:show][];

Splitting out payments by the type and whether a tip was recorded or not reveals an interesting statistic: only 0.05% of cash transactions include a tip compared to the vast majority of card payments (~95%). Upon further research, it seems that either cash tips do not have a standard way of being processed or that cabbies prefer to not declare cash tips as they would be taxed.

Whatever the reason, as zero tips make up such a huge proportion of the dataset and will have a large impact on any tip-based analysis, only non-zero tips will be included in these calculations.

The other payment options are DIS (disputed), NOC (no charge) and UNK (unknown).

In [6]:
display[`:display] tab2df t:exec distinct[paymenttype]#paymenttype!n by Tip:tip from select n:count i
    by value paymenttype, tip:`$("Non-Zero Tip";"Zero Tip") tip=0
    from yellow where date.year<2015;
CARD CASH CRD CSH DIS NA NOC UNK
Non-Zero Tip 410671567 215220 29927326 1165 2191 291 4903 1052539
Zero Tip 12674028 544478485 1098069 22748684 484860 38336 1837628 14491

Returning to the unusual peaks we saw in the fare distribution from earlier, we can plot the longitude and latitude of the corresponding journeys to look for patterns. Trips costing \$45 or \$52 were plotted on a scatter map with the pick-up and drop-off shown in orange and blue respectively.

In [7]:
subplots:plt[`:subplots][1;2;`sharey pykw 1b;`figsize pykw (14;10)];
fig:subplots[@;0];

t1:-50000?select x:(pick-uplong,'dropofflong), y:(pick-uplat,'dropofflat) from yellow where farecost in 45 52e;
ax1:subplots[@;1][`$":__getitem__"].p.eval","sv string  enlist 0;
ax1[`:set_title]"All \$45/\$52 Trips";
ax1[`:scatter][;;`s pykw .02;`alpha pykw .6]'[flip t1`x;flip t1`y];
ax1[`:set_ylabel]"Latitude";

t2:-50000?select x:(pick-uplong,'dropofflong), y:(pick-uplat,'dropofflat)
    from yellow
    where
    farecost in 45 52e,
    not any all ((pick-uplat;dropofflat) within (40.760;40.780);(pick-uplong;dropofflong) within (-73.890;-73.8500)),
    not any all ((pick-uplat;dropofflat) within (40.620;40.670);(pick-uplong;dropofflong) within (-73.830;-73.7300));
ax2:subplots[@;1][`$":__getitem__"].p.eval","sv string enlist 1;
ax2[`:set_title]"Excluding those to/from Airport";
ax2[`:scatter][;;`s pykw .02;`alpha pykw .6]'[flip t2`x;flip t2`y];

(ax1;ax2).\:(`:set_xlim;(-74.06;-73.77));
(ax1;ax2).\:(`:set_ylim;(40.61;40.91));
(ax1;ax2).\:(`:set_facecolor;`black);
(ax1;ax2).\:(`:grid;0b);

fig[`:text][0.5;0;`Longitude;`ha pykw `center];
fig[`:tight_layout][];
plt[`:show][];

As we might have expected, clusters around the airports are clearly visible, with JFK in the bottom right and LaGuardia just right of the centre. Trips to and from the airports often demand a fixed rate and are very popular, resulting in a high density of journeys. The fact that the Taxi and Limousine Commission voted to raise the flat-rate fee between Manhattan and JFK Airport from \$45 to \$52 in 2012 backs up our findings.

The middle of Manhattan mostly contains drop-offs and the bright orange I-495 out of Midtown is used mostly for pick ups. This could be because it is much quicker for passengers to flag down taxis here than further into the Boroughs given the number of taxis that use it.

After filtering journeys starting or ending at any of the airports, Jersey City to the west of Manhattan has a higher intensity and consists of almost exclusively of drop-offs. Although Jersey City isn't that far away from Manhattan island, the fact that it is in a different state and subsequent trips are difficult for drivers to find means these trips are also charged at a higher fixed rate.

Tip Percentage as a function of time

The average tip percentage for each month in the dataset is plotted below.

A fare rise was introduced on Tuesday 4th September 2012, the first since 2006 and estimated to average around 17% per ride. The effect on tip percentages was instant with an overnight drop of 0.8 percentage points followed by sustained lower tips, although over time it seems that they will return to their previous levels.

In [8]:
subplots:plt[`:subplots][2;1;`figsize pykw 14 8];
fig:subplots[@;0];

t1:0!select tippcnt:100*avg (tip%total-tip) except 0w by date.month from yellow where tip<>0;
ax1:subplots[@;1][`$":__getitem__"].p.eval","sv string  enlist 0;
l1:ax1[`:plot_date][mdates[`:date2num] todatetime each exec month from t1;exec tippcnt from t1;`color pykw "tab:red";`linestyle pykw `solid;`markersize pykw 3];
ax1[`:set_xlabel] `$"Month";
ax1[`:set_ylabel] `$"Tip Percentage";

t2:0!select fare:avg total-tip except 0w by date.month from yellow where tip<>0;
ax2:ax1[`:twinx][];
l2:ax2[`:plot_date][mdates[`:date2num] todatetime each exec month from t2;exec fare from t2;`color pykw "tab:blue";`linestyle pykw `solid;`markersize pykw 3];
ax2[`:grid]0b;
ax2[`:set_ylabel]"Pre-tip Fare";
.p.set[`l1;.p.unwrap l1];
.p.set[`l2;.p.unwrap l2];
ax1[`:legend][.p.eval"l1+l2";("Tip Percentage";"Pre-tip Fare");`loc pykw 0];

t3:0!select tippcnt:100*avg (tip%total-tip) except 0w by date from yellow where date.month in 2012.09m, tip<>0;
ax3:subplots[@;1][`$":__getitem__"].p.eval","sv string  enlist 1;
ax3[`:plot_date][mdates[`:date2num] todatetime each exec date from t3;exec tippcnt from t3;`linestyle pykw `solid;`markersize pykw 3];
ax3[`:set_xlabel] `$"Date";
ax3[`:set_ylabel] `$"Tip Percentage";
ax3[`:set_title] `$"September 2012";
ax3[`:annotate]["Price Rise";`xy pykw (.p.py2q .p.unwrap mdates[`:date2num]todatetime 2012.09.04;18.55);
            `xytext pykw (.p.py2q .p.unwrap mdates[`:date2num]todatetime 2012.09.01;18.7);
             `arrowprops pykw enlist[`facecolor]!enlist `black];

fig[`:tight_layout][];
plt[`:show][];

The second graph shows the daily average tip percentage during August 2012 and after the drop a weekly cycle develops with a low occuring each Friday.

We can go even further and bucket the trips by the day of the week and hour of the day and use a heatmap to easily visualise the three dimensions.

In [9]:
subplots:plt[`:subplots][3;1;`sharey pykw 1b;`figsize pykw (14;8)];
fig:subplots[@;0];

t1:0!select tippcnt:100*avg (tip%total-tip) except 0w by dayofweek:date mod 7,hour:dropofftime.hh from yellow where date.year within 2010 2014,tip<>0;
ax1:subplots[@;1][`$":__getitem__"].p.eval","sv string enlist 0;
ax1[`:set_title]"Tip Percentage";
sns[`:heatmap][;`ax pykw ax1]tab2df update `Saturday`Sunday`Monday`Tuesday`Wednesday`Thursday`Friday dayofweek from exec distinct[t1`hour]#hour!tippcnt by dayofweek:dayofweek from t1:update `$string hour from t1;
ax1[`:set_ylabel][""];

t2:0!select distance:avg distance by dayofweek:date mod 7,hour:dropofftime.hh from yellow where date.year within 2010 2014,tip<>0;
ax2:subplots[@;1][`$":__getitem__"].p.eval","sv string enlist 1;
sns[`:heatmap][;`ax pykw ax2] tab2df update `Saturday`Sunday`Monday`Tuesday`Wednesday`Thursday`Friday dayofweek from exec distinct[t2`hour]#hour!distance by dayofweek:dayofweek from t2:update `$string hour from t2;
ax2[`:set_title]"Fare";
ax2[`:set_ylabel][""];

t3:0!select total:avg total by dayofweek:date mod 7,hour:dropofftime.hh from yellow where date.year within 2010 2014,tip<>0;
ax3:subplots[@;1][`$":__getitem__"].p.eval","sv string enlist 2;
sns[`:heatmap][;`ax pykw ax3] tab2df update `Saturday`Sunday`Monday`Tuesday`Wednesday`Thursday`Friday dayofweek from exec distinct[t3`hour]#hour!total by dayofweek:dayofweek from t3:update `$string hour from t3;
ax3[`:set_title]"Distance";
ax3[`:set_xlabel]["Hour of Day"];
ax3[`:set_ylabel][""];

fig[`:tight_layout][];
plt[`:show][];

During weekdays there are signs that commuter are tight on tips with the period just after work (18:00-20:00) being the worst for cabbies given that the average fare is also relatively low during the period.

Weekends show almost an inversion of the weekday trend, with morning journeys resulting in the largest tip percentages while late night journeys (8pm-3am) are lowest: perhaps passengers heading into Manhattan for a night out are eager to save a bit of cash?

EmbedPy is an absolutely fantastic tool, enabling more efficient and meaningful data analysis by cutting down query time and improving understanding by doing what kdb+ does best: bundling data and the methods together.

I would recommend you try it out and see what stories you can uncover in this rich dataset.