Using Pandas and Python to Merge Basketball Lineup Data

Over the past few weeks I've been working on a number of improvements to the SpatialJam Lineup-O-Matic. My previous blog post covers how the tool was originally created, the outcome was usable, but a bit clunky and cumbersome for users to navigate efficiently. This is because the interface was relying on the data visualization tool Tableau to do much of the processing and heavy lifting. While Tableau is a great platform, asking it to process hundreds of thousands of rows of play by play data was unrealistic.

To work around this, I utilized both Python and the open-source library Pandas to process the data prior to pushing it to Tableau servers.

I mentioned in my previous post some of the basic ideas SpatialJam is using to extract the lineup data from the FIBA Live Stats interface and while this process has been amended slightly to create a cleaner output, the principals remain the same. The biggest improvement with the new script it the ability to understand whether a particular lineup is playing offense or defense and record the statistics accordingly. This means for every line of play by play (for example a shot attempt or an assist occurring), there are now two lines of data added to the database, one for the lineup that recorded the statistic and one for the lineup the statistic was recorded against.

The resulting database is the processed using Pandas which allows the grouping of lineups and merging of statistics before pushing to the Tableau interface.

Pandas has the ability to quickly process large datasets by creating DataFrame objects, SpatialJam's tool utilities this functionality to group every line of Play by Play which has been attributed to a particular lineup and both sum and calculate statistics.

Firstly the input file is read - in this case a csv file containing all the lineup data extracted from the previous tool:

import pandas as pd
import numpy as np
#Read CSV
inputFile = pd.read_csv('C:\\...\\NBL\\combined.csv', encoding = "ISO-8859-1", low_memory=False)

Once imported Pandas can now aggregate the data using the groupby function:

df = inputfile.groupby([columnstosortby]).agg(aggregationmethod)

In the case of the lineup data, we're wanting to group the data by a number of columns; The lineup itself (a string field containing all 5 players on the court seperated by the ' | ' character), the team they play for and the team they were playing against. Because we're factoring in the opponent here, it means that a particular lineup will potentially appear in the final data multiple times if they were played against a number of different opponents - Tableau will take care of this for us.

Instead of just aggregating a single field, Pandas can work with many fields all at once, which is good in this instance as we're wanting to summarise a number of statistics that each lineup has recorded, not just +/-:

aggr = {
    '(+/-)': {'total': 'sum'},
    '2PA': {'total': 'sum'},
    '2PM': {'total': 'sum'},
    '3PA': {'total': 'sum'},
    '3PM': {'total': 'sum'},
    'FGA': {'total': 'sum'},
    'FGM': {'total': 'sum'},
    'FTA': {'total': 'sum'},
    'FTM': {'total': 'sum'},
    'dREB': {'total': 'sum'},
    'oREB': {'total': 'sum'},
    'AST': {'total': 'sum'},
    'STL': {'total': 'sum'},
    'BLK': {'total': 'sum'},
    'TOV': {'total': 'sum'},
    'PTS': {'total': 'sum'},
    'o2PA': {'total': 'sum'},
    'o2PM': {'total': 'sum'},
    'o3PA': {'total': 'sum'},
    'o3PM': {'total': 'sum'},
    'oFGA': {'total': 'sum'},
    'oFGM': {'total': 'sum'},
    'oFTA': {'total': 'sum'},
    'oFTM': {'total': 'sum'},
    'odREB': {'total': 'sum'},
    'ooREB': {'total': 'sum'},
    'oAST': {'total': 'sum'},
    'oSTL': {'total': 'sum'},
    'oBLK': {'total': 'sum'},
    'oTOV': {'total': 'sum'},
    'oPTS': {'total': 'sum'}
out = inputFile.groupby(['lineup', 'team', 'opponent', 'P1', 'P2', 'P3', 'P4', 'P5 ']).agg(aggr)

Here Pandas is summing a range of statistics if the fields being grouped by match.

At this stage we also want to calculate a number of metrics not processed in the initial pull of data from the FIBA System. To do this a new field is created, the formula for a particular metric is calculated and the result added to each lineup row

#Calculate Possesions
out['oPOS'] = (.96*(out['FGA']+out['TOV']+.44*out['FTA']-out['oREB']))
out['dPOS'] = (.96*(out['oFGA']+out['oTOV']+.44*out['oFTA']-out['ooREB']))

#Calculate Efficency
out['oEFF'] = ((out['PTS']/(.96*(out['FGA']+out['TOV']+.44*out['FTA']-out['oREB'])))*100)
out['dEFF'] = ((out['oPTS']/(.96*(out['oFGA']+out['oTOV']+.44*out['oFTA']-out['ooREB'])))*100)

#Calculate Metrics
out['eFG%'] = ((out['FGM']+0.5*out['3PM'])/out['FGA'])
out['oeFG%'] = ((out['oFGM']+0.5*out['o3PM'])/out['oFGA'])
out['2P%'] = (out['2PM']/out['2PA'])
out['o2P%'] = (out['o2PM']/out['o2PA'])
out['3P%'] = (out['3PM']/out['3PA'])
out['o3P%'] = (out['o3PM']/out['o3PA'])
out['TOV%'] = (out['TOV']/(out['FGA']+.44*out['FTA']+out['TOV']))
out['oTOV%'] = (out['oTOV']/(out['oFGA']+.44*out['oFTA']+out['oTOV']))

And that's about it. Pandas makes the processing of this large dataset effortless and super quick and Tableau can now be left to only have to think about visualizing the data without having to worry about having to do any of the intensive lineup grouping or calculating of metrics.

The resulting dataset can be seen here

Of Note:

The +/- statistic is also now recorded during the initial data extract, something that was not possible with version 1 of the Lineup-O-Matic. Every time a point is scored by a given lineup, the value of those points (1, 2 or 3pts) is recorded for that lineup (+) and against (-) the defensive lineup currently on the floor. There is of course an issue that arises with this approach; it's not uncommon for a substitution to occur prior to or during free throw shooting meaning the points for and against will be attributed to the incorrect players. Consider the example play by play below:

Player 1 fouls Player 2 while Player A was shooting
Player 1 is subbed out for Player 3
Player A makes both free throws
Play continues...

In this case the (-2) points would be attributed to Player 3 as he/she was on the court when the free throws occurred, rather than against Player 1 who committed the foul that lead to the (-2) points.

This is a problem I'm yet to solve and will require some complex logic to work around. While it will only result in a small and probably insignificant difference for the lineup data, it's still something that will require work.