Pivot tables in Python / pandas

Matt Luther

In this post we'll go over the basic functions that the Python data library pandas provides for reorganizing and regrouping tables of data.

We'll start by importing the module.

In [1]:
import pandas as pd

Some examples of DataFrames

The basic object we manipulate with pandas is the DataFrame. We can think of this as 2d array, matrix, or table holding values. We can (optionally) provide columns and row names.

Here's an example.

In [2]:
pd.DataFrame([[1,2],[3,4]], columns=['column A', 'column B'])
column A column B
0 1 2
1 3 4

In practice we would usually create the DataFrame by reading in values from some outside source, such as a .csv file. Pandas will try to infer types, but notice that things are pretty general. If we wanted to, we could create tables like the following, where the types within columns aren't even consistent.

In [3]:
0 1
0 1.0 a
1 3.7 [b, 7]

This last example mimics a typical looking database table for something like sales data. We'll prepare the data in steps as a usual python list sales, and then create the DataFrame which we'll call sales_data.

In [4]:
columns = ['unique sale number', 'item name', 'color', 'total $ cost']

sales = []
sales.append([1, 'hat', 'red', 9.05])
sales.append([2, 'ball', 'blue', 12.34])
sales.append([3, 'hat', 'green', 11.27])
sales.append([4, 'ball', 'red', 8.72])

sales_data = pd.DataFrame(sales, columns=columns)
unique sale number item name color total $ cost
0 1 hat red 9.05
1 2 ball blue 12.34
2 3 hat green 11.27
3 4 ball red 8.72


We will use the last example in the previous section to discuss the pivot_table method of DataFrame.

First let's talk for a moment about what pivot tables do.

Notice that the table above is a row-by-row record of transactions. Each row in the table corresponds to one complete record of the information. We could think of the information as being organized by sale, or by sale number.

It can be useful to organize this data in other ways. For example, the items of the same type ("hat" or "ball") are separated in the table. If we were interested in grouping together the sales according to which type of item was purchased, then we would want to group the "hat" sales together, and group the "ball" sales together. We could think of the desired layout as being organized by item name, rather than by sale number. Changing to an alternative layout like this is an example of constructing a pivot table.

It will be useful throughout this to think of forming a pivot table as forming a table with a hiararchy of row labels or of column labels. The hierarchy is determined by how we decide to group the values from our original columns. This will be more clear once we get to examples.

Forming a pivot table with pandas

We can get pandas to form a pivot table for our DataFrame by calling the pivot or pivot_table methods and providing parameters about how we would like the resulting table organized. We can also use the stack and unstack methods to "flip" columns and rows of the resulting pivot tables to help control their layout.

We'll discuss each of these options below, using the table above as an example.

The pivot method

For example, we can use the pivot method to specify which of the original columns we would like to organize by in our new DataFrame. It's easiest to see this with an example.

Say we want to group the sales together according to item name. One way to do this is to call sales_data.pivot with the parameter index set to "item name". This will make our new DataFrame have a row for each different kind of item name in our original sales data. That is, we will have a row for "ball" and a row for "hat".

We also have to specify a columns parameter. For now, we can just set it to be "unique sale number". This means that our resulting DataFrame will look like several copies of a table with rows for "ball" and "hat" and columns for the sales numbers 1,2,3,4. This basic table structure will be copied for each of the remaining columns from our data, namely "color" and "total \$ cost". The values of the table will be filled with the color and total \$ cost values.

Let's see what it looks like.

In [5]:
sales_data.pivot(index="item name", columns="unique sale number")
color total $ cost
unique sale number 1 2 3 4 1 2 3 4
item name
ball NaN blue NaN red NaN 12.34 NaN 8.72
hat red NaN green NaN 9.05 NaN 11.27 NaN

We can see how this results in a table that is organized "by item name".

Notice that the column labels are now in a hierarchy: there are column labels "color" and "total $ cost", and then these are broken down in the next level by the values of unique sale number. This is important to help understand stack and unstack in the next section.

Also notice that many of the values are NaN. This is because many of the positions of the table do not have matching information from the original sales data.

For example, consider the top-left NaN value. That position of the table must correspond to the value of "the color of the ball with sale number 1", because that is what the row and column labels determine for that position. But sale number 1 did not involve a ball, so there is no value to provide for that entry. In contrast, sale number 1 was a hat, so there was a color value "red" to provide in the bottom-left position.

If we'd like to only see one of the remaining columns (that is, "color" or "total $ cost" in the above example), then we can use the values parameter.

In [6]:
sales_data.pivot(index="item name", columns="unique sale number", values="color")
unique sale number 1 2 3 4
item name
ball NaN blue NaN red
hat red NaN green NaN

Notice that in this case, we only see the color values. This should also help understand the layout of the pivot table where we do not specify a values parameter.

The stack and unstack methods.

The stack and unstack methods flip the layout of a DataFrame by moving whole levels of columns to rows, or whole levels of rows to columns. These are particularly useful to help manipulate the hierarchies we form when making pivot tables, but they can be applied any time.

To help understand this, remember the example above where our pivot table had a hierarchy of column labels. In that case, the column labels were broken down into multiple levels, one for "color" and "total $ cost", and another level for the unique sale number values.

Again, examples are easier to understand than the description.

Let's go back to the first pivot example:

In [7]:
pivoted_data = sales_data.pivot(index="item name", columns="unique sale number")
color total $ cost
unique sale number 1 2 3 4 1 2 3 4
item name
ball NaN blue NaN red NaN 12.34 NaN 8.72
hat red NaN green NaN 9.05 NaN 11.27 NaN

Right now, the sales numbers are written across as column labels. If we'd prefer that they were written downward as row labels, we can call stack on the DataFrame.

In [8]:
color total $ cost
item name unique sale number
ball 2 blue 12.34
4 red 8.72
hat 1 red 9.05
3 green 11.27

The stack method by default takes the last level in the column breakdown and turns it into the last row breakdown. So, we've achieved our goal.

If we call stack again, it will move the remaining column level. This will result in there not being any more columns. This is possible to do, and returns something reasonable. Note that this is no longer a DataFrame object however, it is a Series object.

In [9]:
item name  unique sale number              
ball       2                   color            blue
                               total $ cost    12.34
           4                   color             red
                               total $ cost     8.72
hat        1                   color             red
                               total $ cost     9.05
           3                   color           green
                               total $ cost    11.27
dtype: object

Unstack is similar to stack, but moves row levels to column levels.

One more thing to note is that these methods can be passed a parameter to specify which level in the hierarchy to move. As mentioned above, by default they will move the "last" level.

For example, if we start with our pivot table pivoted_data, and this time use stack on the 0 level, we will move the "color" and "total $ cost" labels, rather than the unique sale number labels.

In [10]:
sales_data.pivot(index="item name", columns="unique sale number").stack(0)
unique sale number 1 2 3 4
item name
ball color NaN blue NaN red
total $ cost NaN 12.34 NaN 8.72
hat color red NaN green NaN
total $ cost 9.05 NaN 11.27 NaN

This lets us control the order in which the stacking occurs. Compare the following result of two stack calls with the Series we obtained above.

In [11]:
sales_data.pivot(index="item name", columns="unique sale number").stack(0).stack()
item name                unique sale number
ball       color         2                      blue
                         4                       red
           total $ cost  2                     12.34
                         4                      8.72
hat        color         1                       red
                         3                     green
           total $ cost  1                      9.05
                         3                     11.27
dtype: object

We'll see in the pivot_table section below that we can simultaneously form a pivot table and stack it in one method call. But before we do that, we'll motivate pivot_table with an example of how pivot can fail.

A pivot problem

Let's add one more entry to our sales records to demonstrate a problem. We will append another list to sales and rebuild our DataFrame.

In [12]:
sales_data = pd.DataFrame(sales, columns=columns)
unique sale number item name color total $ cost
0 1 hat red 9.05
1 2 ball blue 12.34
2 3 hat green 11.27
3 4 ball red 8.72
4 5 hat green 14.99

We've added another green hat sale, but at a different cost.

This seems innocent enough, but now there's a problem if we try to use pivot in certain ways.

Consider what happens if we try to organize the data by color and item name. When we get to the green hats, the table would want a unique sale value and a total $ cost value. But with our data, there are two options, the green hat with sale number 3 and cost 11.27, and the one with sale number 5 and cost 14.99. The table doesn't have a way to decide between these, so the pivot method just throws an error.

In [ ]:
sales_data.pivot(index="item name", columns="color")

# ValueError: Index contains duplicate entries, cannot reshape

Of course, there are still meaningful ways to organize the data. The pivot_table method will help us deal with this problem.

The pivot_table method

The pivot_table method lets us specify more parameters and get a bit more control over the layout of the resulting table. We can use it to simultaneously pivot and stack, or to provide default values when entries are missing, or to aggregate values when there are multiple competing values.

By providing a fill_value parameter, we can set the default when values are missing. For example, if we just wanted to clean up our earlier example by replacing all NaNs with a blank space, we could do the following. More generally, we could use this to replace missing values with meaningful entries (for example, 0 if we wanted to calculate total sales sums).

We can also provide an aggfunc parameter, which is an aggregation function to apply to the multiple values which might appear for a position in the table. This lets us avoid the problem mentioned in the previous section. There are two things to note:

  • The aggfunc will apply to values even if there is only one for that position.
  • It defaults to the numpy mean function, which expects numerical input.

Together, these things mean that by default pivot_table will end up disregarding non-numeric input. But, as we're about to demonstrate, this can be avoided by providing your own aggfunc.

In the next example, we just reproduce our first pivot example using the new method and some extra parameters. We will set fill_value to be the empty string, so that we don't see any NaNs, and we will pass the identity function as our aggfunc parameter, so that we do not process the data.

In [14]:
sales_data.pivot_table(index="item name", columns="unique sale number", fill_value="", aggfunc=lambda x: x)
color total $ cost
unique sale number 1 2 3 4 5 1 2 3 4 5
item name
ball blue red 12.34 8.72
hat red green green 9.05 11.27 14.99

Next we'll see how pivot_table can elegantly perform stacking.

The index and columns parameters for pivot_table can take lists, not just single column labels. So, if we would like the rows to be broken down by item name, and then by unique sale number, we can just list these in that order as our index parameter.

In [15]:
sales_data.pivot_table(index=["item name", "unique sale number"], aggfunc=lambda x:x)
color total $ cost
item name unique sale number
ball 2 blue 12.34
4 red 8.72
hat 1 red 9.05
3 green 11.27
5 green 14.99

We've reproduced the earlier layout that required both pivot and stack.

We can also continue to provide a columns parameter, which can either be one of the column labels, like before, or can also be a list if we want to form a hierarchy. This lets us quickly structure our DataFrame like below.

In [16]:
sales_data.pivot_table(index=["item name", "unique sale number"], columns="color", fill_value="")
total $ cost
color blue green red
item name unique sale number
ball 2 12.34
4 8.72
hat 1 9.05
3 11.27
5 14.99

To demonstrate the aggfunc parameter better, we'll now form the pivot table that pivot couldn't in the previous section. For this example, we will use Python's sum as our aggregation function.

In [17]:
sales_data.pivot_table(index="item name", columns="color", fill_value="", aggfunc=sum)
unique sale number total $ cost
color blue green red blue green red
item name
ball 2 4 12.34 8.72
hat 8 1 26.26 9.05

Notice that the green hat entry contains the sum of the costs, like we wanted.

However, we obtained an 8 for the sale number in the green hat position. This was computed as the sum of the sales numbers, which is fairly meaningless. This shows that care should be taken to remember which function was used for aggregation, since it will be used for all values.

Multiple functions can be passed to aggfunc in a list, but this will form another level in the column hierarchy, broken down by the name of the function. Below, we demonstrate this with both the sum and min functions.

In [18]:
sales_data.pivot_table(index="item name", columns="color", fill_value="", aggfunc=[sum, min])
sum min
unique sale number total $ cost unique sale number total $ cost
color blue green red blue green red blue green red blue green red
item name
ball 2 4 12.34 8.72 2 4 12.34 8.72
hat 8 1 26.26 9.05 3 1 11.27 9.05

There doesn't seem to be a convenient way to use different aggregation functions depending on the column labels.

One final parameter we will mention for the pivot_table method is the margins parameter. This is a boolean that defaults to False, but if we set it to True, the resulting DataFrame will also include total sums along the rows and columns. The totals appear in entries whose column or row labels are "All".

In [19]:
sales_data.pivot_table(index=["item name","unique sale number"], columns="color", fill_value="", aggfunc=sum, margins=True)
total $ cost
color blue green red All
item name unique sale number
ball 2 12.34 12.34
4 8.72 8.72
hat 1 9.05 9.05
3 11.27 11.27
5 14.99 14.99
All 12.34 26.26 17.77 56.37

The last table we've made seems nice. We'll finish by removing the sales number data and specifying that we only want the total cost values.

To do this, we remove the "unique sale number" argument from index, and pass the "total $ cost" label in the values parameter.

In [20]:
sales_data.pivot_table(index="item name", columns="color", values="total $ cost", fill_value="", aggfunc=sum, margins=True)
color blue green red All
item name
ball 12.34 8.72 21.06
hat 26.26 9.05 35.31
All 12.34 26.26 17.77 56.37