In [1]:

```
import pandas as pd
```

The basic object we manipulate with pandas is the DataFrame. We can think of this as 2d array, matrix, or table holding values. We can (optionally) provide columns and row names.

Here's an example.

In [2]:

```
pd.DataFrame([[1,2],[3,4]], columns=['column A', 'column B'])
```

Out[2]:

In [3]:

```
pd.DataFrame([[1,'a'],[3.7,['b',7]]])
```

Out[3]:

*sales*, and then create the DataFrame which we'll call *sales_data*.

In [4]:

```
columns = ['unique sale number', 'item name', 'color', 'total $ cost']
sales = []
sales.append([1, 'hat', 'red', 9.05])
sales.append([2, 'ball', 'blue', 12.34])
sales.append([3, 'hat', 'green', 11.27])
sales.append([4, 'ball', 'red', 8.72])
sales_data = pd.DataFrame(sales, columns=columns)
sales_data
```

Out[4]:

We will use the last example in the previous section to discuss the *pivot_table* method of *DataFrame*.

First let's talk for a moment about what pivot tables do.

Notice that the table above is a row-by-row record of transactions. Each row in the table corresponds to one complete record of the information. We could think of the information as being organized by sale, or by sale number.

It can be useful to organize this data in other ways. For example, the items of the same type ("hat" or "ball") are separated in the table. If we were interested in grouping together the sales according to which type of item was purchased, then we would want to group the "hat" sales together, and group the "ball" sales together. We could think of the desired layout as being organized by item name, rather than by sale number. Changing to an alternative layout like this is an example of constructing a pivot table.

It will be useful throughout this to think of forming a pivot table as forming a table with a hiararchy of row labels or of column labels. The hierarchy is determined by how we decide to group the values from our original columns. This will be more clear once we get to examples.

We can get *pandas* to form a pivot table for our DataFrame by calling the *pivot* or *pivot_table* methods and providing parameters about how we would like the resulting table organized. We can also use the *stack* and *unstack* methods to "flip" columns and rows of the resulting pivot tables to help control their layout.

We'll discuss each of these options below, using the table above as an example.

For example, we can use the *pivot* method to specify which of the original columns we would like to organize by in our new DataFrame. It's easiest to see this with an example.

Say we want to group the sales together according to item name. One way to do this is to call *sales_data.pivot* with the parameter *index* set to *"item name"*. This will make our new DataFrame have a row for each different kind of item name in our original sales data. That is, we will have a row for "ball" and a row for "hat".

We also have to specify a *columns* parameter. For now, we can just set it to be *"unique sale number"*. This means that our resulting DataFrame will look like several copies of a table with rows for "ball" and "hat" and columns for the sales numbers 1,2,3,4. This basic table structure will be copied for each of the remaining columns from our data, namely "color" and "total \$ cost". The values of the table will be filled with the color and total \$ cost values.

Let's see what it looks like.

In [5]:

```
sales_data.pivot(index="item name", columns="unique sale number")
```

Out[5]:

We can see how this results in a table that is organized "by item name".

Notice that the column labels are now in a hierarchy: there are column labels "color" and "total $ cost", and then these are broken down in the next level by the values of unique sale number. This is important to help understand *stack* and *unstack* in the next section.

Also notice that many of the values are *NaN*. This is because many of the positions of the table do not have matching information from the original sales data.

For example, consider the top-left *NaN* value. That position of the table must correspond to the value of "the color of the ball with sale number 1", because that is what the row and column labels determine for that position. But sale number 1 did not involve a ball, so there is no value to provide for that entry. In contrast, sale number 1 was a hat, so there was a color value "red" to provide in the bottom-left position.

If we'd like to only see one of the remaining columns (that is, "color" or "total $ cost" in the above example), then we can use the *values* parameter.

In [6]:

```
sales_data.pivot(index="item name", columns="unique sale number", values="color")
```

Out[6]:

*values* parameter.

The *stack* and *unstack* methods flip the layout of a DataFrame by moving whole levels of columns to rows, or whole levels of rows to columns. These are particularly useful to help manipulate the hierarchies we form when making pivot tables, but they can be applied any time.

To help understand this, remember the example above where our pivot table had a hierarchy of column labels. In that case, the column labels were broken down into multiple levels, one for "color" and "total $ cost", and another level for the unique sale number values.

Again, examples are easier to understand than the description.

Let's go back to the first pivot example:

In [7]:

```
pivoted_data = sales_data.pivot(index="item name", columns="unique sale number")
pivoted_data
```

Out[7]:

*stack* on the DataFrame.

In [8]:

```
pivoted_data.stack()
```

Out[8]:

The *stack* method by default takes the last level in the column breakdown and turns it into the last row breakdown. So, we've achieved our goal.

If we call *stack* again, it will move the remaining column level. This will result in there not being any more columns. This is possible to do, and returns something reasonable. Note that this is no longer a *DataFrame* object however, it is a *Series* object.

In [9]:

```
pivoted_data.stack().stack()
```

Out[9]:

*Unstack* is similar to *stack*, but moves row levels to column levels.

One more thing to note is that these methods can be passed a parameter to specify which level in the hierarchy to move. As mentioned above, by default they will move the "last" level.

For example, if we start with our pivot table *pivoted_data*, and this time use *stack* on the *0* level, we will move the "color" and "total $ cost" labels, rather than the unique sale number labels.

In [10]:

```
sales_data.pivot(index="item name", columns="unique sale number").stack(0)
```

Out[10]:

*stack* calls with the *Series* we obtained above.

In [11]:

```
sales_data.pivot(index="item name", columns="unique sale number").stack(0).stack()
```

Out[11]:

*pivot_table* section below that we can simultaneously form a pivot table and stack it in one method call. But before we do that, we'll motivate *pivot_table* with an example of how *pivot* can fail.

Let's add one more entry to our sales records to demonstrate a problem. We will append another list to *sales* and rebuild our DataFrame.

In [12]:

```
sales.append([5,'hat','green',14.99])
sales_data = pd.DataFrame(sales, columns=columns)
sales_data
```

Out[12]:

We've added another green hat sale, but at a different cost.

This seems innocent enough, but now there's a problem if we try to use *pivot* in certain ways.

Consider what happens if we try to organize the data by color and item name. When we get to the green hats, the table would want a unique sale value and a total $ cost value. But with our data, there are two options, the green hat with sale number 3 and cost 11.27, and the one with sale number 5 and cost 14.99. The table doesn't have a way to decide between these, so the *pivot* method just throws an error.

In [ ]:

```
sales_data.pivot(index="item name", columns="color")
# ValueError: Index contains duplicate entries, cannot reshape
```

*pivot_table* method will help us deal with this problem.

The *pivot_table* method lets us specify more parameters and get a bit more control over the layout of the resulting table. We can use it to simultaneously pivot and stack, or to provide default values when entries are missing, or to aggregate values when there are multiple competing values.

By providing a *fill_value* parameter, we can set the default when values are missing. For example, if we just wanted to clean up our earlier example by replacing all *NaN*s with a blank space, we could do the following. More generally, we could use this to replace missing values with meaningful entries (for example, 0 if we wanted to calculate total sales sums).

We can also provide an *aggfunc* parameter, which is an aggregation function to apply to the multiple values which might appear for a position in the table. This lets us avoid the problem mentioned in the previous section. There are two things to note:

- The
*aggfunc*will apply to values even if there is only one for that position. - It defaults to the numpy
*mean*function, which expects numerical input.

Together, these things mean that by default *pivot_table* will end up disregarding non-numeric input. But, as we're about to demonstrate, this can be avoided by providing your own *aggfunc*.

In the next example, we just reproduce our first *pivot* example using the new method and some extra parameters. We will set *fill_value* to be the empty string, so that we don't see any *NaN*s, and we will pass the identity function as our *aggfunc* parameter, so that we do not process the data.

In [14]:

```
sales_data.pivot_table(index="item name", columns="unique sale number", fill_value="", aggfunc=lambda x: x)
```

Out[14]:

Next we'll see how *pivot_table* can elegantly perform stacking.

The *index* and *columns* parameters for *pivot_table* can take lists, not just single column labels. So, if we would like the rows to be broken down by item name, and then by unique sale number, we can just list these in that order as our *index* parameter.

In [15]:

```
sales_data.pivot_table(index=["item name", "unique sale number"], aggfunc=lambda x:x)
```

Out[15]:

We've reproduced the earlier layout that required both *pivot* and *stack*.

We can also continue to provide a *columns* parameter, which can either be one of the column labels, like before, or can also be a list if we want to form a hierarchy. This lets us quickly structure our DataFrame like below.

In [16]:

```
sales_data.pivot_table(index=["item name", "unique sale number"], columns="color", fill_value="")
```

Out[16]:

*aggfunc* parameter better, we'll now form the pivot table that *pivot* couldn't in the previous section. For this example, we will use Python's *sum* as our aggregation function.

In [17]:

```
sales_data.pivot_table(index="item name", columns="color", fill_value="", aggfunc=sum)
```

Out[17]:

Notice that the green hat entry contains the sum of the costs, like we wanted.

However, we obtained an 8 for the sale number in the green hat position. This was computed as the sum of the sales numbers, which is fairly meaningless. This shows that care should be taken to remember which function was used for aggregation, since it will be used for all values.

Multiple functions can be passed to *aggfunc* in a list, but this will form another level in the column hierarchy, broken down by the name of the function. Below, we demonstrate this with both the *sum* and *min* functions.

In [18]:

```
sales_data.pivot_table(index="item name", columns="color", fill_value="", aggfunc=[sum, min])
```

Out[18]:

There doesn't seem to be a convenient way to use different aggregation functions depending on the column labels.

One final parameter we will mention for the *pivot_table* method is the *margins* parameter. This is a boolean that defaults to *False*, but if we set it to *True*, the resulting DataFrame will also include total sums along the rows and columns. The totals appear in entries whose column or row labels are "All".

In [19]:

```
sales_data.pivot_table(index=["item name","unique sale number"], columns="color", fill_value="", aggfunc=sum, margins=True)
```

Out[19]:

The last table we've made seems nice. We'll finish by removing the sales number data and specifying that we only want the total cost values.

To do this, we remove the *"unique sale number"* argument from *index*, and pass the *"total $ cost"* label in the *values* parameter.

In [20]:

```
sales_data.pivot_table(index="item name", columns="color", values="total $ cost", fill_value="", aggfunc=sum, margins=True)
```

Out[20]: