menu

Brief introduction to pandas

Despite being chubby little cute animals, pandas is also a popular, powerful and easy to use open source library for data analysis/manipulation, built on top of the Python.

In this article I will present and discuss basic features and functionality for this library using the dataset of players statistics on the series of the NBA finals 2021 right after the Bucks defeated the Suns in game 5 on the road, and marked their third triumph against 2 from the Suns.

Installation

First of all, let's install pandas, which can be done in most systems via pip:

pip install pandas

or conda:

conda install pandas

After installing it, the first step is to load the library using the regular python import statement:

import pandas as pd

I believe that for the sake of practicality and to avoid overlapping of built-in methods, like map(), all(), any(), filter(), max() and min(), the library is usually imported under the pd alias.

Dataframes and Series

Pandas stores data in the form of two main structures: Dataframes and Series. Series is a one-dimensional labelled array capable of holding data of any type. Dataframes on the other hand are two-dimensional structures, size-mutable, potentially used to store tabular data (a collection of series).

Using the NBA finals 2021 dataset as example, each "column" in the dataset, separaretly, is structured as a Series and the collection os these "columns" are treated as a Dataframe.

Series and Dataframes in Pandas

Load data

Pandas provides parsers for many different types of input files (.csv, .xls, .sql, .stata, .dta and etc), assuming that you and I have very limited amount of time these days, for the scope of this introduction I will focus on the read_csv() method, used for reading Comma Separated Values file (.csv) .

For more details on the many parameters available for this method please visit the official documentation.

df = pd.read_csv("nba_finals_gm5_2021.csv")

In general .csv files use a comma (,) as a column separator, but is common that other programs use a tab (\t), pipe (|), semicolon (;) and other characters. Pandas allows it manually specify the character used to separate columns in your input file via the spec parameter:

Columns names are automatically read from the first line of the input file. For headless input files, set header=None in order to skip this behaviour. Alternativey, for manually setting column names, set header=0 and use the parameter names to set the new column names as an ordered list.

By default pandas will automatically assign data types to the values stored in all columns. The dtype parameter accepts a dictionary and allows one to overwrite this behaviour.

Basic data exploring

After loaded, the number of dimensions (number of rows and columns) as well as the labels assigned to the columns of a dataframe can be accessed via the shape and columns attributes:

df.shape
(25, 23)


df.columns
Index(['PLAYER', 'TEAM', 'GP', 'MIN', 'FGM', 'FGA', 'FG%', '3PM',
       '3PA', '3P%', 'FTM', 'FTA', 'FT%', 'OREB', 'DREB', 'REB',
       'AST', 'TOV', 'STL', 'BLK', 'PF', 'PTS', '+/-'],
      dtype='object')

Descriptive statistics for all columns on the dataframe can be generated using the .describe() method:

df.describe()
              GP        MIN        FGM        FGA         FG%        3PM        3PA  ...        AST       TOV       STL        BLK         PF        PTS      +/-
count  25.000000  25.000000  25.000000  25.000000   25.000000  25.000000  25.000000  ...  25.000000  25.00000  25.00000  25.000000  25.000000  25.000000  25.0000
mean    3.760000  19.780000   3.504000   7.284000   41.508000   1.008000   2.548000  ...   1.804000   0.86400   0.55600   0.280000   1.572000   9.224000   0.0040
std     1.762574  15.547267   3.716235   7.304718   27.301494   1.028721   2.487489  ...   2.662781   0.98737   0.54626   0.432049   1.328696   9.629005   3.6464
min     1.000000   0.300000   0.000000   0.000000    0.000000   0.000000   0.000000  ...   0.000000   0.00000   0.00000   0.000000   0.000000   0.000000  -7.0000
25%     2.000000   4.100000   1.000000   1.700000   30.000000   0.000000   0.000000  ...   0.000000   0.00000   0.00000   0.000000   0.400000   2.000000  -3.3000
50%     5.000000  16.600000   2.200000   6.000000   44.600000   0.800000   2.200000  ...   0.800000   0.60000   0.60000   0.000000   1.200000   6.000000   0.0000
75%     5.000000  36.500000   4.600000   9.600000   54.300000   1.800000   4.000000  ...   2.000000   1.20000   1.00000   0.400000   2.400000  13.000000   3.6000
max     5.000000  42.800000  12.000000  24.200000  100.000000   3.000000   8.200000  ...   9.000000   3.60000   1.80000   1.400000   3.800000  32.200000   5.4000

Values stored on each column can be directly accessed using the columns label as a key to the dataframe, similar to the key/value association of python dictionaries:

df['PLAYER']
0               Devin Booker
1                 Chris Paul
2              Deandre Ayton
3              Mikal Bridges
4                Jae Crowder
5            Cameron Johnson
6              Cameron Payne
7               Torrey Craig
8             Frank Kaminsky
9          Ty-Shon Alexander
10               Abdel Nader
11               Dario Saric
12     Giannis Antetokounmpo
13           Khris Middleton
14              Jrue Holiday
15               Brook Lopez
16           Pat Connaughton
17              Bobby Portis
18               P.J. Tucker
19               Bryn Forbes
20              Jordan Nwora
21               Jeff Teague
22             Elijah Bryant
23               Sam Merrill
24    Thanasis Antetokounmpo
Name: PLAYER, dtype: object

Multiple columns can be retrieve at the same time if a list of values is provided:

df[['PLAYER', 'PTS']]
                    PLAYER   PTS
0             Devin Booker  30.0
1               Chris Paul  21.0
2            Deandre Ayton  15.2
3            Mikal Bridges  13.0
4              Jae Crowder  11.0
5          Cameron Johnson   9.6
6            Cameron Payne   6.8
7             Torrey Craig   3.4
8           Frank Kaminsky   2.0
9        Ty-Shon Alexander   2.0
10             Abdel Nader   0.0
11             Dario Saric   0.0
12   Giannis Antetokounmpo  32.2
13         Khris Middleton  25.4
14            Jrue Holiday  17.6
15             Brook Lopez  11.8
16         Pat Connaughton  11.0
17            Bobby Portis   6.0
18             P.J. Tucker   4.8
19             Bryn Forbes   3.0
20            Jordan Nwora   3.0
21             Jeff Teague   1.8
22           Elijah Bryant   0.0
23             Sam Merrill   0.0
24  Thanasis Antetokounmpo   0.0

TIP: If a single value is used as input, pandas returns a Series object and if a list of columns is provided, the output returned in the form of a Dataframe.

Use python list slice operations to filter based on row index, like selecting only the first dozen rows from our dataframe of NBA data:

df[['PLAYER', 'PTS']][0:12]
               PLAYER   PTS
0        Devin Booker  30.0
1          Chris Paul  21.0
2       Deandre Ayton  15.2
3       Mikal Bridges  13.0
4         Jae Crowder  11.0
5     Cameron Johnson   9.6
6       Cameron Payne   6.8
7        Torrey Craig   3.4
8      Frank Kaminsky   2.0
9   Ty-Shon Alexander   2.0
10        Abdel Nader   0.0
11        Dario Saric   0.0

When dealing with arge number of rows, use .head and .tail() to display top and bottom rows. In the same context, the .sample function can be used to retrieve a random sample of items from an axis of object. The number of items to be "sampled" is specified via the n or frac parameters (a specific number or a a fraction of the data, respectively).

df.head()
          PLAYER  TEAM  GP   MIN   FGM   FGA  ...  TOV  STL  BLK   PF   PTS  +/-
0   Devin Booker  Suns   5  39.0  11.4  24.2  ...  2.6  1.0  0.4  3.8  30.0  5.4
1     Chris Paul  Suns   5  36.9   8.8  16.2  ...  3.6  0.6  0.2  2.8  21.0 -0.6
2  Deandre Ayton  Suns   5  37.8   6.0  10.4  ...  1.2  1.4  1.4  3.6  15.2  3.2
3  Mikal Bridges  Suns   5  30.6   4.6   8.4  ...  1.4  0.8  0.6  1.2  13.0  4.4
4    Jae Crowder  Suns   5  36.5   3.4   8.0  ...  1.0  1.2  1.0  3.4  11.0  1.2


df.tail()
                    PLAYER   TEAM  GP   MIN  FGM  FGA  ...  TOV  STL  BLK   PF  PTS  +/-
20            Jordan Nwora  Bucks   1   1.3  1.0  1.0  ...  0.0  0.0  0.0  0.0  3.0  1.0
21             Jeff Teague  Bucks   5  10.8  0.4  2.0  ...  0.6  0.6  0.0  0.4  1.8 -1.4
22           Elijah Bryant  Bucks   1   0.3  0.0  0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0
23             Sam Merrill  Bucks   1   1.3  0.0  0.0  ...  0.0  0.0  0.0  0.0  0.0  1.0
24  Thanasis Antetokounmpo  Bucks   1   1.6  0.0  2.0  ...  0.0  0.0  0.0  1.0  0.0  1.0


df.sample(frac=0.10)
             PLAYER   TEAM  GP   MIN   FGM   FGA  ...  TOV  STL  BLK   PF   PTS  +/-
16  Pat Connaughton  Bucks   5  31.4   3.8   7.2  ...  0.6  0.2  0.0  2.4  11.0  4.4
15      Brook Lopez  Bucks   5  24.0   4.6   9.6  ...  1.0  0.6  0.8  2.2  11.8 -3.0
13  Khris Middleton  Bucks   5  42.8  10.0  22.4  ...  2.8  1.0  0.0  2.2  25.4 -0.8

By default, values will be stored in the same order they are read from their input source. pandas allows for controlling the order of entries in a dataframe using the sort_values() method. This function requires a string or a list of columns identifier. The order (ascending or descending) can be controlled via the ascending parameter.

df.sort_values(['PTS','PLAYER'], ascending=[False, True])
                    PLAYER   TEAM  GP   MIN   FGM   FGA    FG%  3PM  3PA    3P%  FTM   FTA    FT%  OREB  DREB   REB  AST  TOV  STL  BLK   PF   PTS  +/-
12   Giannis Antetokounmpo  Bucks   5  39.3  12.0  19.6   61.2  0.4  2.4   16.7  7.8  13.2   59.1   4.0   9.0  13.0  5.6  1.6  1.4  1.2  3.2  32.2  4.0
0             Devin Booker   Suns   5  39.0  11.4  24.2   47.1  2.2  6.8   32.4  5.0   5.8   86.2   1.2   2.4   3.6  3.8  2.6  1.0  0.4  3.8  30.0  5.4
13         Khris Middleton  Bucks   5  42.8  10.0  22.4   44.6  3.0  8.2   36.6  2.4   2.8   85.7   0.4   6.2   6.6  5.4  2.8  1.0  0.0  2.2  25.4 -0.8
1               Chris Paul   Suns   5  36.9   8.8  16.2   54.3  2.2  4.2   52.4  1.2   1.6   75.0   0.6   2.2   2.8  8.8  3.6  0.6  0.2  2.8  21.0 -0.6
14            Jrue Holiday  Bucks   5  40.8   7.0  17.8   39.3  1.8  5.6   32.1  1.8   2.0   90.0   1.2   4.4   5.6  9.0  2.0  1.8  0.8  2.0  17.6  5.0
2            Deandre Ayton   Suns   5  37.8   6.0  10.4   57.7  0.0  0.0    0.0  3.2   3.4   94.1   2.4  10.8  13.2  2.0  1.2  1.4  1.4  3.6  15.2  3.2
3            Mikal Bridges   Suns   5  30.6   4.6   8.4   54.8  1.8  4.0   45.0  2.0   2.0  100.0   0.6   3.2   3.8  1.0  1.4  0.8  0.6  1.2  13.0  4.4
15             Brook Lopez  Bucks   5  24.0   4.6   9.6   47.9  1.0  3.6   27.8  1.6   2.0   80.0   1.8   3.0   4.8  0.2  1.0  0.6  0.8  2.2  11.8 -3.0
4              Jae Crowder   Suns   5  36.5   3.4   8.0   42.5  2.8  6.0   46.7  1.4   1.8   77.8   0.4   7.2   7.6  2.0  1.0  1.2  1.0  3.4  11.0  1.2
16         Pat Connaughton  Bucks   5  31.4   3.8   7.2   52.8  3.0  6.0   50.0  0.4   0.4  100.0   1.6   3.8   5.4  1.2  0.6  0.2  0.0  2.4  11.0  4.4
5          Cameron Johnson   Suns   5  23.9   3.2   6.0   53.3  1.8  3.8   47.4  1.4   1.4  100.0   0.8   2.4   3.2  1.0  0.8  0.4  0.4  2.4   9.6 -7.0
6            Cameron Payne   Suns   5  16.6   3.0   7.0   42.9  0.6  2.4   25.0  0.2   0.4   50.0   0.4   2.2   2.6  2.0  0.8  0.8  0.0  1.2   6.8 -6.0
17            Bobby Portis  Bucks   5  15.4   2.2   6.0   36.7  1.0  2.2   45.5  0.6   0.6  100.0   2.0   2.2   4.2  0.2  0.2  0.8  0.2  0.8   6.0  3.6
18             P.J. Tucker  Bucks   5  30.3   2.0   3.8   52.6  0.8  1.4   57.1  0.0   0.0    0.0   2.0   1.4   3.4  1.2  0.4  1.0  0.0  3.8   4.8 -3.4
7             Torrey Craig   Suns   5  12.6   1.2   3.0   40.0  0.6  2.0   30.0  0.4   0.8   50.0   0.4   1.2   1.6  0.2  0.4  0.0  0.0  1.4   3.4 -4.8
19             Bryn Forbes  Bucks   3   7.3   1.0   3.3   30.0  1.0  3.0   33.3  0.0   0.0    0.0   0.0   0.3   0.3  0.0  0.3  0.3  0.0  0.7   3.0 -3.3
20            Jordan Nwora  Bucks   1   1.3   1.0   1.0  100.0  1.0  1.0  100.0  0.0   0.0    0.0   0.0   1.0   1.0  0.0  0.0  0.0  0.0  0.0   3.0  1.0
8           Frank Kaminsky   Suns   3   6.2   1.0   1.7   60.0  0.0  0.0    0.0  0.0   0.0    0.0   0.3   1.0   1.3  0.7  0.3  0.0  0.0  0.3   2.0 -3.3
9        Ty-Shon Alexander   Suns   1   1.3   1.0   1.0  100.0  0.0  0.0    0.0  0.0   0.0    0.0   0.0   0.0   0.0  0.0  0.0  0.0  0.0  0.0   2.0 -1.0
21             Jeff Teague  Bucks   5  10.8   0.4   2.0   20.0  0.2  0.6   33.3  0.8   1.2   66.7   0.4   0.6   1.0  0.8  0.6  0.6  0.0  0.4   1.8 -1.4
10             Abdel Nader   Suns   2   4.1   0.0   0.5    0.0  0.0  0.5    0.0  0.0   0.0    0.0   0.0   0.0   0.0  0.0  0.0  0.0  0.0  0.5   0.0 -3.5
11             Dario Saric   Suns   1   2.4   0.0   1.0    0.0  0.0  0.0    0.0  0.0   0.0    0.0   1.0   0.0   1.0  0.0  0.0  0.0  0.0  0.0   0.0  4.0
22           Elijah Bryant  Bucks   1   0.3   0.0   0.0    0.0  0.0  0.0    0.0  0.0   0.0    0.0   0.0   0.0   0.0  0.0  0.0  0.0  0.0  0.0   0.0  0.0
23             Sam Merrill  Bucks   1   1.3   0.0   0.0    0.0  0.0  0.0    0.0  0.0   0.0    0.0   0.0   0.0   0.0  0.0  0.0  0.0  0.0  0.0   0.0  1.0
24  Thanasis Antetokounmpo  Bucks   1   1.6   0.0   2.0    0.0  0.0  0.0    0.0  0.0   0.0    0.0   1.0   2.0   3.0  0.0  0.0  0.0  0.0  1.0   0.0  1.0

FYI: most of the commands I have shown so far, can also be "chained" into "one-liners" to form more specific subsets of data. For example, if we want to retrieve the top 5 scoring players from our dataset we can do something like the following:

df.sort_values(['PTS','PLAYER'], ascending=[False, True])[0:5][['PLAYER','TEAM','PTS']]
                   PLAYER   TEAM   PTS
12  Giannis Antetokounmpo  Bucks  32.2
0            Devin Booker   Suns  30.0
13        Khris Middleton  Bucks  25.4
1              Chris Paul   Suns  21.0
14           Jrue Holiday  Bucks  17.6

the functions are applied in order from left to right:

  1. sort_values sort the dataframe by PTS in descending order (highest values first and lowest for last)
  2. then we use a slicing operation with [0:5] to select the first 5 entries from the dataset sorted in the previous step
  3. finally we select only PLAYER, TEAM and PTS to be displayed using the indexing operator ([])

For columns storing categorical variabels (string), the value_counts() method is very handy for identify the number of entries per categorical value. From our NBA dataframe, we can identify the number of players in each team by:

df['TEAM'].value_counts()
Bucks    13
Suns     12
Name: TEAM, dtype: int64

Finally, the .drop method allows one to remove entries from a dataframe:

df.shape
(25, 23)

df.drop(0).shape
(24, 23)

df.drop('GP', axis=1).shape
(25, 22)

By default, drop deletes the row for a particular index, and this behaviour can be altered using the axis parameter.

.loc[] and iloc()

Among other methods implemented within pandas, .loc[] and .iloc[] are very useful for making slightly more complex queries with great flexibility.

.loc[]

.loc[] is a label based method for querying data from a dataframe. In this case, one is required to provide the name of the row and/or column to be selected. Using our NBA dataset, one can draw a few examples of operations that we can perform using .loc[]:

  • To check for players on the Milwalkee Bucks which average at least 10 points in the series:
df.loc[(df.TEAM == 'Bucks') & (df.PTS >= 10)]
                   PLAYER   TEAM  GP   MIN   FGM   FGA  ...  TOV  STL  BLK   PF   PTS  +/-
12  Giannis Antetokounmpo  Bucks   5  39.3  12.0  19.6  ...  1.6  1.4  1.2  3.2  32.2  4.0
13        Khris Middleton  Bucks   5  42.8  10.0  22.4  ...  2.8  1.0  0.0  2.2  25.4 -0.8
14           Jrue Holiday  Bucks   5  40.8   7.0  17.8  ...  2.0  1.8  0.8  2.0  17.6  5.0
15            Brook Lopez  Bucks   5  24.0   4.6   9.6  ...  1.0  0.6  0.8  2.2  11.8 -3.0
16        Pat Connaughton  Bucks   5  31.4   3.8   7.2  ...  0.6  0.2  0.0  2.4  11.0  4.4
[5 rows x 23 columns]

TIP: Each condition should be wrapped within parenthesis and operators and (&) and or (|) are valid.

TIP: Use .str.contains() for substring searching. In the example below, we are looking for the greek brothers by their last name:

df.loc[(df['PLAYER'].str.contains("Antetokounmpo"))]
                    PLAYER   TEAM  GP   MIN   FGM  ...  STL  BLK   PF   PTS  +/-
12   Giannis Antetokounmpo  Bucks   5  39.3  12.0  ...  1.4  1.2  3.2  32.2  4.0
24  Thanasis Antetokounmpo  Bucks   1   1.6   0.0  ...  0.0  0.0  1.0   0.0  1.0
[2 rows x 23 columns]
  • To select a range of rows based on their index:
df.loc[0:5]
            PLAYER  TEAM  GP   MIN   FGM   FGA   FG%  3PM  3PA   3P%  FTM  FTA    FT%  OREB  DREB   REB  AST  TOV  STL  BLK   PF   PTS  +/-
0     Devin Booker  Suns   5  39.0  11.4  24.2  47.1  2.2  6.8  32.4  5.0  5.8   86.2   1.2   2.4   3.6  3.8  2.6  1.0  0.4  3.8  30.0  5.4
1       Chris Paul  Suns   5  36.9   8.8  16.2  54.3  2.2  4.2  52.4  1.2  1.6   75.0   0.6   2.2   2.8  8.8  3.6  0.6  0.2  2.8  21.0 -0.6
2    Deandre Ayton  Suns   5  37.8   6.0  10.4  57.7  0.0  0.0   0.0  3.2  3.4   94.1   2.4  10.8  13.2  2.0  1.2  1.4  1.4  3.6  15.2  3.2
3    Mikal Bridges  Suns   5  30.6   4.6   8.4  54.8  1.8  4.0  45.0  2.0  2.0  100.0   0.6   3.2   3.8  1.0  1.4  0.8  0.6  1.2  13.0  4.4
4      Jae Crowder  Suns   5  36.5   3.4   8.0  42.5  2.8  6.0  46.7  1.4  1.8   77.8   0.4   7.2   7.6  2.0  1.0  1.2  1.0  3.4  11.0  1.2
5  Cameron Johnson  Suns   5  23.9   3.2   6.0  53.3  1.8  3.8  47.4  1.4  1.4  100.0   0.8   2.4   3.2  1.0  0.8  0.4  0.4  2.4   9.6 -7.0
  • To update the value of any given columns:
df['STATUS'] = ""
df.loc[(df['PTS'] >= 10) & (df['REB'] >= 10), 'STATUS'] = 'Double-Double'

print(df[['PLAYER', 'PTS', 'REB', 'AST', 'STATUS']])
                    PLAYER   PTS   REB  AST         STATUS
0             Devin Booker  30.0   3.6  3.8
1               Chris Paul  21.0   2.8  8.8
2            Deandre Ayton  15.2  13.2  2.0  Double-Double
3            Mikal Bridges  13.0   3.8  1.0
4              Jae Crowder  11.0   7.6  2.0
5          Cameron Johnson   9.6   3.2  1.0
6            Cameron Payne   6.8   2.6  2.0
7             Torrey Craig   3.4   1.6  0.2
8           Frank Kaminsky   2.0   1.3  0.7
9        Ty-Shon Alexander   2.0   0.0  0.0
10             Abdel Nader   0.0   0.0  0.0
11             Dario Saric   0.0   1.0  0.0
12   Giannis Antetokounmpo  32.2  13.0  5.6  Double-Double
13         Khris Middleton  25.4   6.6  5.4
14            Jrue Holiday  17.6   5.6  9.0
15             Brook Lopez  11.8   4.8  0.2
16         Pat Connaughton  11.0   5.4  1.2
17            Bobby Portis   6.0   4.2  0.2
18             P.J. Tucker   4.8   3.4  1.2
19             Bryn Forbes   3.0   0.3  0.0
20            Jordan Nwora   3.0   1.0  0.0
21             Jeff Teague   1.8   1.0  0.8
22           Elijah Bryant   0.0   0.0  0.0
23             Sam Merrill   0.0   0.0  0.0
24  Thanasis Antetokounmpo   0.0   3.0  0.0

In the above example, we created a new column STATUS with default value as an empty string. Then we define the set of conditions we were looking for as the first parameter of .loc[], the second parameter is the name of the column we want to change and the last component is the new-value.

Since a double-double is a status that occurs when a player reaches two digits in two out of three different categories PTS, REB and AST, we are better off writing the last example as:

df.loc[((df['PTS'] >= 10) & (df['REB'] >= 10)) | \
       ((df['PTS'] >= 10) & (df['AST'] >= 10)) | \
       ((df['AST'] >= 10) & (df['REB'] >= 10)), 'STATUS'] = 'Double-Double'

In fact, this modification did not made a difference to the output, since Deandre Ayton and Giannis Antetokounmpo are the only players averaging a double-double in this series.

.iloc[]

Different from .loc[], .iloc[] is an indexed based selecting method and it uses indexing to select specific rows/columns. Some useful ways of using .iloc[]:

  • to select specific rows using a list of indices:
indices2select = [0,2,3,4,7,6]
df.iloc[indices2select]
          PLAYER  TEAM  GP   MIN   FGM   FGA   FG%  3PM  3PA   3P%  FTM  FTA    FT%  OREB  DREB   REB  AST  TOV  STL  BLK   PF   PTS  +/-
0   Devin Booker  Suns   5  39.0  11.4  24.2  47.1  2.2  6.8  32.4  5.0  5.8   86.2   1.2   2.4   3.6  3.8  2.6  1.0  0.4  3.8  30.0  5.4
2  Deandre Ayton  Suns   5  37.8   6.0  10.4  57.7  0.0  0.0   0.0  3.2  3.4   94.1   2.4  10.8  13.2  2.0  1.2  1.4  1.4  3.6  15.2  3.2
3  Mikal Bridges  Suns   5  30.6   4.6   8.4  54.8  1.8  4.0  45.0  2.0  2.0  100.0   0.6   3.2   3.8  1.0  1.4  0.8  0.6  1.2  13.0  4.4
4    Jae Crowder  Suns   5  36.5   3.4   8.0  42.5  2.8  6.0  46.7  1.4  1.8   77.8   0.4   7.2   7.6  2.0  1.0  1.2  1.0  3.4  11.0  1.2
7   Torrey Craig  Suns   5  12.6   1.2   3.0  40.0  0.6  2.0  30.0  0.4  0.8   50.0   0.4   1.2   1.6  0.2  0.4  0.0  0.0  1.4   3.4 -4.8
6  Cameron Payne  Suns   5  16.6   3.0   7.0  42.9  0.6  2.4  25.0  0.2  0.4   50.0   0.4   2.2   2.6  2.0  0.8  0.8  0.0  1.2   6.8 -6.0
  • to select a range of columns and rows simultaneously:
df.iloc[2:9, 0:5]
            PLAYER  TEAM  GP   MIN  FGM
2    Deandre Ayton  Suns   5  37.8  6.0
3    Mikal Bridges  Suns   5  30.6  4.6
4      Jae Crowder  Suns   5  36.5  3.4
5  Cameron Johnson  Suns   5  23.9  3.2
6    Cameron Payne  Suns   5  16.6  3.0
7     Torrey Craig  Suns   5  12.6  1.2
8   Frank Kaminsky  Suns   3   6.2  1.0

In the example above, the first parameter of .iloc[] defines the set of rows to be selected based on a range of indices (2:9), and the second parameter defines the columns that only the first 5 columns will be retrieved (0:5). In both cases, the input values could have also been a list of indices.

Conclusion

Here I explored some of the basic concepts behind pandas, one of the most popular and powerful Python libraries for data analysis/manipulation. I used a sample of data with players statistics on the NBA finals 2021 up to game 5 to explore the structure of objects used to store data, basic properties and methods for data exploring, and auxiliary methods for more complex queries.

The code samples and data presented in this article are available at github.

Thank you for reading and if you are interested in learning more about pandas, here is a non-exhaustive list of resources with more information and tips: