Brief introduction to pandas
Despite being chubby little cute animals, pandas is also a popular, powerful and easy to use open source library for data analysis/manipulation, built on top of the Python.
In this article I will present and discuss basic features and functionality for this library using the dataset of players statistics on the series of the NBA finals 2021 right after the Bucks defeated the Suns in game 5 on the road, and marked their third triumph against 2 from the Suns.
Installation
First of all, let's install pandas, which can be done in most systems via pip
:
pip install pandas
or conda
:
conda install pandas
After installing it, the first step is to load the library using the regular python import statement:
import pandas as pd
I believe that for the sake of practicality and to avoid overlapping of built-in methods, like map()
, all()
, any()
, filter()
, max()
and min()
, the library is usually imported under the pd
alias.
Dataframes and Series
Pandas stores data in the form of two main structures: Dataframes and Series. Series is a one-dimensional labelled array capable of holding data of any type. Dataframes on the other hand are two-dimensional structures, size-mutable, potentially used to store tabular data (a collection of series).
Using the NBA finals 2021 dataset as example, each "column" in the dataset, separaretly, is structured as a Series and the collection os these "columns" are treated as a Dataframe.
Load data
Pandas provides parsers for many different types of input files (.csv, .xls, .sql, .stata, .dta and etc), assuming that you and I have very limited amount of time these days, for the scope of this introduction I will focus on the read_csv()
method, used for reading Comma Separated Values file (.csv) .
For more details on the many parameters available for this method please visit the official documentation.
df = pd.read_csv("nba_finals_gm5_2021.csv")
In general .csv files use a comma (,) as a column separator, but is common that other programs use a tab (\t
), pipe (|
), semicolon (;
) and other characters. Pandas allows it manually specify the character used to separate columns in your input file via the spec
parameter:
Columns names are automatically read from the first line of the input file. For headless input files, set header=None
in order to skip this behaviour. Alternativey, for manually setting column names, set header=0
and use the parameter names
to set the new column names as an ordered list.
By default pandas will automatically assign data types to the values stored in all columns. The dtype
parameter accepts a dictionary and allows one to overwrite this behaviour.
Basic data exploring
After loaded, the number of dimensions (number of rows and columns) as well as the labels assigned to the columns of a dataframe can be accessed via the shape
and columns
attributes:
df.shape
(25, 23)
df.columns
Index(['PLAYER', 'TEAM', 'GP', 'MIN', 'FGM', 'FGA', 'FG%', '3PM',
'3PA', '3P%', 'FTM', 'FTA', 'FT%', 'OREB', 'DREB', 'REB',
'AST', 'TOV', 'STL', 'BLK', 'PF', 'PTS', '+/-'],
dtype='object')
Descriptive statistics for all columns on the dataframe can be generated using the .describe()
method:
df.describe()
GP MIN FGM FGA FG% 3PM 3PA ... AST TOV STL BLK PF PTS +/-
count 25.000000 25.000000 25.000000 25.000000 25.000000 25.000000 25.000000 ... 25.000000 25.00000 25.00000 25.000000 25.000000 25.000000 25.0000
mean 3.760000 19.780000 3.504000 7.284000 41.508000 1.008000 2.548000 ... 1.804000 0.86400 0.55600 0.280000 1.572000 9.224000 0.0040
std 1.762574 15.547267 3.716235 7.304718 27.301494 1.028721 2.487489 ... 2.662781 0.98737 0.54626 0.432049 1.328696 9.629005 3.6464
min 1.000000 0.300000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.00000 0.00000 0.000000 0.000000 0.000000 -7.0000
25% 2.000000 4.100000 1.000000 1.700000 30.000000 0.000000 0.000000 ... 0.000000 0.00000 0.00000 0.000000 0.400000 2.000000 -3.3000
50% 5.000000 16.600000 2.200000 6.000000 44.600000 0.800000 2.200000 ... 0.800000 0.60000 0.60000 0.000000 1.200000 6.000000 0.0000
75% 5.000000 36.500000 4.600000 9.600000 54.300000 1.800000 4.000000 ... 2.000000 1.20000 1.00000 0.400000 2.400000 13.000000 3.6000
max 5.000000 42.800000 12.000000 24.200000 100.000000 3.000000 8.200000 ... 9.000000 3.60000 1.80000 1.400000 3.800000 32.200000 5.4000
Values stored on each column can be directly accessed using the columns label as a key to the dataframe, similar to the key/value association of python dictionaries:
df['PLAYER']
0 Devin Booker
1 Chris Paul
2 Deandre Ayton
3 Mikal Bridges
4 Jae Crowder
5 Cameron Johnson
6 Cameron Payne
7 Torrey Craig
8 Frank Kaminsky
9 Ty-Shon Alexander
10 Abdel Nader
11 Dario Saric
12 Giannis Antetokounmpo
13 Khris Middleton
14 Jrue Holiday
15 Brook Lopez
16 Pat Connaughton
17 Bobby Portis
18 P.J. Tucker
19 Bryn Forbes
20 Jordan Nwora
21 Jeff Teague
22 Elijah Bryant
23 Sam Merrill
24 Thanasis Antetokounmpo
Name: PLAYER, dtype: object
Multiple columns can be retrieve at the same time if a list of values is provided:
df[['PLAYER', 'PTS']]
PLAYER PTS
0 Devin Booker 30.0
1 Chris Paul 21.0
2 Deandre Ayton 15.2
3 Mikal Bridges 13.0
4 Jae Crowder 11.0
5 Cameron Johnson 9.6
6 Cameron Payne 6.8
7 Torrey Craig 3.4
8 Frank Kaminsky 2.0
9 Ty-Shon Alexander 2.0
10 Abdel Nader 0.0
11 Dario Saric 0.0
12 Giannis Antetokounmpo 32.2
13 Khris Middleton 25.4
14 Jrue Holiday 17.6
15 Brook Lopez 11.8
16 Pat Connaughton 11.0
17 Bobby Portis 6.0
18 P.J. Tucker 4.8
19 Bryn Forbes 3.0
20 Jordan Nwora 3.0
21 Jeff Teague 1.8
22 Elijah Bryant 0.0
23 Sam Merrill 0.0
24 Thanasis Antetokounmpo 0.0
TIP: If a single value is used as input, pandas returns a Series object and if a list of columns is provided, the output returned in the form of a Dataframe.
Use python list slice operations to filter based on row index, like selecting only the first dozen rows from our dataframe of NBA data:
df[['PLAYER', 'PTS']][0:12]
PLAYER PTS
0 Devin Booker 30.0
1 Chris Paul 21.0
2 Deandre Ayton 15.2
3 Mikal Bridges 13.0
4 Jae Crowder 11.0
5 Cameron Johnson 9.6
6 Cameron Payne 6.8
7 Torrey Craig 3.4
8 Frank Kaminsky 2.0
9 Ty-Shon Alexander 2.0
10 Abdel Nader 0.0
11 Dario Saric 0.0
When dealing with arge number of rows, use .head
and .tail()
to display top and bottom rows. In the same context, the .sample
function can be used to retrieve a random sample of items from an axis of object. The number of items to be "sampled" is specified via the n
or frac
parameters (a specific number or a a fraction of the data, respectively).
df.head()
PLAYER TEAM GP MIN FGM FGA ... TOV STL BLK PF PTS +/-
0 Devin Booker Suns 5 39.0 11.4 24.2 ... 2.6 1.0 0.4 3.8 30.0 5.4
1 Chris Paul Suns 5 36.9 8.8 16.2 ... 3.6 0.6 0.2 2.8 21.0 -0.6
2 Deandre Ayton Suns 5 37.8 6.0 10.4 ... 1.2 1.4 1.4 3.6 15.2 3.2
3 Mikal Bridges Suns 5 30.6 4.6 8.4 ... 1.4 0.8 0.6 1.2 13.0 4.4
4 Jae Crowder Suns 5 36.5 3.4 8.0 ... 1.0 1.2 1.0 3.4 11.0 1.2
df.tail()
PLAYER TEAM GP MIN FGM FGA ... TOV STL BLK PF PTS +/-
20 Jordan Nwora Bucks 1 1.3 1.0 1.0 ... 0.0 0.0 0.0 0.0 3.0 1.0
21 Jeff Teague Bucks 5 10.8 0.4 2.0 ... 0.6 0.6 0.0 0.4 1.8 -1.4
22 Elijah Bryant Bucks 1 0.3 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
23 Sam Merrill Bucks 1 1.3 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0
24 Thanasis Antetokounmpo Bucks 1 1.6 0.0 2.0 ... 0.0 0.0 0.0 1.0 0.0 1.0
df.sample(frac=0.10)
PLAYER TEAM GP MIN FGM FGA ... TOV STL BLK PF PTS +/-
16 Pat Connaughton Bucks 5 31.4 3.8 7.2 ... 0.6 0.2 0.0 2.4 11.0 4.4
15 Brook Lopez Bucks 5 24.0 4.6 9.6 ... 1.0 0.6 0.8 2.2 11.8 -3.0
13 Khris Middleton Bucks 5 42.8 10.0 22.4 ... 2.8 1.0 0.0 2.2 25.4 -0.8
By default, values will be stored in the same order they are read from their input source. pandas allows for controlling the order of entries in a dataframe using the sort_values()
method. This function requires a string or a list of columns identifier. The order (ascending or descending) can be controlled via the ascending
parameter.
df.sort_values(['PTS','PLAYER'], ascending=[False, True])
PLAYER TEAM GP MIN FGM FGA FG% 3PM 3PA 3P% FTM FTA FT% OREB DREB REB AST TOV STL BLK PF PTS +/-
12 Giannis Antetokounmpo Bucks 5 39.3 12.0 19.6 61.2 0.4 2.4 16.7 7.8 13.2 59.1 4.0 9.0 13.0 5.6 1.6 1.4 1.2 3.2 32.2 4.0
0 Devin Booker Suns 5 39.0 11.4 24.2 47.1 2.2 6.8 32.4 5.0 5.8 86.2 1.2 2.4 3.6 3.8 2.6 1.0 0.4 3.8 30.0 5.4
13 Khris Middleton Bucks 5 42.8 10.0 22.4 44.6 3.0 8.2 36.6 2.4 2.8 85.7 0.4 6.2 6.6 5.4 2.8 1.0 0.0 2.2 25.4 -0.8
1 Chris Paul Suns 5 36.9 8.8 16.2 54.3 2.2 4.2 52.4 1.2 1.6 75.0 0.6 2.2 2.8 8.8 3.6 0.6 0.2 2.8 21.0 -0.6
14 Jrue Holiday Bucks 5 40.8 7.0 17.8 39.3 1.8 5.6 32.1 1.8 2.0 90.0 1.2 4.4 5.6 9.0 2.0 1.8 0.8 2.0 17.6 5.0
2 Deandre Ayton Suns 5 37.8 6.0 10.4 57.7 0.0 0.0 0.0 3.2 3.4 94.1 2.4 10.8 13.2 2.0 1.2 1.4 1.4 3.6 15.2 3.2
3 Mikal Bridges Suns 5 30.6 4.6 8.4 54.8 1.8 4.0 45.0 2.0 2.0 100.0 0.6 3.2 3.8 1.0 1.4 0.8 0.6 1.2 13.0 4.4
15 Brook Lopez Bucks 5 24.0 4.6 9.6 47.9 1.0 3.6 27.8 1.6 2.0 80.0 1.8 3.0 4.8 0.2 1.0 0.6 0.8 2.2 11.8 -3.0
4 Jae Crowder Suns 5 36.5 3.4 8.0 42.5 2.8 6.0 46.7 1.4 1.8 77.8 0.4 7.2 7.6 2.0 1.0 1.2 1.0 3.4 11.0 1.2
16 Pat Connaughton Bucks 5 31.4 3.8 7.2 52.8 3.0 6.0 50.0 0.4 0.4 100.0 1.6 3.8 5.4 1.2 0.6 0.2 0.0 2.4 11.0 4.4
5 Cameron Johnson Suns 5 23.9 3.2 6.0 53.3 1.8 3.8 47.4 1.4 1.4 100.0 0.8 2.4 3.2 1.0 0.8 0.4 0.4 2.4 9.6 -7.0
6 Cameron Payne Suns 5 16.6 3.0 7.0 42.9 0.6 2.4 25.0 0.2 0.4 50.0 0.4 2.2 2.6 2.0 0.8 0.8 0.0 1.2 6.8 -6.0
17 Bobby Portis Bucks 5 15.4 2.2 6.0 36.7 1.0 2.2 45.5 0.6 0.6 100.0 2.0 2.2 4.2 0.2 0.2 0.8 0.2 0.8 6.0 3.6
18 P.J. Tucker Bucks 5 30.3 2.0 3.8 52.6 0.8 1.4 57.1 0.0 0.0 0.0 2.0 1.4 3.4 1.2 0.4 1.0 0.0 3.8 4.8 -3.4
7 Torrey Craig Suns 5 12.6 1.2 3.0 40.0 0.6 2.0 30.0 0.4 0.8 50.0 0.4 1.2 1.6 0.2 0.4 0.0 0.0 1.4 3.4 -4.8
19 Bryn Forbes Bucks 3 7.3 1.0 3.3 30.0 1.0 3.0 33.3 0.0 0.0 0.0 0.0 0.3 0.3 0.0 0.3 0.3 0.0 0.7 3.0 -3.3
20 Jordan Nwora Bucks 1 1.3 1.0 1.0 100.0 1.0 1.0 100.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 3.0 1.0
8 Frank Kaminsky Suns 3 6.2 1.0 1.7 60.0 0.0 0.0 0.0 0.0 0.0 0.0 0.3 1.0 1.3 0.7 0.3 0.0 0.0 0.3 2.0 -3.3
9 Ty-Shon Alexander Suns 1 1.3 1.0 1.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 -1.0
21 Jeff Teague Bucks 5 10.8 0.4 2.0 20.0 0.2 0.6 33.3 0.8 1.2 66.7 0.4 0.6 1.0 0.8 0.6 0.6 0.0 0.4 1.8 -1.4
10 Abdel Nader Suns 2 4.1 0.0 0.5 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 -3.5
11 Dario Saric Suns 1 2.4 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 4.0
22 Elijah Bryant Bucks 1 0.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
23 Sam Merrill Bucks 1 1.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
24 Thanasis Antetokounmpo Bucks 1 1.6 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 2.0 3.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0
FYI: most of the commands I have shown so far, can also be "chained" into "one-liners" to form more specific subsets of data. For example, if we want to retrieve the top 5 scoring players from our dataset we can do something like the following:
df.sort_values(['PTS','PLAYER'], ascending=[False, True])[0:5][['PLAYER','TEAM','PTS']]
PLAYER TEAM PTS
12 Giannis Antetokounmpo Bucks 32.2
0 Devin Booker Suns 30.0
13 Khris Middleton Bucks 25.4
1 Chris Paul Suns 21.0
14 Jrue Holiday Bucks 17.6
the functions are applied in order from left to right:
sort_values
sort the dataframe by PTS in descending order (highest values first and lowest for last)- then we use a slicing operation with
[0:5]
to select the first 5 entries from the dataset sorted in the previous step - finally we select only PLAYER, TEAM and PTS to be displayed using the indexing operator ([])
For columns storing categorical variabels (string), the value_counts()
method is very handy for identify the number of entries per categorical value. From our NBA dataframe, we can identify the number of players in each team by:
df['TEAM'].value_counts()
Bucks 13
Suns 12
Name: TEAM, dtype: int64
Finally, the .drop
method allows one to remove entries from a dataframe:
df.shape
(25, 23)
df.drop(0).shape
(24, 23)
df.drop('GP', axis=1).shape
(25, 22)
By default, drop
deletes the row for a particular index, and this behaviour can be altered using the axis
parameter.
.loc[] and iloc()
Among other methods implemented within pandas, .loc[]
and .iloc[]
are very useful for making slightly more complex queries with great flexibility.
.loc[]
.loc[]
is a label based method for querying data from a dataframe. In this case, one is required to provide the name of the row and/or column to be selected. Using our NBA dataset, one can draw a few examples of operations that we can perform using .loc[]
:
- To check for players on the Milwalkee Bucks which average at least 10 points in the series:
df.loc[(df.TEAM == 'Bucks') & (df.PTS >= 10)]
PLAYER TEAM GP MIN FGM FGA ... TOV STL BLK PF PTS +/-
12 Giannis Antetokounmpo Bucks 5 39.3 12.0 19.6 ... 1.6 1.4 1.2 3.2 32.2 4.0
13 Khris Middleton Bucks 5 42.8 10.0 22.4 ... 2.8 1.0 0.0 2.2 25.4 -0.8
14 Jrue Holiday Bucks 5 40.8 7.0 17.8 ... 2.0 1.8 0.8 2.0 17.6 5.0
15 Brook Lopez Bucks 5 24.0 4.6 9.6 ... 1.0 0.6 0.8 2.2 11.8 -3.0
16 Pat Connaughton Bucks 5 31.4 3.8 7.2 ... 0.6 0.2 0.0 2.4 11.0 4.4
[5 rows x 23 columns]
TIP: Each condition should be wrapped within parenthesis and operators and (&
) and or (|
) are valid.
TIP: Use .str.contains()
for substring searching. In the example below, we are looking for the greek brothers by their last name:
df.loc[(df['PLAYER'].str.contains("Antetokounmpo"))]
PLAYER TEAM GP MIN FGM ... STL BLK PF PTS +/-
12 Giannis Antetokounmpo Bucks 5 39.3 12.0 ... 1.4 1.2 3.2 32.2 4.0
24 Thanasis Antetokounmpo Bucks 1 1.6 0.0 ... 0.0 0.0 1.0 0.0 1.0
[2 rows x 23 columns]
- To select a range of rows based on their index:
df.loc[0:5]
PLAYER TEAM GP MIN FGM FGA FG% 3PM 3PA 3P% FTM FTA FT% OREB DREB REB AST TOV STL BLK PF PTS +/-
0 Devin Booker Suns 5 39.0 11.4 24.2 47.1 2.2 6.8 32.4 5.0 5.8 86.2 1.2 2.4 3.6 3.8 2.6 1.0 0.4 3.8 30.0 5.4
1 Chris Paul Suns 5 36.9 8.8 16.2 54.3 2.2 4.2 52.4 1.2 1.6 75.0 0.6 2.2 2.8 8.8 3.6 0.6 0.2 2.8 21.0 -0.6
2 Deandre Ayton Suns 5 37.8 6.0 10.4 57.7 0.0 0.0 0.0 3.2 3.4 94.1 2.4 10.8 13.2 2.0 1.2 1.4 1.4 3.6 15.2 3.2
3 Mikal Bridges Suns 5 30.6 4.6 8.4 54.8 1.8 4.0 45.0 2.0 2.0 100.0 0.6 3.2 3.8 1.0 1.4 0.8 0.6 1.2 13.0 4.4
4 Jae Crowder Suns 5 36.5 3.4 8.0 42.5 2.8 6.0 46.7 1.4 1.8 77.8 0.4 7.2 7.6 2.0 1.0 1.2 1.0 3.4 11.0 1.2
5 Cameron Johnson Suns 5 23.9 3.2 6.0 53.3 1.8 3.8 47.4 1.4 1.4 100.0 0.8 2.4 3.2 1.0 0.8 0.4 0.4 2.4 9.6 -7.0
- To update the value of any given columns:
df['STATUS'] = ""
df.loc[(df['PTS'] >= 10) & (df['REB'] >= 10), 'STATUS'] = 'Double-Double'
print(df[['PLAYER', 'PTS', 'REB', 'AST', 'STATUS']])
PLAYER PTS REB AST STATUS
0 Devin Booker 30.0 3.6 3.8
1 Chris Paul 21.0 2.8 8.8
2 Deandre Ayton 15.2 13.2 2.0 Double-Double
3 Mikal Bridges 13.0 3.8 1.0
4 Jae Crowder 11.0 7.6 2.0
5 Cameron Johnson 9.6 3.2 1.0
6 Cameron Payne 6.8 2.6 2.0
7 Torrey Craig 3.4 1.6 0.2
8 Frank Kaminsky 2.0 1.3 0.7
9 Ty-Shon Alexander 2.0 0.0 0.0
10 Abdel Nader 0.0 0.0 0.0
11 Dario Saric 0.0 1.0 0.0
12 Giannis Antetokounmpo 32.2 13.0 5.6 Double-Double
13 Khris Middleton 25.4 6.6 5.4
14 Jrue Holiday 17.6 5.6 9.0
15 Brook Lopez 11.8 4.8 0.2
16 Pat Connaughton 11.0 5.4 1.2
17 Bobby Portis 6.0 4.2 0.2
18 P.J. Tucker 4.8 3.4 1.2
19 Bryn Forbes 3.0 0.3 0.0
20 Jordan Nwora 3.0 1.0 0.0
21 Jeff Teague 1.8 1.0 0.8
22 Elijah Bryant 0.0 0.0 0.0
23 Sam Merrill 0.0 0.0 0.0
24 Thanasis Antetokounmpo 0.0 3.0 0.0
In the above example, we created a new column STATUS with default value as an empty string. Then we define the set of conditions we were looking for as the first parameter of .loc[]
, the second parameter is the name of the column we want to change and the last component is the new-value.
Since a double-double is a status that occurs when a player reaches two digits in two out of three different categories PTS, REB and AST, we are better off writing the last example as:
df.loc[((df['PTS'] >= 10) & (df['REB'] >= 10)) | \
((df['PTS'] >= 10) & (df['AST'] >= 10)) | \
((df['AST'] >= 10) & (df['REB'] >= 10)), 'STATUS'] = 'Double-Double'
In fact, this modification did not made a difference to the output, since Deandre Ayton and Giannis Antetokounmpo are the only players averaging a double-double in this series.
.iloc[]
Different from .loc[]
, .iloc[]
is an indexed based selecting method and it uses indexing to select specific rows/columns. Some useful ways of using .iloc[]
:
- to select specific rows using a list of indices:
indices2select = [0,2,3,4,7,6]
df.iloc[indices2select]
PLAYER TEAM GP MIN FGM FGA FG% 3PM 3PA 3P% FTM FTA FT% OREB DREB REB AST TOV STL BLK PF PTS +/-
0 Devin Booker Suns 5 39.0 11.4 24.2 47.1 2.2 6.8 32.4 5.0 5.8 86.2 1.2 2.4 3.6 3.8 2.6 1.0 0.4 3.8 30.0 5.4
2 Deandre Ayton Suns 5 37.8 6.0 10.4 57.7 0.0 0.0 0.0 3.2 3.4 94.1 2.4 10.8 13.2 2.0 1.2 1.4 1.4 3.6 15.2 3.2
3 Mikal Bridges Suns 5 30.6 4.6 8.4 54.8 1.8 4.0 45.0 2.0 2.0 100.0 0.6 3.2 3.8 1.0 1.4 0.8 0.6 1.2 13.0 4.4
4 Jae Crowder Suns 5 36.5 3.4 8.0 42.5 2.8 6.0 46.7 1.4 1.8 77.8 0.4 7.2 7.6 2.0 1.0 1.2 1.0 3.4 11.0 1.2
7 Torrey Craig Suns 5 12.6 1.2 3.0 40.0 0.6 2.0 30.0 0.4 0.8 50.0 0.4 1.2 1.6 0.2 0.4 0.0 0.0 1.4 3.4 -4.8
6 Cameron Payne Suns 5 16.6 3.0 7.0 42.9 0.6 2.4 25.0 0.2 0.4 50.0 0.4 2.2 2.6 2.0 0.8 0.8 0.0 1.2 6.8 -6.0
- to select a range of columns and rows simultaneously:
df.iloc[2:9, 0:5]
PLAYER TEAM GP MIN FGM
2 Deandre Ayton Suns 5 37.8 6.0
3 Mikal Bridges Suns 5 30.6 4.6
4 Jae Crowder Suns 5 36.5 3.4
5 Cameron Johnson Suns 5 23.9 3.2
6 Cameron Payne Suns 5 16.6 3.0
7 Torrey Craig Suns 5 12.6 1.2
8 Frank Kaminsky Suns 3 6.2 1.0
In the example above, the first parameter of .iloc[]
defines the set of rows to be selected based on a range of indices (2:9
), and the second parameter defines the columns that only the first 5 columns will be retrieved (0:5
). In both cases, the input values could have also been a list of indices.
Conclusion
Here I explored some of the basic concepts behind pandas, one of the most popular and powerful Python libraries for data analysis/manipulation. I used a sample of data with players statistics on the NBA finals 2021 up to game 5 to explore the structure of objects used to store data, basic properties and methods for data exploring, and auxiliary methods for more complex queries.
The code samples and data presented in this article are available at github.
Thank you for reading and if you are interested in learning more about pandas, here is a non-exhaustive list of resources with more information and tips: