Getting started¶
Installation¶
pandas-select
is a Python-only package hosted on PyPI.
It can be installed via pip:
pip install pandas-select
Dataset¶
The introduction to pandas-select is based on the Pokemon with stats. This dataset is a listing of all Pokemon species as of mid-2016, containing data about their types and statistics.
In [1]: import pandas as pd
In [2]: pd.set_option('display.max_rows', 5)
In [3]: df = pd.read_csv(
...: "https://raw.githubusercontent.com/jeffzi/pokemonData/master/Pokemon.csv"
...: )
...:
In [4]: df = df.set_index("Name")
In [5]: df
Out[5]:
Number Type1 Type2 ... Speed Generation Legendary
Name ...
Bulbasaur 1 Grass Poison ... 45 1 False
Ivysaur 2 Grass Poison ... 60 1 False
... ... ... ... ... ... ... ...
Blacephalon 806 Fire Ghost ... 107 7 False
Zenaora 807 Electric NaN ... 143 7 False
[894 rows x 12 columns]
Basics¶
In [6]: from pandas_select import *
The simplest possible usage is:
In [7]: df[StartsWith("Type")]
Out[7]:
Type1 Type2
Name
Bulbasaur Grass Poison
Ivysaur Grass Poison
... ... ...
Blacephalon Fire Ghost
Zenaora Electric NaN
[894 rows x 2 columns]
In [8]: df.loc[:, df.columns.str.startswith("Type")]
Out[8]:
Type1 Type2
Name
Bulbasaur Grass Poison
Ivysaur Grass Poison
... ... ...
Blacephalon Fire Ghost
Zenaora Electric NaN
[894 rows x 2 columns]
Index can be selected as well. loc()
must be used, on top of
axis="index"
, to tell pandas we are selecting on index
.
In [9]: df.loc[Contains("chu", axis="index")]
Out[9]:
Number Type1 Type2 ... Speed Generation Legendary
Name ...
Pikachu 25 Electric NaN ... 90 1 False
Raichu 26 Electric NaN ... 110 1 False
Pichu 172 Electric NaN ... 60 2 False
Smoochum 238 Ice Psychic ... 65 2 False
[4 rows x 12 columns]
In [10]: df.loc[df.index.str.contains("chu")]
Out[10]:
Number Type1 Type2 ... Speed Generation Legendary
Name ...
Pikachu 25 Electric NaN ... 90 1 False
Raichu 26 Electric NaN ... 110 1 False
Pichu 172 Electric NaN ... 60 2 False
Smoochum 238 Ice Psychic ... 65 2 False
[4 rows x 12 columns]
Logical operators¶
&
, |
, ^
, ~
operators are supported on selectors. See API reference.
In [11]: df[~AllNumeric()] # same as df[HasDtype(exclude="number")]
Out[11]:
Type1 Type2 Legendary
Name
Bulbasaur Grass Poison False
Ivysaur Grass Poison False
... ... ... ...
Blacephalon Fire Ghost False
Zenaora Electric NaN False
[894 rows x 3 columns]
In [12]: df[StartsWith("Type") | "Legendary"]
Out[12]:
Type1 Type2 Legendary
Name
Bulbasaur Grass Poison False
Ivysaur Grass Poison False
... ... ... ...
Blacephalon Fire Ghost False
Zenaora Electric NaN False
[894 rows x 3 columns]
In [13]: cols = df.select_dtypes(exclude="number").columns
In [14]: df[cols]
Out[14]:
Type1 Type2 Legendary
Name
Bulbasaur Grass Poison False
Ivysaur Grass Poison False
... ... ... ...
Blacephalon Fire Ghost False
Zenaora Electric NaN False
[894 rows x 3 columns]
In [15]: df.loc[:, df.columns.str.startswith("Type") | (df.columns == "Legendary")]
Out[15]:
Type1 Type2 Legendary
Name
Bulbasaur Grass Poison False
Ivysaur Grass Poison False
... ... ... ...
Blacephalon Fire Ghost False
Zenaora Electric NaN False
[894 rows x 3 columns]
Filters¶
pandas-select
is also helpful to filter row values.
For example, let’s find out which are the strongest legendary pokemons. We define strong as being Legendary and having at least one stat above 100.
In [16]: stats = AllNumeric() & ~AnyOf("Total")
In [17]: has_strong_stat = Anywhere(lambda stat: stat > 100, columns=stats)
In [18]: df.loc[has_strong_stat & df["Legendary"]]
Out[18]:
Number Type1 Type2 ... Speed Generation Legendary
Name ...
Articuno 144 Ice Flying ... 85 1 True
Zapdos 145 Electric Flying ... 100 1 True
... ... ... ... ... ... ... ...
Necrozma Dawn Wings 800 Psychic Ghost ... 77 7 True
NecrozmaUltra 800 Psychic Dragon ... 129 7 True
[79 rows x 12 columns]
In [19]: stats = [col for col in df.select_dtypes("number").columns if col != "Total"]
In [20]: df_stats = df[stats]
In [21]: has_strong_stat = df_stats.where(df_stats > 100).notnull().any(axis="columns")
In [22]: df.loc[has_strong_stat & df["Legendary"]]
Out[22]:
Number Type1 Type2 ... Speed Generation Legendary
Name ...
Articuno 144 Ice Flying ... 85 1 True
Zapdos 145 Electric Flying ... 100 1 True
... ... ... ... ... ... ... ...
Necrozma Dawn Wings 800 Psychic Ghost ... 77 7 True
NecrozmaUltra 800 Psychic Dragon ... 129 7 True
[79 rows x 12 columns]
Hierarchical indexing¶
In vanilla pandas indexing with hierarchical index is challenging as soon as you step out of slicers or exact selection.
label selectors are compatible with MultiIndex
out of the box.
They also have a level
argument to target a specific level in the hierarchy.
First, we’ll set a MultiIndex
on the pokemon dataset:
In [23]: df_mi = df.reset_index().set_index(["Generation", "Number", "Name"])
In [24]: df_mi
Out[24]:
Type1 Type2 ... Speed Legendary
Generation Number Name ...
1 1 Bulbasaur Grass Poison ... 45 False
2 Ivysaur Grass Poison ... 60 False
... ... ... ... ... ...
7 806 Blacephalon Fire Ghost ... 107 False
807 Zenaora Electric NaN ... 143 False
[894 rows x 10 columns]
- Now, we can select pokemon with a name containing “eon” and part of the 1st or 6th
generation.
In [25]: eon_mask = Contains("eon", axis="index", level="Name")
In [26]: df_mi.loc[eon_mask]
Out[26]:
Type1 Type2 Total ... SpecialDef Speed Legendary
Generation Number Name ...
1 5 Charmeleon Fire NaN 405 ... 65 80 False
134 Vaporeon Water NaN 525 ... 95 65 False
... ... ... ... ... ... ... ...
4 471 Glaceon Ice NaN 525 ... 95 65 False
6 700 Sylveon Fairy NaN 525 ... 130 60 False
[13 rows x 10 columns]
In [27]: gen_mask = AnyOf([1, 6], axis="index", level="Generation")
In [28]: df_mi.loc[eon_mask & gen_mask]
Out[28]:
Type1 Type2 ... Speed Legendary
Generation Number Name ...
1 5 Charmeleon Fire NaN ... 80 False
134 Vaporeon Water NaN ... 65 False
135 Jolteon Electric NaN ... 130 False
136 Flareon Fire NaN ... 65 False
6 700 Sylveon Fairy NaN ... 60 False
[5 rows x 10 columns]
In [29]: eon_mask = df_mi.index.get_level_values("Name").str.contains("eon")
In [30]: df_mi.loc[eon_mask]
Out[30]:
Type1 Type2 Total ... SpecialDef Speed Legendary
Generation Number Name ...
1 5 Charmeleon Fire NaN 405 ... 65 80 False
134 Vaporeon Water NaN 525 ... 95 65 False
... ... ... ... ... ... ... ...
4 471 Glaceon Ice NaN 525 ... 95 65 False
6 700 Sylveon Fairy NaN 525 ... 130 60 False
[13 rows x 10 columns]
In [31]: df_mi_copy = df_mi.reset_index()
In [32]: gen_mask = df_mi_copy["Generation"].isin([1, 6])
In [33]: df_mi_copy[eon_mask & gen_mask].set_index(["Generation", "Number", "Name"])
Out[33]:
Type1 Type2 ... Speed Legendary
Generation Number Name ...
1 5 Charmeleon Fire NaN ... 80 False
134 Vaporeon Water NaN ... 65 False
135 Jolteon Electric NaN ... 130 False
136 Flareon Fire NaN ... 65 False
6 700 Sylveon Fairy NaN ... 60 False
[5 rows x 10 columns]
Scikit-learn integration¶
sklearn.compose.ColumnTransformer
was added to scikit-learn
in version 0.20. It allows combining the outputs of multiple transformer objects used
on column subsets of the data into a single feature space.
There is also a helper sklearn.compose.make_column_selector()
to map columns
based on datatype or the column names with a regex.
We clean up the dataset:
In [34]: from sklearn.compose import make_column_selector, make_column_transformer In [35]: from sklearn.preprocessing import OneHotEncoder, StandardScaler In [36]: df = df.reset_index() In [37]: df.loc[:, StartsWith("Type")] = df.loc[:, StartsWith("Type")].fillna("") In [38]: df Out[38]: Name Number Type1 ... Speed Generation Legendary 0 Bulbasaur 1 Grass ... 45 1 False 1 Ivysaur 2 Grass ... 60 1 False .. ... ... ... ... ... ... ... 892 Blacephalon 806 Fire ... 107 7 False 893 Zenaora 807 Electric ... 143 7 False [894 rows x 13 columns]
Similarly to sklearn.compose.make_column_selector()
, pandas-select
selectors
can be fed to sklearn.compose.ColumnTransformer
via the wrapper
ColumnSelector
.
In [39]: ct = make_column_transformer(
....: (StandardScaler(), ColumnSelector(AllNumeric() & ~AnyOf("Generation"))),
....: (OneHotEncoder(), ColumnSelector(AllNominal() | AllBool() | "Generation")),
....: )
....:
In [40]: ct.fit_transform(df).shape
Out[40]: (894, 948)
In [41]: to_encode = ["object", "bool"]
In [42]: if pd.__version__ >= "1.0.0":
....: to_encode.append("string")
....:
In [43]: ct = make_column_transformer(
....: (
....: StandardScaler(),
....: make_column_selector(r"^(?!Generation).*$", dtype_include=["number"]),
....: ),
....: (OneHotEncoder(), make_column_selector(dtype_include=to_encode)),
....: (OneHotEncoder(), make_column_selector("Generation")),
....: )
....:
In [44]: ct.fit_transform(df).shape
Out[44]: (894, 948)
pandas-select
makes the intent clearer and enables more complex selection.
Order¶
Selectors preserve the column order found in the DataFrame, except
for the Exact
selector.
In [45]: df
Out[45]:
Name Number Type1 ... Speed Generation Legendary
0 Bulbasaur 1 Grass ... 45 1 False
1 Ivysaur 2 Grass ... 60 1 False
.. ... ... ... ... ... ... ...
892 Blacephalon 806 Fire ... 107 7 False
893 Zenaora 807 Electric ... 143 7 False
[894 rows x 13 columns]
In [46]: df[["Type2", "Type1"]]
Out[46]:
Type2 Type1
0 Poison Grass
1 Poison Grass
.. ... ...
892 Ghost Fire
893 Electric
[894 rows x 2 columns]
In [47]: df[AnyOf(["Type2", "Type1"])]
Out[47]:
Type1 Type2
0 Grass Poison
1 Grass Poison
.. ... ...
892 Fire Ghost
893 Electric
[894 rows x 2 columns]
Logical operators can be used to force a particular order.
In [48]: df[AllNumeric()]
Out[48]:
Number Total HP Attack ... SpecialAtk SpecialDef Speed Generation
0 1 318 45 49 ... 65 65 45 1
1 2 405 60 62 ... 80 80 60 1
.. ... ... .. ... ... ... ... ... ...
892 806 570 53 127 ... 151 79 107 7
893 807 600 88 112 ... 102 80 143 7
[894 rows x 9 columns]
In [49]: df["Generation" | AllNumeric()]
Out[49]:
Generation Number Total HP ... Defense SpecialAtk SpecialDef Speed
0 1 1 318 45 ... 49 65 65 45
1 1 2 405 60 ... 63 80 80 60
.. ... ... ... .. ... ... ... ... ...
892 7 806 570 53 ... 53 151 79 107
893 7 807 600 88 ... 75 102 80 143
[894 rows x 9 columns]