Getting started¶

Installation¶

pandas-select is a Python-only package hosted on PyPI. It can be installed via pip:

pip install pandas-select

Dataset¶

The introduction to pandas-select is based on the Pokemon with stats. This dataset is a listing of all Pokemon species as of mid-2016, containing data about their types and statistics.

In [1]: import pandas as pd

In [2]: pd.set_option('display.max_rows', 5)

In [3]: df = pd.read_csv(
   ...:     "https://raw.githubusercontent.com/jeffzi/pokemonData/master/Pokemon.csv"
   ...: )
   ...: 

In [4]: df = df.set_index("Name")

In [5]: df
Out[5]: 
             Number     Type1   Type2  ...  Speed  Generation  Legendary
Name                                   ...                              
Bulbasaur         1     Grass  Poison  ...     45           1      False
Ivysaur           2     Grass  Poison  ...     60           1      False
...             ...       ...     ...  ...    ...         ...        ...
Blacephalon     806      Fire   Ghost  ...    107           7      False
Zenaora         807  Electric     NaN  ...    143           7      False

[894 rows x 12 columns]

Basics¶

In [6]: from pandas_select import *

The simplest possible usage is:

pandas-select

In [7]: df[StartsWith("Type")]
Out[7]: 
                Type1   Type2
Name                         
Bulbasaur       Grass  Poison
Ivysaur         Grass  Poison
...               ...     ...
Blacephalon      Fire   Ghost
Zenaora      Electric     NaN

[894 rows x 2 columns]

vanilla

In [8]: df.loc[:, df.columns.str.startswith("Type")]
Out[8]: 
                Type1   Type2
Name                         
Bulbasaur       Grass  Poison
Ivysaur         Grass  Poison
...               ...     ...
Blacephalon      Fire   Ghost
Zenaora      Electric     NaN

[894 rows x 2 columns]

Index can be selected as well. loc() must be used, on top of axis="index", to tell pandas we are selecting on index.

pandas-select

In [9]: df.loc[Contains("chu", axis="index")]
Out[9]: 
          Number     Type1    Type2  ...  Speed  Generation  Legendary
Name                                 ...                              
Pikachu       25  Electric      NaN  ...     90           1      False
Raichu        26  Electric      NaN  ...    110           1      False
Pichu        172  Electric      NaN  ...     60           2      False
Smoochum     238       Ice  Psychic  ...     65           2      False

[4 rows x 12 columns]

vanilla

In [10]: df.loc[df.index.str.contains("chu")]
Out[10]: 
          Number     Type1    Type2  ...  Speed  Generation  Legendary
Name                                 ...                              
Pikachu       25  Electric      NaN  ...     90           1      False
Raichu        26  Electric      NaN  ...    110           1      False
Pichu        172  Electric      NaN  ...     60           2      False
Smoochum     238       Ice  Psychic  ...     65           2      False

[4 rows x 12 columns]

Logical operators¶

&, |, ^, ~ operators are supported on selectors. See API reference.

pandas-select

In [11]: df[~AllNumeric()]  # same as df[HasDtype(exclude="number")]
Out[11]: 
                Type1   Type2  Legendary
Name                                    
Bulbasaur       Grass  Poison      False
Ivysaur         Grass  Poison      False
...               ...     ...        ...
Blacephalon      Fire   Ghost      False
Zenaora      Electric     NaN      False

[894 rows x 3 columns]

In [12]: df[StartsWith("Type") | "Legendary"]
Out[12]: 
                Type1   Type2  Legendary
Name                                    
Bulbasaur       Grass  Poison      False
Ivysaur         Grass  Poison      False
...               ...     ...        ...
Blacephalon      Fire   Ghost      False
Zenaora      Electric     NaN      False

[894 rows x 3 columns]

vanilla

In [13]: cols = df.select_dtypes(exclude="number").columns

In [14]: df[cols]
Out[14]: 
                Type1   Type2  Legendary
Name                                    
Bulbasaur       Grass  Poison      False
Ivysaur         Grass  Poison      False
...               ...     ...        ...
Blacephalon      Fire   Ghost      False
Zenaora      Electric     NaN      False

[894 rows x 3 columns]

In [15]: df.loc[:, df.columns.str.startswith("Type") | (df.columns == "Legendary")]
Out[15]: 
                Type1   Type2  Legendary
Name                                    
Bulbasaur       Grass  Poison      False
Ivysaur         Grass  Poison      False
...               ...     ...        ...
Blacephalon      Fire   Ghost      False
Zenaora      Electric     NaN      False

[894 rows x 3 columns]

Filters¶

pandas-select is also helpful to filter row values.

For example, let’s find out which are the strongest legendary pokemons. We define strong as being Legendary and having at least one stat above 100.

pandas-select

In [16]: stats = AllNumeric() & ~AnyOf("Total")

In [17]: has_strong_stat = Anywhere(lambda stat: stat > 100, columns=stats)

In [18]: df.loc[has_strong_stat & df["Legendary"]]
Out[18]: 
                     Number     Type1   Type2  ...  Speed  Generation  Legendary
Name                                           ...                              
Articuno                144       Ice  Flying  ...     85           1       True
Zapdos                  145  Electric  Flying  ...    100           1       True
...                     ...       ...     ...  ...    ...         ...        ...
Necrozma Dawn Wings     800   Psychic   Ghost  ...     77           7       True
NecrozmaUltra           800   Psychic  Dragon  ...    129           7       True

[79 rows x 12 columns]

vanilla

In [19]: stats = [col for col in df.select_dtypes("number").columns if col != "Total"]

In [20]: df_stats = df[stats]

In [21]: has_strong_stat = df_stats.where(df_stats > 100).notnull().any(axis="columns")

In [22]: df.loc[has_strong_stat & df["Legendary"]]
Out[22]: 
                     Number     Type1   Type2  ...  Speed  Generation  Legendary
Name                                           ...                              
Articuno                144       Ice  Flying  ...     85           1       True
Zapdos                  145  Electric  Flying  ...    100           1       True
...                     ...       ...     ...  ...    ...         ...        ...
Necrozma Dawn Wings     800   Psychic   Ghost  ...     77           7       True
NecrozmaUltra           800   Psychic  Dragon  ...    129           7       True

[79 rows x 12 columns]

Hierarchical indexing¶

In vanilla pandas indexing with hierarchical index is challenging as soon as you step out of slicers or exact selection.

label selectors are compatible with MultiIndex out of the box. They also have a level argument to target a specific level in the hierarchy.

First, we’ll set a MultiIndex on the pokemon dataset:

In [23]: df_mi = df.reset_index().set_index(["Generation", "Number", "Name"])

In [24]: df_mi
Out[24]: 
                                  Type1   Type2  ...  Speed  Legendary
Generation Number Name                           ...                  
1          1      Bulbasaur       Grass  Poison  ...     45      False
           2      Ivysaur         Grass  Poison  ...     60      False
...                                 ...     ...  ...    ...        ...
7          806    Blacephalon      Fire   Ghost  ...    107      False
           807    Zenaora      Electric     NaN  ...    143      False

[894 rows x 10 columns]

Now, we can select pokemon with a name containing “eon” and part of the 1st or 6th: generation.

pandas-select

In [25]: eon_mask = Contains("eon", axis="index", level="Name")

In [26]: df_mi.loc[eon_mask]
Out[26]: 
                              Type1 Type2  Total  ...  SpecialDef  Speed  Legendary
Generation Number Name                            ...                              
1          5      Charmeleon   Fire   NaN    405  ...          65     80      False
           134    Vaporeon    Water   NaN    525  ...          95     65      False
...                             ...   ...    ...  ...         ...    ...        ...
4          471    Glaceon       Ice   NaN    525  ...          95     65      False
6          700    Sylveon     Fairy   NaN    525  ...         130     60      False

[13 rows x 10 columns]

In [27]: gen_mask = AnyOf([1, 6], axis="index", level="Generation")

In [28]: df_mi.loc[eon_mask & gen_mask]
Out[28]: 
                                 Type1 Type2  ...  Speed  Legendary
Generation Number Name                        ...                  
1          5      Charmeleon      Fire   NaN  ...     80      False
           134    Vaporeon       Water   NaN  ...     65      False
           135    Jolteon     Electric   NaN  ...    130      False
           136    Flareon         Fire   NaN  ...     65      False
6          700    Sylveon        Fairy   NaN  ...     60      False

[5 rows x 10 columns]

vanilla

In [29]: eon_mask = df_mi.index.get_level_values("Name").str.contains("eon")

In [30]: df_mi.loc[eon_mask]
Out[30]: 
                              Type1 Type2  Total  ...  SpecialDef  Speed  Legendary
Generation Number Name                            ...                              
1          5      Charmeleon   Fire   NaN    405  ...          65     80      False
           134    Vaporeon    Water   NaN    525  ...          95     65      False
...                             ...   ...    ...  ...         ...    ...        ...
4          471    Glaceon       Ice   NaN    525  ...          95     65      False
6          700    Sylveon     Fairy   NaN    525  ...         130     60      False

[13 rows x 10 columns]

In [31]: df_mi_copy = df_mi.reset_index()

In [32]: gen_mask = df_mi_copy["Generation"].isin([1, 6])

In [33]: df_mi_copy[eon_mask & gen_mask].set_index(["Generation", "Number", "Name"])
Out[33]: 
                                 Type1 Type2  ...  Speed  Legendary
Generation Number Name                        ...                  
1          5      Charmeleon      Fire   NaN  ...     80      False
           134    Vaporeon       Water   NaN  ...     65      False
           135    Jolteon     Electric   NaN  ...    130      False
           136    Flareon         Fire   NaN  ...     65      False
6          700    Sylveon        Fairy   NaN  ...     60      False

[5 rows x 10 columns]

Scikit-learn integration¶

sklearn.compose.ColumnTransformer was added to scikit-learn in version 0.20. It allows combining the outputs of multiple transformer objects used on column subsets of the data into a single feature space.

There is also a helper sklearn.compose.make_column_selector() to map columns based on datatype or the column names with a regex.

We clean up the dataset:

In [34]: from sklearn.compose import make_column_selector, make_column_transformer

In [35]: from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [36]: df = df.reset_index()

In [37]: df.loc[:, StartsWith("Type")] = df.loc[:, StartsWith("Type")].fillna("")

In [38]: df
Out[38]: 
            Name  Number     Type1  ... Speed  Generation  Legendary
0      Bulbasaur       1     Grass  ...    45           1      False
1        Ivysaur       2     Grass  ...    60           1      False
..           ...     ...       ...  ...   ...         ...        ...
892  Blacephalon     806      Fire  ...   107           7      False
893      Zenaora     807  Electric  ...   143           7      False

[894 rows x 13 columns]

Similarly to sklearn.compose.make_column_selector(), pandas-select selectors can be fed to sklearn.compose.ColumnTransformer via the wrapper ColumnSelector.

pandas-select

In [39]: ct = make_column_transformer(
   ....:     (StandardScaler(), ColumnSelector(AllNumeric() & ~AnyOf("Generation"))),
   ....:     (OneHotEncoder(), ColumnSelector(AllNominal() | AllBool() | "Generation")),
   ....: )
   ....: 

In [40]: ct.fit_transform(df).shape
Out[40]: (894, 948)

vanilla

In [41]: to_encode = ["object", "bool"]

In [42]: if pd.__version__ >= "1.0.0":
   ....:     to_encode.append("string")
   ....: 

In [43]: ct = make_column_transformer(
   ....:     (
   ....:         StandardScaler(),
   ....:         make_column_selector(r"^(?!Generation).*$", dtype_include=["number"]),
   ....:     ),
   ....:     (OneHotEncoder(), make_column_selector(dtype_include=to_encode)),
   ....:     (OneHotEncoder(), make_column_selector("Generation")),
   ....: )
   ....: 

In [44]: ct.fit_transform(df).shape
Out[44]: (894, 948)

pandas-select makes the intent clearer and enables more complex selection.

Order¶

Selectors preserve the column order found in the DataFrame, except for the Exact selector.

In [45]: df
Out[45]: 
            Name  Number     Type1  ... Speed  Generation  Legendary
0      Bulbasaur       1     Grass  ...    45           1      False
1        Ivysaur       2     Grass  ...    60           1      False
..           ...     ...       ...  ...   ...         ...        ...
892  Blacephalon     806      Fire  ...   107           7      False
893      Zenaora     807  Electric  ...   143           7      False

[894 rows x 13 columns]

In [46]: df[["Type2", "Type1"]]
Out[46]: 
      Type2     Type1
0    Poison     Grass
1    Poison     Grass
..      ...       ...
892   Ghost      Fire
893          Electric

[894 rows x 2 columns]

In [47]: df[AnyOf(["Type2", "Type1"])]
Out[47]: 
        Type1   Type2
0       Grass  Poison
1       Grass  Poison
..        ...     ...
892      Fire   Ghost
893  Electric        

[894 rows x 2 columns]

Logical operators can be used to force a particular order.

In [48]: df[AllNumeric()]
Out[48]: 
     Number  Total  HP  Attack  ...  SpecialAtk  SpecialDef  Speed  Generation
0         1    318  45      49  ...          65          65     45           1
1         2    405  60      62  ...          80          80     60           1
..      ...    ...  ..     ...  ...         ...         ...    ...         ...
892     806    570  53     127  ...         151          79    107           7
893     807    600  88     112  ...         102          80    143           7

[894 rows x 9 columns]

In [49]: df["Generation" | AllNumeric()]
Out[49]: 
     Generation  Number  Total  HP  ...  Defense  SpecialAtk  SpecialDef  Speed
0             1       1    318  45  ...       49          65          65     45
1             1       2    405  60  ...       63          80          80     60
..          ...     ...    ...  ..  ...      ...         ...         ...    ...
892           7     806    570  53  ...       53         151          79    107
893           7     807    600  88  ...       75         102          80    143

[894 rows x 9 columns]