6. Pandas#

6.1 Introduction#

With the standard Python functions and the numpy library, you already have access to powerful tools to process data. However, you’ll find that organizing data using them might still be confusing and messy… so let us introduce you to pandas: a Python library specialized in data organization. Its functions are simple to use, and they achieve a lot. Furthermore, pandas was built on top of the numpy library, using some of their functions and data structures. This makes pandas fast. The pandas library is often used in Data Science and Machine Learning to organize data that are used as input in other functions, of other libraries. For example, you store and organize an Excel file using pandas data structures, apply statistical analysis using SciPy, and then plot the result using matplotlib.

In this section, we’ll introduce you to the basic pandas data structures: the Series and DataFrame objects; and how to store data in them. In pandas, a Series represents a list, and DataFrame represents a table.

As always, let’s import some libraries. We’ll use numpy only for comparison.

import pandas as pd 
import numpy as np

6.2 Series#

We start with pandas Series, since a DataFrame is made out of Series; retrieving a row or a column from a DataFrame results in a Series. A Series object is a numpy ndarray used to hold one-dimensional data, like a list. We create a Series object using its constructor pd.Series(). It can be called by using a list that you want to convert into a pandas series. Unlike numpy arrays, a Series may hold data of different types.

First we create a list containing elements of various types. Then we construct a Series and a numpy.array using our list. Finally we compare the type of each element in my_series and my_nparray.

my_list = ['begin', 2, 3/4, "end"]

my_series = pd.Series(data=my_list)
my_nparray = np.array(my_list)

for i in range(len(my_list)):
    print('----type of each element----')
    print(f'my_series element #{i} => {type(my_series[i])}')
    print(f'my_nparray element #{i} => {type(my_nparray[i])}\n')
----type of each element----
my_series element #0 => <class 'str'>
my_nparray element #0 => <class 'numpy.str_'>

----type of each element----
my_series element #1 => <class 'int'>
my_nparray element #1 => <class 'numpy.str_'>

----type of each element----
my_series element #2 => <class 'float'>
my_nparray element #2 => <class 'numpy.str_'>

----type of each element----
my_series element #3 => <class 'str'>
my_nparray element #3 => <class 'numpy.str_'>

As expected, the numpy array changed all elements to one type; in this case, strings. As mentioned in Section 5.1, in Notebook 5, a numpy array cannot hold data of different types. Note that a pandas series is, by default, printed more elaborately.

print(my_series)
print('-----------------')
print(my_nparray)
0    begin
1        2
2     0.75
3      end
dtype: object
-----------------
['begin' '2' '0.75' 'end']

The values of a series can be accessed and sliced using the iloc() function:

my_series.iloc[1:]
1       2
2    0.75
3     end
dtype: object
my_series.iloc[[2,len(my_series)-1]]
2    0.75
3     end
dtype: object

Labeling Series#

So far we have referred to values within a list or array using indexing, but that might be confusing. With pandas Series, you can refer to your values by labeling their indices. Labels allow you to access the values in a more informative way, similar to dictionaries; depicted in Section 2.3, in Notebook 2.

We create the indices of the same size as the list since we want to construct our Series object with and use the index option in the Series constructor. Note that our entries can be called both ways

my_index_labels = ["My first entry", "1","2","END"]
my_labeled_Series = pd.Series(data=my_list, index=my_index_labels)

print(my_labeled_Series[0] == my_labeled_Series["My first entry"])
True
C:\Users\gui-win10\AppData\Local\Temp\ipykernel_24848\1771852785.py:4: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  print(my_labeled_Series[0] == my_labeled_Series["My first entry"])

pandas can automatically create labels of indices if we construct a Series using a dictionary with labeled entries.

my_dictionary = {"a list": [420, 10],"a float": 380/3, 
                 "a list of strings": ["first word", "Second Word", "3rd w0rd"] }
my_Series = pd.Series(my_dictionary)
print(my_Series)
a list                                         [420, 10]
a float                                       126.666667
a list of strings    [first word, Second Word, 3rd w0rd]
dtype: object

We can access an element within the list labeled "a list of strings" by using its label followed by the desired index

my_Series["a list of strings"][1]
'Second Word'

Warning

When using pandas, it’s a good idea to try and avoid for loops or iterative solutions; pandas usually has a faster solution than iterating through its elements.

6.3 DataFrame#

A DataFrame is a data structure used to represent two-dimensional data. A common example of data organized in two dimensions are tables or two-dimensional arrays.

Usually, DataFrames are made out of Series. However, its constructor accepts other types.

First, we define a few series to represent individual rows with data from here.

row_1 = pd.Series(['Garnet', 7.0,   'Fracture', 3.9  ])
row_2 = pd.Series(['Graphite', 1.5, 'One', 2.3 ])
row_3 = pd.Series(['Kyanite', 6,    'One', 4.01   ])
row_4 = pd.Series(['test', 'bad', '@#$%^', False, "asdf"])

By default, the DataFrame constructor creates rows with the elements of each Series

Create DataFrame using multiple Series as rows. The name “df” is often used as a name for a dataframe object.

df2 = pd.DataFrame(data=[row_1, row_2, row_3, row_4]) 
df2
0 1 2 3 4
0 Garnet 7.0 Fracture 3.9 NaN
1 Graphite 1.5 One 2.3 NaN
2 Kyanite 6 One 4.01 NaN
3 test bad @#$%^ False asdf

We can also create a DataFrame using Series as columns:

Define series representing columns and convert each series to a dataframe.

name = pd.Series(['Amphibole', 'Biotite', 'Calcite', 'Dolomite', 'Feldspars'])
hardness = pd.Series([5.5, 2.75, 3, 3, 6])
specific_gravity = pd.Series([2.8, 3.0, 2.72, 2.85, 2.645])
test = pd.Series(['A', 12j, True, 9, 11.00000])

col_1 = name.to_frame(name='name') 
col_2 = hardness.to_frame(name='hardness')
col_3 = specific_gravity.to_frame(name='sp. gr.')
col_4 = test.to_frame(name='test')

After creating each column, we can concatenate them together into one DataFrame table using the pd.concat() function. There are two axes in a DataFrame: ‘0’ (or ‘index’) for the rows, and ‘1’ (or ‘columns’) for the columns. We want to concatenate the series along their columns, so use axis=1 (or axis=’columns’).

Concatenate the series into one dataframe:

df1 = pd.concat([col_1, col_2, col_3, col_4], axis=1)
df1
name hardness sp. gr. test
0 Amphibole 5.50 2.800 A
1 Biotite 2.75 3.000 12j
2 Calcite 3.00 2.720 True
3 Dolomite 3.00 2.850 9
4 Feldspars 6.00 2.645 11.0

Alternatively, you could use the np.transpose() or .T function to generate the DataFrame using a Series as columns and then label the columns.

df_T = pd.DataFrame(data=[name,hardness,specific_gravity,test]).T
df_T.columns=['name', 'hardness', 'sp. gr.', 'test']

df_T
name hardness sp. gr. test
0 Amphibole 5.5 2.8 A
1 Biotite 2.75 3.0 12j
2 Calcite 3.0 2.72 True
3 Dolomite 3.0 2.85 9
4 Feldspars 6.0 2.645 11.0

To create a new column, simply do:

df1['new_column'] = np.nan
df1
name hardness sp. gr. test new_column
0 Amphibole 5.50 2.800 A NaN
1 Biotite 2.75 3.000 12j NaN
2 Calcite 3.00 2.720 True NaN
3 Dolomite 3.00 2.850 9 NaN
4 Feldspars 6.00 2.645 11.0 NaN

Labeling DataFrames#

You can also rename your columns’ and rows’ index names by changing your DataFrame’s columns and index attributes, respectively. Recall our df2:

df2
0 1 2 3 4
0 Garnet 7.0 Fracture 3.9 NaN
1 Graphite 1.5 One 2.3 NaN
2 Kyanite 6 One 4.01 NaN
3 test bad @#$%^ False asdf

Now let’s rename its columns’ and rows’ index names.

df2.columns = ['name', 'hardness',  'cleavage', 'sp. gr.','test']
df2.index = ['row0','row1','row2','row3']

df2
name hardness cleavage sp. gr. test
row0 Garnet 7.0 Fracture 3.9 NaN
row1 Graphite 1.5 One 2.3 NaN
row2 Kyanite 6 One 4.01 NaN
row3 test bad @#$%^ False asdf

Concatenating DataFrames#

We can also concatenate DataFrames to each other, even if they are of different sizes.

df3 = pd.concat([df1, df2])
print(df3)
           name hardness sp. gr.  test  new_column  cleavage
0     Amphibole      5.5     2.8     A         NaN       NaN
1       Biotite     2.75     3.0   12j         NaN       NaN
2       Calcite      3.0    2.72  True         NaN       NaN
3      Dolomite      3.0    2.85     9         NaN       NaN
4     Feldspars      6.0   2.645  11.0         NaN       NaN
row0     Garnet      7.0     3.9   NaN         NaN  Fracture
row1   Graphite      1.5     2.3   NaN         NaN       One
row2    Kyanite        6    4.01   NaN         NaN       One
row3       test      bad   False  asdf         NaN     @#$%^

In the resulting table, we notice that pandas automatically added NaN (“Not a Number”) values to the missing entries of either table. Very convenient!

Removing rows and columns#

Our table looks a bit messy though… let’s make it better by removing the ’test’ row and column. We can drop them by using the pd.drop() function. By default, the pd.drop() function outputs a copy of the inserted DataFrame without the dropped items. The original data will then be preserved if you give a new name to the dataframe (i.e., the copy) with dropped items.

Example of how you could use pd.drop() to remove a row or column. Note that we used that row’s index name and column’s index name.

new_df3 = df3.drop('row3', axis=0) 
new_df3 = new_df3.drop('test', axis=1) 
print(new_df3)
           name hardness sp. gr.  new_column  cleavage
0     Amphibole      5.5     2.8         NaN       NaN
1       Biotite     2.75     3.0         NaN       NaN
2       Calcite      3.0    2.72         NaN       NaN
3      Dolomite      3.0    2.85         NaN       NaN
4     Feldspars      6.0   2.645         NaN       NaN
row0     Garnet      7.0     3.9         NaN  Fracture
row1   Graphite      1.5     2.3         NaN       One
row2    Kyanite        6    4.01         NaN       One

Recall that df3 is unchanged

df3
name hardness sp. gr. test new_column cleavage
0 Amphibole 5.5 2.8 A NaN NaN
1 Biotite 2.75 3.0 12j NaN NaN
2 Calcite 3.0 2.72 True NaN NaN
3 Dolomite 3.0 2.85 9 NaN NaN
4 Feldspars 6.0 2.645 11.0 NaN NaN
row0 Garnet 7.0 3.9 NaN NaN Fracture
row1 Graphite 1.5 2.3 NaN NaN One
row2 Kyanite 6 4.01 NaN NaN One
row3 test bad False asdf NaN @#$%^

In case you would like to edit the original df3; you would apply the drop-operation ’inplace’. For this, we use the inplace=True option. To avoid dropping rows and columns with the same index name, you could use the pd.reset_index() function. With drop=True, the new DataFrame will have dropped the old indexes; keeping the new reset ones.

Reset the index to the default integer index and drop the old indexes.

df3_cleaned = df3.reset_index(drop=True) 
df3_cleaned
name hardness sp. gr. test new_column cleavage
0 Amphibole 5.5 2.8 A NaN NaN
1 Biotite 2.75 3.0 12j NaN NaN
2 Calcite 3.0 2.72 True NaN NaN
3 Dolomite 3.0 2.85 9 NaN NaN
4 Feldspars 6.0 2.645 11.0 NaN NaN
5 Garnet 7.0 3.9 NaN NaN Fracture
6 Graphite 1.5 2.3 NaN NaN One
7 Kyanite 6 4.01 NaN NaN One
8 test bad False asdf NaN @#$%^

Now let’s drop row 8 and column ‘test’ (remember to add axis=1 when dropping a column).

df3_cleaned.drop(8,inplace=True)
df3_cleaned.drop('test',inplace=True, axis=1)
print(df3_cleaned)
        name hardness sp. gr.  new_column  cleavage
0  Amphibole      5.5     2.8         NaN       NaN
1    Biotite     2.75     3.0         NaN       NaN
2    Calcite      3.0    2.72         NaN       NaN
3   Dolomite      3.0    2.85         NaN       NaN
4  Feldspars      6.0   2.645         NaN       NaN
5     Garnet      7.0     3.9         NaN  Fracture
6   Graphite      1.5     2.3         NaN       One
7    Kyanite        6    4.01         NaN       One

Note that if you try to re-run the above cell without re-running the previous one, you will get an error. This happens because after running the above cell once, you drop row ‘8’ and the ‘test’ column from df3_cleaned, trying to run again will attempt to re-drop something that doesn’t exist. Re-running the previous cell ‘fixes’ this issue since you re-assign df3_cleaned as df3.reset_index(drop=True).

Warning

If one or more rows (or columns) have the same index name, pd.drop() will drop all of them.

Accessing and modifying DataFrame values#

Now that we created our table, we want to be able to access and modify the individual values within it. This way we could add the missing values to the ’cleavage’ column. There are many ways to do this because numpy and standard Python functions are often still applicable on pandas data structures. However, there is a difference in processing speed. Therefore, when accessing or modifying data, try to use pandas functions, such as: .iat(), .at(), .iloc(), and .loc() instead of using the common [] square bracket syntax, as they might raise some warnings:

Note that the value still changed, but avoid doing so.

df3_cleaned.cleavage[1]='One' 
df3_cleaned 
C:\Users\gui-win10\AppData\Local\Temp\ipykernel_24848\54782356.py:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df3_cleaned.cleavage[1]='One'
C:\Users\gui-win10\AppData\Local\Temp\ipykernel_24848\54782356.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df3_cleaned.cleavage[1]='One'
name hardness sp. gr. new_column cleavage
0 Amphibole 5.5 2.8 NaN NaN
1 Biotite 2.75 3.0 NaN One
2 Calcite 3.0 2.72 NaN NaN
3 Dolomite 3.0 2.85 NaN NaN
4 Feldspars 6.0 2.645 NaN NaN
5 Garnet 7.0 3.9 NaN Fracture
6 Graphite 1.5 2.3 NaN One
7 Kyanite 6 4.01 NaN One

.iat() vs .at() vs .iloc() vs .loc()#

In this discussion in stackoverflow, they point out some differences between these methods. To keep in mind, .iat() and .at() are the fastest ones, but they only return scalars (one element), while .loc() and .iloc() can access several elements at the same time. Lastly, .iat() and .iloc use indexes (numbers), while .at() and .loc() use labels. For more information, check the stackoverflow discussion.

Acessing values using arrays as index with .iloc(), only method that allows arrays in both rows and cols.

indx_array_row = np.array([0,1,2])
indx_array_col = np.array([0,1,2,3])

df3_cleaned.iloc[indx_array_row[:2],indx_array_col[:3]]
name hardness sp. gr.
0 Amphibole 5.5 2.8
1 Biotite 2.75 3.0

Accessing values with .loc(), only rows are allowed to be arrays

df3_cleaned.loc[indx_array_row[:2],'hardness']
0     5.5
1    2.75
Name: hardness, dtype: object

Or multiple columns with .loc():

df3_cleaned.loc[indx_array_row[:2],['hardness', 'cleavage']]
hardness cleavage
0 5.5 NaN
1 2.75 One

Accessing values with .at(), no arrays allowed, only labels. Note that using 0 will work since we do not have labels for the rows so their labels are their index.

df3_cleaned.at[0, 'hardness']
5.5

If we were to change the index of this df to their names and delete column ‘name’…

df3_cleaned.index = df3_cleaned.name
df3_cleaned.drop('name',axis=1,inplace=True)
df3_cleaned
hardness sp. gr. new_column cleavage
name
Amphibole 5.5 2.8 NaN NaN
Biotite 2.75 3.0 NaN One
Calcite 3.0 2.72 NaN NaN
Dolomite 3.0 2.85 NaN NaN
Feldspars 6.0 2.645 NaN NaN
Garnet 7.0 3.9 NaN Fracture
Graphite 1.5 2.3 NaN One
Kyanite 6 4.01 NaN One

Now using df.at[0,'hardness'] doesn’t work since there’s no row with label 0.

df3_cleaned.at[0, 'hardness']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\indexes\base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas\\_libs\\hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas\\_libs\\hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[58], line 1
----> 1 df3_cleaned.at[0, 'hardness']

File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\indexing.py:2575, in _AtIndexer.__getitem__(self, key)
   2572         raise ValueError("Invalid call for scalar access (getting)!")
   2573     return self.obj.loc[key]
-> 2575 return super().__getitem__(key)

File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\indexing.py:2527, in _ScalarAccessIndexer.__getitem__(self, key)
   2524         raise ValueError("Invalid call for scalar access (getting)!")
   2526 key = self._convert_key(key)
-> 2527 return self.obj._get_value(*key, takeable=self._takeable)

File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\frame.py:4209, in DataFrame._get_value(self, index, col, takeable)
   4203 engine = self.index._engine
   4205 if not isinstance(self.index, MultiIndex):
   4206     # CategoricalIndex: Trying to use the engine fastpath may give incorrect
   4207     #  results if our categories are integers that dont match our codes
   4208     # IntervalIndex: IntervalTree has no get_loc
-> 4209     row = self.index.get_loc(index)
   4210     return series._values[row]
   4212 # For MultiIndex going through engine effectively restricts us to
   4213 #  same-length tuples; see test_get_set_value_no_partial_indexing

File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\indexes\base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 0

Accessing values with .iat(), no arrays allowed, no labels allowed only indexes.

df3_cleaned.iat[1, 2]
nan

In case you want to know the index number of a column or row:

print(df3_cleaned.columns.get_loc('new_column')) 
print(df3_cleaned.index.get_loc('Dolomite')) 
2
3

Since the above lines return the index of a column and row you can use them directly with .iat().

df3_cleaned.iat[df3_cleaned.index.get_loc('Dolomite'), df3_cleaned.columns.get_loc('new_column')]
nan

Selecting multiple columns:

df3_cleaned[['hardness','new_column']]
hardness new_column
name
Amphibole 5.5 NaN
Biotite 2.75 NaN
Calcite 3.0 NaN
Dolomite 3.0 NaN
Feldspars 6.0 NaN
Garnet 7.0 NaN
Graphite 1.5 NaN
Kyanite 6 NaN

Finally, removing a column by making a copy and re-assigning the original variable as its copy. (Alternative to inplace=True)

df3_cleaned = df3_cleaned.drop('new_column', axis=1)
df3_cleaned
hardness sp. gr. cleavage
name
Amphibole 5.5 2.8 NaN
Biotite 2.75 3.0 One
Calcite 3.0 2.72 NaN
Dolomite 3.0 2.85 NaN
Feldspars 6.0 2.645 NaN
Garnet 7.0 3.9 Fracture
Graphite 1.5 2.3 One
Kyanite 6 4.01 One

6.4 Importing data into DataFrames and exploring its attributes#

pandas provides many functions to import data into dataframes, such as read_csv() to read delimited text files, or read_excel() for Excel or OpenDocument spreadsheets. read_csv() provides options that allow you to filter the data, such as specifying the separator/delimiter, the lines that form the headers, which rows to skip, etc. Let’s analyze the mineral_properties.txt. Below a screenshot of it:

rectangle

below we import the .txt:

  • we indicate that the separator is the comma "sep=','"

  • we indicate the header (what should be the columns names) is in the second line "header=[1]"

  • we indicate to not skip any rows "skiprows=None"

  • we indicate the first column should be the index of the rows "index_col=0"

file_location = ("mineral_properties.txt")
df4 = pd.read_csv(file_location, sep=',', header=[1],
                  skiprows=None, index_col=0)
df4
hardness sp. gr. cleavage
name
Amphibole 5.50 2.800 Two
Biotite 2.75 3.000 One
Calcite 3.00 2.720 Three
Dolomite 3.00 2.850 Three
Feldspars 6.00 2.645 Two
Garnet 7.00 3.900 Fracture
Graphite 1.50 2.300 One
Kyanite 6.00 4.010 One
Muscovite 2.25 2.930 One
Pyroxene 5.50 3.325 Two
Quartz 7.00 2.650 Fracture
Sillimanite 6.50 3.230 One

Note that if we try to call any of the columns from df4 we will get an error.

df4['hardness']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\indexes\base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas\\_libs\\hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas\\_libs\\hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'hardness'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[67], line 1
----> 1 df4['hardness']

File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\frame.py:4090, in DataFrame.__getitem__(self, key)
   4088 if self.columns.nlevels > 1:
   4089     return self._getitem_multilevel(key)
-> 4090 indexer = self.columns.get_loc(key)
   4091 if is_integer(indexer):
   4092     indexer = [indexer]

File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\indexes\base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 'hardness'

Do you know why?

Answer

In case you were not able to answer the above question, let’s look into df4.columns

df4.columns
Index([' hardness', ' sp. gr.', ' cleavage'], dtype='object')

You see there are spaces at the beginning of each column name… this happens because that’s how people usually type, with commas followed by a space. We could use the skipinitialspace = True from the pd.read_csv() function to avoid this. Let’s try it out:

df4 = pd.read_csv(file_location + 'mineral_properties.txt',sep=',',header=[1], 
                  skiprows=None, index_col=0, skipinitialspace=True)
print(df4.columns)
Index(['hardness', 'sp. gr.', 'cleavage'], dtype='object')

Ok, much better!

6.5 Statistics with pandas#

Recall some functions such as np.mean() and np.max(); these functions can be used to calculate a row’s or column’s statistics. Say you want to know what’s the average hardness of the different minerals:

df4['hardness'].mean()
4.666666666666667

Often we don’t know much about the data, and printing all the values is inconvenient. In that case, it’s wise to take a look at some of its attributes first.

See the labels of the columns and rows.

print(df4.columns)
print('----------------------')
print(df4.index)
Index(['hardness', 'sp. gr.', 'cleavage'], dtype='object')
----------------------
Index(['Amphibole', 'Biotite', 'Calcite', 'Dolomite', 'Feldspars', 'Garnet',
       'Graphite', 'Kyanite', 'Muscovite', 'Pyroxene', 'Quartz',
       'Sillimanite'],
      dtype='object', name='name')

df4.info is similar to print(df4.info).

df4.info
<bound method DataFrame.info of              hardness  sp. gr.  cleavage
name                                    
Amphibole        5.50    2.800       Two
Biotite          2.75    3.000       One
Calcite          3.00    2.720     Three
Dolomite         3.00    2.850     Three
Feldspars        6.00    2.645       Two
Garnet           7.00    3.900  Fracture
Graphite         1.50    2.300       One
Kyanite          6.00    4.010       One
Muscovite        2.25    2.930       One
Pyroxene         5.50    3.325       Two
Quartz           7.00    2.650  Fracture
Sillimanite      6.50    3.230       One>

Deep copying a DataFrame#

As you have seen in Notebook 4, shallow copies can be troublesome if you’re not aware of it. In pandas, it’s the same story.

To make a deep copy use the DataFrame.copy(deep=True) function.

df_deep = df4.copy(deep=True)

Now, altering df_deep will not alter df4; and vice-versa.

Additional study material:#

After this Notebook you should be able to:

  • understand Series and DataFrames

  • concatenate DataFrames

  • work with different labels of a DataFrame

  • drop unwanted rows and columns

  • access and modify values within your DataFrame

  • import data into a pandas DataFrame

  • manipulate a DataFrame in several important ways