2.2. DataFrame#

A DataFrame is a data structure used to represent two-dimensional data. A common example of data organized in two dimensions are tables or two-dimensional arrays.

Usually, DataFrames are made out of Series. However, its constructor accepts other types.

First, we define a few series to represent individual rows with data from here.

row_1 = pd.Series(['Garnet', 7.0,   'Fracture', 3.9  ])
row_2 = pd.Series(['Graphite', 1.5, 'One', 2.3 ])
row_3 = pd.Series(['Kyanite', 6,    'One', 4.01   ])
row_4 = pd.Series(['test', 'bad', '@#$%^', False, "asdf"])

By default, the DataFrame constructor creates rows with the elements of each Series

Create DataFrame using multiple Series as rows. The name “df” is often used as a name for a dataframe object.

df2 = pd.DataFrame(data=[row_1, row_2, row_3, row_4]) 
df2
0 1 2 3 4
0 Garnet 7.0 Fracture 3.9 NaN
1 Graphite 1.5 One 2.3 NaN
2 Kyanite 6 One 4.01 NaN
3 test bad @#$%^ False asdf

We can also create a DataFrame using Series as columns:

Define series representing columns and convert each series to a dataframe.

name = pd.Series(['Amphibole', 'Biotite', 'Calcite', 'Dolomite', 'Feldspars'])
hardness = pd.Series([5.5, 2.75, 3, 3, 6])
specific_gravity = pd.Series([2.8, 3.0, 2.72, 2.85, 2.645])
test = pd.Series(['A', 12j, True, 9, 11.00000])

col_1 = name.to_frame(name='name') 
col_2 = hardness.to_frame(name='hardness')
col_3 = specific_gravity.to_frame(name='sp. gr.')
col_4 = test.to_frame(name='test')

After creating each column, we can concatenate them together into one DataFrame table using the pd.concat() function. There are two axes in a DataFrame: ‘0’ (or ‘index’) for the rows, and ‘1’ (or ‘columns’) for the columns. We want to concatenate the series along their columns, so use axis=1 (or axis=’columns’).

Concatenate the series into one dataframe:

df1 = pd.concat([col_1, col_2, col_3, col_4], axis=1)
df1
name hardness sp. gr. test
0 Amphibole 5.50 2.800 A
1 Biotite 2.75 3.000 12j
2 Calcite 3.00 2.720 True
3 Dolomite 3.00 2.850 9
4 Feldspars 6.00 2.645 11.0

Alternatively, you could use the np.transpose() or .T function to generate the DataFrame using a Series as columns and then label the columns.

df_T = pd.DataFrame(data=[name,hardness,specific_gravity,test]).T
df_T.columns=['name', 'hardness', 'sp. gr.', 'test']

df_T
name hardness sp. gr. test
0 Amphibole 5.5 2.8 A
1 Biotite 2.75 3.0 12j
2 Calcite 3.0 2.72 True
3 Dolomite 3.0 2.85 9
4 Feldspars 6.0 2.645 11.0

To create a new column, simply do:

df1['new_column'] = np.nan
df1
name hardness sp. gr. test new_column
0 Amphibole 5.50 2.800 A NaN
1 Biotite 2.75 3.000 12j NaN
2 Calcite 3.00 2.720 True NaN
3 Dolomite 3.00 2.850 9 NaN
4 Feldspars 6.00 2.645 11.0 NaN

2.2.1. Labeling DataFrames#

You can also rename your columns’ and rows’ index names by changing your DataFrame’s columns and index attributes, respectively. Recall our df2:

df2
0 1 2 3 4
0 Garnet 7.0 Fracture 3.9 NaN
1 Graphite 1.5 One 2.3 NaN
2 Kyanite 6 One 4.01 NaN
3 test bad @#$%^ False asdf

Now let’s rename its columns’ and rows’ index names.

df2.columns = ['name', 'hardness',  'cleavage', 'sp. gr.','test']
df2.index = ['row0','row1','row2','row3']

df2
name hardness cleavage sp. gr. test
row0 Garnet 7.0 Fracture 3.9 NaN
row1 Graphite 1.5 One 2.3 NaN
row2 Kyanite 6 One 4.01 NaN
row3 test bad @#$%^ False asdf

2.2.2. Concatenating DataFrames#

We can also concatenate DataFrames to each other, even if they are of different sizes.

df3 = pd.concat([df1, df2])
print(df3)
           name hardness sp. gr.  test  new_column  cleavage
0     Amphibole      5.5     2.8     A         NaN       NaN
1       Biotite     2.75     3.0   12j         NaN       NaN
2       Calcite      3.0    2.72  True         NaN       NaN
3      Dolomite      3.0    2.85     9         NaN       NaN
4     Feldspars      6.0   2.645  11.0         NaN       NaN
row0     Garnet      7.0     3.9   NaN         NaN  Fracture
row1   Graphite      1.5     2.3   NaN         NaN       One
row2    Kyanite        6    4.01   NaN         NaN       One
row3       test      bad   False  asdf         NaN     @#$%^

In the resulting table, we notice that pandas automatically added NaN (“Not a Number”) values to the missing entries of either table. Very convenient!

2.2.3. Removing rows and columns#

Our table looks a bit messy though… let’s make it better by removing the ’test’ row and column. We can drop them by using the pd.drop() function. By default, the pd.drop() function outputs a copy of the inserted DataFrame without the dropped items. The original data will then be preserved if you give a new name to the dataframe (i.e., the copy) with dropped items.

Example of how you could use pd.drop() to remove a row or column. Note that we used that row’s index name and column’s index name.

new_df3 = df3.drop('row3', axis=0) 
new_df3 = new_df3.drop('test', axis=1) 
print(new_df3)
           name hardness sp. gr.  new_column  cleavage
0     Amphibole      5.5     2.8         NaN       NaN
1       Biotite     2.75     3.0         NaN       NaN
2       Calcite      3.0    2.72         NaN       NaN
3      Dolomite      3.0    2.85         NaN       NaN
4     Feldspars      6.0   2.645         NaN       NaN
row0     Garnet      7.0     3.9         NaN  Fracture
row1   Graphite      1.5     2.3         NaN       One
row2    Kyanite        6    4.01         NaN       One

Recall that df3 is unchanged

df3
name hardness sp. gr. test new_column cleavage
0 Amphibole 5.5 2.8 A NaN NaN
1 Biotite 2.75 3.0 12j NaN NaN
2 Calcite 3.0 2.72 True NaN NaN
3 Dolomite 3.0 2.85 9 NaN NaN
4 Feldspars 6.0 2.645 11.0 NaN NaN
row0 Garnet 7.0 3.9 NaN NaN Fracture
row1 Graphite 1.5 2.3 NaN NaN One
row2 Kyanite 6 4.01 NaN NaN One
row3 test bad False asdf NaN @#$%^

In case you would like to edit the original df3; you would apply the drop-operation ’inplace’. For this, we use the inplace=True option. To avoid dropping rows and columns with the same index name, you could use the pd.reset_index() function. With drop=True, the new DataFrame will have dropped the old indexes; keeping the new reset ones.

Reset the index to the default integer index and drop the old indexes.

df3_cleaned = df3.reset_index(drop=True) 
df3_cleaned
name hardness sp. gr. test new_column cleavage
0 Amphibole 5.5 2.8 A NaN NaN
1 Biotite 2.75 3.0 12j NaN NaN
2 Calcite 3.0 2.72 True NaN NaN
3 Dolomite 3.0 2.85 9 NaN NaN
4 Feldspars 6.0 2.645 11.0 NaN NaN
5 Garnet 7.0 3.9 NaN NaN Fracture
6 Graphite 1.5 2.3 NaN NaN One
7 Kyanite 6 4.01 NaN NaN One
8 test bad False asdf NaN @#$%^

Now let’s drop row 8 and column ‘test’ (remember to add axis=1 when dropping a column).

df3_cleaned.drop(8,inplace=True)
df3_cleaned.drop('test',inplace=True, axis=1)
print(df3_cleaned)
        name hardness sp. gr.  new_column  cleavage
0  Amphibole      5.5     2.8         NaN       NaN
1    Biotite     2.75     3.0         NaN       NaN
2    Calcite      3.0    2.72         NaN       NaN
3   Dolomite      3.0    2.85         NaN       NaN
4  Feldspars      6.0   2.645         NaN       NaN
5     Garnet      7.0     3.9         NaN  Fracture
6   Graphite      1.5     2.3         NaN       One
7    Kyanite        6    4.01         NaN       One

Note that if you try to re-run the above cell without re-running the previous one, you will get an error. This happens because after running the above cell once, you drop row ‘8’ and the ‘test’ column from df3_cleaned, trying to run again will attempt to re-drop something that doesn’t exist. Re-running the previous cell ‘fixes’ this issue since you re-assign df3_cleaned as df3.reset_index(drop=True).

Warning

If one or more rows (or columns) have the same index name, pd.drop() will drop all of them.

2.2.4. Accessing and modifying DataFrame values#

Now that we created our table, we want to be able to access and modify the individual values within it. This way we could add the missing values to the ’cleavage’ column. There are many ways to do this because numpy and standard Python functions are often still applicable on pandas data structures. However, there is a difference in processing speed. Therefore, when accessing or modifying data, try to use pandas functions, such as: .iat(), .at(), .iloc(), and .loc() instead of using the common [] square bracket syntax, as they might raise some warnings:

Note that the value still changed, but avoid doing so.

df3_cleaned.cleavage[1]='One' 
df3_cleaned 
C:\Users\gui-win10\AppData\Local\Temp\ipykernel_24848\54782356.py:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df3_cleaned.cleavage[1]='One'
C:\Users\gui-win10\AppData\Local\Temp\ipykernel_24848\54782356.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df3_cleaned.cleavage[1]='One'
name hardness sp. gr. new_column cleavage
0 Amphibole 5.5 2.8 NaN NaN
1 Biotite 2.75 3.0 NaN One
2 Calcite 3.0 2.72 NaN NaN
3 Dolomite 3.0 2.85 NaN NaN
4 Feldspars 6.0 2.645 NaN NaN
5 Garnet 7.0 3.9 NaN Fracture
6 Graphite 1.5 2.3 NaN One
7 Kyanite 6 4.01 NaN One

2.2.5. .iat() vs .at() vs .iloc() vs .loc()#

In this discussion in stackoverflow, they point out some differences between these methods. To keep in mind, .iat() and .at() are the fastest ones, but they only return scalars (one element), while .loc() and .iloc() can access several elements at the same time. Lastly, .iat() and .iloc use indexes (numbers), while .at() and .loc() use labels. For more information, check the stackoverflow discussion.

Acessing values using arrays as index with .iloc(), only method that allows arrays in both rows and cols.

indx_array_row = np.array([0,1,2])
indx_array_col = np.array([0,1,2,3])

df3_cleaned.iloc[indx_array_row[:2],indx_array_col[:3]]
name hardness sp. gr.
0 Amphibole 5.5 2.8
1 Biotite 2.75 3.0

Accessing values with .loc(), only rows are allowed to be arrays

df3_cleaned.loc[indx_array_row[:2],'hardness']
0     5.5
1    2.75
Name: hardness, dtype: object

Or multiple columns with .loc():

df3_cleaned.loc[indx_array_row[:2],['hardness', 'cleavage']]
hardness cleavage
0 5.5 NaN
1 2.75 One

Accessing values with .at(), no arrays allowed, only labels. Note that using 0 will work since we do not have labels for the rows so their labels are their index.

df3_cleaned.at[0, 'hardness']
5.5

If we were to change the index of this df to their names and delete column ‘name’…

df3_cleaned.index = df3_cleaned.name
df3_cleaned.drop('name',axis=1,inplace=True)
df3_cleaned
hardness sp. gr. new_column cleavage
name
Amphibole 5.5 2.8 NaN NaN
Biotite 2.75 3.0 NaN One
Calcite 3.0 2.72 NaN NaN
Dolomite 3.0 2.85 NaN NaN
Feldspars 6.0 2.645 NaN NaN
Garnet 7.0 3.9 NaN Fracture
Graphite 1.5 2.3 NaN One
Kyanite 6 4.01 NaN One

Now using df.at[0,'hardness'] doesn’t work since there’s no row with label 0.

df3_cleaned.at[0, 'hardness']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\indexes\base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas\\_libs\\hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas\\_libs\\hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[58], line 1
----> 1 df3_cleaned.at[0, 'hardness']

File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\indexing.py:2575, in _AtIndexer.__getitem__(self, key)
   2572         raise ValueError("Invalid call for scalar access (getting)!")
   2573     return self.obj.loc[key]
-> 2575 return super().__getitem__(key)

File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\indexing.py:2527, in _ScalarAccessIndexer.__getitem__(self, key)
   2524         raise ValueError("Invalid call for scalar access (getting)!")
   2526 key = self._convert_key(key)
-> 2527 return self.obj._get_value(*key, takeable=self._takeable)

File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\frame.py:4209, in DataFrame._get_value(self, index, col, takeable)
   4203 engine = self.index._engine
   4205 if not isinstance(self.index, MultiIndex):
   4206     # CategoricalIndex: Trying to use the engine fastpath may give incorrect
   4207     #  results if our categories are integers that dont match our codes
   4208     # IntervalIndex: IntervalTree has no get_loc
-> 4209     row = self.index.get_loc(index)
   4210     return series._values[row]
   4212 # For MultiIndex going through engine effectively restricts us to
   4213 #  same-length tuples; see test_get_set_value_no_partial_indexing

File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\indexes\base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 0

Accessing values with .iat(), no arrays allowed, no labels allowed only indexes.

df3_cleaned.iat[1, 2]
nan

In case you want to know the index number of a column or row:

print(df3_cleaned.columns.get_loc('new_column')) 
print(df3_cleaned.index.get_loc('Dolomite')) 
2
3

Since the above lines return the index of a column and row you can use them directly with .iat().

df3_cleaned.iat[df3_cleaned.index.get_loc('Dolomite'), df3_cleaned.columns.get_loc('new_column')]
nan

Selecting multiple columns:

df3_cleaned[['hardness','new_column']]
hardness new_column
name
Amphibole 5.5 NaN
Biotite 2.75 NaN
Calcite 3.0 NaN
Dolomite 3.0 NaN
Feldspars 6.0 NaN
Garnet 7.0 NaN
Graphite 1.5 NaN
Kyanite 6 NaN

Finally, removing a column by making a copy and re-assigning the original variable as its copy. (Alternative to inplace=True)

df3_cleaned = df3_cleaned.drop('new_column', axis=1)
df3_cleaned
hardness sp. gr. cleavage
name
Amphibole 5.5 2.8 NaN
Biotite 2.75 3.0 One
Calcite 3.0 2.72 NaN
Dolomite 3.0 2.85 NaN
Feldspars 6.0 2.645 NaN
Garnet 7.0 3.9 Fracture
Graphite 1.5 2.3 One
Kyanite 6 4.01 One