2.2. DataFrame
#
A DataFrame
is a data structure used to represent two-dimensional data. A common example of data organized in two dimensions are tables or two-dimensional arrays.
Usually, DataFrames
are made out of Series
. However, its constructor accepts other types.
First, we define a few series to represent individual rows with data from here.
row_1 = pd.Series(['Garnet', 7.0, 'Fracture', 3.9 ])
row_2 = pd.Series(['Graphite', 1.5, 'One', 2.3 ])
row_3 = pd.Series(['Kyanite', 6, 'One', 4.01 ])
row_4 = pd.Series(['test', 'bad', '@#$%^', False, "asdf"])
By default, the DataFrame
constructor creates rows with the elements of each Series
Create DataFrame using multiple Series as rows. The name “df” is often used as a name for a dataframe object.
df2 = pd.DataFrame(data=[row_1, row_2, row_3, row_4])
df2
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | Garnet | 7.0 | Fracture | 3.9 | NaN |
1 | Graphite | 1.5 | One | 2.3 | NaN |
2 | Kyanite | 6 | One | 4.01 | NaN |
3 | test | bad | @#$%^ | False | asdf |
We can also create a DataFrame
using Series
as columns:
Define series representing columns and convert each series to a dataframe.
name = pd.Series(['Amphibole', 'Biotite', 'Calcite', 'Dolomite', 'Feldspars'])
hardness = pd.Series([5.5, 2.75, 3, 3, 6])
specific_gravity = pd.Series([2.8, 3.0, 2.72, 2.85, 2.645])
test = pd.Series(['A', 12j, True, 9, 11.00000])
col_1 = name.to_frame(name='name')
col_2 = hardness.to_frame(name='hardness')
col_3 = specific_gravity.to_frame(name='sp. gr.')
col_4 = test.to_frame(name='test')
After creating each column, we can concatenate them together into one DataFrame
table using the pd.concat()
function. There are two axes in a DataFrame
: ‘0’ (or ‘index’) for the rows, and ‘1’ (or ‘columns’) for the columns. We want to concatenate the series along their columns, so use axis=1
(or axis=’columns’
).
Concatenate the series into one dataframe:
df1 = pd.concat([col_1, col_2, col_3, col_4], axis=1)
df1
name | hardness | sp. gr. | test | |
---|---|---|---|---|
0 | Amphibole | 5.50 | 2.800 | A |
1 | Biotite | 2.75 | 3.000 | 12j |
2 | Calcite | 3.00 | 2.720 | True |
3 | Dolomite | 3.00 | 2.850 | 9 |
4 | Feldspars | 6.00 | 2.645 | 11.0 |
Alternatively, you could use the np.transpose()
or .T
function to generate the DataFrame
using a Series
as columns and then label the columns.
df_T = pd.DataFrame(data=[name,hardness,specific_gravity,test]).T
df_T.columns=['name', 'hardness', 'sp. gr.', 'test']
df_T
name | hardness | sp. gr. | test | |
---|---|---|---|---|
0 | Amphibole | 5.5 | 2.8 | A |
1 | Biotite | 2.75 | 3.0 | 12j |
2 | Calcite | 3.0 | 2.72 | True |
3 | Dolomite | 3.0 | 2.85 | 9 |
4 | Feldspars | 6.0 | 2.645 | 11.0 |
To create a new column, simply do:
df1['new_column'] = np.nan
df1
name | hardness | sp. gr. | test | new_column | |
---|---|---|---|---|---|
0 | Amphibole | 5.50 | 2.800 | A | NaN |
1 | Biotite | 2.75 | 3.000 | 12j | NaN |
2 | Calcite | 3.00 | 2.720 | True | NaN |
3 | Dolomite | 3.00 | 2.850 | 9 | NaN |
4 | Feldspars | 6.00 | 2.645 | 11.0 | NaN |
2.2.1. Labeling DataFrames
#
You can also rename your columns’ and rows’ index names by changing your DataFrame’s columns
and index
attributes, respectively. Recall our df2
:
df2
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | Garnet | 7.0 | Fracture | 3.9 | NaN |
1 | Graphite | 1.5 | One | 2.3 | NaN |
2 | Kyanite | 6 | One | 4.01 | NaN |
3 | test | bad | @#$%^ | False | asdf |
Now let’s rename its columns’ and rows’ index names.
df2.columns = ['name', 'hardness', 'cleavage', 'sp. gr.','test']
df2.index = ['row0','row1','row2','row3']
df2
name | hardness | cleavage | sp. gr. | test | |
---|---|---|---|---|---|
row0 | Garnet | 7.0 | Fracture | 3.9 | NaN |
row1 | Graphite | 1.5 | One | 2.3 | NaN |
row2 | Kyanite | 6 | One | 4.01 | NaN |
row3 | test | bad | @#$%^ | False | asdf |
2.2.2. Concatenating DataFrames
#
We can also concatenate DataFrames
to each other, even if they are of different sizes.
df3 = pd.concat([df1, df2])
print(df3)
name hardness sp. gr. test new_column cleavage
0 Amphibole 5.5 2.8 A NaN NaN
1 Biotite 2.75 3.0 12j NaN NaN
2 Calcite 3.0 2.72 True NaN NaN
3 Dolomite 3.0 2.85 9 NaN NaN
4 Feldspars 6.0 2.645 11.0 NaN NaN
row0 Garnet 7.0 3.9 NaN NaN Fracture
row1 Graphite 1.5 2.3 NaN NaN One
row2 Kyanite 6 4.01 NaN NaN One
row3 test bad False asdf NaN @#$%^
In the resulting table, we notice that pandas
automatically added NaN
(“Not a Number”) values to the missing entries of either table. Very convenient!
2.2.3. Removing rows and columns#
Our table looks a bit messy though… let’s make it better by removing the ’test’ row and column. We can drop them by using the pd.drop()
function. By default, the pd.drop()
function outputs a copy of the inserted DataFrame
without the dropped items. The original data will then be preserved if you give a new name to the dataframe (i.e., the copy) with dropped items.
Example of how you could use pd.drop()
to remove a row or column. Note that we used that row’s index name and column’s index name.
new_df3 = df3.drop('row3', axis=0)
new_df3 = new_df3.drop('test', axis=1)
print(new_df3)
name hardness sp. gr. new_column cleavage
0 Amphibole 5.5 2.8 NaN NaN
1 Biotite 2.75 3.0 NaN NaN
2 Calcite 3.0 2.72 NaN NaN
3 Dolomite 3.0 2.85 NaN NaN
4 Feldspars 6.0 2.645 NaN NaN
row0 Garnet 7.0 3.9 NaN Fracture
row1 Graphite 1.5 2.3 NaN One
row2 Kyanite 6 4.01 NaN One
Recall that df3
is unchanged
df3
name | hardness | sp. gr. | test | new_column | cleavage | |
---|---|---|---|---|---|---|
0 | Amphibole | 5.5 | 2.8 | A | NaN | NaN |
1 | Biotite | 2.75 | 3.0 | 12j | NaN | NaN |
2 | Calcite | 3.0 | 2.72 | True | NaN | NaN |
3 | Dolomite | 3.0 | 2.85 | 9 | NaN | NaN |
4 | Feldspars | 6.0 | 2.645 | 11.0 | NaN | NaN |
row0 | Garnet | 7.0 | 3.9 | NaN | NaN | Fracture |
row1 | Graphite | 1.5 | 2.3 | NaN | NaN | One |
row2 | Kyanite | 6 | 4.01 | NaN | NaN | One |
row3 | test | bad | False | asdf | NaN | @#$%^ |
In case you would like to edit the original df3
; you would apply the drop-operation ’inplace’. For this, we use the inplace=True
option. To avoid dropping rows and columns with the same index name, you could use the pd.reset_index()
function. With drop=True
, the new DataFrame
will have dropped the old indexes; keeping the new reset ones.
Reset the index to the default integer index and drop the old indexes.
df3_cleaned = df3.reset_index(drop=True)
df3_cleaned
name | hardness | sp. gr. | test | new_column | cleavage | |
---|---|---|---|---|---|---|
0 | Amphibole | 5.5 | 2.8 | A | NaN | NaN |
1 | Biotite | 2.75 | 3.0 | 12j | NaN | NaN |
2 | Calcite | 3.0 | 2.72 | True | NaN | NaN |
3 | Dolomite | 3.0 | 2.85 | 9 | NaN | NaN |
4 | Feldspars | 6.0 | 2.645 | 11.0 | NaN | NaN |
5 | Garnet | 7.0 | 3.9 | NaN | NaN | Fracture |
6 | Graphite | 1.5 | 2.3 | NaN | NaN | One |
7 | Kyanite | 6 | 4.01 | NaN | NaN | One |
8 | test | bad | False | asdf | NaN | @#$%^ |
Now let’s drop row 8 and column ‘test’ (remember to add axis=1 when dropping a column).
df3_cleaned.drop(8,inplace=True)
df3_cleaned.drop('test',inplace=True, axis=1)
print(df3_cleaned)
name hardness sp. gr. new_column cleavage
0 Amphibole 5.5 2.8 NaN NaN
1 Biotite 2.75 3.0 NaN NaN
2 Calcite 3.0 2.72 NaN NaN
3 Dolomite 3.0 2.85 NaN NaN
4 Feldspars 6.0 2.645 NaN NaN
5 Garnet 7.0 3.9 NaN Fracture
6 Graphite 1.5 2.3 NaN One
7 Kyanite 6 4.01 NaN One
Note that if you try to re-run the above cell without re-running the previous one, you will get an error. This happens because after running the above cell once, you drop row ‘8’ and the ‘test’ column from df3_cleaned
, trying to run again will attempt to re-drop something that doesn’t exist. Re-running the previous cell ‘fixes’ this issue since you re-assign df3_cleaned
as df3.reset_index(drop=True)
.
Warning
If one or more rows (or columns) have the same index name, pd.drop()
will drop all of them.
2.2.4. Accessing and modifying DataFrame
values#
Now that we created our table, we want to be able to access and modify the individual values within it. This way we could add the missing values to the ’cleavage’ column. There are many ways to do this because numpy
and standard Python functions are often still applicable on pandas
data structures. However, there is a difference in processing speed. Therefore, when accessing or modifying data, try to use pandas
functions, such as: .iat()
, .at()
, .iloc()
, and .loc()
instead of using the common []
square bracket syntax, as they might raise some warnings:
Note that the value still changed, but avoid doing so.
df3_cleaned.cleavage[1]='One'
df3_cleaned
C:\Users\gui-win10\AppData\Local\Temp\ipykernel_24848\54782356.py:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:
df["col"][row_indexer] = value
Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df3_cleaned.cleavage[1]='One'
C:\Users\gui-win10\AppData\Local\Temp\ipykernel_24848\54782356.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df3_cleaned.cleavage[1]='One'
name | hardness | sp. gr. | new_column | cleavage | |
---|---|---|---|---|---|
0 | Amphibole | 5.5 | 2.8 | NaN | NaN |
1 | Biotite | 2.75 | 3.0 | NaN | One |
2 | Calcite | 3.0 | 2.72 | NaN | NaN |
3 | Dolomite | 3.0 | 2.85 | NaN | NaN |
4 | Feldspars | 6.0 | 2.645 | NaN | NaN |
5 | Garnet | 7.0 | 3.9 | NaN | Fracture |
6 | Graphite | 1.5 | 2.3 | NaN | One |
7 | Kyanite | 6 | 4.01 | NaN | One |
2.2.5. .iat()
vs .at()
vs .iloc()
vs .loc()
#
In this discussion in stackoverflow, they point out some differences between these methods. To keep in mind, .iat()
and .at()
are the fastest ones, but they only return scalars (one element), while .loc()
and .iloc()
can access several elements at the same time. Lastly, .iat()
and .iloc
use indexes (numbers), while .at()
and .loc()
use labels. For more information, check the stackoverflow discussion.
Acessing values using arrays as index with .iloc()
, only method that allows arrays in both rows and cols.
indx_array_row = np.array([0,1,2])
indx_array_col = np.array([0,1,2,3])
df3_cleaned.iloc[indx_array_row[:2],indx_array_col[:3]]
name | hardness | sp. gr. | |
---|---|---|---|
0 | Amphibole | 5.5 | 2.8 |
1 | Biotite | 2.75 | 3.0 |
Accessing values with .loc()
, only rows are allowed to be arrays
df3_cleaned.loc[indx_array_row[:2],'hardness']
0 5.5
1 2.75
Name: hardness, dtype: object
Or multiple columns with .loc()
:
df3_cleaned.loc[indx_array_row[:2],['hardness', 'cleavage']]
hardness | cleavage | |
---|---|---|
0 | 5.5 | NaN |
1 | 2.75 | One |
Accessing values with .at()
, no arrays allowed, only labels. Note that using 0 will work since we do not have labels for the rows so their labels are their index.
df3_cleaned.at[0, 'hardness']
5.5
If we were to change the index of this df to their names and delete column ‘name’…
df3_cleaned.index = df3_cleaned.name
df3_cleaned.drop('name',axis=1,inplace=True)
df3_cleaned
hardness | sp. gr. | new_column | cleavage | |
---|---|---|---|---|
name | ||||
Amphibole | 5.5 | 2.8 | NaN | NaN |
Biotite | 2.75 | 3.0 | NaN | One |
Calcite | 3.0 | 2.72 | NaN | NaN |
Dolomite | 3.0 | 2.85 | NaN | NaN |
Feldspars | 6.0 | 2.645 | NaN | NaN |
Garnet | 7.0 | 3.9 | NaN | Fracture |
Graphite | 1.5 | 2.3 | NaN | One |
Kyanite | 6 | 4.01 | NaN | One |
Now using df.at[0,'hardness']
doesn’t work since there’s no row with label 0.
df3_cleaned.at[0, 'hardness']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\indexes\base.py:3805, in Index.get_loc(self, key)
3804 try:
-> 3805 return self._engine.get_loc(casted_key)
3806 except KeyError as err:
File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas\\_libs\\hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas\\_libs\\hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[58], line 1
----> 1 df3_cleaned.at[0, 'hardness']
File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\indexing.py:2575, in _AtIndexer.__getitem__(self, key)
2572 raise ValueError("Invalid call for scalar access (getting)!")
2573 return self.obj.loc[key]
-> 2575 return super().__getitem__(key)
File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\indexing.py:2527, in _ScalarAccessIndexer.__getitem__(self, key)
2524 raise ValueError("Invalid call for scalar access (getting)!")
2526 key = self._convert_key(key)
-> 2527 return self.obj._get_value(*key, takeable=self._takeable)
File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\frame.py:4209, in DataFrame._get_value(self, index, col, takeable)
4203 engine = self.index._engine
4205 if not isinstance(self.index, MultiIndex):
4206 # CategoricalIndex: Trying to use the engine fastpath may give incorrect
4207 # results if our categories are integers that dont match our codes
4208 # IntervalIndex: IntervalTree has no get_loc
-> 4209 row = self.index.get_loc(index)
4210 return series._values[row]
4212 # For MultiIndex going through engine effectively restricts us to
4213 # same-length tuples; see test_get_set_value_no_partial_indexing
File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\indexes\base.py:3812, in Index.get_loc(self, key)
3807 if isinstance(casted_key, slice) or (
3808 isinstance(casted_key, abc.Iterable)
3809 and any(isinstance(x, slice) for x in casted_key)
3810 ):
3811 raise InvalidIndexError(key)
-> 3812 raise KeyError(key) from err
3813 except TypeError:
3814 # If we have a listlike key, _check_indexing_error will raise
3815 # InvalidIndexError. Otherwise we fall through and re-raise
3816 # the TypeError.
3817 self._check_indexing_error(key)
KeyError: 0
Accessing values with .iat()
, no arrays allowed, no labels allowed only indexes.
df3_cleaned.iat[1, 2]
nan
In case you want to know the index number of a column or row:
print(df3_cleaned.columns.get_loc('new_column'))
print(df3_cleaned.index.get_loc('Dolomite'))
2
3
Since the above lines return the index of a column and row you can use them directly with .iat()
.
df3_cleaned.iat[df3_cleaned.index.get_loc('Dolomite'), df3_cleaned.columns.get_loc('new_column')]
nan
Selecting multiple columns:
df3_cleaned[['hardness','new_column']]
hardness | new_column | |
---|---|---|
name | ||
Amphibole | 5.5 | NaN |
Biotite | 2.75 | NaN |
Calcite | 3.0 | NaN |
Dolomite | 3.0 | NaN |
Feldspars | 6.0 | NaN |
Garnet | 7.0 | NaN |
Graphite | 1.5 | NaN |
Kyanite | 6 | NaN |
Finally, removing a column by making a copy and re-assigning the original variable as its copy. (Alternative to inplace=True
)
df3_cleaned = df3_cleaned.drop('new_column', axis=1)
df3_cleaned
hardness | sp. gr. | cleavage | |
---|---|---|---|
name | |||
Amphibole | 5.5 | 2.8 | NaN |
Biotite | 2.75 | 3.0 | One |
Calcite | 3.0 | 2.72 | NaN |
Dolomite | 3.0 | 2.85 | NaN |
Feldspars | 6.0 | 2.645 | NaN |
Garnet | 7.0 | 3.9 | Fracture |
Graphite | 1.5 | 2.3 | One |
Kyanite | 6 | 4.01 | One |