2.3. Importing data into DataFrames
and exploring its attributes#
pandas
provides many functions to import data into dataframes
, such as read_csv()
to read delimited text files, or read_excel()
for Excel or OpenDocument spreadsheets. read_csv()
provides options that allow you to filter the data, such as specifying the separator/delimiter, the lines that form the headers, which rows to skip, etc. Let’s analyze the mineral_properties.txt
. Below a screenshot of it:

below we import the .txt
:
we indicate that the separator is the comma
"sep=','"
we indicate the header (what should be the columns names) is in the second line
"header=[1]"
we indicate to not skip any rows
"skiprows=None"
we indicate the first column should be the index of the rows
"index_col=0"
file_location = ("mineral_properties.txt")
df4 = pd.read_csv(file_location, sep=',', header=[1],
skiprows=None, index_col=0)
df4
hardness | sp. gr. | cleavage | |
---|---|---|---|
name | |||
Amphibole | 5.50 | 2.800 | Two |
Biotite | 2.75 | 3.000 | One |
Calcite | 3.00 | 2.720 | Three |
Dolomite | 3.00 | 2.850 | Three |
Feldspars | 6.00 | 2.645 | Two |
Garnet | 7.00 | 3.900 | Fracture |
Graphite | 1.50 | 2.300 | One |
Kyanite | 6.00 | 4.010 | One |
Muscovite | 2.25 | 2.930 | One |
Pyroxene | 5.50 | 3.325 | Two |
Quartz | 7.00 | 2.650 | Fracture |
Sillimanite | 6.50 | 3.230 | One |
Note that if we try to call any of the columns from df4
we will get an error.
df4['hardness']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\indexes\base.py:3805, in Index.get_loc(self, key)
3804 try:
-> 3805 return self._engine.get_loc(casted_key)
3806 except KeyError as err:
File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas\\_libs\\hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas\\_libs\\hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'hardness'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[67], line 1
----> 1 df4['hardness']
File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\frame.py:4090, in DataFrame.__getitem__(self, key)
4088 if self.columns.nlevels > 1:
4089 return self._getitem_multilevel(key)
-> 4090 indexer = self.columns.get_loc(key)
4091 if is_integer(indexer):
4092 indexer = [indexer]
File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\indexes\base.py:3812, in Index.get_loc(self, key)
3807 if isinstance(casted_key, slice) or (
3808 isinstance(casted_key, abc.Iterable)
3809 and any(isinstance(x, slice) for x in casted_key)
3810 ):
3811 raise InvalidIndexError(key)
-> 3812 raise KeyError(key) from err
3813 except TypeError:
3814 # If we have a listlike key, _check_indexing_error will raise
3815 # InvalidIndexError. Otherwise we fall through and re-raise
3816 # the TypeError.
3817 self._check_indexing_error(key)
KeyError: 'hardness'
Do you know why?
Answer …
In case you were not able to answer the above question, let’s look into df4.columns
df4.columns
Index([' hardness', ' sp. gr.', ' cleavage'], dtype='object')
You see there are spaces at the beginning of each column name… this happens because that’s how people usually type, with commas followed by a space. We could use the skipinitialspace = True
from the pd.read_csv()
function to avoid this. Let’s try it out:
df4 = pd.read_csv(file_location + 'mineral_properties.txt',sep=',',header=[1],
skiprows=None, index_col=0, skipinitialspace=True)
print(df4.columns)
Index(['hardness', 'sp. gr.', 'cleavage'], dtype='object')
Ok, much better!