Importing data into DataFrames and exploring its attributes

2.3. Importing data into DataFrames and exploring its attributes#

pandas provides many functions to import data into dataframes, such as read_csv() to read delimited text files, or read_excel() for Excel or OpenDocument spreadsheets. read_csv() provides options that allow you to filter the data, such as specifying the separator/delimiter, the lines that form the headers, which rows to skip, etc. Let’s analyze the mineral_properties.txt. Below a screenshot of it:

rectangle

below we import the .txt:

  • we indicate that the separator is the comma "sep=','"

  • we indicate the header (what should be the columns names) is in the second line "header=[1]"

  • we indicate to not skip any rows "skiprows=None"

  • we indicate the first column should be the index of the rows "index_col=0"

file_location = ("mineral_properties.txt")
df4 = pd.read_csv(file_location, sep=',', header=[1],
                  skiprows=None, index_col=0)
df4
hardness sp. gr. cleavage
name
Amphibole 5.50 2.800 Two
Biotite 2.75 3.000 One
Calcite 3.00 2.720 Three
Dolomite 3.00 2.850 Three
Feldspars 6.00 2.645 Two
Garnet 7.00 3.900 Fracture
Graphite 1.50 2.300 One
Kyanite 6.00 4.010 One
Muscovite 2.25 2.930 One
Pyroxene 5.50 3.325 Two
Quartz 7.00 2.650 Fracture
Sillimanite 6.50 3.230 One

Note that if we try to call any of the columns from df4 we will get an error.

df4['hardness']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\indexes\base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas\\_libs\\hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas\\_libs\\hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'hardness'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[67], line 1
----> 1 df4['hardness']

File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\frame.py:4090, in DataFrame.__getitem__(self, key)
   4088 if self.columns.nlevels > 1:
   4089     return self._getitem_multilevel(key)
-> 4090 indexer = self.columns.get_loc(key)
   4091 if is_integer(indexer):
   4092     indexer = [indexer]

File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\indexes\base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 'hardness'

Do you know why?

Answer

In case you were not able to answer the above question, let’s look into df4.columns

df4.columns
Index([' hardness', ' sp. gr.', ' cleavage'], dtype='object')

You see there are spaces at the beginning of each column name… this happens because that’s how people usually type, with commas followed by a space. We could use the skipinitialspace = True from the pd.read_csv() function to avoid this. Let’s try it out:

df4 = pd.read_csv(file_location + 'mineral_properties.txt',sep=',',header=[1], 
                  skiprows=None, index_col=0, skipinitialspace=True)
print(df4.columns)
Index(['hardness', 'sp. gr.', 'cleavage'], dtype='object')

Ok, much better!