The pandas Python library provides data structures and data manipulation tools that greatly simplify data analysis and data cleaning. There are many excellent books on using pandas for data analysis, as well as numerous websites dedicated to the topic. This page provides information on how to load Enigma Public data into pandas, with a focus on the challenges posed by large datasets.

Getting started

This article assumes a basic knowledge of Python. The code samples are shown in Jupyter Notebook. If you haven’t used this before, see this Jupyter Notebook Quickstart. You can download the notebook used in this article here: pandas_tutorial.ipynb

All of the examples use the pandas library. Several also use the Requests library. Begin by importing the required libraries into your Python project.

In  [1]:  
import pandas as pd
import requests

If you haven’t already installed these libraries into your Python virtual environment, do this now.

Then add two variables: one with your API key and the other with the Enigma Public base URL:

In  [2]:
headers = {'authorization': 'Bearer <YOUR_API_KEY>'}
base_url = "https://public.enigma.com/api/"

To obtain your API key, sign into Enigma Public, click your initials at the top right of the screen, and choose Account Settings. Your API key is at the top right of the screen.

Downloading and reading a CSV file

The easiest way to get the dataset you want to analyze into pandas is to first download the dataset and then read it using the pandas read_csv( ) function.

You can also read the dataset directly into pandas using the Enigma Public API’s /export/{id} endpoint. For details, see Using the API to read data into pandas.

If you want to follow along with the examples in this article:

  1. Open the NYC Restaurant Inspections dataset.
  2. Click the Export button and download the full dataset to disk as a CSV file.
  3. Copy the file into your Jupyter Notebook directory and rename it restaurants.csv.

You can now read the CSV file into a pandas DataFrame (df) using read_csv():

In  [3]:  
df = pd.read_csv('restaurants.csv')
df.head()

You may see a warning message like the one below. You can ignore it for now. We’ll address it later when we use the dtype option to apply a schema.

/Users/enigma/.virtualenvs/jupyter3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2728: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.

Dataset size considerations

Some of the Enigma Public datasets are too large to load into memory. As a general rule, you’ll need 5-10 times as much RAM as the size of the dataset you want to analyze (see https://wesmckinney.com/blog/apache-arrow-pandas-internals/). You’ll be able to do the initial read with less RAM, but will likely run into problems when attempting any kind of analysis. To get the approximate size of a dataset:

  1. Open the dataset in Enigma Public and click the API button.
  2. Remove the default query parameters (?&row_limit=200&row_offset=0) from the URL in the address bar.
  3. Press Enter to re-execute the query.
  4. Scroll to the bottom of the page and read the size attribute. This gives the approximate size on disk in bytes.

The table below includes examples of a few Enigma Public datasets, showing the value of the size attribute as well as the actual size when downloaded to disk as a CSV file.

Dataset name # of rows # of cols Size attribute CSV size
NYC Restaurant Inspections 388,620 18 189,800,448 156.8 MB
US Import Records Cargo Summary 2016 5,253,980 9 1,518,231,552 1.06 GB
US Import Records Cargo Summary 2017 26,937,376 9 7,388,463,104 5.38 GB
Occupational Injuries and Illnesses 49,406,071 53 15,976,603,648 13.38 GB

Enigma updates datasets continuously, so the actual numbers for these datasets may be different from the values shown here. Additionally, as you’ll see shortly, a dataset’s in-memory footprint is typically much larger than its size on disk.

If the dataset you want to use is too large, there are a couple of easy options you might consider:

If you need the entire dataset, there are a few more advanced techniques you can use to read the dataset more efficiently and optimize its memory footprint (see Reducing a dataset’s memory footprint).

Downloading a filtered subset

If you’re only interested in a subset of the data, you can filter the dataset in Enigma Public and download just the filtered rows (see Downloading data).

For example, the NYC Restaurant Inspections dataset covers the boroughs of Manhattan, Brooklyn, Queens, Bronx, and Staten Island. If you’re only interested in the data for Manhattan, you can filter by Borough and export using the Applied Filters option.

Although you can download selected rows, you can’t download selected columns. The “hide” option in the Enigma Public data viewer only hides the column from view – it doesn’t affect the download. You’ll need to filter out the columns you don’t want when you read the file into pandas (see the next section), or remove them from the DataFrame after you’ve read it into pandas.

Reading selected columns into pandas

The pandas read_csv function has a usecols argument that lets you specify the columns you want to read (the default is all columns). The value for usecols is a list of column names. These are different from the “display names” shown in the Enigma Public data viewer. Since the column names are in the first row of the CSV, here’s a quick way to get them in list format from the downloaded file:

In  [4]:  
print(pd.read_csv('restaurants.csv', nrows=0).columns.tolist())

['camis', 'dba', 'boro', 'building', 'street', 'zipcode', 'phone', 
 'cuisine_description', 'inspection_date', 'action', 'violation_code', 
 'violation_description', 'critical_flag', 'score', 'grade', 
 'grade_date', 'record_date', 'inspection_type']

You can then copy the list, remove the names of the columns you don’t want, and call read_csv() using the list as the argument value. This example reads a subset of the columns using an abbreviated list:

In  [5]:  
cols = ['dba', 'boro', 'building', 'street', 'zipcode', 'cuisine_description', 'inspection_date', 'grade']
df = pd.read_csv('restaurants.csv', usecols=cols)
df.head()

Out [5]:  

In the case of the NYC Restaurant Inspections dataset, you should have no difficulty reading the entire dataset into memory, but this technique can be helpful for larger datasets.

Using the API to read data into pandas

While downloading the dataset and reading it into pandas using read_csv() is the easiest approach, it’s not a fully programmatic approach. If you want to read data directly into pandas, you’ll need to use the Enigma Public API.

Reading from the export URL is fine for small datasets but is generally not recommended for large datasets, since you’ll be downloading the snapshot file every time you run the script, and this can take a while. Also, keep in mind that when Enigma updates a dataset, it creates a new snapshot with a new snapshot ID (see How data is organized). To make sure you’re downloading the latest data (assuming that’s what you want), reference the snapshot using the parent dataset ID, as shown in the example below.

These examples use the Python Requests library, so make sure you imported it into your Python project and installed it into your Python virtual environment.

The Enigma Public API’s GET /export/{id} endpoint lets you download a dataset given its snapshot ID, so first you’ll need to get the dataset’s most recent snapshot ID as shown below (for details on how this code works, see Finding the most recent snapshot ID).

In  [6]:  
dataset_id = 'fa7ab996-fb43-4e86-80e7-f8e82ccba15f'
url = base_url + 'datasets/{}'.format(dataset_id)
response = requests.get(url, headers=headers)
json = response.json()
snapshot_id = json['current_snapshot']['id']
snapshot_id

Out [6]:
'bdeed7df-d6c1-4991-87b4-245d917c31b0'

Once you have the current snapshot ID, you can read the snapshot into pandas using the read_csv() function.

In  [7]:
import io
url = base_url + 'export/{}'.format(snapshot_id)
response = requests.get(url, headers=headers).content
df = pd.read_csv(io.StringIO(response.decode('utf-8')))

Although read_csv() accepts URLs, it doesn’t support request headers, like the one specified earlier for your API key, so you can’t read from the URL directly. Instead, you must make a GET request and supply the response byte stream to read_csv().

The API’s GET /export/{id} also supports row filtering. If you want to read only rows that contain certain strings, you can use the /export/{id} endpoint’s query query parameter.

In  [8]:  
url = base_url + 'export/{}?query=manhattan'.format(snapshot_id)
response = requests.get(url, headers=headers).content
df = pd.read_csv(io.StringIO(response.decode('utf-8')))
df.head()

Out [8]:

Additionally, you can read only selected columns from the response using technique shown earlier (see Reading selected columns into pandas):

In  [9]: 
cols = ['dba', 'boro', 'building', 'street', 'zipcode', 'cuisine_description', 'inspection_date', 'grade']
url = base_url + 'export/{}'.format(snapshot_id)
response = requests.get(url, headers=headers).content
df = pd.read_csv(io.StringIO(response.decode('utf-8')), usecols=cols)
df.head()

Out [9]:  

Schema considerations

read_csv() infers the data type for each column. This generally yields columns of type int64 or float64 for numeric fields, and object for everything else. A DataFrame’s dtypes method lets you see what data type is used for each column. The example below shows the inferred column types for the NYC Restaurant Inspections dataset.

In [10]:  
df = pd.read_csv('restaurants.csv')
df.dtypes

Out [10]:  
camis                      int64
dba                       object
boro                      object
building                  object
street                    object
zipcode                  float64
phone                     object
cuisine_description       object
inspection_date           object
action                    object
violation_code            object
violation_description     object
critical_flag             object
score                    float64
grade                     object
grade_date                object
record_date               object
inspection_type           object

In most cases, the inferred schema is all you need, but there are situations where you may need to apply a schema yourself. These include:

Additionally, the techniques described later for reading a dataset more efficiently and optimizing its memory footprint require a schema (see Reducing a dataset’s memory footprint). Before moving on, let’s look at the dataset’s memory footprint using the inferred schema so we can make comparisons later.

In  [11]:  
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 388620 entries, 0 to 388619
Data columns (total 18 columns):
camis                    388620 non-null int64
dba                      388227 non-null object
boro                     388620 non-null object
building                 388487 non-null object
street                   388616 non-null object
zipcode                  388616 non-null float64
phone                    388603 non-null object
cuisine_description      388620 non-null object
inspection_date          388620 non-null object
action                   387478 non-null object
violation_code           382499 non-null object
violation_description    381789 non-null object
critical_flag            388620 non-null object
score                    366798 non-null float64
grade                    191939 non-null object
grade_date               189927 non-null object
record_date              388620 non-null object
inspection_type          387478 non-null object
dtypes: float64(2), int64(1), object(15)
memory usage: 453.3 MB

The total memory footprint is 453.3 MB, as shown at the bottom of the listing (memory_usage: 453.3 MB).

Integer columns with empty fields (NAs)

pandas uses NumPy, and since NumPy doesn’t support empty fields in integer arrays, read_csv() assigns float64 if detects empty fields within a numeric column. (This is one of the reasons for the “mixed types” warning message shown earlier.)

One side effect of using floats is that all values within the “integer” column are displayed with a trailing “.0”. Here’s an example using the same NYC Restaurant Inspections dataset.

In [12]:  
df = pd.read_csv('restaurants.csv')
df[['zipcode', 'score']]

Out[12]:
  zipcode score
0 10462.0 7.0
1 10462.0 7.0
2 10462.0 10.0
3 10462.0 10.0
4 10462.0 6.0
5 11225.0 10.0
6 11225.0 10.0
7 11225.0 7.0
8 11225.0 11.0
9 11225.0 11.0
...

If this is a problem, you can apply a schema that forces integers to type object for something like a zip code. This way the field is treated as a string and you won’t see “.0” at the end of each value. For details, see Applying a schema below.

Don’t convert fields like score to object, as you won’t be able to do numerical analysis on object fields.

Handling datetimes

read_csv() does not detect datetime fields automatically. The NYC Restaurant Inspections dataset includes three datetime fields, but you can see in the example above that they are read as type object.

If datetimes are important, you can use the parse_dates option to specify a list of fields read_csv() will parse as datetime fields, for example:

In [13]:  
date_columns = ['inspection_date', 'grade_date', 'record_date']
df = pd.read_csv('restaurants.csv', parse_dates=date_columns)
df.dtypes

Out[13]:  
camis                             int64
dba                              object
boro                             object
building                         object
street                           object
zipcode                         float64
phone                            object
cuisine_description              object
inspection_date          datetime64[ns]
action                           object
violation_code                   object
violation_description            object
critical_flag                    object
score                           float64
grade                            object
grade_date               datetime64[ns]
record_date              datetime64[ns]
inspection_type                  object

Applying a schema

As noted earlier, the schema that’s inferred automatically by read_csv() may be fine for your purposes. If it isn’t, you can specify a schema using the dtype argument. This argument takes a dictionary of column name: data type pairs as its value. To demonstrate how it works, the example below forces the camis field (inferred previously as int64) to be read as a string and therefore stored (inefficiently) as an object.

In [14]:  
schema = {'camis': 'str'}
df = pd.read_csv('restaurants.csv', dtype=schema)
df.dtypes

Out[14]:  
camis                            object
dba                              object
boro                             object
...

When we ran df.info on the NYC Restaurant Inspections dataset using the inferred schema, the memory footprint was 453.3 MB. Below you can see the effect of defining camis as a string.

In [15]:  
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 388620 entries, 0 to 388619
Data columns (total 18 columns):
camis                    388620 non-null object
dba                      388227 non-null object
boro                     388620 non-null object
...
dtypes: float64(2), object(16)
memory usage: 474.4 MB

The change from int to str on one column increased the dataset’s memory footprint from 453.3 MB to 474.4 MB. We’ll examine how to apply a schema to reduce the memory footprint later. First, though, let’s look at how to create a schema for a given dataset.

Obtaining the schema from Enigma Public

The dtypes argument takes a dictionary of column: type pairs as its value. Creating such a dictionary by hand can be tedious, especially if the dataset has many columns. This section demonstrates how to create a custom schema quickly using the Enigma Public API.

The API’s GET /datasets/{id} endpoint returns information about each of the current snapshot’s fields, including its type. You can use this as a starting point for your schema. Here’s a portion of the fields JSON object for the NYC Restaurant Inspections dataset:

"fields": [
  {'data_type': 'integer',
  'description': 'Total score for inspection; updated based on adjudication results.  ',
  'display_name': 'Inspection Score',
  'name': 'score',
  'visible_by_default': True},
  {'data_type': 'string',
  'description': 'This field represents the grade associated with this inspection...',
  'display_name': 'Restaurant Grade',
  'name': 'grade',
  'visible_by_default': True},
  {'data_type': 'datetime',
  'description': 'The date when the grade was issued to the entity (restaurant).',
  'display_name': 'Grade Date',
  'name': 'grade_date',
  'visible_by_default': True},
  ...
  ]

Within the fields object, the keys used to build the schema are name and data_type. The example below shows one way to get them from the response JSON and into the required format by first getting the fields JSON object from the API response, and then using a dictionary comprehension to create the schema dictionary:

In  [16]:  
dataset_id = 'fa7ab996-fb43-4e86-80e7-f8e82ccba15f'
url = base_url + 'datasets/{}'.format(dataset_id)
response = requests.get(url, headers=headers)
json = response.json()
fields_json = json['current_snapshot']['fields']

In  [17]:  
schema = {field['name']:field['data_type'] for field in fields_json}
schema

Out [17]:  
{'action': 'string',
 'boro': 'string',
 'building': 'string',
 'camis': 'string',
 'critical_flag': 'string',
 'cuisine_description': 'string',
 'dba': 'string',
 'grade': 'string',
 'grade_date': 'datetime',
 'inspection_date': 'datetime',
 'inspection_type': 'string',
 'phone': 'string',
 'record_date': 'datetime',
 'score': 'integer',
 'street': 'string',
 'violation_code': 'string',
 'violation_description': 'string',
 'zipcode': 'string'}

Modifying the default schema

Although this provides the dataset’s schema in the required format, there are a few problems we’ll need to resolve:

  • The Enigma schema uses decimal and integer rather than the required float and int.
  • The Enigma schema uses string and boolean rather than the required str and bool.
  • The Enigma schema specifies datetime, whereas datetimes must be strings parsed using the parse_dates option (see Handling dates).

Additionally, since the NYC Restaurant Inspections dataset includes empty fields in integer columns, we’d like to read integer values as float, rather than the default object type.

You can edit the schema manually or progammatically. The example below makes the required conversions. (boolean and decimal aren’t used in this dataset, but are included for completeness.)

In  [18]:  
replacements = {'string':'str', 'integer':'float', 'datetime':'str', 'decimal':'float', 'boolean':'bool'}
for name, type in schema.items():
  if type in replacements:
    schema[name] = replacements[type]
schema

Out [18]:  
{'action': 'str',
 'boro': 'str',
 'building': 'str',
 'camis': 'str',
 'critical_flag': 'str',
 'cuisine_description': 'str',
 'dba': 'str',
 'grade': 'str',
 'grade_date': 'str',
 'inspection_date': 'str',
 'inspection_type': 'str',
 'phone': 'str',
 'record_date': 'str',
 'score': 'float',
 'street': 'str',
 'violation_code': 'str',
 'violation_description': 'str',
 'zipcode': 'str'}

You can then call read_csv(), specifying the schema and date columns as shown below.

In  [19]:  
date_cols = ['inspection_date', 'grade_date', 'record_date']
df = pd.read_csv('restaurants.csv', parse_dates=date_cols, dtype=schema)
df.head()

Reducing a dataset’s memory footprint

As noted earlier, the inferred schema generally yields columns of type int64 or float64 for numeric fields, and object for everything else. Although pandas uses highly efficient NumPy arrays to store numeric columns in contiguous blocks of memory, object values are stored as standard Python objects. If you examine the dataset’s memory footprint, you’ll see it’s around three times the size of the CSV file on disk. This is due in large part to Python’s inefficient string handling (see Josh Devlin’s excellent article, Using pandas with large data).

In  [20]:  
df = pd.read_csv('restaurants.csv')
df.info(memory_usage='deep')

Out [20]:  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 388620 entries, 0 to 388619
Data columns (total 18 columns):
camis                    388620 non-null int64
dba                      388227 non-null object
boro                     388620 non-null object
...
dtypes: float64(2), int64(1), object(15)
memory usage: 453.3 MB

You saw earlier how reading a single numeric column as a string instead of an integer increased the dataset’s memory footprint from 453.3 MB to 474.4 MB. Given the inefficiency of strings, they are always the first target for any memory optimization.

Using categoricals

read_csv() supports the schema type category (see Categorial Data). In a nutshell, if a column includes values that repeat, you can store the value once and then reference it multiple times using a numeric pointer (see Using pandas with large data for a helpful diagram and discussion). This can significantly reduce the dataset’s memory footprint.

Don’t use categoricals for numeric fields on which you plan to perform any numerical analysis. If you try to perform calculations on a category column (for example, to compute the mean), you’ll get the error “TypeError: Categorical cannot perform the operation mean.”

This example returns again to the NYC Restaurant Inspections dataset. Enigma Public provides an easy way to see which columns might benefit from categorization. Open the dataset and click the Stats button.

From the screenshot, it looks as though all of the columns shown here could benefit from categorization – even the “Building” column. Although the building number is almost always an integer value, there are a few non-numeric values that caused read_csv() to set the field type to object. If you scroll horizontally in Enigma Public you can see similar repetition in most of the other columns.

pandas itself provides a way to get detailed information about repetition within columns. Even if the unoptimized dataset is too large to read into memory, reading the first “chunk” (for example, 10,000 rows) using the nrows argument will show you which columns are good candidates for categoricals.

In  [21]:  
df = pd.read_csv('restaurants.csv', nrows=10000)
df.select_dtypes(include=['object']).describe()

Out [21]:

To use categorization, all you need to do is specify the column as type category in the schema. read_csv takes care of the actual categorization automatically. The example below sets every non-numeric (except zip_code) and non-date column to category.

In  [22]:  
schema={'action': 'category',
'boro': 'category',
'building': 'category',
'camis': 'int',
'critical_flag': 'category',
'cuisine_description': 'category',
'dba': 'category',
'grade': 'category',
'grade_date': 'str',
'inspection_date': 'str',
'inspection_type': 'category',
'phone': 'category',
'record_date': 'str',
'score': 'float',
'street': 'category',
'violation_code': 'category',
'violation_description': 'category',
'zipcode': 'category'}

You can then use read_csv with the dtype option to the read the dataset into a DataFrame and examine its memory footprint.

In  [23]:  
df = pd.read_csv('restaurants.csv', dtype=schema)
df.info(memory_usage='deep')

Out [23]:  
<<class 'pandas.core.frame.DataFrame'>
RangeIndex: 388620 entries, 0 to 388619
Data columns (total 18 columns):
camis                    388620 non-null int64
dba                      388227 non-null category
boro                     388620 non-null category
building                 388487 non-null category
street                   388616 non-null category
zipcode                  388616 non-null category
phone                    388603 non-null category
cuisine_description      388620 non-null category
inspection_date          388620 non-null object
action                   387478 non-null category
violation_code           382499 non-null category
violation_description    381789 non-null category
critical_flag            388620 non-null category
score                    366798 non-null float64
grade                    191939 non-null category
grade_date               189927 non-null object
record_date              388620 non-null object
inspection_type          387478 non-null category
dtypes: category(13), float64(1), int64(1), object(3)
memory usage: 100.5 MB

Although categorization slows down the read process, the results can be dramatic. In this example, the dataset’s memory footprint went from 453.3 MB down to 100.5 MB.

As a final optimization, we’ll use the parse_dates option to store the date columns as memory-efficient NumPy arrays instead of Python objects.

In  [24]:  
date_cols = ['inspection_date', 'grade_date', 'record_date']
df = pd.read_csv('restaurants.csv', dtype=schema, parse_dates=date_cols)
df.info(memory_usage='deep')

Out [24]:  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 388620 entries, 0 to 388619
Data columns (total 18 columns):
camis                    388620 non-null int64
dba                      388227 non-null category
boro                     388620 non-null category
building                 388487 non-null category
street                   388616 non-null category
zipcode                  388616 non-null category
phone                    388603 non-null category
cuisine_description      388620 non-null category
inspection_date          388620 non-null datetime64[ns]
action                   387478 non-null category
violation_code           382499 non-null category
violation_description    381789 non-null category
critical_flag            388620 non-null category
score                    366798 non-null float64
grade                    191939 non-null category
grade_date               189927 non-null datetime64[ns]
record_date              388620 non-null datetime64[ns]
inspection_type          387478 non-null category
dtypes: category(13), datetime64[ns](3), float64(1), int64(1)
memory usage: 27.7 MB

Using a combination of categorization and datetime parsing, the dataset’s memory footprint has gone from 453.3 MB to 27.7 MB, or to 6.1% of its original size.

Reading the data in chunks

For the largest of Enigma Public datasets, the task of reading and categorizing a 5-15 GB CSV file overwhelms most personal computers. If the dataset you want to analyze is too large, you may be able to do it in chunks, and then recombine the optimized chunks to create the full DataFrame.

While the technique shown here works, if you’re having to go to such lengths to get a large dataset into memory, it’s probably time to consider a non-pandas option. Even if you get the dataset into memory, analysis will be slow, or perhaps not even possible.

Although read_csv has a chunksize option, this reads horizontal chunks (blocks of rows), which don’t recombine well when using categoricals. A better approach is to read the dataset in vertical slices, so categorization can operate on full columns. To save on memory during this process, write each optimized slice back to disk and then recombine them when done.

When writing slices to disk, you’ll need to use a binary format that stores the data, the schema, and all of the mappings that support the categoricals. Pickling is well suited for this.

To demonstrate the vertical slicing technique, we’ll use one of Enigma Public’s largest datasets: the Bureau of Labor Statistics Occupational Injuries and Illnesses dataset. This dataset has 49 million rows and the CSV is 13.38 GB on disk – too large to read directly into pandas.

Begin by looking at the column statistics using the Stats button.

Repeating values in the thousands or even millions in some columns indicate that categoricals will work well for almost all columns in this dataset. The Series Id column is probably not worth categorizing, but we’ll do it anyway.

Next, download the dataset:

  1. Use the Export button to download the full dataset to disk as a CSV file.
  2. Copy the file into your Jupyter Notebook directory and rename it injuries.csv.

Now fetch the default schema using the Enigma Public API, as shown earlier.

In  [25]:  
url = base_url + 'datasets/{}'.format('2632836a-4680-4f9c-aae1-82067445bc72')
response = requests.get(url, headers=headers)
json = response.json()
fields_json = json['current_snapshot']['fields']
schema = {field['name']:field['data_type'] for field in fields_json}
schema

Out [25]:  
 {'footnote_codes': 'string',
  'footnote_text': 'string',
  'period': 'string',
  'period_name': 'string',
  'series_age_code': 'string',
  'series_age_text': 'string',
  'series_begin_period': 'string',
  'series_begin_period_text': 'string',
  'series_begin_year': 'string',
  'series_case_code': 'string',
  'series_case_text': 'string',
  'series_category_code': 'string',
  'series_category_text': 'string',
  'series_datatype_code': 'string',
  'series_datatype_text': 'string',
  'series_end_period': 'string',
  'series_end_period_text': 'string',
  'series_end_year': 'string',
  'series_event_code': 'string',
  'series_event_text': 'string',
  'series_footnote_codes': 'string',
  'series_footnote_text': 'string',
  'series_gender_code': 'string',
  'series_gender_text': 'string',
  'series_hour_code': 'string',
  'series_hour_text': 'string',
  'series_id': 'string',
  'series_industry_code': 'string',
  'series_industry_text': 'string',
  'series_los_code': 'string',
  'series_los_text': 'string',
  'series_nature_code': 'string',
  'series_nature_text': 'string',
  'series_occupation_code': 'string',
  'series_occupation_text': 'string',
  'series_ownership_code': 'string',
  'series_ownership_text': 'string',
  'series_pob_code': 'string',
  'series_pob_text': 'string',
  'series_race_code': 'string',
  'series_race_text': 'string',
  'series_source_code': 'string',
  'series_source_text': 'string',
  'series_special_code': 'string',
  'series_special_text': 'string',
  'series_state_code': 'string',
  'series_state_name': 'string',
  'series_time_code': 'string',
  'series_time_text': 'string',
  'series_weekday_code': 'string',
  'series_weekday_text': 'string',
  'value': 'string',
  'year': 'string'}

Then change the column types to categoricals using a modified version of the code from earlier.

In this dataset, all fields are defined as strings in the default schema. Since we won’t be doing any calculations on any of the numeric fields, we’ll convert them all to category, but in your own analysis, you may need to read certain fields as int or float.
In  [26]:  
replacements = {'string':'category', 'integer':'float', 'datetime':'str', 'decimal':'float', 'boolean':'bool'}
for name, type in schema.items():
    if type in replacements:
        schema[name] = replacements[type]
schema

Out [26]:  
{'footnote_codes': 'category',
 'footnote_text': 'category',
 'period': 'category',
 'period_name': 'category',
 'series_age_code': 'category',
 'series_age_text': 'category',
 'series_begin_period': 'category',
 'series_begin_period_text': 'category',
 'series_begin_year': 'category',
 'series_case_code': 'category',
 'series_case_text': 'category',
 'series_category_code': 'category',
 'series_category_text': 'category',
 'series_datatype_code': 'category',
 'series_datatype_text': 'category',
 'series_end_period': 'category',
 'series_end_period_text': 'category',
 'series_end_year': 'category',
 'series_event_code': 'category',
 'series_event_text': 'category',
 'series_footnote_codes': 'category',
 'series_footnote_text': 'category',
 'series_gender_code': 'category',
 'series_gender_text': 'category',
 'series_hour_code': 'category',
 'series_hour_text': 'category',
 'series_id': 'category',
 'series_industry_code': 'category',
 'series_industry_text': 'category',
 'series_los_code': 'category',
 'series_los_text': 'category',
 'series_nature_code': 'category',
 'series_nature_text': 'category',
 'series_occupation_code': 'category',
 'series_occupation_text': 'category',
 'series_ownership_code': 'category',
 'series_ownership_text': 'category',
 'series_pob_code': 'category',
 'series_pob_text': 'category',
 'series_race_code': 'category',
 'series_race_text': 'category',
 'series_source_code': 'category',
 'series_source_text': 'category',
 'series_special_code': 'category',
 'series_special_text': 'category',
 'series_state_code': 'category',
 'series_state_name': 'category',
 'series_time_code': 'category',
 'series_time_text': 'category',
 'series_weekday_code': 'category',
 'series_weekday_text': 'category',
 'value': 'category',
 'year': 'category'}

When you read a dataset in vertical slices, you must decide how many columns to process in one slice. For each slice you must read the entire CSV, so you want to do this as few times as possible. On the other hand, you want to reserve as much memory as possible for the categorization process, so you want to minimize the number of columns per slice. A lot depends on the dataset and the machine you’re using, but here we’ll process three columns per slice.

In  [27]:  
col_names = [field['name'] for field in fields_json]
for i in range(0, len(col_names), 3):
    print(col_names[i:i+3])
    df = pd.read_csv('injuries.csv', dtype=schema, usecols=col_names[i:i+3])
    df.to_pickle('pickles/' + str(i) + '.pkl')
['series_id', 'year', 'period']
['value', 'footnote_codes', 'period_name']
['footnote_text', 'series_pob_text', 'series_occupation_code']
['series_pob_code', 'series_gender_code', 'series_ownership_code']
['series_nature_code', 'series_category_text', 'series_datatype_code']
['series_race_code', 'series_footnote_text', 'series_hour_code']
['series_special_text', 'series_industry_code', 'series_datatype_text']
['series_event_code', 'series_time_code', 'series_state_name']
['series_los_code', 'series_los_text', 'series_case_text']
['series_state_code', 'series_gender_text', 'series_nature_text']
['series_source_text', 'series_end_period', 'series_begin_period']
['series_industry_text', 'series_age_code', 'series_time_text']
['series_begin_year', 'series_end_period_text', 'series_ownership_text']
['series_hour_text', 'series_race_text', 'series_category_code']
['series_age_text', 'series_occupation_text', 'series_event_text']
['series_end_year', 'series_weekday_text', 'series_begin_period_text']
['series_weekday_code', 'series_footnote_codes', 'series_source_code']
['series_special_code', 'series_case_code']

For reference, on a 16 GB MacBook Pro with a 3.1 GHz Intel i7 processor it took about 50 minutes to complete the read process. Once done, you can read and recombine the optimize slices as follows:

In  [28]:  
dfs = []
for i in range(0, len(col_names), 3):
    print(col_names[i:i+3])
    df = pd.read_pickle('pickles/' + str(i) + '.pkl')
    dfs.append(df)
df = pd.concat(dfs, axis=1)

Finally, check the DataFrame’s memory footprint:

In  [29]:  
df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49406071 entries, 0 to 49406070
Data columns (total 53 columns):
series_id                   object
year                        category
period                      category
value                       category
...
dtypes: category(52), object(1)
memory usage: 6.4 GB

In terms of percentage reduction, the result is not as dramatic we saw for the NYC Restaurant Inspections dataset, but we were able to load the dataset into memory. Before you switch off your computer, be sure to save the resulting DataFrame to disk as a pickle so you can reload it quickly.

Summary

pandas is a great tool for data analysis, but it can sometimes be challenging to get your data into pandas, especially if the dataset is large. Enigma Public provides a vast selection of data to explore and analyze, but some of the datasets are multiple gigabytes in size and contain tens of millions of rows. Prefiltering or memory optimizations are essential if you want to perform any kind of analysis on these datasets. Hopefully this article has provided some helpful tips on how to get the most out of Enigma Public.