The examples on this page demonstrate how to access Enigma Public data from Python through the Enigma Public API. All of the examples use the Requests module. For information on obtaining the IDs for entities of interest, see UUIDs.

Getting started

To use the Requests module, you’ll need to import it into your Python project. Then add two global variables:

  • One with your API key (click your initials at the top right of the Enigma Public screen and choose Account Settings to get your API key)
  • The other with the Enigma Public base URL
import requests

headers = {'authorization': 'Bearer <YOUR_API_KEY>'}
base_url = "https://public.enigma.com/api/"

Getting collection information

This example uses the GET /collections/ endpoint and prints the name, description, and ID of each top level collection.

def print_top_level_collections():
    url = base_url + "collections/"
    r = requests.get(url, headers=headers)
    collections = r.json()
    for collection in collections:
        print collection['display_name']
        print collection['description_short']
        print collection['id'] + "\n"

The output looks like this:

Companies
Data sources that come from companies or commercial data.
5f8faa60-e6c3-4dc0-8eea-ade8c81d1265

Curated Collections
Datasets curated by Enigma.
52dfb31c-f22e-49fb-bc05-8f5d8a5e7cab

...

How it works

GET /collections/ returns a JSON object that maps to a list of dictionary objects in Python. Each dictionary represents a collection and contains key:value pairs representing the collection’s attributes. A portion of the JSON returned by GET /collections/ is shown below.

[
  {
    "ancestors": [], 
    "created_at": "2017-06-10T21:05:33.780625+00:00", 
    "description": "Data sources that come from companies or commercial data.", 
    "description_short": "Data sources that come from companies or commercial data.", 
    "display_name": "Companies", 
    "editable": false, 
    "id": "5f8faa60-e6c3-4dc0-8eea-ade8c81d1265", 
    "modified_at": "2017-06-27T14:06:32.352415+00:00", 
    "parent_collection": {
      "id": null
    }, 
    "published": true
  }, 
  {
    "ancestors": [], 
    "created_at": "2017-06-10T21:03:06.174454+00:00", 
    "description": "Datasets curated by Enigma.", 
    "description_short": "Datasets curated by Enigma.", 
    "display_name": "Curated Collections", 
    "editable": false, 
    "id": "52dfb31c-f22e-49fb-bc05-8f5d8a5e7cab", 
    "modified_at": "2017-06-10T21:03:06.174459+00:00", 
    "parent_collection": {
      "id": null
    }, 
    "published": true
  },
  ... 
]

The Python function iterates over the list of collection objects. Within the for loop, collection['display_name'] says, within this collection object, find the value for the key display_name. The same construct is used for the other attributes.

Finding the most recent snapshot ID

This example uses GET /datasets/{id} to return the ID of the most recent snapshot (the “current snapshot”) within the specified dataset.

def find_current_snapshot_id(dataset_id):
    url = base_url + "datasets/" + dataset_id
    r = requests.get(url, headers=headers)
    dataset = r.json()
    return dataset['current_snapshot']['id']

How it works

GET /datasets/{id} returns a JSON object representing the dataset. A portion of the JSON returned by GET /datasets/{id} is shown below.

{
  "ancestors": [
    ...
  ], 
  "citation": "http://www.foreignlaborcert.doleta.gov/performancedata.cfm", 
  "created_at": "2017-02-07T22:28:26.483897+00:00", 
  "current_snapshot": {                            <-- ['current_snapshot']
    "fields": [
      ...
    ], 
    "id": "d1ce71cb-5f90-4e0f-9cd1-d9edd79d9cf7",  <-- ['id']
    "row_count": 647852
  }, 
  "data_updated_at": null, 
  "description": "The H-1B is a non-immigrant visa...", 
  "description_short": "The H-1B is a non-immigrant visa...", 
  "display_name": "H-1B Visa Applications - 2016", 
  "editable": false, 
  "id": "62daf463-8094-4e88-8d46-b1715065dcf1", 
  "modified_at": "2017-06-29T03:29:10.806290+00:00", 
  "parent_collection": {
    "id": "d582dfbd-4329-4b5e-b0c9-39149f5dd546"
  }, 
  "published": true, 
  "schema_updated_at": null
}

In the Python function, dataset['current_snapshot']['id'] says, within the dataset object, find the value for the key current_snapshot and within that get the value for the key id.

Finding all snapshots within a collection

This example uses GET /datasets/ with the parent_collection_id query parameter. It returns a list of IDs for all current snapshots within the specified collection.

def snapshots_in_collection(collection_id):
    url = base_url + "datasets/?parent_collection_id=" + collection_id
    r = requests.get(url, headers=headers)
    datasets = r.json()
    return [
        dataset['current_snapshot']['id']
        for dataset in datasets
            if dataset['current_snapshot'] is not None
        ]

How it works

GET /datasets/ returns a list of dataset objects. The function iterates over the list and uses dataset['current_snapshot']['id'] as described in the previous example to get the ID for each current snapshot.

Finding matching rows from a dataset

This example uses GET /snapshots/{id} with the query parameter to get the rows within the specified snapshot that contain the specified search string. row_limit is required because the default value is zero (i.e. no rows returned).

def get_matching_rows(snapshot_id, search_string):
    url = base_url + "snapshots/" + snapshot_id + "?row_limit=1000&query=" + search_string
    r = requests.get(url, headers=headers)
    snapshot = r.json()
    return snapshot['table_rows']['rows']

The return object might look something like this (here we queried the snapshot for the NASDAQ Company Listings dataset searching for the word “micro”):

[[u'SMCI', u'Super Micro Computer, Inc.', u'24.2', u'1168760683.2', u'n/a', u'n/a', u'Technology', u'Computer Manufacturing', u'http://www.nasdaq.com/symbol/smci'], [u'AMD', u'Advanced Micro Devices, Inc.', u'10.19', u'9592853089.27', u'n/a', u'n/a', u'Technology', u'Semiconductors', u'http://www.nasdaq.com/symbol/amd'], [u'SMSI', u'Smith Micro Software, Inc.', u'1.03', u'12479640.68', u'n/a', u'n/a', u'Technology', u'Computer Software: Prepackaged Software', u'http://www.nasdaq.com/symbol/smsi']]

How it works

A portion of the JSON returned by GET /snapshots/{id} for the NASDAQ dataset with query=micro is shown below.

{
  "created_at": "2017-05-07T07:55:45.184629+00:00", 
  "dataset": {
    "created_at": "2017-05-07T07:55:45.184629+00:00", 
    "description": "Companies listed on the NASDAQ Stock Exchange.", 
    "description_short": "Companies listed on the NASDAQ Stock Exchange.", 
    "display_name": "Company Listings - NASDAQ Stock Exchange", 
    "id": "ebb5e1c4-3780-4524-9d12-6f2a9f6b83b6", 
    "modified_at": "2017-08-23T15:22:32.634973+00:00", 
    "published": true
  }, 
  "fields": [
    {
      "data_type": "string", 
      "description": "Company Ticker Symbol.", 
      "display_name": "Company Symbol", 
      "name": "symbol", 
      "visible_by_default": true
    }, 
    {
      "data_type": "string", 
      "description": "Company Name", 
      "display_name": "Company Name", 
      "name": "name", 
      "visible_by_default": true
    }, 
    {
      "data_type": "string", 
      "description": "Last Sale Price - Delayed data, current as of \"Last Update Date\"", 
      "display_name": "Last Sale Price", 
      "name": "lastsale", 
      "visible_by_default": true
    },
    etc.
  ], 
  "highlights": [], 
  "id": "8508e202-7b9c-4419-84c8-63276c23626c", 
  "parent_snapshot": null, 
  "row_count": 3209, 
  "size": 655360, 
  "table_rows": {
    "count": 3, 
    "fields": [
      "symbol", 
      "name", 
      "lastsale", 
      "marketcap", 
      "adr_tso", 
      "ipoyear", 
      "sector", 
      "industry", 
      "summary_quote"
    ], 
    "rows": [
      [
        "SMCI", 
        "Super Micro Computer, Inc.", 
        "24.2", 
        "1168760683.2", 
        "n/a", 
        "n/a", 
        "Technology", 
        "Computer Manufacturing", 
        "http://www.nasdaq.com/symbol/smci"
      ], 
      [
        "AMD", 
        "Advanced Micro Devices, Inc.", 
        "10.19", 
        "9592853089.27", 
        "n/a", 
        "n/a", 
        "Technology", 
        "Semiconductors", 
        "http://www.nasdaq.com/symbol/amd"
      ], 
      [
        "SMSI", 
        "Smith Micro Software, Inc.", 
        "1.03", 
        "12479640.68", 
        "n/a", 
        "n/a", 
        "Technology", 
        "Computer Software: Prepackaged Software", 
        "http://www.nasdaq.com/symbol/smsi"
      ]
    ]
  }
}

In the Python function, snapshot['table_rows']['rows'] says, within the snapshot object, find the value for the key table_rows and within that get the value (a list in this case) for the key rows.

Finding rows that match on a specific column

This example is like the previous one, except it returns only rows where the query string is found within a specific column (rather than anywhere within the row). The function includes an additional parameter to pass in the 0-based index of the field you want to search.

def get_matching_rows(snapshot_id, search_string, field_index):
    url = base_url + "snapshots/" + snapshot_id + "?row_limit=1000&query=" + search_string
    r = requests.get(url, headers=headers)
    snapshot = r.json()
    rows = snapshot['table_rows']['rows']
    return [
        row
        for row in rows if search_string in row[field_index].lower()
        ]

How it works

The for loop in this example iterates over the list of matching rows generated by the previous example, and looks for a match within the specified field. Since in (substring match) does a case sensitive match, the field value is converted to lowercase first.

Using a Range header

The API may limit the amount of data returned in reponse to a request. For example, GET /datasets/ returns data for the first 20 datasets. If you want more, you must include a Range header in the request, specifying the number of datasets. This example shows how you can specify a Range header.

def get_dataset_ids_in_collection(collection_id, max):
    url = base_url + "datasets/?parent_collection_id=" + collection_id
    headers['Range'] = 'resources=%d-%d' % (0, max - 1)
    r = requests.get(url, headers = headers)
    datasets = r.json()
    return [
        dataset['id']
        for dataset in datasets
        ]

How it works

For endpoints that return a list of items, the Range header (for example, Range: resources=0-9) lets you specify which items to return (for example, items 0 through 9). The Python function adds the Range request to the existing header dictonary, specifying the range start (0) and range end (max), and then iterates through the list of datasets that’s returned.

Querying the API for the content range

Frequently, you need to know in advance how many items the API will return. You can then request the information in chunks, rather than in one huge request that may time out or deliver more data than you need. This example shows how to use the HTTP HEAD method to query the API server. The HEAD method is like the GET method, except that the server doesn’t return the response body – only the response headers. The example then uses this information to request datasets from the API server in chunks of 10.

def get_all_datasets_in_collection(collection_id):
    url = base_url + "datasets/?parent_collection_id=" + collection_id
    r = requests.head(url, headers = headers)
    num_datasets = int(r.headers.get('content-range').split("/")[1])
    dataset_ids = []
    for start in range(0, num_datasets, 10):
        headers['Range'] = 'resources=%d-%d' % (start, start + 9)
        r = requests.get(url, headers=headers)
        datasets = r.json()
        for dataset in datasets:
            dataset_ids.append(dataset['id'])
    return dataset_ids

How it works

The Python function makes a HEAD request to the GET /datasets/ endpoint. It uses the requests module’s headers method to retrieve the Content-Range response header, which looks something like this:

resources 0-19/94

From this it splits off the portion after the /, which specifies the total number of datasets available. The for loop then makes multiple GETrequests, using the Range header (described in the previous example) to request the datasets in chunks of 10.