Enigma Public includes tens of thousands of datasets, each of which may contain useful information, but often you need to combine datasets to gain insights that aren’t otherwise apparent. For example, the United States Senate issues annual reports of lobbying activities, but these are compiled in a way that makes it hard to see which lobbyists and clients tried to influence specific issues, and which client spent the most money. In this article, we’ll show you how to combine two datasets to answer those questions.

About the datasets

All the Senate lobbying activity reports are available on Enigma Public in the U.S. Senate - Lobbying Reports collection:

https://public.enigma.com/browse/u-s-senate-lobbying-reports/3b169a69-cbf4-4137-92d4-0ca268816153

The two datasets we’ll use in this example are:

The Primary Report lists registrant and client transactions, but it doesn’t provide information about the specific issues lobbyists worked on for their clients.

The Issues dataset provides information about the issues worked on by lobbyists, but doesn’t provide information about the lobbyists or their clients, or how much was spent.

Fortunately, the two tables use a common “Filing ID” (shown in the “ID” column). By downloading and joining the two tables using a SQL query, you can find out what issues the lobbyists and their clients worked on.

There are many ways you can do this kind of a join. In this article, we’ll show you how to do this using a free app called DB Browser for SQLite, which is available for PCs and Macs.

Downloading the datasets

Primary Report 2017: https://public.enigma.com/datasets/lobbying-disclosures-primary-report-2017/bbb5f44a-f9c4-4c01-b112-03d576984d67

Issues 2017: https://public.enigma.com/datasets/lobbying-disclosures-issues-2017/fc655a9d-7a69-4a0e-8440-47e36712cb4c

  1. Open the first dataset in the Enigma Public Data Viewer.
  2. Click the Export button to download the dataset to your computer.
  3. Repeat for the second dataset.

You should now have the two CSV files you need.

Importing the files into DB Browser

If you don’t already have DB Browser for SQLite, download it from http://sqlitebrowser.org/ and install it on your computer. Then follow the steps below to import the datasets.

  1. In DB Browser, click File > New Database.
  2. Enter a file name and click Save. Then click Cancel to close the “Edit table definition” window.
  3. In DB Browser, click File > Import > Table from CSV file.
  4. Select the first CSV file and click Open.
  5. In the Table name field, change the table name to primary2017.
  6. Make sure the Column names in first line option is selected.
  7. Click OK.

Repeat for the second CSV file, setting the table name to issues2017. After importing both files, you’ll see the tables listed in the Database Structure view and the DB Schema view.

Joining the two tables

The SQL command to combine rows from two tables is the JOIN command (specifically an “inner join,” which is the default join type). In this case, you want to combine rows with the same ID. The query to do this is:

SELECT * FROM issues2017 JOIN primary2017 ON issues2017.id = primary2017.id

The * says to include all columns from both tables. To include only specific columns, specify the columns you want like this:

SELECT issues2017.*, primary2017.registrant_name, primary2017.client_name, primary2017.amount 
FROM issues2017 JOIN primary2017 ON issues2017.id = primary2017.id

This query says to include all columns from the issues2017 table, plus the registrant name, client name, and amount from the primary2017 table.

To perform the join:

  1. In DB Browser for SQLite, click the Execute SQL tab.
  2. Copy the SQL query above and paste it into the command query window.
  3. Click the Execute button. You’ll see the query results showing all of the columns you specified for each matching ID.

Locating specific information

In the screen above, there’s a note below the table indicating that the query returned 35,045 rows. This means there are 35,045 rows in the issues2017 table with an ID that matches a row in the primary2017 table. Next, we’ll narrow this down to answer the question, which lobbyists and clients tried to influence the Senate on health issues relating to prescription drugs?

  1. In DB Browser, add an AND statement to the query as shown below. This narrows the results to include only rows where the code field is HEALTH ISSUES (this is one of the standard issue categories).

    SELECT issues2017.*, primary2017.registrant_name, primary2017.client_name, primary2017.amount 
    FROM issues2017 JOIN primary2017 
    ON issues2017.id = primary2017.id 
    AND issues2017.code="HEALTH ISSUES"
    
  2. Click the Execute button to view the matching rows.
  3. Add two more AND statements to the query to narrow the results down to include only rows with the words “prescription” and “drugs” in the specific_issue column.

    SELECT issues2017.*, primary2017.registrant_name, primary2017.client_name, primary2017.amount 
    FROM issues2017 JOIN primary2017 
    ON issues2017.id = primary2017.id 
    AND issues2017.code="HEALTH ISSUES" 
    AND issues2017.specific_issue LIKE "%prescription%" 
    AND issues2017.specific_issue LIKE "%drugs%"
    
  4. Click the Execute button to view the matching rows.
  5. To sort the rows by client name, add an ORDER BY statement to the query.

    SELECT issues2017.*, primary2017.registrant_name, primary2017.client_name, primary2017.amount 
    FROM issues2017 JOIN primary2017 
    ON issues2017.id = primary2017.id 
    AND issues2017.code="HEALTH ISSUES" 
    AND issues2017.specific_issue LIKE "%prescription%" 
    AND issues2017.specific_issue LIKE "%drugs%" 
    ORDER BY primary2017.client_name
    
  6. Click the Execute button. This time the query returns 61 matching rows showing which clients and lobbyists worked on issues relating to prescription drugs.

Aggregating amounts

The second question we wanted to answer was, of the companies lobbying on health issues relating to prescription drugs, which spent the most money? Since each “filing ID” in the primary report may map to multiple issues in the issues report, we can’t know how much money was targeted on a specific issue type, but we can sort the results by total amount spent. To do this, you’ll need to perform some calculations on the data – specifically to aggregate the amount spent by each client company.

When you import data from a CSV file, all columns are defined as text fields. Before you can perform calculations, you’ll need to define the amount column as as a numeric field. To do this:

  1. In DB Browser, click the Database Structure tab.
  2. Select the primary2017 table and click Modify Table.
  3. Set the Type field for amount to NUMERIC, as shown above.
  4. Click OK.

With the amount column specified as a numeric field, you can now perform the calculations:

  1. Click the Execute SQL tab and paste in the query below.

    SELECT registrant_name, client_name, SUM(amount) AS total_amount
    FROM issues2017 JOIN primary2017 ON issues2017.id = primary2017.id 
    AND issues2017.code="HEALTH ISSUES" 
    AND issues2017.specific_issue LIKE "%prescription%" 
    AND issues2017.specific_issue LIKE "%drugs%" 
    GROUP BY client_name
    ORDER BY total_amount DESC
    
  2. Click the Execute button.

In this modified version of the query, the SUM command works with the GROUP BY command to compute the total amount for each client. It then orders the results using the computed total amount and displays the results in descending (DESC) order.

When reporting lobbying activities, companies can upload amendments to originally reported amounts. Each upload is given a separate filing ID and appears as a separate row in the dataset. This means a simple analysis like the one here may count the same transaction multiple times. As with all data analysis, it is important to understand the dataset you are using and account for possible anomalies.

Summary

Sometimes you can’t find the information you need from a single dataset. In this example, you combined two datasets containing a common key using a SQL query. You were then able to quickly narrow down over 35,000 rows to just 61 rows with the information you were seeking.