Enigma Public includes tens of thousands of datasets, each of which may contain useful information, but often you need to combine datasets to gain insights that aren’t otherwise apparent. For example, the United States Senate issues annual reports of lobbying activities, but these are compiled in a way that makes it hard to see which lobbyists and clients tried to influence specific issues, and which client spent the most money. In this article, we’ll show you how to combine two datasets to answer those questions.
About the datasets
All the Senate lobbying activity reports are available on Enigma Public in the U.S. Senate - Lobbying Reports collection:
The two datasets we’ll use in this example are:
- Lobbying Disclosures - Primary Report 2017: https://public.enigma.com/datasets/lobbying-disclosures-primary-report-2017/bbb5f44a-f9c4-4c01-b112-03d576984d67
- Lobbying Disclosures - Issues 2017: https://public.enigma.com/datasets/lobbying-disclosures-issues-2017/fc655a9d-7a69-4a0e-8440-47e36712cb4c
The Primary Report lists registrant and client transactions, but it doesn’t provide information about the specific issues lobbyists worked on for their clients.
The Issues dataset provides information about the issues worked on by lobbyists, but doesn’t provide information about the lobbyists or their clients, or how much was spent.
Fortunately, the two tables use a common “Filing ID” (shown in the “ID” column). By downloading and joining the two tables using a SQL query, you can find out what issues the lobbyists and their clients worked on.
There are many ways you can do this kind of a join. In this article, we’ll show you how to do this using a free app called DB Browser for SQLite, which is available for PCs and Macs.
Downloading the datasets
- Open the first dataset in the Enigma Public Data Viewer.
- Click the Export button to download the dataset to your computer.
- Repeat for the second dataset.
You should now have the two CSV files you need.
Importing the files into DB Browser
If you don’t already have DB Browser for SQLite, download it from http://sqlitebrowser.org/ and install it on your computer. Then follow the steps below to import the datasets.
- In DB Browser, click File > New Database.
- Enter a file name and click Save. Then click Cancel to close the “Edit table definition” window.
- In DB Browser, click File > Import > Table from CSV file.
- Select the first CSV file and click Open.
- In the Table name field, change the table name to primary2017.
- Make sure the Column names in first line option is selected.
- Click OK.
Repeat for the second CSV file, setting the table name to issues2017. After importing both files, you’ll see the tables listed in the Database Structure view and the DB Schema view.
Joining the two tables
The SQL command to combine rows from two tables is the
JOIN command (specifically an “inner join,” which is the default join type). In this case, you want to combine rows with the same ID. The query to do this is:
SELECT * FROM issues2017 JOIN primary2017 ON issues2017.id = primary2017.id
* says to include all columns from both tables. To include only specific columns, specify the columns you want like this:
SELECT issues2017.*, primary2017.registrant_name, primary2017.client_name, primary2017.amount FROM issues2017 JOIN primary2017 ON issues2017.id = primary2017.id
This query says to include all columns from the issues2017 table, plus the registrant name, client name, and amount from the primary2017 table.
To perform the join:
- In DB Browser for SQLite, click the Execute SQL tab.
- Copy the SQL query above and paste it into the command query window.
- Click the Execute button. You’ll see the query results showing all of the columns you specified for each matching ID.
Locating specific information
In the screen above, there’s a note below the table indicating that the query returned 35,045 rows. This means there are 35,045 rows in the issues2017 table with an ID that matches a row in the primary2017 table. Next, we’ll narrow this down to answer the question, which lobbyists and clients tried to influence the Senate on health issues relating to prescription drugs?
In DB Browser, add an
ANDstatement to the query as shown below. This narrows the results to include only rows where the code field is HEALTH ISSUES (this is one of the standard issue categories).
SELECT issues2017.*, primary2017.registrant_name, primary2017.client_name, primary2017.amount FROM issues2017 JOIN primary2017 ON issues2017.id = primary2017.id AND issues2017.code="HEALTH ISSUES"
- Click the Execute button to view the matching rows.
Add two more
ANDstatements to the query to narrow the results down to include only rows with the words “prescription” and “drugs” in the specific_issue column.
SELECT issues2017.*, primary2017.registrant_name, primary2017.client_name, primary2017.amount FROM issues2017 JOIN primary2017 ON issues2017.id = primary2017.id AND issues2017.code="HEALTH ISSUES" AND issues2017.specific_issue LIKE "%prescription%" AND issues2017.specific_issue LIKE "%drugs%"
- Click the Execute button to view the matching rows.
To sort the rows by client name, add an
ORDER BYstatement to the query.
SELECT issues2017.*, primary2017.registrant_name, primary2017.client_name, primary2017.amount FROM issues2017 JOIN primary2017 ON issues2017.id = primary2017.id AND issues2017.code="HEALTH ISSUES" AND issues2017.specific_issue LIKE "%prescription%" AND issues2017.specific_issue LIKE "%drugs%" ORDER BY primary2017.client_name
- Click the Execute button. This time the query returns 61 matching rows showing which clients and lobbyists worked on issues relating to prescription drugs.
The second question we wanted to answer was, of the companies lobbying on health issues relating to prescription drugs, which spent the most money? Since each “filing ID” in the primary report may map to multiple issues in the issues report, we can’t know how much money was targeted on a specific issue type, but we can sort the results by total amount spent. To do this, you’ll need to perform some calculations on the data – specifically to aggregate the amount spent by each client company.
When you import data from a CSV file, all columns are defined as text fields. Before you can perform calculations, you’ll need to define the amount column as as a numeric field. To do this:
- In DB Browser, click the Database Structure tab.
- Select the primary2017 table and click Modify Table.
- Set the Type field for amount to NUMERIC, as shown above.
- Click OK.
With the amount column specified as a numeric field, you can now perform the calculations:
Click the Execute SQL tab and paste in the query below.
SELECT registrant_name, client_name, SUM(amount) AS total_amount FROM issues2017 JOIN primary2017 ON issues2017.id = primary2017.id AND issues2017.code="HEALTH ISSUES" AND issues2017.specific_issue LIKE "%prescription%" AND issues2017.specific_issue LIKE "%drugs%" GROUP BY client_name ORDER BY total_amount DESC
Click the Execute button.
In this modified version of the query, the
SUM command works with the
GROUP BY command to compute the total amount for each client. It then orders the results using the computed total amount and displays the results in descending (
When reporting lobbying activities, companies can upload amendments to originally reported amounts. Each upload is given a separate filing ID and appears as a separate row in the dataset. This means a simple analysis like the one here may count the same transaction multiple times. As with all data analysis, it is important to understand the dataset you are using and account for possible anomalies.
Sometimes you can’t find the information you need from a single dataset. In this example, you combined two datasets containing a common key using a SQL query. You were then able to quickly narrow down over 35,000 rows to just 61 rows with the information you were seeking.