by Tashay Green, data scientist

What to Expect

The Enigma Public Data Viewer makes it easy to view a large data set, sort data columns and perform simple summary statistics, but if you’re interested in understanding the data a little deeper, this tutorial will walk you through the how to ask and answer a question from a Public dataset, using the Python programming language.

Prior experience with Python and pandas will be helpful for this guide. You will learn to explore the data by following the code step by step. You will also learn to pose a research question, manipulate a data set, visualize and draw conclusions about the data.

Installing Python and Jupyter Notebook

To run the code included in this tutorial, you will need to download Python version 2.7 onto your local machine. For those unfamiliar, Python is an object-oriented programming language that has become very popular for the sake of data analysis.

This tutorial won’t walk you through all of Python or pandas basics - that would make it very long - but the Hitchhiker’s Guide to Python will get you start with downloading Python for your operating system.

Jupyter Notebook

Jupyter notebook is a web application that allows you to interactively engage with your code, take notes, and create visualizations in one place.

Installing Jupyter notebook: https://jupyter.readthedocs.io/en/latest/install.html

Library Requirements

Python language comes with built-in modules that are useful for executing functions within your code. For this tutorial, it will be helpful to install additional libraries (collections of modules) designed specifically for data manipulation and visualization.

The necessary libraries can be installed using the command line prompts:

  • pip install pandas
  • pip install matplotlib
  • pip install geopandas
  • pip install seaborn

More info about the Python libraries used in the project can be found here:

All dependencies are listed in the requirements.txt document.

This guide will not explicitly teach the ins and outs of the packages, but if you enjoy doing the analysis here, you should practice them further. The links above are a great place to start.

How to Run the Code

You may either download this .ipynb file directly and execute it using Jupyter Notebook, or copy and paste the code into a new notebook.

Data Exploration

In [1]:

# Import libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from __future__ import division
from geopandas import GeoDataFrame

# Set max column display width to 55
pd.set_option("display.max_columns", 55)

Background Information

In a city as busy as New York, there’s a constant presence of cranes and construction crews. Between the new businesses, residences, and attractions that are constructed in the city each year one can only wonder just how much construction is taking place. A great way to find out is by researching the number of building permits the New York City Department of Buildings issues each year, since permits are required before beginning any construction in the city.

Research Question

How many building permits are issued in NYC each year? What are the most common types of permits issued? Where are the most building permits issued?

Finding the Data

The data for this project was sourced from the Enigma Public database. A search for “building permit” yielded a page of several hundred results, which was narrowed down by clicking on “Collection Results”. From here you can find the “New York City Building Permits Issued” data set. Source links have been provided below.

Read in the Data

To get started with the analysis, we will load in a data set sourced from Enigma Public. We’ll need to do this in order to manipulate the data using the pandas library.

Viewing the data in Enigma Public helped us get a sense of what the data contained but we’ll take things a bit further by creating graphs and maps below.

New York City Department of Buildings Permits Data

Data source (Enigma Public Data Viewer): https://public.enigma.com/datasets/new-york-city-building-permits-issued/862f9e4a-88f3-4231-ad43-c823833ba4fd

Note: to export the data you need to create an Enigma Public account. Luckily, it’s free! Just click the button in the top right to create an account and then again, turn to the top right to download a CSV of the data.

In [2]:

directory = '../../Downloads/' #set the directory for where the data can be found on your machine

In [3]:

#create pandas dataframe from the csv data file
df = pd.read_csv(directory + 'f57081d3-e14a-4c45-bf99-52ddf9836a36.csv',low_memory=False)

At this point, you want to preview the data to make sure it was read in properly.

In [4]:

df.head(5) #preview the first 5 rows of the data

Out[4]:

How many building permits are issued in NYC each year?

Subset data for ‘issued’ permit status since we aren’t interested in any other permit status, then count the number of permits by year.

In [5]:

df = df[df['permit_status'] == 'ISSUED'] #return rows with 'issued' permit status

In [6]:

df['issuance_date'] = pd.to_datetime(df['issuance_date'], format='%Y-%m-%d') #format date to perform datetime functions

In [7]:

df['issuance_date'].dt.year.value_counts() #counting values by year using datetime function

Out[7]:

2015    187702
2016    184188
2014    169107
2013    111918
2017     98557
Name: issuance_date, dtype: int64

Calculate the percent change of permits issued over years

In [8]:

df['issuance_date'].dt.year.value_counts().sort_index().pct_change() #sorting the years then calculating percent change

Out[8]:

2013         NaN
2014    0.510990
2015    0.109960
2016   -0.018721
2017   -0.464911
Name: issuance_date, dtype: float64

In [9]:

df['issuance_date'].dt.year.value_counts().sort_index().plot(kind='bar') #sorting index so data is in order by year

#Plotting the data
plt.title('Building Permits Issued per Year', size=14)
plt.xlabel('Year', size=12)
plt.ylabel('Count', size=12)
plt.ylim(0,200000)
plt.xticks(rotation='horizontal')
plt.show()
Note: The data for 2017 includes only 6 months of the year. Since the sample is so small, we should probably omit it.

In [10]:

df = df[df['issuance_date'].dt.year != 2017] #return all data where the year does not equal 2017

Which types of permits are often issued?

In [11]:

df['permit_type'].value_counts().sort_values(ascending=False)

Out[11]:

EW    308513
PL    128218
EQ    107216
AL     46121
NB     24506
FO     19033
SG     11938
DM      7370
Name: permit_type, dtype: int64

What exactly do these codes mean? We can use information posted by the Department of Buildings to find out.

New York City Department of Buildings Permit Types

https://www1.nyc.gov/site/buildings/industry/permit-type-and-job-status-codes.page

Changing the names to make them more readable.

In [12]:

df['permit_type'] = df['permit_type'].map({'EW': 'Equipment Work', 'PL': 'Plumbing',
                                            'EQ': 'Construction Equipment', 'AL': 'Alteration',
                                            'NB': 'New Building', 'FO': 'Foundation',
                                            'SG': 'Sign', 'DM': 'Demolition'}) #use map function to set new values

In [13]:

#Plotting the data
fig = plt.figure(figsize=(14,7))

df['permit_type'].value_counts().plot(kind='bar')

plt.title('Permits Issued by Type', size=14)
plt.ylabel('Count', size=12)
plt.xticks(rotation='horizontal')
plt.show()

Where were most building permits issued between 2013-2016?

In [14]:

df['borough'].value_counts() #counting the # of permits by borough

Out[14]:

MANHATTAN        284013
BROOKLYN         166532
QUEENS           123363
BRONX             48776
STATEN ISLAND     30231
Name: borough, dtype: int64

In [15]:

#Plotting the data
df['borough'].value_counts().plot(kind='bar')

plt.title('Building Permits Issued by Borough', size=14)
plt.ylabel('Count', size=12)
plt.xticks(rotation='horizontal')
plt.show()

What percentage of the total permits have been issued for Manhattan? Brooklyn?

In [16]:

len(df[df['borough'] == 'MANHATTAN']) / len(df) #dividing the number of Manhattan records by the entire data set

Out[16]:

0.4349923037455105

In [17]:

len(df[df['borough'] == 'BROOKLYN']) / len(df) #dividing the number of Brooklyn records by the entire data set

Out[17]:

0.2550592343566927

What percentage of borough permits are for residential projects?

In [18]:

boroughs = ['BROOKLYN', 'MANHATTAN', 'QUEENS', 'STATEN ISLAND', 'BRONX'] #list of borough names to pass to for loop

for i in boroughs:
    """
    this for loop returns the percentage of residential permits
    within each borough

    count the number of residential permits then divide by the
    total number of borough permits
    """
    print '\n' + i
    print df[df['borough'] == i]['residential'].value_counts() / len(df[df['borough'] == i])
BROOKLYN
YES    0.640003
Name: residential, dtype: float64

MANHATTAN
YES    0.456634
Name: residential, dtype: float64

QUEENS
YES    0.600561
Name: residential, dtype: float64

STATEN ISLAND
YES    0.642089
Name: residential, dtype: float64

BRONX
YES    0.548918
Name: residential, dtype: float64

Looks like Manhattan is the only borough to issue more permits for commercial projects than residential!

How many permits have been issued by zip code?

In [19]:

df['zip_code'].value_counts().sort_index().head(5)  #counting the # of permits by zipcode then sorting them

Out[19]:

0.0            8
950.0          3
4452.0         3
10000.0       16
10001.0    12509
Name: zip_code, dtype: int64

The first three listed zipcodes look like errors, so we should get rid of them. Zipcodes for NYC begin at 10000.

In [20]:

df = df[df['zip_code'] >= 10000] #returns only zip codes that are > or = to 10000

In [21]:

df['zip_code'].value_counts().sort_values(ascending=False).head(5) #five zip codes with most permits issued

Out[21]:

10022.0    16126
10019.0    14114
10011.0    13510
10013.0    13249
10001.0    12509
Name: zip_code, dtype: int64

It is helpful to know the 5 zip codes with the most permits issues but it doesn’t tell me where they are in the city. Let’s map the data to get an idea of where these are located, through visual inspection.

Visualize permits by zip code

In order to visualize the permits on a heatmap, we must merge permit count data with a geometric shape file for NYC zip code boundaries. Next we need to download the zip code boundary data from New York’s open data portal.

New York City Zip Code Boundaries Shapefile

Data Source: https://data.cityofnewyork.us/Business/Zip-Code-Boundaries/i8iw-xf4u/data

In [22]:

zip_codes = GeoDataFrame.from_file(directory + 'ZIP_CODE_040114/ZIP_CODE_040114.shp') #read in shape file

In order to merge two dataframes, there must be a shared column. The shared column must have the same data type.

Ensure zip code column data types are the same for merging.

In [23]:

zip_codes['zip_code'] = zip_codes['ZIPCODE'].astype(int) #converting zipcode column to integer data type
df['zip_code'] = df['zip_code'].astype(int)

Count the number of occurrences for each zip code in the data frame, then convert the data series to a data frame for merging.

In [24]:

counts = df['zip_code'].value_counts()
counts = counts.to_frame(name='count')
counts = counts.reset_index()

Merge the number of occurences for each zip code, with the corresponding zip code polygon

In [25]:

counts = GeoDataFrame(counts.merge(zip_codes, how='left', left_on='index', right_on='zip_code'))

Initially, when attempting to plot the counts, the code returned an error. Turns out there were more bad zip codes in the data, so we’ll need to drop all NaNs in the geometry column.

In [26]:

counts = counts.dropna() #drop null values

In [27]:

#Plotting the data
fig, ax = plt.subplots(figsize = (8,8))

counts.plot(column='count', cmap='Blues',alpha=1,linewidth=0.1, ax=ax)

plt.title('Building Permits by Zipcode', size=20)
plt.axis('off')
plt.show()

The heatmap is darker in areas with more permits issued, and lighter in areas with fewer permits issued.

Conclusion

Between the years 2013-2016, the New York City Department of Buildings issued over 600,000 building permits, where 44% of them were for work in the borough of Manhattan. There was a 51% spike in the number of building permits issued from the year 2013 to 2014, which seems to be pretty significant. In further exploration, one might dig a bit deeper to find out why there was such an increase between those years.

The New York City neighborhood Midtown East (zip code 10022) had the greatest number of permits issued between 2013- 2016. Midtown East is the home of the famous 5th Avenue shopping district, St. Patrick’s Cathedral, and a host of other attractions. Considering the area generates a ton of traffic from tourists, shoppers, and workers alike, it is unsurprising that a high volume construction work is done to maintain the area and continue to drive the NYC economy. Such construction could have also been in anticipation of the new MTA yellow line train, the Q train, which runs along 2nd avenue on the east side and opened in January 2017.

Visualizing the building permits issued by zipcode provides better context for where permits have been issued. Merging the building data with a shape file allows you to display the density of building permits issued and locate the hotspots. For someone unfamiliar with zip codes in New York, the visual representation provides clues to spatial positioning. For example, one can see that the darkest spot is right beneath Central Park, which is more informative than simply identifying the darkest spot as zip code 10022.

Future Exploration

This exploratory analysis can be used as a foundation for more sophisticated research. Here are a few potential areas of interest for future exploration:

  • Trends for specific permit types over time
  • Contractors that receive the most permits
  • Trends for specific zip codes over time & reasons for the trend
  • Distribution of initial vs renewal permits issued

Other cities to explore using Enigma Public: