Exploring Yelp Reviews in Philadelphia

In this assignment, we’ll explore restaurant review data available through the Yelp Dataset Challenge. The dataset includes Yelp data for user reviews and business information for many metropolitan areas. I’ve already downloaded this dataset (8 GB total!) and extracted out the data files for reviews and restaurants in Philadelphia. I’ve placed these data files into the data directory in this repository.

This assignment is broken into two parts:

Part 1: Analyzing correlations between restaurant reviews and census data

We’ll explore the relationship between restaurant reviews and the income levels of the restaurant’s surrounding area.

Part 2: Exploring the impact of fast food restaurants

We’ll run a sentiment analysis on reviews of fast food restaurants and estimate income levels in neighborhoods with fast food restaurants. We’ll test how well our sentiment analysis works by comparing the number of stars to the sentiment of reviews.

Background readings - Does sentiment analysis work? - The Geography of Taste: Using Yelp to Study Urban Culture

1. Correlating restaurant ratings and income levels

In this part, we’ll use the census API to download household income data and explore how it correlates with restaurant review data.

1.1 Query the Census API

Use the cenpy package to download median household income in the past 12 months by census tract from the 2021 ACS 5-year data set for your county of interest.

You have two options to find the correct variable names:

Search through: https://api.census.gov/data/2021/acs/acs5/variables.html Initialize an API connection and use the .varslike() function to search for the proper keywords At the end of this step, you should have a pandas DataFrame holding the income data for all census tracts within the county being analyzed. Feel free to rename your variable from the ACS so it has a more meaningful name!

::: {.callout-caution} Some census tracts won’t have any value because there are not enough households in that tract. The census will use a negative number as a default value for those tracts. You can safely remove those tracts from the analysis! :::

import cenpy
import pandas as pd
import geopandas as gpd
import pygris
import requests as rq
import numpy as np
from matplotlib import pyplot as plt
import holoviews as hv
import hvplot.pandas
from shapely.geometry import Point
pd.options.display.max_colwidth = None
C:\Users\Owner\miniforge3\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:39: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def nb_dist(x, y):
C:\Users\Owner\miniforge3\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:165: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def get_faces(triangle):
C:\Users\Owner\miniforge3\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:199: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def build_faces(faces, triangles_is, num_triangles, num_faces_single):
C:\Users\Owner\miniforge3\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:261: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def nb_mask_faces(mask, faces):
available = cenpy.explorer.available()
acs = available.filter(regex="^ACS", axis=0)
counties = cenpy.explorer.fips_table("COUNTY")
#counties.loc[ counties[3].str.contains("Nassau") &  counties[0].str.contains("NY") ]
#in case i wanted to find any other counties, only for experimenting
counties.loc[ counties[3].str.contains("Philadelphia") ]
0 1 2 3 4
2294 PA 42 101 Philadelphia County H6
philly_county_code = "101"
pa_state_code = "42"
acs = cenpy.remote.APIConnection("ACSDT5Y2021")
variable_id = acs.varslike(
    pattern='MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS',
    by="concept",  # searches along concept column
).sort_index()
#variable_id = B19013_001E
medianHHinc = ['B19013_001E']
philly_demo_data = acs.query(
    cols=medianHHinc,
    geo_unit="tract:*",
    geo_filter={"state": pa_state_code, "county": philly_county_code},
)


philly_demo_data.head()

B19013_001E state county tract
0 104052 42 101 000101
1 91944 42 101 000102
2 91067 42 101 000200
3 86782 42 101 000300
4 67188 42 101 000401

1.2 Download census tracts from the Census and merge the data from part 1.1

  • Download census tracts for the desired geography using the pygris package
  • Merge the downloaded census tracts with the household income DataFrame
philly_tracts = pygris.tracts(
    state=pa_state_code, county=philly_county_code, year=2021
)
philly_merge = pd.merge(philly_tracts, philly_demo_data, left_on=['TRACTCE'], right_on=['tract'], how='left')
#philly_merge.head(3)

1.3 Load the restaurants data

The Yelp dataset includes data for 7,350 restaurants across the city. Load the data from the data/ folder and use the latitude and longitude columns to create a GeoDataFrame after loading the JSON data. Be sure to set the right CRS on when initializing the GeoDataFrame!

Notes

The JSON data is in a “records” format. To load it, you’ll need to pass the following keywords:

  • orient='records'
  • lines=True
restaurants = pd.read_json('Data/restaurants_philly.json', orient='records', lines=True)
resturantsGpd = gpd.GeoDataFrame(restaurants, geometry=gpd.points_from_xy(restaurants["longitude"], restaurants["latitude"]))
resturantsGpd.crs = 'EPSG:4326'
restaurantproj = resturantsGpd.to_crs("EPSG:4269")
restaurantproj.head(3)
business_id latitude longitude name review_count stars categories geometry
0 MTSW4McQd7CbVtyjqoe9mw 39.955505 -75.155564 St Honore Pastries 80 4.0 Restaurants, Food, Bubble Tea, Coffee & Tea, Bakeries POINT (-75.15557 39.95550)
1 MUTTqe8uqyMdBl186RmNeA 39.953949 -75.143226 Tuna Bar 245 4.0 Sushi Bars, Restaurants, Japanese POINT (-75.14323 39.95395)
2 ROeacJQwBeh05Rqg7F6TCg 39.943223 -75.162568 BAP 205 4.5 Korean, Restaurants POINT (-75.16257 39.94322)

1.4 Add tract info for each restaurant

Do a spatial join to identify which census tract each restaurant is within. Make sure each dataframe has the same CRS!

At the end of this step, you should have a new dataframe with a column identifying the tract number for each restaurant.

restauranttract = gpd.sjoin(restaurantproj, philly_merge, how='left', predicate='intersects')
restauranttract.head(3)
business_id latitude longitude name review_count stars categories geometry index_right STATEFP ... MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON B19013_001E state county tract
0 MTSW4McQd7CbVtyjqoe9mw 39.955505 -75.155564 St Honore Pastries 80 4.0 Restaurants, Food, Bubble Tea, Coffee & Tea, Bakeries POINT (-75.15557 39.95550) 113.0 42 ... G5020 S 386232.0 0.0 +39.9554162 -075.1569255 91067 42 101 000200
1 MUTTqe8uqyMdBl186RmNeA 39.953949 -75.143226 Tuna Bar 245 4.0 Sushi Bars, Restaurants, Japanese POINT (-75.14323 39.95395) 311.0 42 ... G5020 S 444406.0 0.0 +39.9547160 -075.1465255 91944 42 101 000102
2 ROeacJQwBeh05Rqg7F6TCg 39.943223 -75.162568 BAP 205 4.5 Korean, Restaurants POINT (-75.16257 39.94322) 67.0 42 ... G5020 S 239383.0 0.0 +39.9419037 -075.1591158 93203 42 101 001500

3 rows × 25 columns

1.5 Add income data to your restaurant data

Add the income data to your dataframe from the previous step, merging the census data based on the tract that each restaurant is within.

restauranttract.rename(columns={'B19013_001E' : 'MedHHInc'}, inplace=True)
restauranttract.head(2)
business_id latitude longitude name review_count stars categories geometry index_right STATEFP ... MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON MedHHInc state county tract
0 MTSW4McQd7CbVtyjqoe9mw 39.955505 -75.155564 St Honore Pastries 80 4.0 Restaurants, Food, Bubble Tea, Coffee & Tea, Bakeries POINT (-75.15557 39.95550) 113.0 42 ... G5020 S 386232.0 0.0 +39.9554162 -075.1569255 91067 42 101 000200
1 MUTTqe8uqyMdBl186RmNeA 39.953949 -75.143226 Tuna Bar 245 4.0 Sushi Bars, Restaurants, Japanese POINT (-75.14323 39.95395) 311.0 42 ... G5020 S 444406.0 0.0 +39.9547160 -075.1465255 91944 42 101 000102

2 rows × 25 columns

1.6 Make a plot of median household income vs. Yelp stars

Our dataset has the number of stars for each restaurant, rounded to the nearest 0.5 star. In this step, create a line plot that shows the average income value for each stars category (e.g., all restaurants with 1 star, 1.5 stars, 2 stars, etc.)

While their are multiple ways to do this, the seaborn.lineplot() is a great option. This can show the average value in each category as well as 95% uncertainty intervals. Use this function to plot the stars (“x”) vs. average income (“y”) for all of our restaurants, using the dataframe from last step. Be sure to format your figure to make it look nice!

Question: Is there a correlation between a restaurant’s ratings and the income levels of its surrounding neighborhood?

import seaborn as sns
#restauranttract['MedHHInc'] = pd.to_numeric(restauranttract['MedHHInc'])

sns.lineplot(data=restauranttract, x="stars", y="MedHHInc")
plt.ylabel("Median Household Income")
plt.ybins=5





# There no real correlation between income levels and ratings, any restaurant can be