Exploring Yelp Reviews in Philadelphia

In this assignment, we’ll explore restaurant review data available through the Yelp Dataset Challenge. The dataset includes Yelp data for user reviews and business information for many metropolitan areas. I’ve already downloaded this dataset (8 GB total!) and extracted out the data files for reviews and restaurants in Philadelphia. I’ve placed these data files into the data directory in this repository.

This assignment is broken into two parts:

Part 1: Analyzing correlations between restaurant reviews and census data

We’ll explore the relationship between restaurant reviews and the income levels of the restaurant’s surrounding area.

Part 2: Exploring the impact of fast food restaurants

We’ll run a sentiment analysis on reviews of fast food restaurants and estimate income levels in neighborhoods with fast food restaurants. We’ll test how well our sentiment analysis works by comparing the number of stars to the sentiment of reviews.

Background readings - Does sentiment analysis work? - The Geography of Taste: Using Yelp to Study Urban Culture

1. Correlating restaurant ratings and income levels

In this part, we’ll use the census API to download household income data and explore how it correlates with restaurant review data.

1.1 Query the Census API

Use the cenpy package to download median household income in the past 12 months by census tract from the 2021 ACS 5-year data set for your county of interest.

You have two options to find the correct variable names:

Search through: https://api.census.gov/data/2021/acs/acs5/variables.html Initialize an API connection and use the .varslike() function to search for the proper keywords At the end of this step, you should have a pandas DataFrame holding the income data for all census tracts within the county being analyzed. Feel free to rename your variable from the ACS so it has a more meaningful name!

::: {.callout-caution} Some census tracts won’t have any value because there are not enough households in that tract. The census will use a negative number as a default value for those tracts. You can safely remove those tracts from the analysis! :::

import cenpy
import pandas as pd
import geopandas as gpd
import pygris
import requests as rq
import numpy as np
from matplotlib import pyplot as plt
import holoviews as hv
import hvplot.pandas
from shapely.geometry import Point
pd.options.display.max_colwidth = None

C:\Users\Owner\miniforge3\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:39: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def nb_dist(x, y):
C:\Users\Owner\miniforge3\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:165: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def get_faces(triangle):
C:\Users\Owner\miniforge3\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:199: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def build_faces(faces, triangles_is, num_triangles, num_faces_single):
C:\Users\Owner\miniforge3\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:261: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def nb_mask_faces(mask, faces):

available = cenpy.explorer.available()
acs = available.filter(regex="^ACS", axis=0)

counties = cenpy.explorer.fips_table("COUNTY")
#counties.loc[ counties[3].str.contains("Nassau") &  counties[0].str.contains("NY") ]
#in case i wanted to find any other counties, only for experimenting
counties.loc[ counties[3].str.contains("Philadelphia") ]

	0	1	2	3	4
2294	PA	42	101	Philadelphia County	H6

philly_county_code = "101"
pa_state_code = "42"

acs = cenpy.remote.APIConnection("ACSDT5Y2021")
variable_id = acs.varslike(
    pattern='MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS',
    by="concept",  # searches along concept column
).sort_index()
#variable_id = B19013_001E

medianHHinc = ['B19013_001E']

philly_demo_data = acs.query(
    cols=medianHHinc,
    geo_unit="tract:*",
    geo_filter={"state": pa_state_code, "county": philly_county_code},
)


philly_demo_data.head()

	B19013_001E	state	county	tract
0	104052	42	101	000101
1	91944	42	101	000102
2	91067	42	101	000200
3	86782	42	101	000300
4	67188	42	101	000401

1.2 Download census tracts from the Census and merge the data from part 1.1

Download census tracts for the desired geography using the pygris package
Merge the downloaded census tracts with the household income DataFrame

philly_tracts = pygris.tracts(
    state=pa_state_code, county=philly_county_code, year=2021
)

philly_merge = pd.merge(philly_tracts, philly_demo_data, left_on=['TRACTCE'], right_on=['tract'], how='left')

#philly_merge.head(3)

1.3 Load the restaurants data

The Yelp dataset includes data for 7,350 restaurants across the city. Load the data from the data/ folder and use the latitude and longitude columns to create a GeoDataFrame after loading the JSON data. Be sure to set the right CRS on when initializing the GeoDataFrame!

Notes

The JSON data is in a “records” format. To load it, you’ll need to pass the following keywords:

orient='records'
lines=True

restaurants = pd.read_json('Data/restaurants_philly.json', orient='records', lines=True)

resturantsGpd = gpd.GeoDataFrame(restaurants, geometry=gpd.points_from_xy(restaurants["longitude"], restaurants["latitude"]))
resturantsGpd.crs = 'EPSG:4326'
restaurantproj = resturantsGpd.to_crs("EPSG:4269")
restaurantproj.head(3)

	business_id	latitude	longitude	name	review_count	stars	categories	geometry
0	MTSW4McQd7CbVtyjqoe9mw	39.955505	-75.155564	St Honore Pastries	80	4.0	Restaurants, Food, Bubble Tea, Coffee & Tea, Bakeries	POINT (-75.15557 39.95550)
1	MUTTqe8uqyMdBl186RmNeA	39.953949	-75.143226	Tuna Bar	245	4.0	Sushi Bars, Restaurants, Japanese	POINT (-75.14323 39.95395)
2	ROeacJQwBeh05Rqg7F6TCg	39.943223	-75.162568	BAP	205	4.5	Korean, Restaurants	POINT (-75.16257 39.94322)

1.4 Add tract info for each restaurant

Do a spatial join to identify which census tract each restaurant is within. Make sure each dataframe has the same CRS!

At the end of this step, you should have a new dataframe with a column identifying the tract number for each restaurant.

restauranttract = gpd.sjoin(restaurantproj, philly_merge, how='left', predicate='intersects')
restauranttract.head(3)

	business_id	latitude	longitude	name	review_count	stars	categories	geometry	index_right	STATEFP	...	MTFCC	FUNCSTAT	ALAND	INTPTLAT	INTPTLON	B19013_001E	state	county	tract
0	MTSW4McQd7CbVtyjqoe9mw	39.955505	-75.155564	St Honore Pastries	80	4.0	Restaurants, Food, Bubble Tea, Coffee & Tea, Bakeries	POINT (-75.15557 39.95550)	113.0	42	...	G5020	S	386232.0	+39.9554162	-075.1569255	91067	42	101	000200
1	MUTTqe8uqyMdBl186RmNeA	39.953949	-75.143226	Tuna Bar	245	4.0	Sushi Bars, Restaurants, Japanese	POINT (-75.14323 39.95395)	311.0	42	...	G5020	S	444406.0	+39.9547160	-075.1465255	91944	42	101	000102
2	ROeacJQwBeh05Rqg7F6TCg	39.943223	-75.162568	BAP	205	4.5	Korean, Restaurants	POINT (-75.16257 39.94322)	67.0	42	...	G5020	S	239383.0	+39.9419037	-075.1591158	93203	42	101	001500

3 rows × 25 columns

1.5 Add income data to your restaurant data

Add the income data to your dataframe from the previous step, merging the census data based on the tract that each restaurant is within.

restauranttract.rename(columns={'B19013_001E' : 'MedHHInc'}, inplace=True)
restauranttract.head(2)

	business_id	latitude	longitude	name	review_count	stars	categories	geometry	index_right	STATEFP	...	MTFCC	FUNCSTAT	ALAND	AWATER	INTPTLAT	INTPTLON	MedHHInc	state	county	tract
0	MTSW4McQd7CbVtyjqoe9mw	39.955505	-75.155564	St Honore Pastries	80	4.0	Restaurants, Food, Bubble Tea, Coffee & Tea, Bakeries	POINT (-75.15557 39.95550)	113.0	42	...	G5020	S	386232.0	0.0	+39.9554162	-075.1569255	91067	42	101	000200
1	MUTTqe8uqyMdBl186RmNeA	39.953949	-75.143226	Tuna Bar	245	4.0	Sushi Bars, Restaurants, Japanese	POINT (-75.14323 39.95395)	311.0	42	...	G5020	S	444406.0	0.0	+39.9547160	-075.1465255	91944	42	101	000102

2 rows × 25 columns

1.6 Make a plot of median household income vs. Yelp stars

Our dataset has the number of stars for each restaurant, rounded to the nearest 0.5 star. In this step, create a line plot that shows the average income value for each stars category (e.g., all restaurants with 1 star, 1.5 stars, 2 stars, etc.)

While their are multiple ways to do this, the seaborn.lineplot() is a great option. This can show the average value in each category as well as 95% uncertainty intervals. Use this function to plot the stars (“x”) vs. average income (“y”) for all of our restaurants, using the dataframe from last step. Be sure to format your figure to make it look nice!

Question: Is there a correlation between a restaurant’s ratings and the income levels of its surrounding neighborhood?

import seaborn as sns

#restauranttract['MedHHInc'] = pd.to_numeric(restauranttract['MedHHInc'])


sns.lineplot(data=restauranttract, x="stars", y="MedHHInc")
plt.ylabel("Median Household Income")
plt.ybins=5





# There no real correlation between income levels and ratings, any restaurant can be

2. Fast food trends in Philadelphia

At the end of part 1, you should have seen a strong trend where higher income tracts generally had restaurants with better reviews. In this section, we’ll explore the impact of fast food restaurants and how they might be impacting this trend.

Hypothesis

Fast food restaurants are predominantly located in areas with lower median income levels.
Fast food restaurants have worse reviews compared to typical restaurants.

If true, these two hypotheses could help to explain the trend we found in part 1. Let’s dive in and test our hypotheses!

2.1 Identify fast food restaurants

The “categories” column in our dataset contains multiple classifications for each restaurant. One such category is “Fast Food”. In this step, add a new column called “is_fast_food” that is True if the “categories” column contains the term “Fast Food” and False otherwise

restauranttract['is_fast_food']  = restauranttract['categories'].str.contains('Fast Food', case=False, na=False).astype(int)
restauranttract[restauranttract["is_fast_food"] == 1]["name"].head(5)

11                Wendy's
22    Crown Fried Chicken
33            Chick-fil-A
34               Checkers
37                    KFC
Name: name, dtype: object

restauranttract.head(2)

	business_id	latitude	longitude	name	review_count	stars	categories	geometry	index_right	STATEFP	...	FUNCSTAT	ALAND	AWATER	INTPTLAT	INTPTLON	MedHHInc	state	county	tract	is_fast_food
0	MTSW4McQd7CbVtyjqoe9mw	39.955505	-75.155564	St Honore Pastries	80	4.0	Restaurants, Food, Bubble Tea, Coffee & Tea, Bakeries	POINT (-75.15557 39.95550)	113.0	42	...	S	386232.0	0.0	+39.9554162	-075.1569255	91067.0	42	101	000200	0
1	MUTTqe8uqyMdBl186RmNeA	39.953949	-75.143226	Tuna Bar	245	4.0	Sushi Bars, Restaurants, Japanese	POINT (-75.14323 39.95395)	311.0	42	...	S	444406.0	0.0	+39.9547160	-075.1465255	91944.0	42	101	000102	0

2 rows × 26 columns

2.2 Calculate the median income for fast food and otherwise

Group by the “is_fast_food” column and calculate the median income for restaurants that are and are not fast food. You should find that income levels are lower in tracts with fast food.

Note: this is just an estimate, since we are calculating a median of median income values.

FFtable = median_income_by_fast_food = restauranttract.groupby("is_fast_food")["MedHHInc"].median().reset_index()
FFtable

	is_fast_food	MedHHInc
0	0	75583.0
1	1	53480.0

restauranttract.rename(columns={"B19013_001E": "medianHHincomelevel"})

	business_id	latitude	longitude	name	review_count	stars	categories	geometry	index_right	STATEFP	...	FUNCSTAT	ALAND	AWATER	INTPTLAT	INTPTLON	MedHHInc	state	county	tract	is_fast_food
0	MTSW4McQd7CbVtyjqoe9mw	39.955505	-75.155564	St Honore Pastries	80	4.0	Restaurants, Food, Bubble Tea, Coffee & Tea, Bakeries	POINT (-75.15557 39.95550)	113.0	42	...	S	386232.0	0.0	+39.9554162	-075.1569255	91067.0	42	101	000200	0
1	MUTTqe8uqyMdBl186RmNeA	39.953949	-75.143226	Tuna Bar	245	4.0	Sushi Bars, Restaurants, Japanese	POINT (-75.14323 39.95395)	311.0	42	...	S	444406.0	0.0	+39.9547160	-075.1465255	91944.0	42	101	000102	0
2	ROeacJQwBeh05Rqg7F6TCg	39.943223	-75.162568	BAP	205	4.5	Korean, Restaurants	POINT (-75.16257 39.94322)	67.0	42	...	S	239383.0	0.0	+39.9419037	-075.1591158	93203.0	42	101	001500	0
3	QdN72BWoyFypdGJhhI5r7g	39.939825	-75.157447	Bar One	65	4.0	Cocktail Bars, Bars, Italian, Nightlife, Restaurants	POINT (-75.15745 39.93982)	29.0	42	...	S	242440.0	0.0	+39.9400001	-075.1593102	99125.0	42	101	001800	0
4	Mjboz24M9NlBeiOJKLEd_Q	40.022466	-75.218314	DeSandro on Main	41	3.0	Pizza, Restaurants, Salad, Soup	POINT (-75.21832 40.02246)	288.0	42	...	S	858218.0	55718.0	+40.0244730	-075.2151045	79264.0	42	101	021000	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7345	VZbkSeZtFynEascotq7ExA	39.953391	-75.196765	Ali Baba Magic Food	8	4.0	Restaurants, Food Stands	POINT (-75.19677 39.95339)	255.0	42	...	S	184674.0	0.0	+39.9539319	-075.1984477	36250.0	42	101	008801	0
7346	gPr1io7ks0Eo3FDsnDTYfg	40.060414	-75.191084	Tata Cafe	21	4.0	Sandwiches, Restaurants, Italian	POINT (-75.19109 40.06041)	243.0	42	...	S	865284.0	1982.0	+40.0683411	-075.1883025	92404.0	42	101	025600	0
7347	wVxXRFf10zTTAs11nr4xeA	40.032483	-75.214430	PrimoHoagies	55	3.0	Restaurants, Specialty Food, Food, Sandwiches, Italian	POINT (-75.21443 40.03248)	204.0	42	...	S	538872.0	0.0	+40.0322339	-075.2181174	79464.0	42	101	021300	0
7348	8n93L-ilMAsvwUatarykSg	39.951018	-75.198240	Kitchen Gia	22	3.0	Coffee & Tea, Food, Sandwiches, American (Traditional), Restaurants	POINT (-75.19824 39.95102)	315.0	42	...	S	883507.0	49251.0	+39.9523358	-075.1889603	27821.0	42	101	036902	0
7349	WnT9NIzQgLlILjPT0kEcsQ	39.935982	-75.158665	Adelita Taqueria & Restaurant	35	4.5	Restaurants, Mexican	POINT (-75.15867 39.93598)	33.0	42	...	S	535423.0	0.0	+39.9367634	-075.1595100	91382.0	42	101	002400	0

7350 rows × 26 columns

2.3 Load fast food review data

In the rest of part 2, we’re going to run a sentiment analysis on the reviews for fast food restaurants. The review data for all fast food restaurants identified in part 2.1 is already stored in the data/ folder. The data is stored as a JSON file and you can use pandas.read_json to load it.

Notes

The JSON data is in a “records” format. To load it, you’ll need to pass the following keywords:

orient='records'
lines=True

reviews = pd.read_json('Data/reviews_philly_fast_food.json', orient='records', lines=True)
restreviews = reviews.merge(resturantsGpd, on = "business_id", how = "inner")
restreviews = restreviews.rename(columns={"stars_x": "stars_given", "stars_y": "overall_rating"})
restreviews.head(1)

	business_id	review_id	stars_given	text	latitude	longitude	name	review_count	overall_rating	categories	geometry
0	kgMEBZG6rjkGeFzPaIM4MQ	E-yGr1OhsUBxNeUVLDVouA	1	I know I shouldn't expect much but everything I asked for that was on the drive thru menu was not available. I was actually afraid of what I was going to get once I did get it. I saw the movie "waiting". Word of advice stay clear of this arch. Just so you know I was only trying to order a beverage how pathetic is that.	39.93944	-75.166805	McDonald's	55	2.0	Fast Food, Food, Restaurants, Coffee & Tea, Burgers	POINT (-75.16680 39.93944)

2.4 Trim to the most popular fast food restaurants

There’s too many reviews to run a sentiment analysis on all of them in a reasonable time. Let’s trim our reviews dataset to the most popular fast food restaurants, using the list provided below.

You will need to get the “business_id” values for each of these restaurants from the restaurants data loaded in part 1.3. Then, trim the reviews data to include reviews only for those business IDs.

popular_fast_food = [
    "McDonald's",
    "Wendy's",
    "Subway",
    "Popeyes Louisiana Kitchen",
    "Taco Bell",
    "KFC",
    "Burger King",
]

restfocus = restreviews[restreviews['name'].isin(popular_fast_food)]
restfocus.head(1)

	business_id	review_id	stars_given	text	latitude	longitude	name	review_count	overall_rating	categories	geometry
0	kgMEBZG6rjkGeFzPaIM4MQ	E-yGr1OhsUBxNeUVLDVouA	1	I know I shouldn't expect much but everything I asked for that was on the drive thru menu was not available. I was actually afraid of what I was going to get once I did get it. I saw the movie "waiting". Word of advice stay clear of this arch. Just so you know I was only trying to order a beverage how pathetic is that.	39.93944	-75.166805	McDonald's	55	2.0	Fast Food, Food, Restaurants, Coffee & Tea, Burgers	POINT (-75.16680 39.93944)

2.5 Run the emotions classifier on fast food reviews

Run a sentiment analysis on the reviews data from the previous step. Use the DistilBERT model that can predict emotion labels (anger, fear, sadness, joy, love, and surprise). Transform the result from the classifier into a DataFrame so that you have a column for each of the emotion labels.

descriptions = restfocus["text"].str.strip().tolist()
descriptions[:3]

['I know I shouldn\'t expect much but everything I asked for that was on the drive thru menu was not available. I was actually afraid of what I was going to get once I did get it. I saw the movie "waiting".  Word of advice stay clear of this arch. Just so you know I was only trying to order a beverage how pathetic is that.',
 'Dirty bathrooms and very slow service, but I was pleased because they had a TV on with subtitles and the volume on, and it was turned to the news!  A good place to pass some time with a tasty Mc-snack and a hot coffee while in Philly during a chilly day!  We stopped here on the way to a football game and found it a very pleasant and relaxing place to hang out for a while.',
 "That's a shame! \nThis place is full of junkies customers \nThe staff and the service is fast \nIt's just too much like homeless or you can tell the addict people all coming here and soliciting"]

example_string = "This is an Example"
descriptions_words = [desc.split() for desc in descriptions]
descriptions_words_flat = []

for list_of_words in descriptions_words:
    for word in list_of_words:
        descriptions_words_flat.append(word)
descriptions_words_lower = [word.lower() for word in descriptions_words_flat]
import nltk

nltk.download("stopwords");
stop_words = list(set(nltk.corpus.stopwords.words("english")))
descriptions_no_stop = []

for word in descriptions_words_lower:
    if word not in stop_words:
        descriptions_no_stop.append(word)
descriptions_no_stop = [
    word for word in descriptions_words_lower if word not in stop_words
]
import string
punctuation = list(string.punctuation)
descriptions_final = []

# Loop over all words
for word in descriptions_no_stop:
    # Remove any punctuation from the words
    for p in punctuation:
        word = word.replace(p, "")

    # Save it if the string is not empty
    if word != "":
        descriptions_final.append(word)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Owner\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

words = pd.DataFrame({"words": descriptions_final})

N = (
    words.groupby("words", as_index=False)
    .size()
    .sort_values("size", ascending=False, ignore_index=True)
)

top20 = N.head(20)

top20

	words	size
0	food	1840
1	order	1430
2	get	1012
3	one	980
4	like	922
5	service	901
6	place	882
7	time	873
8	chicken	848
9	mcdonalds	810
10	go	748
11	location	677
12	even	637
13	drive	601
14	got	585
15	fries	570
16	never	568
17	ordered	563
18	back	554
19	good	550

fig, ax = plt.subplots(figsize=(8, 8))

# Plot horizontal bar graph
sns.barplot(
    y="words",
    x="size",
    data=top20,
    ax=ax,
    color="crimson",
    saturation=1.0,
)

ax.set_title("Most Common Words Found in Fast Food reviews")

# Remove spines (the box)
for spine in ax.spines.values():
    spine.set_visible(False)

# Remove x and y ticks
ax.tick_params(bottom=False, left=False)

2.6 Identify the predicted emotion for each text

Use the pandas idxmax() to identify the predicted emotion for each review, and add this value to a new column called “prediction”

The predicted emotion has the highest confidence score across all emotion labels for a particular label.

from transformers import pipeline
model = "bhadresh-savani/distilbert-base-uncased-emotion"

emotion_classifier = pipeline(
    task="text-classification", 
    model=model,  
    top_k=None, 
    tokenizer=model,  
    truncation=True, 
)

C:\Users\Owner\miniforge3\envs\musa-550-fall-2023\lib\site-packages\huggingface_hub\file_download.py:138: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\Owner\.cache\huggingface\hub. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.

%%time 

emotion_scores = emotion_classifier(descriptions)
emotion_scores[0]

CPU times: total: 1min 49s
Wall time: 5min 1s

[{'label': 'sadness', 'score': 0.7338687777519226},
 {'label': 'fear', 'score': 0.25067660212516785},
 {'label': 'anger', 'score': 0.011039001867175102},
 {'label': 'joy', 'score': 0.002758048241958022},
 {'label': 'surprise', 'score': 0.0010148986475542188},
 {'label': 'love', 'score': 0.0006427847547456622}]

emotion = pd.DataFrame(
    [{d["label"]: d["score"] for d in dd} for dd in emotion_scores]
).assign(text=descriptions)
emotion.head(3)

	sadness	fear	anger	joy	surprise	love	text
0	0.733869	0.250677	0.011039	0.002758	0.001015	0.000643	I know I shouldn't expect much but everything I asked for that was on the drive thru menu was not available. I was actually afraid of what I was going to get once I did get it. I saw the movie "waiting". Word of advice stay clear of this arch. Just so you know I was only trying to order a beverage how pathetic is that.
1	0.000216	0.000088	0.000153	0.998563	0.000161	0.000819	Dirty bathrooms and very slow service, but I was pleased because they had a TV on with subtitles and the volume on, and it was turned to the news! A good place to pass some time with a tasty Mc-snack and a hot coffee while in Philly during a chilly day! We stopped here on the way to a football game and found it a very pleasant and relaxing place to hang out for a while.
2	0.268274	0.039276	0.658548	0.030782	0.001894	0.001226	That's a shame! \nThis place is full of junkies customers \nThe staff and the service is fast \nIt's just too much like homeless or you can tell the addict people all coming here and soliciting

emotion_labels = ["anger", "fear", "sadness", "joy", "surprise", "love"]
emotion['prediction'] = emotion[emotion_labels].idxmax(axis=1)

emotion_count = emotion.groupby("prediction").size().sort_values()

# Plotting
ax = emotion_count.plot(kind='barh', color='darkolivegreen')

# Add title
ax.set_title('Emotion Predictions for Yelp Fast Food Reviews')

# Remove spines (the box)
for spine in ax.spines.values():
    spine.set_visible(False)

# Remove x and y ticks
ax.tick_params(bottom=False, left=False)

2.7 Combine the ratings and sentiment data

Combine the data from part 2.4 (reviews data) and part 2.6 (emotion data). Use the pd.concat() function and combine along the column axis.

Note: You’ll need to reset the index of your reviews data frame so it matches the emotion data index (it should run from 0 to the length of the data - 1).

review_sentiment = emotion.merge(restfocus, on = "text", how = "left")

2.8 Plot sentiment vs. stars

We now have a dataframe with the predicted primary emotion for each review and the associated number of stars for each review. Let’s explore two questions:

Does sentiment analysis work? Do reviews with fewer stars have negative emotions?
For our fast food restaurants, are reviews generally positive or negative?

Use seaborn’s histplot() to make a stacked bar chart showing the breakdown of each emotion for each stars category (1 star, 2 stars, etc.). A few notes:

To stack multiple emotion labels in one bar, use the multiple="stack" keyword
The discrete=True can be helpful to tell seaborn our stars values are discrete categories

review_sentiment_grouped = review_sentiment.groupby("stars_given")[["sadness", "fear", "joy", "anger", "surprise"]].mean()
review_sentiment_grouped = review_sentiment_grouped.reset_index()
review_sentiment_long = pd.melt(review_sentiment_grouped, id_vars=['stars_given'], var_name='emotion', value_name='score')


palette = sns.color_palette("cubehelix")


plt.figure(figsize=(10, 6))
sns.histplot(data=review_sentiment_long, x="stars_given", weights="score", hue="emotion", multiple="stack", discrete=True, palette=palette)


plt.title('Stacked Histogram of Review Sentiment Scores by Star Rating')
plt.xlabel("User's Rating")
plt.ylabel('Score')


ax = plt.gca()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.tick_params(bottom=False, left=False)

plt.show()

C:\Users\Owner\AppData\Local\Temp\ipykernel_17508\205832856.py:5: UserWarning: The palette list has more values (6) than needed (5), which may not be intended.
  sns.histplot(data=review_sentiment_long, x="stars_given", weights="score", hue="emotion", multiple="stack", discrete=True, palette=palette)

Question: What does your chart indicate for the effectiveness of our sentiment analysis? Does our original hypothesis about fast food restaurants seem plausible?

The chart indicates there are certain emotions more prominent in certain ratings. Places recieving a 4 or 5 have mostly joyful reviews, with scatters of other emotions that may have been misinterpreted by the model. Lower rated places see anger or sadness which makes sense given the user’s attitude towards the restaurant.