TDM 20200: Project 9 — 2024
Motivation: Spark uses a distributed computing model to process data, which means that data is processed in parallel across a cluster of machines. PySpark is a Spark API that allows you to interact with Spark through the Python shell, or in Jupyter Lab, or in DataBricks, etc. PySpark provides a way to access Spark’s framework using Python. It combines the simplicity of Python with the power of Apache Spark.
Context: This is the second project in which we will continue to understand components of Spark’s ecosystem that PySpark can use
Scope: Python, Spark SQL, Spark Streaming
Dataset(s)
The following questions will use the following dataset:
/anvil/projects/tdm/data/amazon/amazon_fine_food_reviews.csv
Readings and Resources
Questions
Question 1 (2 points)
-
Create a PySpark session, and then load the dataset using PySpark.
-
Calculate the average
Score
for the reviews, grouped byProductId
. (There are 74258ProductId
values, so you do not need to display them all. If youshow()
the results, only 20 of the 74258ProductId
values and their averageScore
values will appear. That is OK for the purposes of this question.) -
Save the output for all 74258
ProductId
values and their averageScore
values to a file namedaverageScores.csv
.
You may import the following modules:
While reading the csv file into a data frame, you may need to specify the option that tells PySpark that there are headers. Otherwise, the header will be treated as part of the data itself.
You may use the following option to make the column names accessible as DataFrame attributes.
After all the operations are complete, you may need to close the SparkSession.
A PySpark DataFrame’s
|
It is not necessary to submit the file with the project solutions. |
Question 2 (2 points)
-
Use PySpark SQL to calculate the average helpfulness ratio (HelpfulnessNumerator/HelpfulnessDenominator) for each product.
-
Save the output for all 74258
ProductId
values and their average helpfulness ratio values to a file namedaverageHelpfulness.csv
.
The
A few more notes:
|
It is not necessary to submit the file with the project solutions. |
Question 3 (2 points)
In questions 1 and 2, we used the batch processing mode to do the data processing. In other words, the dataset is processed in one go. Alternatively, we can use Spark Streaming concepts. This technique would allow us to even work on a data set in which the data is being provided in a real-time data stream. (In this project, we are just working with a fixed data set, but we still want students to know how to work with streaming data.)
-
Please count the number of reviews for each
ProductId
, in a streaming style (simulating a real-time data monitoring and analytics). -
Display the results from 20 rows of the output.
You may use a |
|
Question 4 (2 points)
Use a streaming session like you did in Question 3.
-
Display the
ProductId
values andScore
values for the first 20 rows in which theScore
is strictly larger than 3. Output these values to the screen as the new data arrives in the streaming session.
Filtering streaming data for reviews with a score strictly greater than 3 is a straightforward operation. You may use a filter condition on the streaming DataFrame, for instance, like this
It is also necessary to remove the |
Question 5 (2 points)
-
Please state your understanding of PySpark streaming concepts in 2 or more sentences.
Project 09 Assignment Checklist
-
Jupyter Lab notebook with your code, comments and outputs for the assignment
-
firstname-lastname-project09.ipynb
-
-
Python file with code and comments for the assignment
-
firstname-lastname-project09.py
-
-
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |