Top PySpark Coding Questions for Data Engineering Roles

Solve the most common PySpark coding questions asked in Data Engineering, Data Analyst and Data Science roles!
You can also check the most popular conceptual questions here.

1.

Load and Transform Data

Popular

Practice loading a CSV file and apply basic transformations such as selecting, filtering, and dropping columns.

2.

Handling Null Values

Clean the dataset by filtering out or replacing null values in various columns.

3.

Total Purchases by Customer

Group data by customer and compute the total purchase amount per user.

4.

Discounts on Products

Add a new column calculating discounted prices for products using arithmetic operations.

5.

Load & Transform JSON file

Popular

Read a nested JSON file and flatten it using explode and array-handling techniques.

6.

Employee Earnings

Accenture

Use window functions to find employees whose salary is higher than the department average.

7.

Remove Duplicates From Dataset

Popular

Identify and remove duplicate records based on custom logic using window functions.

8.

Word Count Program in PySpark

Popular

Implement a word count logic using PySpark RDD transformations on a text file.

9.

Group By and Aggregate List

Tiger Analytics

Group records and aggregate values into lists using advanced group and array functions.

10.

Monthly Transaction Summary

Summarize transactions month-wise by grouping and using date functions to extract months.

11.

Top Players Summary

ITC Infotech

Generate a summary of top players using joins, aggregations, and string operations.

12.

Daily Total Sales

Walmart

Calculate total sales for each store on a daily basis using grouping and aggregation.

13.

Top 5 Products by Sales

Walmart

Find the top 5 products with the highest total sales across all stores for a given day.

14.

Products with Increasing Sales

Deloitte

Given two years of product sales data, identify products whose total sales revenue has increased every year.

15.

Remove Outliers from Trip Data

VISA

Given a dataset of trip costs and customer ratings, remove rows that contain outliers.

Stay tuned - we're adding more interview questions!

Interested in contributing? We'd love your help - click "Help Us Improve" on the bottom right.