๐งน Handling Nulls & Missing Data
Working with missing values is one of the most common tasks in data engineering. PySpark provides several useful functions to clean, replace, or drop null values.
Practice Question
Read the tutorial below and try solving this problem to get hands-on practice here.
1. na.fill()
Fill missing values with a specific value.
Example
df = spark.createDataFrame([
(1, None, "A"),
(2, 20, None),
(3, None, "C")
], ["id", "age", "category"])
df_filled = df.na.fill({
"age": 0,
"category": "Unknown"
})
df_filled.show()
Output
+---+---+--------+
| id|age|category|
+---+---+--------+
| 1| 0| A|
| 2| 20| Unknown|
| 3| 0| C|
+---+---+--------+
2. dropna()
Drop rows containing missing values.
Example
df_dropped = df.dropna() # drops rows with ANY null
df_dropped.show()
Output
+---+---+--------+
| id|age|category|
+---+---+--------+
| 2| 20| None|
+---+---+--------+
Drop only if ALL values are null
df.dropna(how="all")
Drop if specific columns contain null
df.dropna(subset=["age"])
3. na.replace()
Replace specific values (not only nulls).
Example
df_replace = df.na.replace({
"A": "Category-A",
"C": "Category-C"
})
df_replace.show()
Output
+---+----+-----------+
| id| age| category|
+---+----+-----------+
| 1|null| Category-A|
| 2| 20| null|
| 3|null| Category-C|
+---+----+-----------+
4. where() / filter()
Filter rows based on conditions, including null and non-null values.
where()andfilter()are functionally identical in PySpark.
Filter rows where column is NOT null
df_filtered = df.where(df.age.isNotNull())
df_filtered.show()
Output
+---+---+--------+
| id|age|category|
+---+---+--------+
| 2| 20| None|
+---+---+--------+
Filter rows where column IS null
df.where(df.category.isNull()).show()
Filter using SQL-style condition
df.filter("age IS NOT NULL AND category IS NOT NULL").show()
Filter after filling nulls
df.na.fill({"age": 0}).where("age > 0").show()
Summary
| Function | Purpose |
|---|---|
na.fill() | Fill missing values with constants |
dropna() | Remove rows with null values |
na.replace() | Replace specific values in the DataFrame |
where() / filter() | Filter rows using conditions (null-safe) |
Best Practices
- Use
dropna()carefully --> you may lose important data. - Use
na.fill()for numeric columns โ fill with 0 or mean. - Use
na.fill()for string columns โ fill with"Unknown". - Use
na.replace()for value corrections (not only nulls). - Use
where()/filter()when you need conditional control instead of blindly dropping data.