๐Ÿงน Handling Nulls & Missing Data

Working with missing values is one of the most common tasks in data engineering. PySpark provides several useful functions to clean, replace, or drop null values.

Practice Question

Read the tutorial below and try solving this problem to get hands-on practice here.


1. na.fill()

Fill missing values with a specific value.

Example

df = spark.createDataFrame([
    (1, None, "A"),
    (2, 20, None),
    (3, None, "C")
], ["id", "age", "category"])

df_filled = df.na.fill({
    "age": 0,
    "category": "Unknown"
})
df_filled.show()

Output

+---+---+--------+
| id|age|category|
+---+---+--------+
|  1|  0|       A|
|  2| 20| Unknown|
|  3|  0|       C|
+---+---+--------+

2. dropna()

Drop rows containing missing values.

Example

df_dropped = df.dropna()   # drops rows with ANY null
df_dropped.show()

Output

+---+---+--------+
| id|age|category|
+---+---+--------+
|  2| 20|    None|
+---+---+--------+

Drop only if ALL values are null

df.dropna(how="all")

Drop if specific columns contain null

df.dropna(subset=["age"])

3. na.replace()

Replace specific values (not only nulls).

Example

df_replace = df.na.replace({
    "A": "Category-A",
    "C": "Category-C"
})
df_replace.show()

Output

+---+----+-----------+
| id| age|   category|
+---+----+-----------+
|  1|null| Category-A|
|  2|  20|       null|
|  3|null| Category-C|
+---+----+-----------+

4. where() / filter()

Filter rows based on conditions, including null and non-null values.

where() and filter() are functionally identical in PySpark.

Filter rows where column is NOT null

df_filtered = df.where(df.age.isNotNull())
df_filtered.show()

Output

+---+---+--------+
| id|age|category|
+---+---+--------+
|  2| 20|    None|
+---+---+--------+

Filter rows where column IS null

df.where(df.category.isNull()).show()

Filter using SQL-style condition

df.filter("age IS NOT NULL AND category IS NOT NULL").show()

Filter after filling nulls

df.na.fill({"age": 0}).where("age > 0").show()

Summary

FunctionPurpose
na.fill()Fill missing values with constants
dropna()Remove rows with null values
na.replace()Replace specific values in the DataFrame
where() / filter()Filter rows using conditions (null-safe)

Best Practices

  • Use dropna() carefully --> you may lose important data.
  • Use na.fill() for numeric columns โ†’ fill with 0 or mean.
  • Use na.fill() for string columns โ†’ fill with "Unknown".
  • Use na.replace() for value corrections (not only nulls).
  • Use where() / filter() when you need conditional control instead of blindly dropping data.