๐Ÿ“… Date & Time Functions

When working with date and time in PySpark, the pyspark.sql.functions module provides a range of functions to manipulate, format, and query date and time values effectively.


Commonly Used Date Functions in PySpark

Here are some important date functions and their usage:

1. Parsing and Converting Dates

  • to_date(column, format): Converts a string to a date using the specified format.
  • to_timestamp(column, format): Converts a string to a timestamp.

Example:

from pyspark.sql.functions import to_date, to_timestamp
df.select(
    to_date(df["date_string"], "yyyy-M-d").alias("parsed_date"),
    to_timestamp(df["timestamp_string"], "yyyy-M-d HH:mm:ss").alias("parsed_timestamp")
)

2. Formatting Dates

  • date_format(date, format): Formats a date into a string with the specified pattern (e.g., "yyyy-M-d", "d/M/yyyy").

Example:

from pyspark.sql.functions import date_format
df.select(
    date_format(df["date"], "d-M-yyyy").alias("formatted_date")
)

3. Current Date and Time

  • current_date(): Returns the current date.
  • current_timestamp(): Returns the current timestamp.

Example:

from pyspark.sql.functions import current_date, current_timestamp
df.select(
    current_date().alias("today"),
    current_timestamp().alias("now")
)

4. Extracting Parts of Dates

  • year(): Extracts the year from a date.
  • month(): Extracts the month.
  • dayofmonth(): Extracts the day of the month.
  • dayofweek(): Returns the day of the week as an integer (1 = Sunday, 7 = Saturday).
  • weekofyear(): Extracts the week number of the year.

Example:

from pyspark.sql.functions import year, month, dayofmonth, dayofweek, weekofyear
df.select(
    year(df["date"]).alias("year"),
    month(df["date"]).alias("month"),
    dayofmonth(df["date"]).alias("day"),
    dayofweek(df["date"]).alias("weekday"),
    weekofyear(df["date"]).alias("week_number")
)

5. Date Arithmetic

  • datediff(end, start): Returns the number of days between two dates.
  • add_months(date, numMonths): Adds or subtracts months from a date.
  • date_add(date, days): Adds a specific number of days to a date.
  • date_sub(date, days): Subtracts a specific number of days from a date.

Example:

from pyspark.sql.functions import datediff, add_months, date_add, date_sub
df.select(
    datediff(df["end_date"], df["start_date"]).alias("days_diff"),
    add_months(df["date"], 3).alias("plus_3_months"),
    date_add(df["date"], 7).alias("plus_7_days"),
    date_sub(df["date"], 7).alias("minus_7_days")
)

6. Truncating Dates

  • trunc(date, format): Truncates a date to the specified format ("year", "month", etc.).
  • date_trunc(format, timestamp): Truncates a timestamp to the specified unit ("hour", "day", "month", etc.).

Example:

from pyspark.sql.functions import trunc, date_trunc
df.select(
    trunc(df["date"], "month").alias("start_of_month"),
    date_trunc("hour", df["timestamp"]).alias("start_of_hour")
)

Sample Code

The following code demonstrates the usage of these functions and can be executed directly:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, date_add, date_sub, datediff, months_between, current_date, current_timestamp, year, month, dayofmonth, date_format

# Create a Spark session
spark = SparkSession.builder.master("local").appName("Date Functions").getOrCreate()

# Sample data
data = [
    (1, "2024-11-10", "2024-11-20"),
    (2, "2023-10-05", "2023-10-15"),
    (3, "2022-05-01", "2022-05-10")
]

# Create a DataFrame
df = spark.createDataFrame(data, ["id", "order_date", "delivery_date"])

# Convert strings to date format
df = df.withColumn("order_date", to_date(col("order_date"), "yyyy-MM-dd"))
df = df.withColumn("delivery_date", to_date(col("delivery_date"), "yyyy-MM-dd"))

# Add current date and timestamp cols
df = df.withColumn("current_date", current_date())
df = df.withColumn("current_timestamp", current_timestamp())

# Calculate the difference in days between delivery and order dates
df = df.withColumn("days_to_delivery", datediff(col("delivery_date"), col("order_date")))

# Add 5 days to the delivery date
df = df.withColumn("delivery_plus_5", date_add(col("delivery_date"), 5))

# Subtract 5 days from the delivery date
df = df.withColumn("delivery_minus_5", date_sub(col("delivery_date"), 5))

# Calculate the months between order and delivery dates
df = df.withColumn("months_between_order_delivery", months_between(col("delivery_date"), col("order_date")))

# Calculate the difference in days between current date and delivery date
df = df.withColumn("days_from_today_to_delivery", datediff(current_date(), col("delivery_date")))

# Extract year, month, and day
df = df.withColumn("year", year(col("order_date")))
df = df.withColumn("month", month(col("order_date")))
df = df.withColumn("day", dayofmonth(col("order_date")))

# Format date as string
df = df.withColumn("formatted_date", date_format(col("order_date"), "MMMM dd, yyyy"))

# Show results
df.show(truncate=False)

This code demonstrates how to apply various date functions in PySpark. Copy the code and try it out in our PySpark Online Compiler!