โž• Math Functions

PySpark provides a range of functions to perform arithmetic and mathematical operations, making it easier to manipulate numerical data. These functions are part of the pyspark.sql.functions module and can be applied to DataFrame columns.

Here we will go through the most commonly used functions. You can refer to official documention for the entire list of functions here.


Arithmetic Operations in PySpark

Arithmetic operations are straightforward and can be performed directly on DataFrame columns using Python operators.

OperationExample SyntaxDescription
Additiondf.withColumn("sum", df["a"] + df["b"])Adds values in columns a and b.
Subtractiondf.withColumn("diff", df["a"] - df["b"])Subtracts b from a.
Multiplicationdf.withColumn("product", df["a"] * df["b"])Multiplies a and b.
Divisiondf.withColumn("quotient", df["a"] / df["b"])Divides a by b.
Modulodf.withColumn("remainder", df["a"] % df["b"])Computes remainder of a/b.

Math Functions in PySpark

Basic Functions

FunctionSyntax ExampleDescription
absdf.withColumn("abs_val", abs(col("a")))Absolute value of a column.
rounddf.withColumn("rounded", round(col("a"), 2))Rounds to 2 decimal places.
ceildf.withColumn("ceil_val", ceil(col("a")))Rounds up to the nearest integer.
floordf.withColumn("floor_val", floor(col("a")))Rounds down to the nearest integer.

Exponential and Logarithmic Functions

FunctionSyntax ExampleDescription
expdf.withColumn("exp_val", exp(col("a")))Exponential value of a column.
logdf.withColumn("log_val", log(col("a")))Natural logarithm of a column.
log10df.withColumn("log10_val", log10(col("a")))Base-10 logarithm of a column.
powdf.withColumn("power_val", pow(col("a"), 3))Raises column a to the power of 3.
sqrtdf.withColumn("sqrt_val", sqrt(col("a")))Square root of a column.

Trigonometric Functions

FunctionSyntax ExampleDescription
sindf.withColumn("sin_val", sin(col("a")))Sine of a column (in radians).
cosdf.withColumn("cos_val", cos(col("a")))Cosine of a column (in radians).
tandf.withColumn("tan_val", tan(col("a")))Tangent of a column (in radians).

Examples

Example 1: Using Arithmetic Operations

Copy the code and try it out in our PySpark Online Compiler!

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.master("local").appName("Arithmetic Example").getOrCreate()

# Sample data
data = [(1, 10, 3), (2, 20, 5), (3, 15, 4)]
df = spark.createDataFrame(data, ["id", "value1", "value2"])

# Perform arithmetic operations
df = df.withColumn("sum", col("value1") + col("value2")) \
       .withColumn("difference", col("value1") - col("value2")) \
       .withColumn("product", col("value1") * col("value2")) \
       .withColumn("quotient", col("value1") / col("value2")) \
       .withColumn("remainder", col("value1") % col("value2"))

df.show()

Output:

+---+------+-------+---+----------+-------+---------+---------+
| id|value1|value2 |sum|difference|product|quotient |remainder|
+---+------+-------+---+----------+-------+---------+---------+
|  1|    10|      3| 13|         7|     30|      3.3|        1|
|  2|    20|      5| 25|        15|    100|      4.0|        0|
|  3|    15|      4| 19|        11|     60|      3.8|        3|
+---+------+-------+---+----------+-------+---------+---------+

Example 2: Using Math Functions

from pyspark.sql.functions import abs, round, ceil, sqrt

# Apply math functions
df = df.withColumn("absolute_value1", abs(col("value1"))) \
       .withColumn("rounded_value2", round(col("value2"), 1)) \
       .withColumn("ceil_value1", ceil(col("value1"))) \
       .withColumn("sqrt_value2", sqrt(col("value2")))

df.show()

Output:

+---+------+-------+----------------+---------------+----------+---------+
| id|value1|value2 |absolute_value1|rounded_value2 |ceil_value1|sqrt_value2|
+---+------+-------+----------------+---------------+----------+---------+
|  1|    10|      3|              10|              3|        10|      1.73|
|  2|    20|      5|              20|              5|        20|      2.23|
|  3|    15|      4|              15|              4|        15|      2.00|
+---+------+-------+----------------+---------------+----------+---------+

Example 3: Combining Arithmetic and Math Functions

from pyspark.sql.functions import pow, log10

# Combining functions
df = df.withColumn("power_value1", pow(col("value1"), 2)) \
       .withColumn("log10_value2", log10(col("value2")))

df.show()

Output:

+---+------+-------+------------+--------------+
| id|value1|value2 |power_value1|log10_value2  |
+---+------+-------+------------+--------------+
|  1|    10|      3|         100|          0.48|
|  2|    20|      5|         400|          0.70|
|  3|    15|      4|         225|          0.60|
+---+------+-------+------------+--------------+

Conclusion

This tutorial demonstrated how to use arithmetic and math functions in PySpark for data manipulation. By combining these functions, you can perform a variety of mathematical operations efficiently.

Copy the code and try it out in our PySpark Online Compiler! to explore further.