๐Ÿ”ค String Functions

String functions in PySpark allow you to manipulate and process textual data. These functions are particularly useful when cleaning data, extracting information, or transforming text columns. If you're familiar with SQL, many of these functions will feel familiar, but PySpark provides a Pythonic interface through the pyspark.sql.functions module.

Below, we will cover some of the most commonly used string functions in PySpark, with examples that demonstrate how to use the withColumn method for transformation.

You can refer to the official documentation for the entire list of functions here.


Commonly Used String Functions in PySpark

1. Basic Operations

  • concat(*cols): Concatenates multiple columns or strings into one.
  • concat_ws(delimiter, *cols): Concatenates columns or strings with a specified delimiter.

Example:

from pyspark.sql.functions import concat, concat_ws

# Concatenate first name and last name
df = df.withColumn("full_name", concat(df.first_name, df.last_name))

# Concatenate first name and last name with a space separator
df = df.withColumn("full_name_with_space", concat_ws(" ", df.first_name, df.last_name))

2. Substring and Extraction

  • substring(col, pos, length): Extracts a substring from a column.
  • substr(col, pos, length): Alias for substring.
  • regexp_extract(col, pattern, groupIdx): Extracts a match from a string using a regex pattern.

Example:

from pyspark.sql.functions import substring, regexp_extract

# Extract the first 5 characters of the full name
df = df.withColumn("name_substring", substring(df.full_name, 1, 5))

# Extract domain from email using regex
df = df.withColumn("email_domain", regexp_extract(df.email, r"@(\w+)", 1))

3. Search and Replace

  • instr(col, substring): Finds the position of the first occurrence of a substring.
  • regexp_replace(col, pattern, replacement): Replaces substrings matching a regex pattern.

Example:

from pyspark.sql.functions import instr, regexp_replace

# Find the position of the first occurrence of 'error'
df = df.withColumn("error_position", instr(df.description, "error"))

# Replace all digits in the text column with empty string
df = df.withColumn("text_without_digits", regexp_replace(df.text, r"[0-9]", ""))

4. Case Conversion

  • upper(col): Converts text to uppercase.
  • lower(col): Converts text to lowercase.
  • initcap(col): Capitalizes the first letter of each word.

Example:

from pyspark.sql.functions import upper, lower, initcap

# Convert first name to uppercase
df = df.withColumn("first_name_upper", upper(df.first_name))

# Convert first name to lowercase
df = df.withColumn("first_name_lower", lower(df.first_name))

# Capitalize the first letter of each word in the full name
df = df.withColumn("name_initcap", initcap(df.full_name))

5. Trimming and Padding

  • trim(col): Removes leading and trailing spaces.
  • ltrim(col): Removes leading spaces.
  • rtrim(col): Removes trailing spaces.
  • lpad(col, length, pad): Pads a string with characters on the left.
  • rpad(col, length, pad): Pads a string with characters on the right.

Example:

from pyspark.sql.functions import trim, lpad, rpad

# Remove leading and trailing spaces from the username
df = df.withColumn("trimmed_username", trim(df.username))

# Pad account ID to the left with zeros to make length 10
df = df.withColumn("padded_account_id", lpad(df.account_id, 10, "0"))

# Pad the code to the right with "#" to make length 8
df = df.withColumn("padded_code", rpad(df.code, 8, "#"))

6. Length and Comparison

  • length(col): Returns the length of a string.
  • levenshtein(col1, col2): Calculates the Levenshtein distance (edit distance) between two strings.

Example:

from pyspark.sql.functions import length, levenshtein

# Get the length of the password column
df = df.withColumn("password_length", length(df.password))

# Compare two strings and compute their Levenshtein distance
df = df.withColumn("levenshtein_distance", levenshtein(df.string1, df.string2))

Sample Code for PySpark Online Compiler

Below is a sample code snippet that can be executed in our PySpark Online Compiler:

from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, concat, concat_ws, substring, upper, lower, initcap, 
    trim, ltrim, rtrim, regexp_replace, regexp_extract, length, 
    instr, lpad, rpad
)

# Sample data
data = [
    ("John", "Doe", "john.doe@example.com", "   active   ", "12345"),
    ("Jane", "Smith", "jane.smith@work.org", "inactive", "67890"),
    ("Sam", "Brown", "sam.brown@data.net", "active   ", "111213")
]

# Create DataFrame
spark = SparkSession.builder.appName("StringFunctionsWithColumn").getOrCreate()
df = spark.createDataFrame(data, ["first_name", "last_name", "email", "status", "account_id"])

# Applying string functions using withColumn
df_transformed = df \
    .withColumn("full_name", concat_ws(" ", col("first_name"), col("last_name"))) \
    .withColumn("email_uppercase", upper(col("email"))) \
    .withColumn("email_domain", regexp_extract(col("email"), r"@(\w+)", 1)) \
    .withColumn("trimmed_status", trim(col("status"))) \
    .withColumn("padded_account_id", lpad(col("account_id"), 10, "0")) \
    .withColumn("email_prefix", substring(col("email"), 1, 5)) \
    .withColumn("cleaned_email", regexp_replace(col("email"), r"[.@]", "-")) \
    .withColumn("email_length", length(col("email"))) \
    .withColumn("first_name_initcap", initcap(col("first_name"))) \
    .withColumn("description", concat_ws(" | ", col("full_name"), col("trimmed_status"), col("email_domain")))

# Show the resulting DataFrame
df_transformed.show(truncate=False)

This code demonstrates various string functions and their practical applications in data processing. You can run this sample code directly in our PySpark online compiler for hands-on practice.