Convert the Character Set/Encoding of a String field in a PySpark DataFrame on Databricks

TL;DR

When defining your PySpark dataframe using spark.read, use the .withColumns() function to override the contents of the affected column. Use the encode function of the pyspark.sql.functions library to change the Character Set Encoding of the column. 
import pyspark.sql.functions

dataFame = (
    spark.read.json(varFilePath)
)
.withColumns("affectedColumnName", sql.functions.encode("affectedColumnName", 'utf-8'))

Scenario

The scenario where this would be needed is quite simple:
  • You have a source dataset which is configured to have it’s string data stored in a specific Character Set Encoding e.g. utf-16.
  • You need to read that data using PySpark in a Databricks Notebook, and write it to a target system.
  • The target system is required to force the string data imported into it to conform to a different Character Set Encoding e.g. utf-8.
  • During the data wrangling operations in the notebook, you need to ensure the data is converted to successfully import into the target system. 

What is a Character Set and what is a Character Encoding?

Character Set

A Character Set is a collection of characters that can be grouped together for various purposes such as forming sentences in a book.
A single character itself is the most minimal unit of text that can have a semantic value on its own. 
E.g. A is a character, the number 5 is a character, and the symbol @ is a character. 
However, the number 10 is a combination of the 1 and 0 characters.
An emoji is also classified as a single character. 
 
The same character can be used in multiple languages. I.e. the Latin set of characters is used by many modern languages such as Italian, Spanish, Swedish, Norwegian etc. 
The Cyrillic set of characters is used by other languages such as the Slavic nations of Russia and Ukraine. 

Character Encoding

Character Encoding is the mapping between the bytes that the computer system reads and understands, and the characters in the character set that we humans read and understand.
 
The encoding describes how the character is broken down into bytes so the computer can understand it. 

Character

Binary code point

Binary UTF-8

Hex UTF-8

$ -  U+0024

010 0100

00100100

24

¢ -  U+00A2

000 1010 0010

11000010 10100010

C2 A2

ह -  U+0939

0000 1001 0011 1001

11100000 10100100 10111001

E0 A4 B9

€ -  U+20AC

0010 0000 1010 1100

11100010 10000010 10101100

E2 82 AC

한 -  U+D55C

1101 0101 0101 1100

11101101 10010101 10011100

ED 95 9C

𐍈 -  U+10348

0 0001 0000 0011 0100 1000

11110000 10010000 10001101 10001000

F0 90 8D 88

In general conversation, the terms Character Set and Character Encoding have been used inter changeably.

Analogy

If I, an Afrikaans speaking South-African send a text message to my friend who only speaks and reads English, they would use an Afrikaans-to-English Dictionary to look up each word that I sent, and then determine what the corresponding word is in English in order to understand what I texted them. 
 
In this scenario, we are using the same character set, the Latin character set. However, our encodings are different. I am on the Afrikaans encoding, they are on the English encoding. So they need to use the dictionary to convert from Afrikaans to English encoding to understand what I am saying to them. 

.withColumns() explained

The PySpark .withColumns() function is a transformation function of data in a Data Frame.  It is used to change the contents or values in an existing column, change the data type, create a new column etc. 
 
Since this is a function we call on an existing dataframe, the field you specify must exist in that dataframe on which you are calling the function in order to be overridden. If it does not, then this will be added as a new column to the resulting dataframe. 
 
Parameters of the function
  1. columnName (string)
    This is the string representation of the column you wish to operate on.
  2. columnExpression
    This is a PySpark compatible column expression that will return scalar data as the resulting value per record in the dataframe.

Performance Note

  • This method uses projection internally. 
  • Thus, when calling it multiple times such as inside a loop to add multiple columns, it will generate large execution plans.
  • This may cause performance bottlenecks and sometimes the StackOverflowException
  • To avoid this, the PySpark documentation recommends to use select() with the multiple columns at once.

pyspark.sql.functions.encode

The pyspark.sql.functions.encode function is used to compute the first argument into a binary value from a string using the provided character set encoding. 
 
In simple terms, it takes a value and converts it to the character set encoding specified. 

How do these functions resolve the problem?

import pyspark.sql.functions

dataFame = (
    spark.read.json(varFilePath)
)
.withColumns("affectedColumnName", sql.functions.encode("affectedColumnName", 'utf-8'))
In this example, the “affectedColumnName” has the incorrect character set encoding for our target system. 
 
We are using the .withColumns() function on the dataFrame to overwrite the contents of the field. 
 
The new contents is retrieved from the output of the sql.functions.encode() function. 
 
In that function, we are converting the data to the utf-8 character set encoding. 

T-SQL Variant

As I come from a background of SQL, I feel obliged to show the comparative methods one would use in T-SQL. 
SELECT AffectedColumnName COLLATE SQL_Latin1_General_CP1_CI_AS
FROM TableName
In this code example, we are reading data from the TableName table. 
 
AffectedColumnName is being converted to a new character set encoding using the COLLATE clause.
 
In SQL, the collation is a unique set of configuration options that is used in a similar way to a character set encoding. 
 
Options it controls
  • Sorting Rules
  • Uppercase and lowercase preferences
  • Code pages to be used
  • Case Sensitivity
  • Accent Sensitivity
In this example, the SQL_Latin1_General_CP1_CI_AS collation has three segments
  • SQL_Latin1_General_CP1 – this is the base collation name which defined the character set encoding that is used
  • CI – Case Insensitive
  • AS – Accent Sensitive

If you are not that strict on explicitly stating the target collation, an easier method is to simply use the database_default convention. This will tell SQL Server that we don’t care what the correct target collation needs to be, you know it, so just convert the incoming text to that collation and I will be happy.

SELECT AffectedColumnName COLLATE database_default
FROM TableName

If you like what I do please consider supporting me on Ko-Fi

5 thoughts on “Convert the Character Set/Encoding of a String field in a PySpark DataFrame on Databricks”

  1. I like what you guys are up also. Such intelligent work and reporting! Carry on the superb works guys I¦ve incorporated you guys to my blogroll. I think it’ll improve the value of my website 🙂

  2. Just want to say your article is as surprising. The clarity in your post is just excellent and i can assume you are an expert on this subject. Fine with your permission allow me to grab your RSS feed to keep updated with forthcoming post. Thanks a million and please continue the gratifying work.

  3. Hello there! I know this is kind of off topic but I was wondering if you knew where I could get a captcha plugin for my comment form? I’m using the same blog platform as yours and I’m having problems finding one? Thanks a lot!

  4. F*ckin¦ amazing issues here. I am very happy to peer your post. Thanks a lot and i am taking a look forward to touch you. Will you kindly drop me a e-mail?

Comments are closed.