
TL;DR
import pyspark.sql.functions
dataFame = (
spark.read.json(varFilePath)
)
.withColumns("affectedColumnName", sql.functions.encode("affectedColumnName", 'utf-8'))
Scenario
- You have a source dataset which is configured to have it’s string data stored in a specific Character Set Encoding e.g. utf-16.
- You need to read that data using PySpark in a Databricks Notebook, and write it to a target system.
- The target system is required to force the string data imported into it to conform to a different Character Set Encoding e.g. utf-8.
- During the data wrangling operations in the notebook, you need to ensure the data is converted to successfully import into the target system.
What is a Character Set and what is a Character Encoding?
Character Set
Character Encoding
Character | Binary code point | Binary UTF-8 | Hex UTF-8 |
---|---|---|---|
$ - U+0024 | 010 0100 | 00100100 | 24 |
¢ - U+00A2 | 000 1010 0010 | 11000010 10100010 | C2 A2 |
ह - U+0939 | 0000 1001 0011 1001 | 11100000 10100100 10111001 | E0 A4 B9 |
€ - U+20AC | 0010 0000 1010 1100 | 11100010 10000010 10101100 | E2 82 AC |
한 - U+D55C | 1101 0101 0101 1100 | 11101101 10010101 10011100 | ED 95 9C |
𐍈 - U+10348 | 0 0001 0000 0011 0100 1000 | 11110000 10010000 10001101 10001000 | F0 90 8D 88 |
Analogy
.withColumns() explained
- columnName (string)
This is the string representation of the column you wish to operate on. - columnExpression
This is a PySpark compatible column expression that will return scalar data as the resulting value per record in the dataframe.
Performance Note
- This method uses projection internally.
- Thus, when calling it multiple times such as inside a loop to add multiple columns, it will generate large execution plans.
- This may cause performance bottlenecks and sometimes the StackOverflowException.
- To avoid this, the PySpark documentation recommends to use select() with the multiple columns at once.
pyspark.sql.functions.encode
How do these functions resolve the problem?
import pyspark.sql.functions
dataFame = (
spark.read.json(varFilePath)
)
.withColumns("affectedColumnName", sql.functions.encode("affectedColumnName", 'utf-8'))
T-SQL Variant
SELECT AffectedColumnName COLLATE SQL_Latin1_General_CP1_CI_AS
FROM TableName
- Sorting Rules
- Uppercase and lowercase preferences
- Code pages to be used
- Case Sensitivity
- Accent Sensitivity
- SQL_Latin1_General_CP1 – this is the base collation name which defined the character set encoding that is used
- CI – Case Insensitive
- AS – Accent Sensitive
If you are not that strict on explicitly stating the target collation, an easier method is to simply use the database_default convention. This will tell SQL Server that we don’t care what the correct target collation needs to be, you know it, so just convert the incoming text to that collation and I will be happy.
SELECT AffectedColumnName COLLATE database_default
FROM TableName
If you like what I do please consider supporting me on Ko-Fi
I like what you guys are up also. Such intelligent work and reporting! Carry on the superb works guys I¦ve incorporated you guys to my blogroll. I think it’ll improve the value of my website 🙂
Just want to say your article is as surprising. The clarity in your post is just excellent and i can assume you are an expert on this subject. Fine with your permission allow me to grab your RSS feed to keep updated with forthcoming post. Thanks a million and please continue the gratifying work.
Hello there! I know this is kind of off topic but I was wondering if you knew where I could get a captcha plugin for my comment form? I’m using the same blog platform as yours and I’m having problems finding one? Thanks a lot!
F*ckin¦ amazing issues here. I am very happy to peer your post. Thanks a lot and i am taking a look forward to touch you. Will you kindly drop me a e-mail?
I consider something truly special in this internet site.