Parameterized queries with PySpark | Databricks Weblog

Big Data

Parameterized queries with PySpark | Databricks Weblog

lohitnath.453

January 4, 2024

Parameterized queries with PySpark | Databricks Weblog

[ad_1]

PySpark has at all times supplied great SQL and Python APIs for querying knowledge. As of Databricks Runtime 12.1 and Apache Spark 3.4, parameterized queries help protected and expressive methods to question knowledge with SQL utilizing Pythonic programming paradigms.

This submit explains find out how to make parameterized queries with PySpark and when this can be a good design sample on your code.

Parameters are useful for making your Spark code simpler to reuse and take a look at. In addition they encourage good coding practices. This submit will exhibit the 2 alternative ways to parameterize PySpark queries:

PySpark customized string formatting
Parameter markers

Let’s take a look at find out how to use each kinds of PySpark parameterized queries and discover why the built-in performance is healthier than different alternate options.

Advantages of parameterized queries

Parameterized queries encourage the “do not repeat your self” (DRY) sample, make unit testing simpler, and make SQL easier-to-reuse. In addition they stop SQL injection assaults, which may pose safety vulnerabilities.

It may be tempting to repeat and paste giant chunks of SQL when writing comparable queries. Parameterized queries encourage abstracting patterns and writing code with the DRY sample.

Parameterized queries are additionally simpler to check. You’ll be able to parameterize a question so it’s straightforward to run on manufacturing and take a look at datasets.

Alternatively, manually parameterizing SQL queries with Python f-strings is a poor various. Take into account the next disadvantages:

Python f-strings don’t defend in opposition to SQL injection assaults.
Python f-strings don’t perceive Python native objects equivalent to DataFrames, columns, and particular characters.

Let’s take a look at find out how to parameterize queries with parameter markers, which defend your code from SQL injection vulnerabilities, and help computerized sort conversion of frequent PySpark cases in string format.

Parameterized queries with PySpark customized string formatting

Suppose you might have the next knowledge desk referred to as h20_1e9 with 9 columns:

+-----+-----+------------+---+---+-----+---+---+---------+
|  id1|  id2|         id3|id4|id5|  id6| v1| v2|       v3|
+-----+-----+------------+---+---+-----+---+---+---------+
|id008|id052|id0000073659| 84| 89|82005|  5| 11|64.785802|
|id079|id037|id0000041462|  4| 35|28153|  1|  1|28.732545|
|id098|id031|id0000027269| 27| 38|13508|  5|  2|59.867875|
+-----+-----+------------+---+---+-----+---+---+---------+

You wish to parameterize the next SQL question:

SELECT id1, SUM(v1) AS v1 
FROM h20_1e9 
WHERE id1 = "id089"
GROUP BY id1

You’d prefer to make it straightforward to run this question with completely different values of id1. Here is find out how to parameterize and run the question with completely different id1 values.

question = """SELECT id1, SUM(v1) AS v1 
FROM h20_1e9 
WHERE id1 = {id1_val} 
GROUP BY id1"""

spark.sql(question, id1_val="id016").present()

+-----+------+
|  id1|    v1|
+-----+------+
|id016|298268|
+-----+------+

Now rerun the question with one other argument:

spark.sql(question, id1_val="id018").present()

+-----+------+
|  id1|    v1|
+-----+------+
|id089|300446|
+-----+------+

The PySpark string formatter additionally enables you to execute SQL queries instantly on a DataFrame with out explicitly defining non permanent views.

Suppose you might have the next DataFrame referred to as person_df:

+---------+--------+
|firstname| nation|
+---------+--------+
|    frank|     usa|
|   sourav|   india|
|    rahul|   india|
|      sim|buglaria|
+---------+--------+

Here is find out how to question the DataFrame with SQL.

spark.sql(
    "choose nation, rely(*) as num_ppl from {person_df} group by nation",
    person_df=person_df,
).present()

+--------+-------+
| nation|num_ppl|
+--------+-------+
|     usa|      1|
|   india|      2|
|bulgaria|      1|
+--------+-------+

Operating queries on a DataFrame utilizing SQL syntax with out having to manually register a brief view could be very good!

Let’s now see find out how to parameterize queries with arguments in parameter markers.

Parameterized queries with parameter markers

You too can use a dictionary of arguments to formulate a parameterized SQL question with parameter markers.

Suppose you might have the next view named some_purchases:

+-------+------+-------------+
|   merchandise|quantity|purchase_date|
+-------+------+-------------+
|  socks|  7.55|   2022-05-15|
|purse| 49.99|   2022-05-16|
| shorts|  25.0|   2023-01-05|
+-------+------+-------------+

Here is find out how to make a parameterized question with named parameter markers to calculate the overall quantity spent on a given merchandise.

question = "SELECT merchandise, sum(quantity) from some_purchases group by merchandise having merchandise = :merchandise"

Compute the overall quantity spent on socks.

spark.sql(
    question,
    args={"merchandise": "socks"},
).present()

+-----+-----------+
| merchandise|sum(quantity)|
+-----+-----------+
|socks|      32.55|
+-----+-----------+

You too can parameterize queries with unnamed parameter markers; see right here for extra data.

Apache Spark sanitizes parameters markers, so this parameterization method additionally protects you from SQL injection assaults.

How PySpark sanitizes parameterized queries

Here is a high-level description of how Spark sanitizes the named parameterized queries:

The SQL question arrives with an non-compulsory key/worth parameters listing.
Apache Spark parses the SQL question and replaces the parameter references with corresponding parse tree nodes.
Throughout evaluation, a Catalyst rule runs to switch these references with their supplied parameter values from the parameters.
This method protects in opposition to SQL injection assaults as a result of it solely helps literal values. Common string interpolation applies substitution on the SQL string; this technique may be weak to assaults if the string comprises SQL syntax aside from the meant literal values.

As beforehand talked about, there are two kinds of parameterized queries supported in PySpark:

The {} syntax does a string substitution on the SQL question on the consumer aspect for ease of use and higher programmability. Nonetheless, it doesn’t defend in opposition to SQL injection assaults because the question textual content is substituted earlier than being despatched to the Spark server.

Parameterization makes use of the args argument of the sql() API and passes the SQL textual content and parameters individually to the server. The SQL textual content will get parsed with the parameter placeholders, substituting the values of the parameters specified within the args within the analyzed question tree.

There are two flavors of server-side parameterized queries: named parameter markers and unnamed parameter markers. Named parameter markers use the :<param_name> syntax for placeholders. See the documentation for extra data on find out how to use unnamed parameter markers.

Parameterized queries vs. string interpolation

You too can use common Python string interpolation to parameterize queries, but it surely’s not as handy.

Here is how we would should parameterize our earlier question with Python f-strings:

some_df.createOrReplaceTempView("no matter")
the_date = "2021-01-01"
min_value = "4.0"
table_name = "no matter"

question = f"""SELECT * from {table_name}
WHERE the_date > '{the_date}' AND quantity > {min_value}"""
spark.sql(question).present()

This is not as good for the next causes:

It requires creating a brief view.
We have to characterize the date as a string, not a Python date.
We have to wrap the date in single quotes within the question to format the SQL string correctly.
This does not defend in opposition to SQL injection assaults.

In sum, built-in question parameterization capabilities are safer and more practical than string interpolation.

Conclusion

PySpark parameterized queries provide you with new capabilities to put in writing clear code with acquainted SQL syntax. They’re handy whenever you need to question a Spark DataFrame with SQL. They allow you to use frequent Python knowledge sorts like floating level values, strings, dates, and datetimes, which mechanically convert to SQL values underneath the hood. On this method, now you can leverage frequent Python idioms and write stunning code.

Begin leveraging PySpark parameterized queries right this moment, and you’ll instantly take pleasure in the advantages of a better high quality codebase.

[ad_2]