This page contains a bunch of spark pipeline transformation methods, whichwe can use for different problems. Use this as a quick cheat on how we cando particular operation on spark dataframe or pyspark.
Spark allows you to speed analytic applications up to 100 times faster compared to other technologies on the market today. Interfacing Spark with Python is easy with PySpark: this Spark Python API exposes the Spark programming model to Python. The PySpark Basics cheat sheet already showed you how to work with the most basic building blocks, RDDs. Having a good cheatsheet at hand can significantly speed up the development process. One of the best cheatsheet I have came across is sparklyr’s cheatsheet. For my work, I’m using Spark’s DataFrame API in Scala to create data transformation pipelines. These are some functions and design patterns that I’ve found to be extremely useful. Connect to the cluster. Sc sparkconnect(method = 'livy', master = 'USING LIVY (Experimental) 1. Install RStudio Server or RStudio Pro on one of the existing nodes, preferably an edge node 2. Locate path to the cluster’s Spark Home Directory, it normally is “/usr/lib/spark” 3. This PySpark SQL cheat sheet covers the basics of working with the Apache Spark DataFrames in Python: from initializing the SparkSession to creating DataFrames, inspecting the data, handling duplicate values, querying, adding, updating or removing columns, grouping, filtering or sorting data. You'll also see that this cheat sheet also on how to run SQL Queries programmatically, how to save your data to parquet and JSON files, and how to stop your SparkSession.
This code snippets are tested on spark-2.4.x version, mostly work onspark-2.3.x also, but not sure about older versions. |
Read the partitioned json files from disk
applicable to all types of files supported
Save partitioned files into a single file.
Here we are merging all the partitions into one file and dumping it intothe disk, this happens at the driver node, so be careful with sie ofdata set that you are dealing with. Otherwise, the driver node may go out of memory.
Use coalesce method to adjust the partition size of RDD based on our needs.
Filter rows which meet particular criteria
Map with case class
Use case class if you want to map on multiple columns with a complexdata structure.
OR using Row class.
Use selectExpr to access inner attributes
Provide easily access the nested data structures like json and filter themusing any existing udfs, or use your udf to get more flexibility here.
How to access RDD methods from pyspark side
Using standard RDD operation via pyspark API isn’t straight forward, to get thatwe need to invoke the .rdd to convert the DataFrame to support these features.
For example, here we are converting a sparse vector to dense and summing it in column-wise.
Pyspark Map on multiple columns
Filtering a DataFrame column of type Seq[String]
Filter a column with custom regex and udf
Sum a column elements
Remove Unicode characters from tokens

Sometimes we only need to work with the ascii text, so it’s better to clean outother chars.
Connecting to jdbc with partition by integer column
Dji Spark Cheat Sheet
When using the spark to read data from the SQL database and then do theother pipeline processing on it, it’s recommended to partition the dataaccording to the natural segments in the data, or at least on an integercolumn, so that spark can fire multiple sql queries to read data from SQLserver and operate on it separately, the results are going to the sparkpartition.
Bellow commands are in pyspark, but the APIs are the same for the scala version also.
Parse nested json data
This will be very helpful when working with pyspark and want to pass verynested json data between JVM and Python processes. Lately spark community relay onapache arrow project to avoid multiple serialization/deserialization costs whensending data from java memory to python memory or vice versa.
So to process the inner objects you can make use of this getItem methodto filter out required parts of the object and pass it over to python memory viaarrow. In the future arrow might support arbitrarily nested data, but right now it won’tsupport complex nested formats. The general recommended option is to go without nesting.
'string ⇒ array<string>' conversion
Type annotation .as[String] avoid implicit conversion assumed.
A crazy string collection and groupby
This is a stream of operation on a column of type Array[String] and collectthe tokens and count the n-gram distribution over all the tokens.
How to access AWS s3 on spark-shell or pyspark
Most of the time we might require a cloud storage provider like s3 / gs etc, toread and write the data for processing, very few keeps in-house hdfs to handle the datathemself, but for majority, I think cloud storage easy to start with and don’t needto bother about the size limitations.
Supply the aws credentials via environment variable
Supply the credentials via default aws ~/.aws/config file
Recent versions of awscli expect its configurations are kept under ~/.aws/credentials file,but old versions looks at ~/.aws/config path, spark 2.4.x version now looks at the ~/.aws/config locationsince spark 2.4.x comes with default hadoop jars of version 2.7.x.
Set spark scratch space or tmp directory correctly
This might require when working with a huge dataset and your machine can’t hold themall in memory for given pipeline steps, those cases the data will be spilled overto disk, and saved in tmp directory.
Set bellow properties to ensure, you have enough space in tmp location.
Pyspark doesn’t support all the data types.
When using the arrow to transport data between jvm to python memory, the arrow may throwbellow error if the types aren’t compatible to existing converters. The fixes may becomein the future on the arrow’s project. I’m keeping this here to know that how the pyspark getsdata from jvm and what are those things can go wrong in that process.
Work with spark standalone cluster manager
Start the spark clustering in standalone mode
Once you have downloaded the same version of the spark binary across the machinesyou can start the spark master and slave processes to form the standalone sparkcluster. Or you could run both these services on the same machine also.
Standalone mode,
Worker can have multiple executors.
Worker is like a node manager in yarn.
We can set worker max core and memory usage settings.
When defining the spark application via spark-shell or so, define the executor memory and cores.
When submitting the job to get 10 executor with 1 cpu and 2gb ram each,
| This page will be updated as and when I see some reusable snippet of code for spark operations |
Spark Sql Cheat Sheet Pdf
Changelog
References
Go TopHave you reached the limit of what a relational database can achieve? Self-diagnose with these 5 telltale signs you’ve outgrown your data warehouse.
The digital age has given enterprises a wealth of new options to measure their operations. Gone are the days of manually updating database tables: today’s digital systems are able to capture the minutiae of user interactions with apps, connected devices and digital systems.
With digital data being generated at a tremendous pace, developers and analysts have a broad range of possibilities when it comes to operationalizing data and preparing it for analytics and machine learning.
One of the fundamental questions you need to ask when planning out your data architecture is the question of batch vs stream processing: do you process data as it arrives, in real time or near-real time, or do you wait for data to accumulate before running your ETL job? This guide is meant to give you a high-level overview of the considerations you should take into account when making that decision. We’ve also included some additional resources if you want to dive deeper.
Batch Processing
What is batch processing?
In batch processing, we wait for a certain amount of raw data to “pile up” before running an ETL job. Typically this means data is between an hour to a few days old before it is made available for analysis. Batch ETL jobs will typically be run on a set schedule (e.g. every 24 hours), or in some cases once the amount of data reaches a certain threshold.
When to use batch processing?
By definition, batch processing entails latencies between the time data appears in the storage layer and the time it is available in analytics or reporting tools. However, this is not necessarily a major issue, and we might choose to accept these latencies because we prefer working with batch processing frameworks.
For example, if we’re trying to analyze the correlation between SaaS license renewals and customer support tickets, we might want to join a table from our CRM with one from our ticketing system. If that join happens once a day rather than the second a ticket is resolved, it probably won’t make much of a difference.
To generalize, you should lean towards batch processing when:
- Data freshness is not a mission-critical issue
- You are working with large datasets and are running a complex algorithm that
- requires access to the entire batch – e.g., sorting the entire dataset
- You get access to the data in batches rather than in streams
- When you are joining tables in relational databases
Batch processing tools and frameworks
- Open-source Hadoop frameworks for such as Spark and MapReduce are a popular choice for big data processing
- For smaller datasets and application data, you might use batch ETL tools such as Informatica and Alteryx
- Relational databases such as Amazon Redshift and Google BigQuery
Need inspiration on how to build your big data architecture? Check out these data lake examples. Want to simplify ETL pipelines with a single self-service platform for batch and stream processing? Try Upsolver today.
Stream Processing
What is stream processing?
In stream processing, we process data as soon as it arrives in the storage layer – which would often also be very close to the time it was generated (although this would not always be the case). This would typically be in sub-second timeframes, so that for the end user the processing happens in real-time. These operations would typically not be stateful, or would only be able to store a ‘small’ state, so would usually involve a relatively simple transformation or calculation.
When to use stream processing
While stream processing and real-time processing are not necessarily synonymous, we would use stream processing when we need to analyze or serve data as close as possible to when we get hold of it.
Examples of scenarios where data freshness is super-important could include real-time advertising, online inference in machine learning, or fraud detection. In these cases we have data-driven systems that need to make a split-second decision: which ad to serve? Do we approve this transaction? We would use stream processing to quickly access the data, perform our calculations and reach a result.
Indications that stream processing is the right approach:
- Data is being generated in a continuous stream and arriving at high velocity
- Sub-second latency is crucial
Stream processing tools and frameworks
Stream processing and micro-batch processing are often used synonymously, and frameworks such as Spark Streaming would actually process data in micro-batches. However, there are some pure-play stream processing tools such as Confluent’s KSQL, which processes data directly in a Kafka stream, as well as Apache Flink and Apache Flume.
Micro-batch Processing
What is micro-batch processing?
In micro-batch processing, we run batch processes on much smaller accumulations of data – typically less than a minute’s worth of data. This means data is available in near real-time. In practice, there is little difference between micro-batching and stream processing, and the terms would often be used interchangeably in data architecture descriptions and software platform descriptions.
When to use micro-batch processing
Microbatch processing is useful when we need very fresh data, but not necessarily real-time – meaning we can’t wait an hour or a day for a batch processing to run, but we also don’t need to know what happened in the last few seconds.
Example scenarios could include web analytics (clickstream) or user behavior. If a large ecommerce site makes a major change to its user interface, analysts would want to know how this affected purchasing behavior almost immediately because a drop in conversion rates could translate into significant revenue losses. However, while a day’s delay is definitely too long in this case, a minute’s delay should not be an issue – making micro-batch processing a good choice.
Micro-batch processing tools and frameworks
- Apache Spark Streaming the most popular open-source framework for micro-batch processing.
- Vertica offers support for microbatches.
Additional resources and further reading
The above are general guidelines for determining when to use batch vs stream processing. However, each of these topics warrants much further research in its own right. To delve deeper into data processing techniques, you might want to check out a few of the following resources:

Upsolver: Streaming-first ETL Platform
One of the major challenges when working with big data streams is the need to orchestrate multiple systems for batch and stream processing, which often leads to complex tech stacks that are difficult to maintain and manage.
Whether you’re just building out your big data architecture or are looking to optimize ETL flows, Upsolver provides a comprehensive self-service platform that combines batch, micro-batch and stream processing and enables developers and analysts to easily combine streaming and historical big data. To learn how it works, check out our technical white paper.
