How to Start a New PySpark JobDec. 4, 2019
Read time: 34 minutes
I’ve been to Spark and back. But I did leave some of my soul.
According to Apache, Spark was developed to “write applications quickly in Java, Scala, Python, R, and SQL”
And I’m sure it’s true. Or at least I’m sure their intentions were noble.
I’m not talking about Scala yet, or Java, those are whole other language. I’m talking about Spark with python. Or PySpark, as the Olgivy inspired geniuses at Apache marketing call it.
The learning curve is not easy my pretties, but luckily for you, I’ve managed to sort out some of the basic ecosystem and how it all operates. Brevity is my goal.
This doesn’t include MLib, or GraphX, or streaming, just the basics
Import some data
train = sqlContext.read.option("header", "true")\ .option("inferSchema", "true")\ .format("csv")\ .load("train_V2.csv")\ .limit(20000) Show the head of a dataframe
train.head(5) List the columns and their value types
train.printSchema() Show a number of rows in a better format
train.show(2,truncate= True) Count the number of rows
train.count() List column names
train.columns Show mean, medium, st, etc...
train.describe().show() Show mean, medium, st, etc... of just one column
train.describe('kills').show() Show only certain columns
train.select('kills','headshotKills').show(5) Get the distinct values of a column
train.select('boosts').distinct().count() train.select('boosts').distinct() That's it for now...
You must login to comment