python - Pyspark: show histogram of a data frame column

python – Pyspark: show histogram of a data frame column

python – Pyspark: show histogram of a data frame column

Unfortunately I dont think that theres a clean plot() or hist() function in the PySpark Dataframes API, but Im hoping that things will eventually go in that direction.

n

For the time being, you could compute the histogram in Spark, and plot the computed histogram as a bar chart. Example:

n

import pandas as pdnimport pyspark.sql as sparksqlnn# Lets use UCLAs college admission datasetnfile_name = https://stats.idre.ucla.edu/stat/data/binary.csvnn# Creating a pandas dataframe from Sample Datandf_pd = pd.read_csv(file_name)nnsql_context = sparksql.SQLcontext(sc)nn# Creating a Spark DataFrame from a pandas dataframendf_spark = sql_context.createDataFrame(df_pd)nndf_spark.show(5)n

n

This is what the data looks like:

n

Out[]:    +-----+---+----+----+n          |admit|gre| gpa|rank|n          +-----+---+----+----+n          |    0|380|3.61|   3|n          |    1|660|3.67|   3|n          |    1|800| 4.0|   1|n          |    1|640|3.19|   4|n          |    0|520|2.93|   4|n          +-----+---+----+----+n          only showing top 5 rowsnnn# This is what we wantndf_pandas.hist(gre);n

n

Histogram when plotted in using df_pandas.hist()

n

# Doing the heavy lifting in Spark. We could leverage the `histogram` function from the RDD apinngre_histogram = df_spark.select(gre).rdd.flatMap(lambda x: x).histogram(11)nn# Loading the Computed Histogram into a Pandas Dataframe for plottingnpd.DataFrame(n    list(zip(*gre_histogram)), n    columns=[bin, frequency]n).set_index(n    binn).plot(kind=bar);n

n

Histogram computed by using RDD.histogram()

You can now use the pyspark_dist_explore package to leverage the matplotlib hist function for Spark DataFrames:

n

from pyspark_dist_explore import histnimport matplotlib.pyplot as pltnnfig, ax = plt.subplots()nhist(ax, my_df.select(field_1), bins = 20, color=[red])n

n

This library uses the rdd histogram function to calculate bin values.

python – Pyspark: show histogram of a data frame column

Another solution, without the need for extra imports,nwhich should also be efficient; First, use window partition:

n

import pyspark.sql.functions as Fnimport pyspark.sql as SQLnwin = SQL.Window.partitionBy(column_of_values)n

n

Then all you need it to use count aggregation partitioned by the window:

n

df.select(F.count(column_of_values).over(win).alias(histogram))

n

The aggregative operators happens on each partition of the cluster, and does not require an extra round-trip to the host.

Related posts on python :

Leave a Reply

Your email address will not be published.