Wednesday, March 17, 2021

Spark calculation of average time users active in Java

Here we are going to start some series of using Apache Spark with Java to perform some calculations to get some analytics.

In this first one we are to calculate the average time a user is subscribed to some service. This is a quite easy task to perform when using a relation database, but it may become complicated or quite time consuming if dataset is quite large or if data is stored in another type of data source.

To start with an easy example we are going to get the data from a MySQL table. The point is, data can be gathered from wherever we want, a large file in CSV format, Apache Cassandra...

As we are using a relational database the query for getting the information desired would be a simple group by sentence, but lets see how it would be using Spark.

In the following repository is the source code of the example:

https://github.com/rafareyeslopez/spark-demo


What is done to achieve the result is to run firstly a map function for get the data in the format required, so we build a Pair RDD where the key is the user ID and the value is the substraction of the timestamp when user left the service and the timestamp when user joined the service (transformed into days).

final JavaPairRDD mapToPair = usersMap.mapToPair(row -> new Tuple2<>(row._1(), TimeUnit.MILLISECONDS.toDays( row._3() == null ? Calendar.getInstance().getTimeInMillis() - row._2().getTime() : row._3().getTime() - row._2().getTime() )));


Next the amount os users is being count, this is and easy one

final long count = mapToPair.count();


And lastly the reduce is done, left just is to calculate the average.

// Get the sum of days in service
 final Long reduce = mapToPair.values().reduce((value1, value2) -> value1 + value2);

// Print out the average
 System.out.println("Average days subscribed " + reduce / count);


Just to note, Spark has a function to perform groupBy BUT it may cause some issues as is a well know caveat that needs to be taken into account.

Moreover, this calculation has been done using map / reduce, in the future some example using SparkSQL will be shown.

Here is the class that runs the Spark Job to perform the calculation

No comments:

Post a Comment