Finding the most spawned Pokemon in Pokemon GO using Spark & visualizing the Data
My colleague Justin Risch recently obtained some data from the popular game Pokemon GO. He cleansed the data into a much more usable CSV format and I decided to use this to do some practice in Apache Spark.
It was a fairly simple Spark class written in Scala using the Eclipse Scala IDE.
Here is the code:
object MostSpawnedPokemon {
  def loadNames() : Map[Int, String] = {
    Source.fromFile("../Data/pokemon.csv")
      .getLines()
      .map(_.split(','))
      .filter(_.length > 1)
      .map(fields => fields(0).toInt -> fields(1))
      .toMap
  }
  def main(args: Array[String]) {
    // Set the log level to only print errors
    Logger.getLogger("org").setLevel(Level.ERROR)
    // Create a SparkContext using every core of the local machine
    val sc = new SparkContext("local[*]", "PopularMoviesNicer")
    // Create a broadcast variable of our ID -> Pokemon Name map
    var nameDict = sc.broadcast(loadNames)
    // Read in each Spawn line
    val lines = sc.textFile("../Data/AllData.csv")
    // Map to (spawnedPokemonId, 1) tuples
    val spawned = lines.map(x => (x.split(",")(2).toInt, 1))
    // Count up all the 1's for each Pokemon
    val spawnedCounts = spawned.reduceByKey( (x, y) => x + y )
    // Flip (spawnedPokemonId, count) to (count, spawnedPokemonId)
    val flipped = spawnedCounts.map( x => (x._2, x._1) )
    // Sort
    val sortedCount = flipped.sortByKey()
    // Fold in the Pokemon names from the broadcast variable
    val sortedCountWithNames = sortedCount.map( x  => (nameDict.value(x._2), x._1) )
    // Collect and print results
    val results = sortedCountWithNames.collect()
    results.foreach(println)
  }
}
I go through each step of my code in the comments. This ended up being a simple yet solid example of using Broadcast Variables in Spark. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. The output from this was the Pokemon's name and the number of spawn occurrences.
Here is a snippet from the output showing the most uncommon Pokemon:
- (machamp,8)
- (kabutops,14)
- (charizard,14)
- (farfetchd,14)
- (muk,15)
- (gyarados,20)
- (raichu,23)
- (omastar,25)
- (alakazam,27)
- (ninetales,27)
This is a fun dataset to work with and I am going to continue using it as I begin learning more advanced Spark programming. This simple bit of information regarding what were the most uncommon Pokemon ended up helping Justin in some work he did visualizing the data.