Wednesday, April 25, 2012

Big Data, R and HANA: Analyze 200 Million Data Points and Later Visualize in HTML5 Using D3 - Part II


In my last blog, Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize Using Google Maps, I analyzed historical airlines performance data set using R and SAP HANA and put the aggregated analysis on Google Maps.  Undoubtedly, Map is a pretty exciting canvas to view and analyze big data sets. One could draw shapes (circles, polygons) on the map under a marker pin, providing pin-point information and display aggregated information in the info-window when a marker is clicked.  So I enjoyed doing all of that, but I was craving for some old fashion bubble charts and other types of charts to provide comparative information on big data sets.  Ultimately, all big data sets get aggregated into smaller analytical sets for viewing, sharing and reporting.  An old fashioned chart is the best way to tell a visual story!

On bubble charts, one could display four dimensional data for comparative analysis. In this blog analysis, I used the same data-set which had 200M data points and went deeper looking at finer slices of information.  I leveraged D3, R and SAP HANA for this blog post.  Here I am publishing some of this work:  

In this first graphics, the performance of top airlines is compared for 2008.  As expected, Southwest, the largest airlines (when using total number of flights as a proxy), performed well for its size (1.2M flights, 64 destinations but average delay was ~10 mins.)  Some of the other airlines like American and Continental were the worst performers along with Skywest.  Note, I didn't remove outliers from this analysis.  Click here to interact with this example (view source to get D3 code).


In the second analysis, I replaced airlines dimension with airports dimension but kept all the other dimensions the same.  To my disbelief, Newark airport is the worst performing airport when it comes to departure delays.  Chicago O'Hare, SFO and JFK follow.  Atlanta airport is the largest airport but it has the best performance. What are they doing differently at ATL?  Click here to interact with this example (view source to get D3 code).


It was hell of a fun playing with D3, R and HANA, good intellectual stimulation if nothing else!  Happy Analyzing and remember possibilities are endless!

As always, my R modules are fairly simple and straightforward:
###########################################################################################  
#ETL - Read the AIRPORT Information, get major aiport informatoin extracted and upload this 
#transfromed dataset into HANA
###########################################################################################
major.airports <- data.table(read.csv("MajorAirports.csv",  header=TRUE, sep=",", stringsAsFactors=FALSE))
setkey(major.airports, iata)

all.airports <- data.table(read.csv("AllAirports.csv",  header=TRUE, sep=",", stringsAsFactors=FALSE)) 
setkey(all.airports, iata)

airports.2008.hp <- data.table(read.csv("2008.csv",  header=TRUE, sep=",", stringsAsFactors=FALSE)) 
setkey(airports.2008.hp, Origin, UniqueCarrier)

#Merge two datasets
airports.2008.hp <- major.airports[airports.2008.hp,]


###########################################################################################  
# Get airport statisitics for all airports
###########################################################################################
airports.2008.hp.summary <- airports.2008.hp[major.airports,     
    list(AvgDepDelay=round(mean(DepDelay, na.rm=TRUE), digits=2),
    TotalMiles=prettyNum(sum(Distance, na.rm=TRUE), big.mark=","),
    TotalFlights=length(Month),
    TotalDestinations=length(unique(Dest)),
    URL=paste("http://www.fly", Origin, ".com",sep="")), 
                    by=list(Origin)][order(-TotalFlights)]
setkey(airports.2008.hp.summary, Origin)
#merge two data tables
airports.2008.hp.summary <- major.airports[airports.2008.hp.summary, 
                                                     list(Airport=airport, 
                                                          AvgDepDelay, TotalMiles, TotalFlights, TotalDestinations, 
                                                          Address=paste(airport, city, state, sep=", "), 
                                                          Lat=lat, Lng=long, URL)][order(-TotalFlights)]

airports.2008.hp.summary.json <- getRowWiseJson(airports.2008.hp.summary)
writeLines(airports.2008.hp.summary.json, "airports.2008.hp.summary.json")                 
write.csv(airports.2008.hp.summary, "airports.2008.hp.summary.csv", row.names=FALSE)

Thursday, April 19, 2012

Getting Historical Weather Data in R and SAP HANA

For many of my latest data blogs, I needed historical weather data to perform data mash-ups to pin-point the cause.  For example, for my continued exploration into the airlines/airports historical data using SAP HANA and R, I wanted to find out whether the weather was behind the extreme delay experienced out of a particular airport for a particular day/hour.  So I needed to mash-up the weather data with the airlines data for this analysis.

I looked around but could not find a better way to get the weather data.  So I turned to R.  Now, to get historical weather data, I am using Weather Underground's REST APIs and I put together a simple program in R to get the weather data in a data.frame.  This R module gets called from SAP HANA and it inserts a new table into HANA with the right weather information.  Once, I have the data in HANA, I performed mash-ups in HANA and off I go on my intellectual pursuit.

Weather Underground returns the data in both XML and JSON file formats.  The program logic is very simple, [once you have spent hours cracking it, the end product looks simple anyways :-)] and there are appropriate comments in the code below for self-learning.

I want to mention that you are not limited to just getting the historical view on weather data.  You can get the weather forecast for next 10 days, perform your analysis and predict future!

Make sure to register with Weather Underground (API documentation link), comply with their rules and get your own key to access their APIs.
############################################################################
getHistoricalWeather <- function(airport.code="SFO", date="Sys.Date()")
{
  base.url <- 'http://api.wunderground.com/api/{your key here}/'
  # compose final url
  final.url <- paste(base.url, 'history_', date, '/q/', airport.code, '.json', sep='')


  # reading in as raw lines from the web service
  conn <- url(final.url)
  raw.data <- readLines(conn, n=-1L, ok=TRUE)
 # Convert to a JSON
  weather.data <- fromJSON(paste(raw.data, collapse=""))
  close(conn)
  return(weather.data)
}



# get data for 10 days - restriction by Weather Underground for free usage
date.range <- seq.Date(from=as.Date('2006-1-01'), to=as.Date('2006-1-10'), by='1 day')


# Initialize a data frame
hdwd <- data.frame()



# loop over dates, and fetch weather data
for(i in seq_along(date.range)) {
    weather.data <- getHistoricalWeather('SFO', format(date.range[i], "%Y%m%d"))                 
      hdwd <- rbind(hdwd, ldply(weather.data$history$dailysummary, 
          function(x) c('SJC', date.range[i], x$fog, x$rain, x$snow,  x$meantempi, x$meanvism, x$maxtempi, x$mintempi)))
}
colnames(hdwd) <- c("Airport", "Date", 'Fog', 'Rain', 'Snow','AvgTemp', 'AvgVisibility','MaxTemp','MinTemp')


# save to CSV
write.csv(hdwd, file=gzfile('SFC-Jan2006.csv.gz'), row.names=FALSE)

############################################################################
Results - 

Airport Date Fog Rain Snow AvgTemp AvgVisibility MaxTemp MinTemp
SFO 13149 0 1 0 55 14 62 47
SFO 13150 0 1 0 53 11 55 50
SFO 13151 0 1 0 51 14 56 46
SFO 13152 0 0 0 56 16 62 50
SFO 13153 0 0 0 54 14 60 48
SFO 13154 0 1 0 52 14 59 45
SFO 13155 0 1 0 56 14 61 50
SFO 13156 0 0 0 51 16 57 45
SFO 13157 0 0 0 49 16 56 41
SFO 13158 0 0 0 54 10 61 46


Happy Analyzing!

Wednesday, April 11, 2012

Big Data, R and HANA: Analyze 200 Million Data Points and Later Visualize Using Google Maps

Technologies: SAP HANA, R, HTML5, D3, Google Maps, JQuery and JSON
For this fun exercise, I analyzed more than 200 million data points using SAP HANA and R and then brought in the aggregated results in HTML5 using D3, JSON and Google Maps APIs.  The 2008 airlines data is from the data expo and I have been using this entire data set (123 million rows and 29 columns) for quite sometime. See my other blogs

The results look beautiful:

Each airport icon is clickable and when clicked displays an info-window describing the key stats for the selected airport:
I then used D3 to display the aggregated result set in the modal window (light box):
D3 made it looks ridiculously simpler to generate a table from a JSON file. 
Unfortunately, I can't provide the live example due to the restrictions put in by Google Maps APIs and I am approaching my free API limits. 

Fun fact:  The Atlanta airport was the largest airport in 2008 on many dimensions: Total Flights Departed, Total Miles Flew, Total Destinations.  It also experienced lower average departure delay in 2008 than Chicago O'Hare. I always thought Chicago O'Hare is the largest US airport.

As always, I just needed 6 lines of R code including two lines of code to write data in JSON and CSV files:
################################################################################
airports.2008.hp.summary <- airports.2008.hp[major.airports,     
    list(AvgDepDelay=round(mean(DepDelay, na.rm=TRUE), digits=2),
    TotalMiles=prettyNum(sum(Distance, na.rm=TRUE), big.mark=","),
    TotalFlights=length(Month),
    TotalDestinations=length(unique(Dest)),
    URL=paste("http://www.fly", Origin, ".com",sep="")), 
                    by=list(Origin)][order(-TotalFlights)]
setkey(airports.2008.hp.summary, Origin)
#merge the two data tables
airports.2008.hp.summary <- major.airports[airports.2008.hp.summary, 
                                                     list(Airport=airport, 
                                                          AvgDepDelay, TotalMiles, TotalFlights, TotalDestinations, 
                                                          Address=paste(airport, city, state, sep=", "), 
                                                          Lat=lat, Lng=long, URL)][order(-TotalFlights)]

airports.2008.hp.summary.json <- getRowWiseJson(airports.2008.hp.summary)
writeLines(airports.2008.hp.summary.json, "airports.2008.hp.summary.json")                 
write.csv(airports.2008.hp.summary, "airports.2008.hp.summary.csv", row.names=FALSE)
##############################################################################

Happy Coding and remember the possibilities are endless!