Posts

Introduction to BigQuery ML

Image
My blog is now  on Medium ! A few months ago Google announced a new Google BigQuery feature called BigQuery ML, which is currently in Beta. It consists of a set of extensions of the SQL language that allows to create machine learning models, evaluate their predictive performance and make predictions for new data directly in BigQuery. Source: https://twitter.com/sfeir/status/1039135212633042945 One of the advantages of BigQuery ML (BQML) is that one only needs to know standard SQL in order to use it (without needing to use R or Python to train models), which makes machine learning more accessible. It even handles data transformation, training/test sets split, etc. In addition, it reduces the training time of models because it works directly where the data is stored (BigQuery) and, consequently, it is not necessary to export the data to other tools. But not everything is an advantage. First of all, the implemented models are currently limited (although we will see that...

Understanding how to explain predictions with "explanation vectors"

Image
My blog is now  on Medium !   In a recent post I introduced three existing approaches to explain individual predictions of any machine learning model. After the posts focused on LIME and Shapley values , now it’s the turn of Explanation vectors , a method presented by David Baehrens, Timon Schroeter and Stefan Harmeling in 2010. As we have seen in the mentioned posts, explaining a decision of a black box model implies understanding what input features made the model give its prediction for the observation being explained. Intuitively, a feature has a lot of influence on the model decision if small variations in its value cause large variations of the model’s output, while a feature has little influence on the prediction if big changes in that variable barely affect the model’s output. Since a model is a scalar function, its gradient points in the direction of the greatest rate of increase of the model’s output, so it can be used as a measure of features’ influence...

Understanding how IME (Shapley Values) explains predictions

Image
My blog is now  on Medium ! In a recent post I introduced three existing approaches to explain individual predictions of any machine learning model. After the post focused on LIME , now it’s the turn of IME (Interactions-based Method for Explanation) , a method presented by Erik Strumbelj and Igor Kononenko in 2010. Recently, this method has also been called Shapley Values . Intuition behind IME When a model gives a prediction for an observation, all features do not play the same role: some of them may have a lot of influence on the model’s prediction, while others may be irrelevant. Consequently, one may think that the effect of each feature can be measured by checking what the prediction would have been if that feature was absent; the bigger the change in the model’s output, the more important should be the feature. However, observing only a single feature at a time implies that dependencies between features are not taken into account, which could produce inaccura...

Understanding how LIME explains predictions

Image
My blog is now on Medium ! In a recent post I introduced three existing approaches to explain individual predictions of any machine learning model. In this post I will focus on one of them: Local Interpretable Model-agnostic Explanations ( LIME ), a method presented by Marco Tulio Ribeiro, Sameer Singh and Carlos Guestrin in 2016. In my opinion, its name perfectly summarizes the three basic ideas behind this explanation method: Model-agnosticism . In other words, model-independent, which means that LIME doesn’t make any assumptions about the model whose prediction is explained. It treats the model as a black-box, so the only way that it has to understand its behavior is perturbing the input and see how the predictions change. Interpretability . Explanations have to be easy to understand by users above all, which is not necessarily true for the feature space used by the model because it may use too many input variables (even a linear model can be difficult to interpret if it...

Download your Tweet Activity report using R

Image
My blog is now  on Medium ! Recently I wanted to analyze my tweets' performance. As you may know,  Twitter Analytics  provides detailed data of a twitter account's tweets: number of impressions, engagement rate, number of retweets, link clicks, etc. What I didn't know was that the website restricts the maximum possible date range you can see (and download) to a 91-day window, but I wanted to analyze more data. Instead of downloading reports 91 days at a time, I wrote an R script which does the dirty work. It downloads the tweet activity data from Twitter Analytics for  any   given period. If the selected date range is longer than 91 days, the download is split into several downloads. Installation To install the script you will need the  devtools package and  the  RSelenium package : install.packages("devtools") library(devtools) install_github("ropensci/RSelenium") Then you can install the script directly from GitHub by running: ...