Data Mining Demo: Modelling and forecasting with R and EXASOL

Freitag, 13.05.2016

In the following you will find a simple data mining demo that shows modelling and forecasting in the connection of R and Exasol. If you have not performed the installation and configuration of the necessary components yet, you can find an instruction here.

# Laden der benötigten Pakete
library(RODBC)
library(exasol)
library(rredis)
library(magrittr)
library(stringi)
library(rpart)
library(partykit)
 
# Verbindung zu Exasol und Redis herstellen
con &lt; – odbcConnect(„exasol_vm“)
redisConnect(„172.20.248.13“)
 
# Erstellen eines Zufallszahlen-Vektors für die Stichprobenziehung
rnd &lt;– rnorm(nrow(iris))
 
# Anfügen der Gruppenvariable (Training/Validierung)
iris$groups &lt;– factor(NA, levels = c(„Train“, „Valid“))
 
# Nach Spezies geschichtete Zufallsziehung: 70% Training, 30% Validierung
for(i in unique(iris$Species)) {
  logVec &lt;– iris$Species == i
  iris$groups[logVec] &lt;– ifelse(test = rnd[logVec] &gt; quantile(rnd[logVec],
                                                                 probs = 0.3),
                                yes = „Train“,
                                no = „Valid“)
}
 
# Überprüfung der Stichprobenziehung
table(iris$groups, iris$Species)
 
# Workspace aufräumen
rm(rnd, logVec, i)

The iris data are now transferred to the Exasol. At first use, a database schema and an empty table are created. Afterwards, the iris data are written into the table. Once the data are uploaded they can be used time and again.

<br />
# Datenbankschema mit dem Namen <em>my_schema</em> erstellen<br />
odbcQuery(con, „create schema my_schema“)</p>
<p># Erstellen der leeren Tabelle unter dem Namen <em>irisdb</em><br />
odbcQuery(con, „create or replace table my_schema.irisdb(<br />
          SepalLength DOUBLE,<br />
          SepalWidth DOUBLE,<br />
          PetalLength DOUBLE,<br />
          PetalWidth DOUBLE,<br />
          Species CHAR(20),<br />
          Groups CHAR(20));“)</p>
<p># Hochladen der <em>iris</em> Daten in die Exasol<br />
exa.writeData(con, data = iris, tableName = „my_schema.irisdb“)

In the first step of the analysis a decision tree is created locally. Based on this tree a forecast is done.

<br />
# Erstellen eines Trainings- und eines Validierungsdatensatzes<br />
train &lt; – subset(iris, subset = groups == „Train“, select = -groups)<br />
valid &lt;- subset(iris, subset = groups == „Valid“, select = -groups)</p>
<p># Erstellen des Entscheidungsbaums mit den Trainingsdaten<br />
localTree &lt;- rpart(Species ~ ., data = train)</p>
<p># Visualisierung des Baums<br />
plot(as.party(localTree))</p>
<p># Prognose Validierungsdaten mit Hilfe des Baums<br />
pred &lt;- predict(localTree, type = „class“, newdata = valid) # Prognose überprüfen table(pred, valid$Species, dnn = c(„Vorhersage“, „Tatsächlich“)) %&gt;%<br />
  addmargins()

The same procedure as above now in the Exasol. The exa.script function creates an R script on the Exasol server. The function call runs the script on the Exasol clusters. This means that the analysis no longer takes place on the local R but on the R instances in the Exasol cluster. Packages which are used on the R instances on the Exasol need to be installed there. Have a look at this blog entry for more information.

The model built in the Exasol cluster will be stored in Redis, a key-value database. With Redis you can distribute models, functions and other R objects in the cluster and load them from the cluster.

<br />
# Erstellen eines Trainings- und eines Validierungsdatensatzes<br />
train &lt; – subset(iris, subset = groups == „Train“, select = -groups)<br />
valid &lt;- subset(iris, subset = groups == „Valid“, select = -groups)</p>
<p># Erstellen des Entscheidungsbaums mit den Trainingsdaten<br />
localTree &lt;- rpart(Species ~ ., data = train)</p>
<p># Visualisierung des Baums<br />
plot(as.party(localTree))</p>
<p># Prognose Validierungsdaten mit Hilfe des Baums<br />
pred &lt;- predict(localTree, type = „class“, newdata = valid) # Prognose überprüfen table(pred, valid$Species, dnn = c(„Vorhersage“, „Tatsächlich“)) %&gt;%<br />
  addmargins()

The tree model can be used for forecasting in a separate step on the Exasol. The first part is mostly identical with the above function.

<br />
exa_predict_rf &lt; – exa.createScript(<br />
  con,<br />
  „my_schema.exa_pred“,<br />
  function(data) {</p>
<p>    require(rpart)<br />
    require(rredis)</p>
<p>    redisConnect(„172.20.248.13“, port = 6379)</p>
<p>    data$next_row(NA)</p>
<p>    df &lt;- data.frame(v1 = data$SepalLength,<br />
                     v2 = data$SepalWidth,<br />
                     v3 = data$PetalLength,<br />
                     v4 = data$PetalWidth,<br />
                     species = data$Species)</p>
<p>    # Laden des Baum Modells aus Redis<br />
    rf &lt;- redisGet(„exa_rf“) </p>
<p>    # Erstellen der Prognose<br />
    pred &lt;- predict(rf, newdata = df, type = „class“)</p>
<p>    # Rückgabe der Prognose sowie der echten Klassenzugehörigkeit<br />
    data$emit(pred, df$species)<br />
  },<br />
  inArgs = c( „SepalLength DOUBLE“,<br />
              „SepalWidth DOUBLE“,<br />
              „PetalLength DOUBLE“,<br />
              „PetalWidth DOUBLE“,<br />
              „Species CHAR(20)“),<br />
  outArgs = c(„Prognose CHAR(20)“,<br />
              „Realwerte CHAR(20)“))</p>
<p># Aufruf der oben gebildeten Funktion. Die Rückgabe wird in ein Objekt gespeichert.<br />
exa_pred &lt;- exa_predict_rf(„SepalLength“, „SepalWidth“, „PetalLength“, „PetalWidth“, „Species“, table = „my_schema.irisdb“, where = „groups = ‚Valid'“) # Mit Hilfe der Table Funktion kann geprüft werden, wie gut die Prognose performt. table(exa_pred$PROGNOSE, exa_pred$REALWERTE, dnn = c(„Prognose“, „Realwerte“)) %&gt;%<br />
  addmargins()

Data Mining Demo: Modelling and forecasting with R and EXASOL

Get started now: We look forward to engaging with you.

Get started now:
We look forward to engaging with you.