Image credit: By Bayreuth2009 GFDL; CC BY 3.0, from Wikimedia Commons

Oktoberfest Analysis Part I

Everybody knows the Oktoberfest which takes place in Munich every year. In this blog post series we are going to look into a public available data set and try to gain some insights about the Oktoberfest.

In the first part we load and describe the data. Furthermore, we will analyse the price and consumption of beer and hendl (chicken) over the years.

In the second part we are going to have a closer look at the Bavarian Central Agricultural Festival - also known as “ZLF” - and its influence on beer- and hendl consumption, as well as on visitor count. Further, we are going to look at the influence of the 9/11 terror attacks.

Aim

Since I am currently diving into the field of data analysis and machine learning, I decided to start my first public analysis in order to use the tools I have been learning so far. The aim of this exploratory data analysis was to create some insights about the Oktoberfest using the public available Oktoberfest data set from the Munich Open Data side.

Further, I wanted to explore the Munich Open Data API to export the data from the server. My biggest aim, though, is to improve my analytic skills by getting feedback from the community.
That is why I would really appreciate your feedback. Feel free to comment on this post and to contact me.

Methods and Material

For the analysis I used the public available Oktoberfest data set which can be found here. Additional information which can not be found in the data description online, has been provided by the city of Munich via email contact.

Set up Environment

We will start this analysis by loading some required packages:

suppressMessages({
require(httr, quietly = T)
require(tidyverse, quietly = T, warn.conflicts = F)
require(gridExtra, quietly = T, warn.conflicts = F)
require(grid, quietly = T, warn.conflicts = F)
})

Importing Data from API

Having the environment set up, we will use the Munich Open Data API and the httr library to load our data from the server, and store it as a data.frame within R.
The resource-id (e0f664cf-6dd9-4743-bd2b-81a8b18bd1d2), which we are going to use to get our data via the API, can be found under additional information (German: “zusätzliche Informationen”) at the bottom of the data set side.

url <- "https://www.opengov-muenchen.de"
path <- "api/action/datastore_search?resource_id=e0f664cf-6dd9-4743-bd2b-81a8b18bd1d2"

# get raw results
raw.result <- GET(url = url, path = path)

# convert raw results to list
result <- content(raw.result)

# convert results to table
dt <- bind_rows(result$result$records)

Let’s take a quick view at our data to see, if the import went well:

head(dt)
## # A tibble: 6 x 9
##   hendl_preis jahr  besucher_gesamt hendl~ dauer bier_~ bier_~ `_id` besu~
##   <chr>       <chr> <chr>           <chr>  <chr> <chr>  <chr>  <int> <chr>
## 1 4.77        1985  7.1             629520 16    54541  3.20       1 444  
## 2 3.92        1986  6.7             698137 16    53807  3.30       2 419  
## 3 3.98        1987  6.5             732859 16    51842  3.37       3 406  
## 4 4.19        1988  5.7             720139 16    50951  3.45       4 356  
## 5 4.22        1989  6.2             775674 16    51241  3.60       5 388  
## 6 4.47        1990  6.7             750947 16    54300  3.77       6 419

Yes, the API data import worked!

As we see, the columns have German names. We change this for all English speaking readers. By doing so, we also perform some class- and metric corrections.

dt <- dt %>% 
  transmute(id = `_id`,
            # class corrections
            year = as.integer(jahr),
            duration = as.integer(dauer),
            hendl_cons = as.integer(hendl_konsum),
            beer_price = as.numeric(bier_preis),
            hendl_price = as.numeric(hendl_preis),
            # class- and additional metric corrections
            beer_cons = as.integer(bier_konsum)*100, # change to L
            visitors_total = as.numeric(besucher_gesamt) * 1000000, # change to nr. of people
            visitors_day = as.integer(besucher_tag)*1000) # change to nr. of people

Data Description

Our data have 9 columns and 33 rows. The data set contains yearly data on beer- and hendl (chicken) consumption from 1985 to 2017. It also provides information about the price of both as well as total visitors, mean daily visitors, and the duration of the Oktoberfest in each year.

The table below gives a quick overview on variable names and their metrics.

Variable Metric
beer consumption L
beer price €/Liter
hendl consumption Nr. of Chicken
hendl price €/half chicken
total visitors Nr.of People
daily visitors Nr. of People

Data Munging

Now we are going to start to get our hands dirty and work with the Oktoberfest data set. We add some information like the years when the Bavarian Central Agricultural Festival (ZLF) took place and other variables which we are going to use either in part 1 or 2 of our analysis.

# generate vector with years in which zlf festival took place 
# (every three years up to 1996; every 4 years from 2000 on)
zlf_years <- c(seq(1810,1996, 3), seq(2000, max(dt$year),4))

dt <- dt %>% 
  mutate(zlf = factor(ifelse(year %in% zlf_years, 1, 0)),
         hendl_cons_per_visitor = hendl_cons / visitors_total,
         beer_cons_per_visitor = beer_cons / visitors_total) %>% 
  select(-id) # remove id column since we don't use it in the analysis

Since there is no missing data in our data set, we can start our analysis.


Data Analysis

In our data analysis we are going to look at the following topics:

Part I:

  1. Beer price and its consumption
  2. Hendl price and its consumption

Part II:

  1. Influence of the ZLF on some variables in our data
  2. Mean daily visitor count before and after 9/11

We are going to use visualizations and simple modeling techniques to describe the data. Further, we are going to perform statistical tests to compare means in groups.

Beer Price and Consumption

Every year one of the most discussed topics around the Oktoberfest is the increased beer price. People are always complaining that the beer is too expensive. Since we have the historical data on beer prices, our first question on the data set is:

How did the beer price develope from 1985 to 2017?

Modeling Beer Price Development Over the Years

Note: The currency used in the data is €. All prices before the year 2002 have been transformed from DM (German currency before 2002) to Euro.

In fact, the price for 1L beer increased almost linear over the years. We will try to use a simple linear regression model to describe the data:

fit_beer <- lm(beer_price ~ year, data = dt)
summary(fit_beer)
## 
## Call:
## lm(formula = beer_price ~ year, data = dt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4185 -0.1280  0.0050  0.1286  0.5254 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.830e+02  8.603e+00  -56.15   <2e-16 ***
## year         2.447e-01  4.299e-03   56.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2352 on 31 degrees of freedom
## Multiple R-squared:  0.9905, Adjusted R-squared:  0.9902 
## F-statistic:  3240 on 1 and 31 DF,  p-value: < 2.2e-16

Our model suggests that the beer price increased by 0.24 Cents per year.

Fitting models to the years with different currencies separately, we see a small difference in estimated price increase per year:

fit_dm <- lm(beer_price ~ year, data = subset(dt, year < 2002))
summary(fit_dm)
## 
## Call:
## lm(formula = beer_price ~ year, data = subset(dt, year < 2002))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.236176 -0.112059 -0.009412  0.077647  0.260000 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.203e+02  1.399e+01  -30.04 8.17e-15 ***
## year         2.132e-01  7.022e-03   30.37 6.94e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1418 on 15 degrees of freedom
## Multiple R-squared:  0.984,  Adjusted R-squared:  0.9829 
## F-statistic: 922.2 on 1 and 15 DF,  p-value: 6.944e-15
fit_euro <- lm(beer_price ~ year, data = subset(dt, year >= 2002))
summary(fit_euro)
## 
## Call:
## lm(formula = beer_price ~ year, data = subset(dt, year >= 2002))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.206721 -0.013706  0.003794  0.020860  0.298456 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.817e+02  1.244e+01  -46.75   <2e-16 ***
## year         2.938e-01  6.192e-03   47.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1142 on 14 degrees of freedom
## Multiple R-squared:  0.9938, Adjusted R-squared:  0.9934 
## F-statistic:  2251 on 1 and 14 DF,  p-value: < 2.2e-16
  • In years with DM currency (1985 to 2001) the estimated yearly price increase was around 0.21 Cents per year.
  • In the years after the Euro has been introduced (>2001), the estimated yearly price increase was 0.29 Cents per year.

Influence of Beer Price on Consumption

Now that we confirmed that the price increased over the years, the next question which arises is:

Did this increase have a negative influence on mean beer consumption per visitor?

In order to answer this question we are going to start with a visualization of our data:

We do not see a decline in mean beer consumption per visitor over the years. In fact, since around year 1999 it seems that it even has been increasing steadily.

To see how the data are correlated we perform a correlation test using the Pearson correlation coefficient:

cor.test(dt$beer_cons_per_visitor, dt$beer_price)
## 
##  Pearson's product-moment correlation
## 
## data:  dt$beer_cons_per_visitor and dt$beer_price
## t = 16.204, df = 31, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8920491 0.9730963
## sample estimates:
##       cor 
## 0.9457298

The test suggests a true correlation with a Pearson correlation coefficient of 0.95. The p-value is way below 0.05 and we have a narrow confidence interval. If we plot the data, all points seem to be around one line:

This result would suggest that beer consumption increases the higher the beer price gets. Nevertheless, in my opinion, this is an example that correlation does not imply causation. I think that the increased consumption is a separate phenomena and not caused by increased prices. One reason you can think of might be:

  • Oktoberfest gets more and more popular outside of Munich. More and more people from all over the world go there in order to drink a “Maß” beer. For people coming from Australia, Sweden, etc. the beer is still cheap compared to their home. That is why they do not consume less beer with increasing prices since it is still cheap for them.

In conclusion the data set suggests that the price does not seem to have a negative influence on mean beer consumption per visitor.

Hendl Price and Consumption

On the Oktoberfest it is not just about drinking. A lot of people also like to have something to eat besides their beer. So what about the hendl consumption and price development?

Modeling Hendl Price Development

As before we are going to have a look at the price only first:

Important Notes:

  1. The currency used in the data is €. All prices before the year 2002 have been transformed from DM (German currency before 2002) to Euro.
  2. Before the year 2000 the data are mean prices at the selling places. Since 2000 the data are only mean prices inside the tents.

It seems that there was a huge price increase in year 2000. But we need to be careful: As mentioned in the notes above, the data collection method changed in 2000. So we need to compare both periods separately. In the year 2013 we have some kind of price outlier. Here the price was raised a huge amount compared to the steady increase the years before. After that, the price dropped to a level which seems to lie on a line of steady price increase. We will try to model the price increase from year 2000 on using a linear regression model.

fit_hendl <- lm(hendl_price ~ year, data = subset(dt, year >= 2000))
summary(fit_hendl)
## 
## Call:
## lm(formula = hendl_price ~ year, data = subset(dt, year >= 2000))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32939 -0.16118 -0.06931  0.10417  0.75069 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -426.73552   25.49174  -16.74 1.46e-11 ***
## year           0.21713    0.01269   17.11 1.05e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2794 on 16 degrees of freedom
## Multiple R-squared:  0.9482, Adjusted R-squared:  0.9449 
## F-statistic: 292.7 on 1 and 16 DF,  p-value: 1.048e-11

The model suggests that the hendl price increased by 0.22 Cents per year since 2000.

When we add the model to our plot it looks like this:

Influence of Hendl Price on Consumption

Above we saw that the beer price did not have a negative influence on the beer consumption. We are going to ask the same question for the hendl price and consumption. To get a first overview, we plot the data.

We see that the hendl consumption dropped in 2001 after slowly decreasing from 1991 on. After the drop, the consumption stayed at a constant level, whereas the price went up.
For me it looks like the hendl consumption being correlated with price until year 2000. After that the consumption reached a minimum bound. Maybe at this point price sensitivity of the remaining hendl consumers changes.

We can check the correlation by performing a correlation test using the Pearson correlation coefficient:

with(subset(dt, year < 2000),cor.test(hendl_cons_per_visitor*100, hendl_price))
## 
##  Pearson's product-moment correlation
## 
## data:  hendl_cons_per_visitor * 100 and hendl_price
## t = -2.1586, df = 13, p-value = 0.05016
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.812214934 -0.001917789
## sample estimates:
##       cor 
## -0.513676
with(subset(dt, year >= 2000),cor.test(hendl_cons_per_visitor*100, hendl_price))
## 
##  Pearson's product-moment correlation
## 
## data:  hendl_cons_per_visitor * 100 and hendl_price
## t = -0.98985, df = 16, p-value = 0.337
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6357835  0.2552837
## sample estimates:
##        cor 
## -0.2402164

The correlation coefficients suggest a negative correlation for both periods. Nevertheless, the tests confirm the null hypothesis, which says that there is no true correlation between the variables.

For the period before 2000 there is at least a strong tendency (p = 0.0501576) for a true negative correlation. Despite that, the confidence interval for the correlation coefficient is very big, which gives us a strong uncertainty.

In conclusion we can say that the price seems to have an influence on the hendl consumption up to a specific point. Maybe we could explain the constant level of consumption after reaching that specific price margin with the following hypothesis:

Normally people would buy a hendl up to their specific price boundary. That would mean a steady decrease in consumption with increasing price.

Nevertheless, there are a lot of companies, which rent a table in a tent and invite their employees to come. Most of the time these companies give a free amount of hendl and beer consumption to their employees. I think that a big company’s price sensitivity concerning hendl and beer is not as high as the price sensitivity of a normal Oktoberfest visitor. Thus, even with increasing prices the basic level of consumption does not change a lot.

Conclusion

In the first part of our quick analysis we showed that…

  • The beer price increased over the years. We modeled a increase of 0.24 Cent per year with differences before and after Euro introduction

  • The increase in beer price did not have a negative influence on beer consumption. The consumption even went up over the years

  • Hendl price also increased steadily. For the years since 2000 our model estimated a price increase of 0.22 Cent per year

  • The hendl price actually seem to have a influence on hendl consumption up to a specific point. We showed a tendency for negative correlation before year 2000.

In addition, I learned that the Munich Open Data API is not that difficult to use.

In the next part of our analysis we will have a closer look on the influence of the ZLF and the 9/11 terror attacks.


Acknowledgement

For this part I would like to say thank you to the people who helped me with the ressources for that analysis.I would like to thank Frank Börger and the team from Munich Open Data for answering my questions on the data and providing additional resources. Another great thank you goes to my friend Pat forproofreading.

Next