Table of Contents
1 About Uncertainty
Well, we live in a world of uncertainty. Whether it is tomorrow’s weather that we don’t know, the next world cup champion or just the realization of the next time rolling the dice – there are some things that we will always be able to predict only up to some level of confidence. Although, of course, the level of confidence will be varying according to the situation. For example, most people probably think about tomorrow’s weather as being more predictable than the outcome of a die.
In addition to varying degrees of uncertainty, there also exist varying perceptions of its source. Sometimes we might think of uncertainty as arising from randomness, while we think of it as just incomplete knowledge of a deterministic process at other times. For example, when it comes to weather predictions, one might well be aware of the fact, that there are some people out there who are quite good at forecasting, even if any non-expert on the field might fail most of the time. Hence, the level of uncertainty is varying over persons. This usually leads to the conclusion that the uncertainty involved is due to incomplete information. In other words, the weather of tomorrow is already completely determined today – we only do not have enough information to know it by now. At least, this seems to be a very reasonable explanation that some people are better at forecasting than others: they just have more information about the underlying deterministic process. In contrast to that, take the classic example of a die. As there are currently no publicly celebrated experts on die-prediction, testing their powers in nationally broadcast competitions, dice have become one of the standard examples of unpredictability and, hence, randomness. Nevertheless, will the notion of randomness for dice still uphold once we find a way to successfully incorporate air temperature, velocity and angle into the prediction? And if so, then what does it tell us about the world? Is it random, or deterministic? Speaking in analogy of Pearl S. Buck: sometimes uncertainty only might appear to be true randomness, and even true randomness may only be so, as of now.
In the end, of course, I am not able to give an answer to any of these questions either. However, I think that it probably is not that important to distinguish between incomplete information and randomness in most situations. We have to treat both things equally anyways: in a framework of probabilities and stochastics.
Let me give you an example for what I mean. Think about the following two situations: first, you plan to throw a die. Second, a friend of yours already has rolled the die, but didn’t tell you the result so far. While in the first case the realization of the die is still subject to randomness, in the second case we only have incomplete knowledge about an already determined realization. However, does it make any difference to us? Either way, the probability of, for example, “3” will be 1/6.
Of course, this is not to say that there is no difference between both situations at all. When we strive for additional information that allows updating our knowledge, there surely is: forecasting true randomness is impossible – we wouldn’t want to spent too much time on trying this.
For the moment, let’s just focus on the case of incomplete information, and think about ways to incorporate additional information with regards to some given quantity of interest. Well, generally speaking, you first need to find a different second quantity, which exhibits some form of dependency to the original quantity that you want to predict. Whenever you now have any additional information about this second quantity, you simultaneously also gain additional knowledge about the first quantity through their dependence structure. This definitely sounds quite abstract – let me hence make this a little bit more clear through an example. Let’s assume we wanted to predict the body height of some random person. Since we do not know which person on the world has been chosen, our first guess for the person’s body height will be just the overall distribution of body heights of the world population. However, we easily could get more precise, given that we get some information about some other characteristics of the person. Let’s say, we now are additionally told the person’s gender. Since there is a dependence structure between gender and body height (male people tend to be larger than females), we thereby simultaneously get additional information about the person’s body height, too. Hence, we usually do not directly get additional information, but only through the detour of some other quantity. And then, it is just all about drawing the right conclusion. In other words: transforming related information into terms that convey information about our original variable of interest. Or, as I call it: quantifying information.
2 About Contents
2.1 Quantifying Information
The process of quantifying information will be the very core of the blog, and my approach to it is described more elaborately on the page QI philosophy. In short, it consists of five pillars:
- quantification: eliminating subjectivity and bias
- data: making the best use of data
- models: balancing the trade-off between assumptions and data
- updating: drawing inferences from given information
- visualization: displaying results intuitively
Let’s try to narrow down these generic terms a bit, so that you will get a better feeling of what you will have to expect thematically.
In terms of data, my research will usually revolve around financial and economic aspects. However, let me please first give a few words of explanation on this, before you prematurely put the label “greedy” on me. No, it is not about playing the largest casino on the planet for me. Rather, I am completely believing in the large impacts that the global economy has on people’s well-being these days. From New York to Munich to Sub-Saharan Africa – developments at global economies and financial markets are perceptible in each part of the world. We might as well just try to use this power for good. Just think about how people would profit if phenomena like recession or high unemployment rates could be abolished. In line with this reasoning, financial economics, as a major component of the global economy, also has the potential to improve economic conditions on the world. Financial markets allow the circulation of money – something that not only plays into the hands of greedy speculators. For example, all around the world, at any point in time, there will always be people saving money that they prefer to have at their disposal at future times. At the other hand, there also will always be people in need of financing, in order to be able to fulfill some idea on a start-up. If we, as a society, want to exploit our opportunities to the maximum, we need to efficiently allocate spare money to the places where it is used best. Therefore, we need banks, equity markets, bond markets and the like. Hence, it is important, as a society, to make sure that these financial intermediaries rely on the highest possible sophistication for asset management and risk management. Something that should be done – in my opinion – based on sound data analysis.
Nevertheless, this does not mean that all of the data tackled in this block exclusively deals with economics or financial economics. Basically, whenever I have some time to follow additional interests and hobbies, I will definitely try to include other topics as well.
In terms of models, I probably will often rely on approaches involving copula theory – something my current research mostly revolves around. As hopefully will become clear during some of my posts, copula functions provide some really nice properties when it comes to modeling financial data. However, whenever differently structured data is involved, we definitely will have the opportunity to switch to other model types as well.
Working for a university, I naturally feel strongly committed to science. Thereby, amongst other things, I’d liked to see the following features put into practice:
- public and fast access: results should be published as soon as possible, and publicly available without any limitations on access, in order to avoid re-inventing the wheel
- reproducible research: results should be reproducible, so that we can have full trust into results presented, and subsequent research will have an optimal starting point
- didactic presentation of results: results should be presented with maximal clarity, to enable third persons to follow the explanations as easy as possible. Whenever supplementary graphics, derivations or computer code do help for comprehension, they should not be banished.
In light of these features, let me emphasize the importance of code a bit further at this point. As any kind of data analysis quite naturally involves some statistical software coding these days, the results of any paper directly depend on the correctness of the underlying code. Why should we leave something out of the publication that contributes so much to the overall results ? How would we know, if some mistakes in the code did remain unnoticed? I’m sure we all know how easily some small mistakes can make it into code. Especially once we admit to ourselves, that still a large fraction of researchers in statistics did never get any professional training in programming, but had to learn it on their own. In addition to that, people could also profit from published code, since they would have the chance to adopt some best practices and nifty tricks from peers with better programming skills.
Furthermore, I also think that in some situations code will be the most efficient and comprehensible way to transfer some information. For example, I’d rather see the implementation of some bootstrapped, rolling window, multi-step ahead forecasting, than getting it described in words. In many situations, statistical code is kind of a lingua franca understood by every statistician or data analyst. Especially, if it is clean and commented code that we talk about.
Due to these reasons, I believe that code should be part of any research in data analysis just as much as text and graphics are. And the best way to do this is through literate programming: a technique that intermixes code and text. This way, code can appear right at the location where it is referred to in the text, and it can be annotated and made fairly understandable through accompanying remarks in the text.
Taking all these points together, I think that blogging has the potential to be a highly beneficial way to spread ideas and publish research results these days. Additional features like sharing on other platforms, following people and re-posting outstanding work of other bloggers are even extending its appeal. Taken together with Google’s search capabilities, blogging could be an ideal way to build a publicly available knowledge base for scientific content.
You can now judge for yourself, which of the aforementioned features you consider as desirable, and whether they are also given in our current publication system…
2.3 Software Tools
As a statistician these days, you already spend most of your time at your computer. Nearly every regular task already either needs to be done or could be done digitally: data analysis, preparation of presentations and creation of graphics inevitably involve a computer, while reading papers or taking notes on computers is only gradually prevailing. Still, the major role that computers play in research is doubtless.
Hence, in order to perform any task at the computer in the most efficient way, I think it is crucial to rely on good software. Thus, I also want to share tips about software tools that facilitate common tasks one has to face in data analysis. Whether it is about software for statistical analysis itself (R, Matlab, Julia), including implementations of best practices in coding (unit testing, version control), literate programming (sweave, org-babel), and IDEs (emacs-ess, R-studio), or just about best ways to publish results (LaTeX, LyX, inkscape, emacs) – there is a real bunch of tools out there. Hence, in my opinion it is always helpful to get an idea of what other people use.
However, at this point I probably should already warn you upfront, about what you will have to expect from me in this regards. During the last years, I have increasingly become fond of open source software, so that my operating system now is Linux (Ubuntu), and almost anything else I do inside of the “one true editor”: emacs!
3 About Me
As you might already guess, some of my personal interests revolve around data analysis, statistics, or just science and research in general. In accordance with that, I am currently working at the Ludwig-Maximilians-University Munich, at the chair of financial econometrics of the statistics department. Besides working on my thesis in order to achieve my doctorate, I am also in charge of the exercises to the lecture Risk management, and of a software course called Matlab for Finance.
Besides, I spend my money on traveling, my spare time on football (of course: the European one) and my love on my family, my girlfriend and my friends.