The Mathematics Of Information

This is a site about data, databases, and database programming languages. It is also a site about the very practical application of category theory as a scalable framework for their integration, not just at a conceptual level but at the level of their implementation layers as well. Drawing on an easy to state, hard to solve problem as a starting point, this inaugural post illustrates a few of the many desirable things that are possible as it begins to outline that framework.

A Simple Hard Problem

One of the common data mining tasks of a mutual fund analysis system is the need to rank funds based on the overlap of their holdings with other funds. Sample code that implements the basic structure of that task appears to the right.

Other than the problem it solves, there is nothing particularly unique or domain specific about this code. The problems and modeling issues it exposes are nevertheless far more universal than they might appear. Superficially, the Smalltalk-like variant in which it is written implements a number of advanced language constructs, many of which this application code exploits successfully. These include closures, list comprehensions and instance specialization, among others. Most, if not all, of these constructs are or soon will be commonplace in many modern programming languages. If that were the only issue, there might not be much more to say. Of course, this is a data mining application and there is a great deal more to say.

Syntax highlighting notwithstanding, perhaps the most notable feature of this example are the number of collection operations that appear as if they would be quite at home in SQL. Their prevalence and centrality raises questions about where the boundary between database query language and application programming language lies. It also raises questions about where this application ought to run. Should it be run as a program that simply goes to a database for the information it needs or should it be run inside the database so that its collection operations can be optimized for the physical organization of the data? This is more than a philosophical question. Although it may not be obvious, starting with a single Account object, this code ultimately performs a cross-sectional analysis examining all accounts holding any of the securities held by the initial account. Depending on the size of that initial account, this product space search can easily touch most of a large database. The problem gets even more interesting if this application must start with a collection of accounts:

AllAccounts do: [^self getHoldingsOverlap first: 5 . do: […] ];

Issues of scale certainly argue for processing an application such as this with an awareness of the physical organization of the data. While query optimizers have the ability to do this for restricted languages, the language of this query goes well beyond traditional database languages and models. For example, groupedBy: [account] returns the groupList constituents for each group it generates. While groupedBy is a powerful construct in SQL, the relational model does not have the conceptual vocabulary required to externalize that collection of constituents, even though the information is available. In light of the current NoSQL movement, it’s tempting to dismiss this problem by simply noting that other languages are not similarly constrained. That is clearly true and sidesteps the fundamental issues. The challenge remains for those languages to articulate their conceptual and mathematical frameworks if their implementations are to leverage something more than abundant cleverness.

Beyond issues of scale, this application relies in very significant ways on complexity encapsulated in the underlying fabric of relationships in the data. Of particular importance is time although all manor of provenance related issues enter here as well. Unlike the encapsulation provided by abstract data types, the complexity encapsulated here often results in context sensitive answers. For example, the answer to:

MyAccount getHoldingsOverlap

should be very different from the answer to:

1 yearAgo evaluate: [MyAccount getHoldingsOverlap].

This presentation I gave to a VLDB workshop on the subject discusses some of the issues.

I will say a great deal more about this area as I describe the mathematical framework that gives answers to these issues a home.

The Mathematics Of Information

Hello world. You look different today…

A Simple Hard Problem

Recent Posts

Recent Comments

Archives

Categories

Meta