Monday, January 28, 2008

Role of StreamSQL in Massive Event Processing

Event processing is catching up as the technology of choice to process a variety of real-time data feeds whether they are from internet applications, portable devices or from traditional applications requiring operational reporting. The differentiating feature of event processing from traditional data processing and business intelligence is that the analysis and reporting on the incoming data is done continuously, seamlessly and well integrated into the application as opposed to doing it as a separate offline activity of data collection, assimilation, cleansing, analysis and reporting. While the traditional business intelligence will continue to grow in importance and investment as an irreplaceable technology for deeper study of past behavior and for predicting the future based on historical data, the society at large seems to be interested and eager to find out what is happing now! As can be expected, traditional business intelligence and event processing, though they started at the two ends of a spectrum, are slowly converging by moving to span the middle.

There are multiple facets to event processing. In simpler cases, one may want to observe the data flowing on the wire and do operations such as filtering, splitting, routing, enrichment and simple aggregations such as maintaining counters and min/max values as the data flows, and raise alerts when these observed behaviors warrant an action. In more complex cases, the event processing system will have to observe much larger volumes of data both due to the very nature of incoming data which can be high-speed & high-volume feeds from sources such as stock market program trades or process control systems in manufacturing and due to the need to correlate the incoming data with master/reference data so that information is processed in context. These more complex scenarios require a somewhat different handling and a different technology as they need to store the incoming data in data repositories and correlate it with other stored data. Due to the high-speed & high-volume of the data feeds and the extent of correlation requirements, this aspect of event processing is called stream processing. We also refer to this as massive event processing.

When we need to store and process large amount of data, whether it is transactional data, warehoused data or event stream data, the time tested method of SQL seems to have an advantage over other event processing techniques both in expressive power and in scalable performance because an enormous amount of R & D has gone into SQL optimizations over the past 35 years. However, traditional SQL is based on the request-response model of processing where the user application issues a SQL statement and receives the results, and then sends another query when it needs another set of results, and so on. In contrast, what massive event processing or stream processing needs is a continuous query capability where the user application can issue a set of queries to the stream processing system along with additional information such as conditions that should trigger querying and windows over streams to which querying should be targeted. The stream processing system continuously receives event streams from multiple producers, continuously runs multiple queries concurrently, and continuously sends query results to observers (e.g., dashboards). A continuous query engine built with these capabilities can be used as a building block in a stream processing network or grid. To enable this type of processing, we need StreamSQL extensions to standard SQL to provide a comprehensive and optimized platform for massive event processing. While there are different approaches to StreamSQL itself, the most promising one seems to be the one that builds upon the SQL support available in existing relational and in-memory databases rather than developing a new StreamSQL language and engine from scratch.

No comments: