HadoopBI Architecture
Faced with exponentially growing volumes of data, enterprises and web companies around the world have been moving away from SQL databases to Google's MapReduce model for their big data challenges. Using tools such as Apache Hadoop, companies can now routinely process petabytes of data. But now there's a major new problem in the world of big data. The MapReduce/Hadoop model is SLOW and its BATCH. Google, the inventors of the model, were the first to recognize this huge problem. To get the required realtime performance, they recently replaced MapReduce in their Google Instant search engine.
Led by Google, the big data space is now moving beyond batch processing to an era in which "continuous realtime intelligence" is essential. Companies are realizing that they need a big data architecture that is not just scalable, but also FAST, INCREMENTAL and REALTIME. The MapReduce model on its own just doesn't cut it for today's business and web apps, where a delay of even seconds can mean the difference between success and failure. Cloudscale was founded to solve this major problem for customers. We offer HadoopBI - the world's fastest big data solution.
Compute Like Google
With HadoopBI you can "compute like Google". The HadoopBI architecture gives you the three compute engines needed for a complete Big Data Business Intelligence solution:
- Apache Hadoop (MapReduce) for offline batch analytics at very large scale
- Apache HBase for scalable database queries
- Cloudscale's patented HRule in-memory architecture for scalable realtime big data analytics and realtime business automation
With HadoopBI you can handle the volume, variety and velocity of today's big data deluge. Scale from gigabytes to terabytes and petabytes. Handle all kinds of data - realtime and historical, structured and unstructured. Continuously analyze millions of new events and scenarios every second.
Throughput and Latency
The HRule big data engine can analyze a live stream in realtime with sub-second latency, and with throughput of more than 150MB/sec, on just three commodity 8-core Cloud Cluster instances. That corresponds to processing a SINGLE STREAM in parallel at a rate of TWO MILLION ROWS PER SECOND, or well over ONE TRILLION EVENTS per week. To give some idea of how fast this is, the nationwide call log systems of even the biggest US telcos only generate about 50,000 rows/sec, even at peak. For processing multiple streams, the solution scales linearly.

The HRule engine can also be used to analyze massive historical data sets at lightning speed.

The performance of HRule is more than 125x faster than Yahoo's recently released S4 (Realtime MapReduce) system, on the same hardware - about the difference in speed between walking from San Francisco to New York (4mph) versus taking a plane (500mph). The three-node app dynamically distributes the rows of the stream in realtime as TEN MILLION different Facebook_User_ID substreams (PEs in Yahoo S4 terminology), one substream per Facebook_User entity, for concurrent processing at that phenomenal rate. The implementation of the HRule engine is C++ and MPI, with smart compression for super-fast, node-to-node, bulk synchronous communications.

For analytics on historical data, the HRule engine can also exploit time-based MapReduce-style parallelism to achieve even higher levels of scalability and performance.

All kinds of advanced realtime analytics algorithms can be developed and run at these lightning speeds - algorithms that alert, clean, connect, encode, filter, group, match, merge, partition, rank, reduce, reorder, sample, sort, transform, validate, and window. The volume and velocity that HadoopBI can handle means that it’s now possible, for the first time, for businesses to have continuous and deep insight into all kinds of critical information - realtime trends, statistics, patterns, correlations, opportunities, threats - as soon as the data becomes available. Sub-second latency means massive competitive advantage!
