August 28, 2019

Vertica Analytics DB

Evolution of the Internet, financial reasons, regulatory compliance and other business pressures over the past decade have resulted in stronger competition and the need to accumulate and analyze huge volumes of data. In this period, data warehouses grew in size considerably, and analysts expect that most data warehouses would continue to grow by 10 times or more in coming years.

However, in recent years, as the response to the smaller databases failing to keep up the pace, certain innovative database management systems, so called DBMS have been developed. Execution of ad-hoc queries with very large data volumes is not natively possible in DBMS systems designed to process records as rows of data. Such systems are more efficient for write-intensive transactions, failing to keep up with the workloads for read-intensive analytics processes. The reason for such low efficiency is, inter alia, the fact that every action requires to process an entire record consisting of dozens or hundreds of fields – even if the query only refers to a single field or to a few of them. Customers have failed to obtain better performance from such DBMS systems and have been spending millions of dollars annually in order to bridge this gap. Huge investments have been made in adding DBA resources, code enhancement, maintaining OLAP cubes or replacing their DBMS with costly proprietary systems.

Research pioneer, Professor Michael Stonebraker, understood that the time was ripe for significant innovation in the database industry and has spearheaded the invention of a new architecture. This technology provides ground-breaking high-speed performance for databases consisting of a few Terabytes to thousands of Terabytes where end-user requirements change rapidly.

Performance testing has shown that Vertica is faster, by a factor of 10 times to 1,000 times, compared to traditional relational databases (RDBMS) and proprietary data warehouse hardware.  Thanks to its unique pricing and the fact that Vertica uses standard blade servers, large databases with a heavy load of intensive queries may be deployed at a fraction of the cost of other solutions.

Vertica uses standard hardware using a Shared Nothing technology and the Linux operating system, with a pricing model which allows for creating and expanding a highly affordable grid network. Furthermore, Vertica licensing is based on the volume of data stored, without any additional payment for hardware upgrades. The cluster may be expanded online with no interruption to users. Vertica has been designed to provide for efficient management capabilities. For example: built-in processes ensure minimal need for manual maintenance. Product availability and recovery capacity eliminate the need to make professional service calls. Schema design and physical optimization are carried out automatically.

From the viewpoint of DB developers, the Vertica database works like a standard product, with support for standard SQL, standard interfaces through ODBC and/or JDBC which works with all popular ETL products, including report generators for BI. A look ‘under the hood’, however, reveals a different picture. The Vertica database was designed to aggressively reduce I/O and was developed from the ground-up to support grid-based clusters.

This innovative architecture allows data to be stored in columns for read-intensive applications with near-immediate response times. In a relational database, data records are stored in rows. In order to satisfy the query conditions, a DBMS must sequentially read all columns and rows - including columns which are not part of the query. Some databases use Bitmap indices, Data cubes, (1)Materialized views to fulfill demands for performance - but for many databases the management of data structures  required to optimize performance for each and every query becomes a highly complex management task, which requires vast storage space and offers poor update performance.
Vertica stores the values for each column individually and efficiently, which allows it to read only the columns, referenced by the query. This approach significantly improves query processing performance, reduces I/O and costly disk space.

CPUs are becoming faster and their evolution is faster than that of disk bandwidth, which is why Vertica has replaced slow, expensive disk I/O with fast CPU performance for more compact data encoding, which, in turn, speeds up the query processing. Vertica uses compressed data directly, yet requires less CPU resources, thanks to its efficient compressed table processing workflow. Data is decompressed only for the presentation stage.

In physical terms, Vertica automatically stores data in a format called Projections. These are sets of sorted and compressed columns (2) stored in the grid computer network without any sharing of resources or data  (shared-nothing). These projections may contain overlapping column data with different sort orders in order to ensure availability, improve performance and provide high redundancy. This improved performance is made possible because the system accesses projections with the appropriate columns and the most appropriate sort order for the query.

Another contributor to the maximum performance and high availability is the automatic database design feature. Based on SQL schema and query specifications provided by a DBA, Vertica automatically determines which projections should be set up and where they should be stored in order to ensure optimal query performance and high availability of the database.

Vertica satisfies even the highest availability requirements in the world of Big Data and implements modern methods to ensure strong performance. Instead of maintaining a separate, unused backup copy of the database (3) , Vertica leverages the intrinsic redundancy of the projections. Selecting the most appropriate projection to be used for the query does not only serve user requests - but also uses data copies on a separate server or site. The Database Designer creates projections as specified by the number of failures allowed by the DBA (in terms of the number and configuration of the projections)  , which could occur without compromising system availability. The Vertica approach to immediate restore does not come at the expense of database performance and does not require costly management of "logging" or "two-phase commit" transactions.

Often, databases in  data warehouse configuration execute queries by day and large load operations by night. The problem is that ever more data needs to be loaded nightly, while users demand to access real-time data more and more often. Vertica provides a hybrid architecture which allows for concurrent query execution and loading on multiple projections. Each Vertica site consists of storage in a Write Optimized Store  (WOS) for writing inserts, updates and deletions, as well as a Read Optimized Store (ROS) for handling queries. The content in the WOS is constantly moved to the ROS. Effective transaction management allows to avoid conflicts between reads and writes to the database, allowing queries to use the ROS,  WOS or both.

To summarize, Vertica significantly improves the price / performance ratio for databases. Its continuity, availability and ease of management allow customers to provide large-scale data to more users for more uses - at lower cost and with less effort. Vertica is an ideal database for businesses of all types and sizes, for such applications as:
Data Marts, Business Intelligence, Data Warehousing, Click Stream Analysis, Fraud Detection, Compliance Reporting, Call Detail Analytics, Tick-Store Query Apps, Buying Pattern Analysis, Basel II Compliance, Sales Dashboards and more.