ⓘ Data warehouse appliance

                                     

ⓘ Data warehouse appliance

In computing, the term data warehouse appliance was coined by Foster Hinshaw for a computer architecture for data warehouses specifically marketed for big data analysis and discovery that is simple to use and high performance for the workload. A DWA includes an integrated set of servers, storage, operating systems, and databases.

In marketing, the term evolved to include pre-installed and pre-optimized hardware and software as well as similar software-only systems promoted as easy to install on specific recommended hardware configurations or preconfigured as a complete system. These are marketing uses of the term and do not reflect the technical definition.

A DWA is designed specifically for high performance big data analytics and is delivered as an easy-to-use packaged system. DW appliances are marketed for data volumes in the terabyte to petabyte range.

                                     

1. Technology

The data warehouse appliance DWA has several characteristics which differentiate that architecture from similar machines in a data center, such as an enterprise data warehouse EDW.

  • A DWA has a very tight integration of its internal components which are optimized for "data-centric" operations in contrast to "compute-centric" operations. The latter tend to emphasize number of CPUs, cores and network bandwidth.
  • A DWA is trivial to use and install. In contrast to a "pre-configuration" of components, a DWA has very few configuration switches or options. The elimination of such options significantly reduces configuration error – the number one cause for failure in large systems.
  • A DWA is optimized for analytics on big data. In contrast, preceding architectures including parallel ones focused on "enterprise data warehouse" being a general-purpose repository for data and supporting analytics as an ancillary task.

Most DW appliances use massively parallel processing MPP architectures to provide high query performance and platform scalability. MPP architectures consist of independent processors or servers executing in parallel. Most MPP architectures implement a "shared-nothing architecture" where each server operates self-sufficiently and controls its own memory and disk. DW appliances distribute data onto dedicated disk storage units connected to each server in the appliance. This distribution allows DW appliances to resolve a relational query by scanning data on each server in parallel. The divide-and-conquer approach delivers high performance and scales linearly as new servers are added into the architecture.

                                     

2. History

"Data warehouse appliance" is a term coined by Foster Hinshaw, the founder of Netezza. In creating the first data warehouse appliance, Hinshaw and Netezza used the foundations developed by Model 204, Teradata, and others, to pioneer a new category to address consumer analytics efficiently by providing a modular, scalable, easy-to-manage database system that’s cost effective.

MPP database architectures have a long pedigree. Some consider Teradatas initial product as the first DW appliance - or Britton-Lees. Teradata acquired Britton Lee - renamed ShareBase - in June, 1990. Others disagree, considering appliances as a "disruptive technology" for Teradata

Additional vendors, including Tandem Computers, and Sequent Computer Systems also offered MPP architectures in the 1980s. Open source and commodity computing components aided a re-emergence of MPP data warehouse appliances. Advances in technology reduced costs and improved performance in storage devices, multi-core CPUs and networking components. Open-source RDBMS products, such as Ingres and PostgreSQL, reduce software-license costs and allow DW-appliance vendors to focus on optimization rather than providing basic database functionality. Open-source Linux became a common operating system for DW appliances.

Other DW appliance vendors use specialized hardware and advanced software, instead of MPP architectures.Netezza announced a "data appliance" in 2003, and used specialized field-programmable gate array hardware. Kickfire followed in 2008 with what they called a dataflow "sql chip".

In 2009 more DW appliances emerged. IBM integrated its InfoSphere warehouse formerly DB2 Warehouse with its own servers and storage to create the IBM InfoSphere Balanced Warehouse. Netezza introduced its TwinFin platform based on commodity IBM hardware. Other DW appliance vendors have also partnered with major hardware vendors. DATAllegro, prior to acquisition by Microsoft, partnered with EMC Corporation and Dell and implemented open-source Ingres on Linux. Greenplum had a partnership with Sun Microsystems and implements Greenplum Database based on PostgreSQL on Solaris using the ZFS file system. HP Neoview uses HP NonStop SQL.

The market has also seen the emergence of data-warehouse bundles where vendors combine their hardware and database software together as a data warehouse platform. The Oracle Optimized Warehouse Initiative combines the Oracle Database with hardware from various computer manufacturers. Oracles Optimized Warehouses offer pre-validated configurations and the database software comes pre-installed. In September 2008 Oracle began offering a more classic appliance offering, the HP Oracle Database Machine, a jointly developed and co-branded platform that Oracle sold and supported and HP built in configurations specifically for Oracle. In September 2009, Oracle released a second-generation Exadata system, based on their acquired Sun Microsystems hardware.