Daniel Lemire's blog

, 1 min read

Building the Open Warehouse

Here’s a link to slides from a talk by Roger Magoulas, (O’Reilly Media, Inc.) about building the open warehouse. The talk was presented at O’Reilly Open Source Convention 2004.

Commodity hardware, faster disks, and open source software now make building a data warehouse more of a resource and design issue than a cost issue for many organizations. Now a robust analysis infrastructure can be built on an open source platform with no performance or functional compromises.

This talk will cover a proven analysis architecture, the open source tool options for each architecture component, the basics of dimensional modeling, and a few tricks of the trade.

Why open source? Aside from the cost savings, open source lets you leverage what your staff already knows — tools like Perl, SQL and Apache — rather than having to procure and staff for the proprietary tools that dominate the commercial space.

Data Warehouse Architecture: – Consolidated Data Store (CDS)
– Process to condition, correlate and transform data
– Multi-topic data marts
– dimensional models
– Multi-channel data access

Open Source Components
Database: MySQL
– fast, effective
Data Movement: Perl/DBI/SQL
– flexible data access
Data Access: Perl/Apache/SQL
– template toolkit for ad hoc SQL
– Perl hash for crosstabs/pivot
– Perl for reports

Dimensional Model
– organizes data for queries and navigation from detail to summary
– normalized fact table for quantitative data
– denormalized dimensions with descriptive data
– conforming dimensions available to multiple facts

Performance Considerations
– configuration
– indexing
– SQL-92 joins
– aggregate tables and aggregate navigation

The presentation should provide you with the basic architecture, toolkit, design principles, and strategy for building an effective open source data warehouse.