Search
Twitter

Entries in Data Warehousing (6)

Tuesday
Jan312012

New Data Mart: Create a New SSAS Cube or a Perspective in an Existing Cube?

Overview: Some things to consider when deciding if you want to integrate new data within an existing cube (and, optionally, use a perspective) or create a new separate cube.

New Data Mart

Recently we created a new data mart (subject area) within an existing data warehouse.  The new data mart had several new facts, several new dimensions, as well as relationships to existing (conformed) dimensions such as Date & Customer.  I had more new dimensions than usage of conformed dimensions.

Using the existing SSAS database project wasn’t up for discussion – we need to reuse the conformed dimensions after all.  I also used the same Data Source View (DSV).  What wasn’t so clear immediately was whether to create a new cube for the subject area, or use the existing cube.  A perspective in the existing cube could certainly help simplify things, but we actually chose to create a new cube.  Here’s why…

Reasons We Chose to Create a New SSAS Cube

This isn’t an exhaustive list, but it’s the things that were meaningful for our project.  Be sure to check out Chris Webb’s blog entry for more things to consider (like security).

  • We were dealing with a separate data mart – the new subject area had no overlap between measure groups & no expected need to do cross-analysis of existing measure groups with the new measure groups.  Put another way, we expect the reporting & analysis to be independent.  (Although linked measure groups are not ideal, that is in our back pocket if the need should come up in the future.)
  • Shield the existing cube from any data quality issue / cube processing errors we may experience during the early going (in our situation, the data mart was based upon a brand new Oracle module which was heavily customized & not yet in Production itself).
  • Less regression testing upon deployment (we were on a very short timeline).
  • The existing cube was already a bit large.
  • Potential for improved query performance (we expected a high volume of data).
  • Facilitates multiple developers in the SSAS BIDS environment (this is a big deal in a team environment).
  • Allows for a separate Agent job to process the dimensions and the measure groups (although we do still have a dependency on the conformed dimensions).  This flexibility is a good thing in our situation, since it’s a different subject area which can be run faster on its own, and on an independent schedule.
  • Simplification for end users, without the need for a perspective.

Any additional thoughts or differences of opinion?  Leave me a comment!

Finding More Information

Chris Webb’s blog – One Cube vs Multiple Cubes

 

Sunday
Jan222012

Data Modeling Tip when Using Many-To-Many Bridge Tables in SSAS

Overview:  A quick tip re: getting the underlying data model correct when a bridge table is involved, to avoid errors in SQL Server Analysis Services.

Level:  Assumes familiarity with bridge tables, as well as SSAS Dimension Usage relationships.

SSAS Error Received

When you have a complex set of data being modeled for addition into the data warehouse, sometimes the star schema joins are not as simple as we’d like them to be.  My first choice is always to join facts to dimensions at the surrogate key level (i.e., the lowest level of granularity).  However, recently we had a challenging set of data where we needed to model the data a little differently.  What we had done in the relational tables (i.e., the underlying data warehouse) was rejected by SSAS.

     image 

The SSAS deployment error:

Errors in the metadata manager.  The ‘Dimension 2’ many-to-many dimension in the ‘Fact 1’ measure group requires that the granularity of the ‘Dimension 1’ dimension is lower than that of the ‘Many-To-Many Bridge’ measure group.

The Underlying Data Model

To avoid the error shown above, Join A needs to be at a lower level of granularity than Join B:

image

If Join A is at the surrogate key level (i.e., the level of granularity the Dimension 1 records are stored at), you shouldn’t run into this issue. 

Why wouldn’t Join A be at the surrogate key level, you ask?  SSAS does let you create a relationship using a non-key granularity attribute (i.e., using a join that’s higher than the lowest level of detail).  A common example of this is joining on the non-key attribute of Quarter when your Date dimension is stored at the day level, because the facts aren’t available at the day level. 

SSAS Dimension Usage

The star schema shown above translates to the following relationships in the SSAS Dimension Usage:

image

Friday
Sep302011

Sorting Descending in Analysis Services for a Date Dimension: Don't Do It!

Why To Not Sort Date Dimension Descending

Now that I am wiser, my recommendation is to leave your date dimension in ascending order.  Within a day of implementing this technique to sort descending, we found a couple of time-oriented issues within the development environment this had been deployed to.  Here are some reasons not to sort your date dimension descending:

  • Unintended consequences on time calculations. 
  • Effect on named sets (although the order could certainly be modified).
  • Appearance on charts & reports.  Although users would like to see drop-downs list in descending order, they probably don't always want to see their data displayed that way. 

Live and learn.  Thanks for reading.

- Melissa  10/3/2011

-----------------------------------

The Original Blog Entry:

Overview:  This entry describes how I fulfilled a customer requirement to present the Date attributes & hierarchies in descending order, rather than ascending.  This technique utilizes individual sort fields for each attribute.  Each sort field contains the same content as the key, but is multiplied by * -1 to reverse the order.

Level:  201 (this assumes some SSAS knowledge, as all SSAS steps are not discussed in detail).

Analysis Services Order By Options

Following is a screen shot of how my customer wanted to see their Calendar Date Hierarchy displayed:

      image

The good news?  Analysis Services attributes have an OrderBy property, which can sort by the Key or Name field of the attribute, or, alternatively, the Key or Name field of a related attribute.

The bad news?  This OrderBy property works in ascending order only.  There is not a descending option, so we need to provide an explicit field to SSAS.

Steps to Deliver Descending Sort Functionality

The steps I used include the following:

  1. Relational DB:  Alter the Date dimension in the data warehouse structure to add the integer sort fields.
  2. Relational DB:  Alter the stored procedure which builds the Date dimension so that it populates the new sort fields.
  3. SSAS:  Refresh the DSV to detect the new relational fields.
  4. SSAS:  Add the new sort fields to the Date dimension (hidden to end users).
  5. SSAS:  Update the attribute relationships in the Date dimension.
  6. SSAS:  Update the OrderBy property for each attribute.
  7. SSAS:  Deploy and test.

The rest of this entry elaborates on some of these steps.

Sort Fields in the Data Warehouse Date Dimension

In order to be able to sort every attribute (and every hierarchy) descending, we need a sort field that goes with every attribute.

Using the screen shot below, let’s focus on the “Quarter of Year” attribute.  It contains:

  • Calendar_quarter_year_nr:  Key property in SSAS
  • Calendar_quarter_year_cd:  Name property in SSAS
  • Calendar_quarter_year_sort:  OrderBy property in SSAS

Note the sort field is nothing more than a reversal of the number (_nr) field.   Sample from Dim_Date:

     image

Stored Procedure which Builds the Date Dimension

In order to populate Dim_Date, we use a stored procedure that uses a variety of date functions & variables.  The stored proc is is where *-1 logic is used to populate each sort field.  We end up with the same contents as each Number (_nr) field, but the sort field is negative (reversed). 

Sample from the Build_Dim_Date stored proc:

      image

Attribute Relationships in SSAS

After you have refreshed the SSAS DSV for the new Dim_Date sort fields, and after you have added each sort field (hidden) to the SSAS Date dimension, you need to update the attribute relationships.  The attribute relationships must in place before the OrderBy property can be updated.

Sample attribute relationships:

     image

As you can see in the above screen shot, having a sort field for each individual attribute expands the attribute relationships pane considerably.  However, the relationships are necessary to associate each individual sort field to the field that’s being displayed.  Only related fields (or the field itself) can be used in the OrderBy property.

OrderBy Property in SSAS

The last thing to do is update the OrderBy property for each attribute.  The following screen shot shows our “Quarter of Year” attribute.

     image

Note that the OrderBy is done on the AttributeKey – this means “don’t use the key of my Calendar_quarter_year_nr field; instead, use the key of this related field.”  In my case, the way the stored proc is written, the integer number fields are all set up to sort nicely so we we can use Key – that’s not always true.  In many other dimensions, sorting by the key won’t work at all & you need to use Name instead.

A Few Final Words

I typically try to implement as many fields in the underlying data warehouse structure as possible; however, if you only have one or two fields to be sorted descending, another viable alternative is to create a named calculation in the SSAS DSV.

The beauty of this technique is that it’s just so darn simple and it’s quick to implement.  The * -1 technique worked for me with the Date dimension because I had integer fields for each of my Number (_nr) fields. 

Sunday
Jul312011

Delivering Alternate Hierarchies for Reporting

Overview:  This post discusses the approach we used in delivering alternate hierarchies for financial reporting within a SQL Server 2008 R2 environment.  Part 1:  the solution, which is a valid alternative if you don’t have unary operators.  Part 2:  workaround we implemented to deal with a unary operator issue.

To Give Credit Where Credit is Due:  Leo Furlong of Intellinet was the architect of this design.  I helped implement it, but he was the brains.

Part 1:  DW and SSAS Solution for Alternate Hierarchies

Business Requirements

The requirements presented to us included: 

1.  For financial reporting, each account may reside in one or more hierarchies for reporting.  Example:  the Sales Returns & Allowances account participates in the GAAP Reporting, Tax Reporting, and Internal Reporting Hierarchies; it does not participate in the Segment Reporting Hierarchy.

2.  For the same account, a unary operator (+/-) may differ between hierarchies.  Example:  in the Internal Reporting Hierarchy, the Discounts account is subtracted from its parent, whereas in the Segment Reporting hierarchy it is added to its parent.

3.  The rollup levels are ragged (unbalanced).  Example:  The GAAP Reporting hierarchy has 6 levels, whereas the Internal Reporting Hierarchy has 11 levels.

4.  Sorting of the members within each level of the hierarchies needs to be customized (i.e., alphabetical is not sufficient).

Sample of the results one hierarchy would be intended to produce:

      image

The Solution

Relational DW

From a relational data warehouse perspective, we implemented a many-to-many bridge table to handle the relationship between accounts and reporting hierarchies. 

Dim Account:  Grain is one row per distinct account.  This table has the direct relationship to the fact tables.

Dim Account Reporting Hierarchies:  Grain is one row per account + each hierarchy the account belongs to. 

  • Each row has a recursive relationship to its parent row.
  • Contains all account attribute fields, including the unary operator (+/-) field. 
  • Each field has a sort value (retrieved from MDS – discussed next).

Following is a simplified version of the data model:

        image

Master Data Services

Both Account dimensions are populated (via SSIS ETL) from a Master Data Services model. 

Each reporting hierarchy was set up in MDS as an Explicit Hierarchy, which allowed for the hierarchies to be ragged.  An Explicit Hierarchy also permits drag & drop ordering of the members within each level (whereas a Derived Hierarchy does not).

Analysis Services

Dim Account: Displays just a few attributes (such as Account Type, Account Name & Number, etc).  Because it’s directly related to the fact table, it has a “regular” relationship to the measure group.

Dim Account Reporting Hierarchies:  The only visible attribute is the the parent-child hierarchy which serves up the reporting levels.  The unary operator was delivered using the functionality built into a parent/child dimension in SSAS.  There’s a few other attributes, set to be hidden, which are exposed through member properties.  This dimension has a “many to many” relationship to the measure group, through the bridge measure group, as shown here:

        image

Initially it seemed like this solution would work really well. It would have worked beautifully if we had no unary operator issues to contend with.

The remainder of this discussion deals with a workaround we had to implement due to a unary operator issue.  If you don’t have unary operator data, the above solution is a valid option to consider.

 

Part 2: Dealing with Unary Operator Issues

The Unary Operator Problem We Ran Into

The unary operator did not function properly when “subtraction” accounts were involved.  What we found consistently was the immediate parent of any subtraction accounts aggregated properly (ex: Level 2-C below); however, the parent of the parent (ex: Level 1 below) did not.  The difference was always 2x the total of the subtraction account children.  Aggregations where the children were always addition worked just fine.

Example:

       image

After doing some research, we narrowed down the problem to being related to the many-to-many relationship.  It’s currently documented in this Connect issue.  Since only 10 people have logged that they can reproduce the bug as of mid-2011, I’m not betting Microsoft finds it a huge priority to get it fixed.  Hence, a workaround was born…

Workaround to Handle Unary Operators

The immediate thing we wanted to do was eliminate the many-to-many relationship and convert it to a regular relationship.  The goal was: if we can eliminate the many-to-many relationship, all other aspects of our solution would likely work as designed.  Therefore, we did the following:

  1. Left the existing fact table as-is (i.e., it still is the “primary” fact table in the DW).  Leave the existing dimensions as-is.
  2. Implemented a secondary fact table (ex: Fact Finance Hierarchies in the screen shot below) which flattened out each fact record with the hierarchy it belongs to.  Caution:  this does create duplicate fact records.  This secondary fact table is populated at the end of the ETL process.  These records are created physically in the ETL process at the client’s request; a very valid alternative would be to create this flattened table using a view.
  3. In the SSAS Data Source View, remove the existing fact table. Replace it with the secondary fact table which has one record for every hierarchy record.
  4. Change the relationship to be Regular; remove the bridge measure group.

Following is a simplified version of the data model after the workaround, from the perspective of the SSAS data source view:

            image

 

After this implementation, the unary operator aggregates worked properly when we browsed the cube.  Although there’s a number of ways to handle it, we found this to be the most straightforward & easy to maintain for our client.

Comments & alternative ideas are welcomed!

 

Monday
May162011

Usage of “Unknown Member” Rows in a Data Warehouse

Overview:  Ideas for usage of Unknown Member rows (also referred to as “predefined members” by some folks) beyond just the default row.

Purpose for the “Unknown Member” Row

A bit of background first:  When we load facts to a Kimball-style data warehouse, each fact row is associated to the appropriate dimension row.  For instance, FactStudentAttendance fact table may relate to the DimStudent, DimClass, and DimDate – these relationships make the magic happen for slicing & dicing capabilities. 

However … what happens when we have a fact that cannot be associated to a valid dimension row?  It usually is mapped to an “unknown member” row in the dimension.  A surrogate key of –1 is often used for the purpose, but it certainly could be any number.  (Although, for unknown members, I do like to specify the values rather than let the database generate them.  I always use negative numbers for this purpose, for clarity.)

Why bother with the unknown member row?  Because it allows the measurement data to still be populated in the fact table – meaning your totals & counts will still be accurate, even if the details cannot all be fully described by the dimensional data.  It also reduces the risk of dropping data when inner joins are utilized in queries.  Basically, this process gives orphan fact rows a temporary home (hint hint: temporary, meaning that a process should exist to clean up the data quality issues).

The Default “Unknown Member” Row

I typically use –1 as my default “unknown member” row in all dimensions.  I know that any facts associated to –1 represent an issue that needs to get resolved.  However, you’re not restricted to just one unknown row in a dimension.  Sometimes it offers value to have several “flavors” of unknown member rows.  

Using an Additional “Unknown Member” Row for a N/A Situation

Besides the default row, you could have a row to map to for data which just doesn’t exist for that fact, but it’s not a problem. 

Let’s say we have a FactClassAttendance table which has two relationships back to DimTeacher:  one for the primary instructor, and another for the secondary instructor.  In situations where we don’t have a secondary instructor, I wouldn’t set that fact record to –1; rather, I’d set it to a –2 row which is set to mean “No Instructor.”  Personally, I prefer to have a fixed dimensional row to explicitly state this circumstance, versus allowing the foreign key in the fact to be null.

This also works well for a parent/child relationship when some rows have “No Parent.”

In this situation, just make sure you’re not messing with the granularity of your fact table.

Using Additional “Unknown Member” Rows When We Know Something about the Data

Another situation which warrants more than just the default row is when we have some data elements, but not everything we need to identify the dimension row.  Let’s say we have a FactSurvey table.  Maybe the student taking the survey was agreeable to providing their identity, but maybe not.  Perhaps they wanted to be anonymous, but would fill in their gender.  In this case, we could utilize DimStudent rows such as:

      -1:  Unknown Student  (i.e., this is a data quality issue to be fixed)

      -2:  No Student (i.e., the survey was taken by someone other than a student, so an N/A status is okay)

      -3:  Anonymous Student  (i.e., differentiated from the –1 so we know it’s not a problem)

      -4:  Male Student

      -5:  Female Student

For me, the main benefits to having something other than the default (-1) row are that: 

  • We can still do some amount of analysis on the data (such as slicing the responses from male versus female students) if there’s requirements to do so, and
  • We can differentiate between what is a data quality issue versus a “no data” issue which doesn’t need further follow up.

 

Wednesday
Sep222010

Using Dual Join to Dimension to Retrieve All Historical Fact Table Rows

I am a big proponent of the Kimball method of data warehousing.  The denormalized table structure, with Type 1/2/3 dimensions, and surrogate keys provide a lot of benefits for querying and reporting.  However, there is one situation that I seem to run into every once in a while, at which time I have that “Oh yeah” moment. 

The situation:  Records for a student have changed over time within a Type 2 dimension table.  We have two goals: 

Goal (a):  Retrieve all historical fact records (regardless of how the surrogate key value may or may not have changed at any point in time).

Goal (b):  Retrieve the current dimension data in the report.

First, let’s review the dimensional data for Danny Zuko.  (And before we go any farther, let’s just agree that Grease is one of the finest movies ever made, shall we?)

Sample Dimension Records:

In the above table, the Student ID is our surrogate key in the data warehouse.

Note that Student ID #3 is the current record, showing Danny’s current address and current major, both of which have changed over time.

Next let’s look at some fact records. 

Sample Fact Records:

Let’s say we have a requirement to produce a report of attendance history by individual student.  If we based the query on the current StudentID (3), we’d inadvertently drop fact table records associated with Student IDs 1 and 2.  Following is a solution we routinely put in place in our reporting queries to ensure all history is captured.

Sample query:

SELECT

    DSt.StudentName    

   ,DSt.Major  

   ,DSt.AddressFull    

   ,FSA.FromDateID

   ,FSA.ToDateID

   ,DaysAttended

   ,DaysAbsent 

FROM DW.DimStudent DSt

   INNER JOIN DW.DimStudent DStAll

        ON DSt.NaturalKey = DStAll.NaturalKey

   INNER JOIN DW.FactStudentAttendance FSA

        ON DStAll.StudentID = FSA.StudentID

WHERE DSt.CurrentRecord = 'Yes'

How our two goals are satisfied with this query:

Goal (a):  Retrieve all historical fact records (regardless of how the surrogate key value may or may not have changed at any point in time).  The above query joins DimStudent back to itself a second time, using the natural key.  The join to the fact table records are based on the 2nd DimStudent table (i.e., the DStAll alias), so we are assured of retrieving all fact records.

Goal (b):  Retrieve the current dimension data in the report.  The WHERE statement restrictions are based on the first DimStudent table (i.e., the DSt alias).

This approach assumes that the natural key is retained within the dimension, but not within the fact.  It also assumes that the natural key is reliable data. 

Any comments, thoughts or alternative suggestions would be welcomed.