Change Data Capture -Part2

June 17, 2009

How To Process Change Data Capture (CDC) in SQL Server Integration Services SSIS 2008

Assume that we have a customer table and we want to store all changes to our customer table in a customer_audit table.  We’ll write an SSIS package to query the CDC data and copy it to the customer_audit table.  We would like the option of running the SSIS package on demand and/or on a schedule (e.g. a SQL Agent job).  Each time we run the SSIS package we want to pickup whatever changed since the last time we ran the package.

Customer Tables

We will use the following customer table:

create table dbo.customer (

customer_id int identity primary key not null

, name nvarchar(50) not null

, sales_rep nvarchar(50) not null

, region nvarchar(50) not null

, credit_limit int not null

)

We will use the following customer_audit table to store changes to the customer table:

create table dbo.customer_audit (

customer_audit_id int identity primary key not null

, customer_id int not null

, name nvarchar(50) not null

, sales_rep nvarchar(50) not null

, region nvarchar(50) not null

, credit_limit int not null

, effective_date datetime not null

, __$start_lsn binary(10) not null

, __$seqval binary(10) not null

, __$operation int not null

, __$update_mask varbinary(128) not null

)

Enabling CDC

As we see before.

Logging

In order to allow our SSIS package to pickup just the changes since the last time it was run, we’ll populate a log table with the data we need.  We’ll use the following log table:

create table dbo.cdc_capture_log (

cdc_capture_log_id int identity not null

, capture_instance nvarchar(50) not null

, start_time datetime not null

, min_lsn binary(10) not null

, max_lsn binary(10) not null

, end_time datetime null

, insert_count int not null default 0

, update_count int not null default 0

, delete_count int not null default 0

, status_code int not null default 0

)

We’ll create two stored procedures to maintain the log:

  • init_cdc_capture_log will create a new row.
  • end_cdc_capture_log will update the row.

The init_cdc_capture_log is called at the beginning of our SSIS package.  It is shown below:

create procedure dbo.init_cdc_capture_log

@capture_instance nvarchar(50)

as

begin

set nocount on;

declare

@begin_lsn binary(10)

, @end_lsn binary(10)

, @prev_max_lsn binary(10)

— get the max LSN for the capture instance from

— the last extract

select @prev_max_lsn = max(max_lsn)

from dbo.cdc_capture_log

where capture_instance = @capture_instance

— if no row found in cdc_capture_log get the min lsn

— for the capture instance

if @prev_max_lsn is null

set @begin_lsn = sys.fn_cdc_get_min_lsn(@capture_instance)

else

set @begin_lsn = sys.fn_cdc_increment_lsn(@prev_max_lsn)

— get the max lsn

set @end_lsn = sys.fn_cdc_get_max_lsn()

insert into dbo.cdc_capture_log

(capture_instance,start_time,min_lsn,max_lsn)

values

(@capture_instance,getdate(),@begin_lsn,@end_lsn)

select cast(scope_identity() as int) cdc_capture_log_id

end

The end_cdc_capture_log stored procedure updates the row created  by the init_cdc_capture_log stored procedure.  It is shown below:

create procedure dbo.end_cdc_capture_log

@cdc_capture_log_id int

, @insert_count int

, @update_count int

, @delete_count int

as

begin

set nocount on;

update dbo.cdc_capture_log set

end_time = getdate()

, insert_count = @insert_count

, update_count = @update_count

, delete_count = @delete_count

, status_code = 1

where cdc_capture_log_id = @cdc_capture_log_id

end

Creating the SSIS Package

We will create an SSIS package that has the following control flow:

clip_image001

The main points about the above control flow are:

  • Init Log is an Execute SQL task; it calls the init_cdc_capture_log stored procedure (described above) and saves the identity value of the created cdc_capture_log row in a package variable.
  • Process Changes is a Data Flow task that retrieves the latest changes from the CDC table and copies them to our audit table.
  • End Log is an Execute SQL task that calls the end_cdc_capture_log stored procedure (described above) to update the cdc_capture_log row.

The Process Changes Data Flow task is implemented as shown below:

clip_image002

The main points about the above data flow are:

  • Extract Customer Changes is an OLE DB Source that executes the stored procedure extract_customer_capture_log to retrieve the customer changes since the last run.
  • Count Inserts Updates and Deletes is a Script Component Transform task that just counts the number of inserts, updates and deletes in the change data.
  • Save Customer Changes to Custom Audit Table is an OLE DB Destination used to insert each change row into the customer_audit table.

The extract_customer_capture_log stored procedure is shown below:

create procedure dbo.extract_customer_capture_log

@cdc_capture_log_id int

as

begin

set nocount on;

declare

@begin_lsn binary(10)

, @end_lsn binary(10)

— get the lsn range to process

select

@begin_lsn = min_lsn

, @end_lsn = max_lsn

from dbo.cdc_capture_log

where cdc_capture_log_id = @cdc_capture_log_id

— extract and return the changes

select m.tran_end_time modified_ts, x.*

from cdc.fn_cdc_get_all_changes_customer_all(

@begin_lsn, @end_lsn, ‘all’

) x

join cdc.lsn_time_mapping m

on m.start_lsn = x.__$start_lsn ;

end

The main points about the above stored procedure are:

  • The cdc_capture_log_id parameter value is the value returned from the call to the init_cdc_capture_log stored procedure (described above in the Logging section).
  • Retrieve the LSN range from the cdc_capture_log table row.  The LSN range represents all changes that have occurred since the last run of the SSIS package.
  • The cdc.fn_cdc_get_all_changes_customer_all function is generated when you enable CDC.  The function name includes the capture instance.  The function returns the changes that occurred in the LSN range.
  • The cdc.lsn_time_mapping table is populated by CDC with the mapping of transaction times to LSNs.  The join retrieves the transaction time.  This alleviates the need to manually track this in the source table.

Testing the SSIS Package

Before running the SSIS package, we need to execute a script that performs some inserts, updates and deletes.  We’ll use the following script:

use SqlHero

go

insert into dbo.customer

(name,sales_rep,region,credit_limit)

values

(N’BGM Systems’, N’Cane’, N’South’, 2500)

update dbo.customer

set sales_rep = N’Smith’

where [name] = N’BGM Systems’

update dbo.customer

set credit_limit = 3000

where [name] = N’BGM Systems’

delete dbo.customer

where [name] = N’BGM Systems’

delete dbo.customer

where [name] = N’BGM Systems’

After running the above script, execute the SSIS package, then check the customer_audit table to see that there are four rows; one for each change made in the script.  The partial contents of the customer_audit table are shown below:

clip_image003

The main points about the above table are:

  • effective_date is the date and time of the transaction, as retrieved from the cdc.lsn_time_mapping table.
  • Row 1 shows the insert; __$operation=2 for an insert.
  • Row 2 shows the update of the sales_rep; __$operation=4 for the update showing the values after the update.
  • Row 3 shows the update of the credit_limit.
  • Row 4 shows the delete; __$operation=1 for a delete.

The effective_date column provides the ability to query the customer_audit table and see the customer values at any point in time by filtering on the maximum effective_date that is less than or equal to some value.

Advertisements

How to create time Dimension Table using T-SQL Query between two years

June 16, 2009

How to create time Dimension Table using T-SQL Query from year 2000 to 2029

My Issue

  • I try to create a new olap solution  using  Microsoft sqlserver 2008 and I need to create the time dimension table  by using the time dimension in adventure works  but I find that  it’s from year 2000 to 2007 and we are now 2009

Solution

 

  • I Write a simple query to create the time dimension table  frbetween two years

The Query

 

CREATE TABLE dimDate

(

dateID int IDENTITY(1,1) PRIMARY KEY NOT NULL,

fullDate datetime NOT NULL,

dayNumOfWeek tinyint NOT NULL,

dayNameOfWeek char(9) NOT NULL, — dayName is reserved word

dayNumOfMonth tinyint NOT NULL,

dayNumOfYear smallint NOT NULL,

monthNumOfYear tinyint NOT NULL,

monthNameOfYear char(12) NOT NULL,

monthOfQuarter tinyint NOT NULL,

calendarQuarter tinyint NOT NULL,

calendarSemester tinyint NOT NULL,

calendarYear char(4) NOT NULL

)

–*/

DECLARE @maxDateAlreadyLoaded datetime

DECLARE @dayNumOfWeek tinyint

DECLARE @dayNameOfWeek char(9) — dayName is reserved word

DECLARE @dayNumOfMonth tinyint

DECLARE @dayNumOfYear smallint

DECLARE @monthNumOfYear tinyint

DECLARE @monthNameOfYear char(12)

DECLARE @monthOfQuarter tinyint

DECLARE @calendarQuarter tinyint

declare @calendarSemester tinyint

DECLARE @calendarYear char(4)

DECLARE @endDate datetime

— Preload the date by check the data in the table for highest date already

–loaded

IF EXISTS(SELECT fullDate FROM dimDate)

SET @maxDateAlreadyLoaded = (SELECT TOP 1 DATEADD(day,1,fullDate) FROM

dimDate ORDER BY dateID DESC)

ELSE

— here you Initialize the startdate of the Time Table

SET @maxDateAlreadyLoaded = ‘1/1/2000’

— Initialize the endDate to one year beyond the start value

— here you can Initialize the end year from the startdate in these case it will be 2029

SET @endDate = DATEADD(year, 29, @maxDateAlreadyLoaded)

— Initialize loop to load one year worth of data

WHILE (@maxDateAlreadyLoaded < @endDate)

BEGIN

— Calculate each of the fields (simplifies INSERT statement)

SET @dayNumOfWeek = DATEPART(weekDay, @maxDateAlreadyLoaded)

SET @dayNameOfWeek = (SELECT CASE

WHEN DATEPART(weekDay, @maxDateAlreadyLoaded)= 1 THEN ‘Sunday’

WHEN DATEPART(weekDay, @maxDateAlreadyLoaded)= 2 THEN ‘Monday’

WHEN DATEPART(weekDay, @maxDateAlreadyLoaded)= 3 THEN ‘Tuesday’

WHEN DATEPART(weekDay, @maxDateAlreadyLoaded)= 4 THEN ‘Wednesday’

WHEN DATEPART(weekDay, @maxDateAlreadyLoaded)= 5 THEN ‘Thursday’

WHEN DATEPART(weekDay, @maxDateAlreadyLoaded)= 6 THEN ‘Friday’

WHEN DATEPART(weekDay, @maxDateAlreadyLoaded)= 7 THEN ‘Saturday’

END)

SET @dayNumOfMonth = DAY(@maxDateAlreadyLoaded)

SET @dayNumOfYear = DATEPART(dayOfYear, @maxDateAlreadyLoaded)

SET @monthNumOfYear = MONTH(@maxDateAlreadyLoaded)

SET @monthNameOfYear = (SELECT CASE

WHEN MONTH(@maxDateAlreadyLoaded) = 1 THEN ‘January’

WHEN MONTH(@maxDateAlreadyLoaded) = 2 THEN ‘February’

WHEN MONTH(@maxDateAlreadyLoaded) = 3 THEN ‘March’

WHEN MONTH(@maxDateAlreadyLoaded) = 4 THEN ‘April’

WHEN MONTH(@maxDateAlreadyLoaded) = 5 THEN ‘May’

WHEN MONTH(@maxDateAlreadyLoaded) = 6 THEN ‘June’

WHEN MONTH(@maxDateAlreadyLoaded) = 7 THEN ‘July’

WHEN MONTH(@maxDateAlreadyLoaded) = 8 THEN ‘August’

WHEN MONTH(@maxDateAlreadyLoaded) = 9 THEN ‘September’

WHEN MONTH(@maxDateAlreadyLoaded) = 10 THEN ‘October’

WHEN MONTH(@maxDateAlreadyLoaded) = 11 THEN ‘November’

WHEN MONTH(@maxDateAlreadyLoaded) = 12 THEN ‘December’

END)

SET @monthOfQuarter = (SELECT CASE

WHEN MONTH(@maxDateAlreadyLoaded) IN(1,4,7,10) THEN 1

WHEN MONTH(@maxDateAlreadyLoaded) IN(2,5,8,11) THEN 2

WHEN MONTH(@maxDateAlreadyLoaded) IN(3,6,9,12) THEN 3

END)

SET @calendarQuarter = DATEPART(quarter, @maxDateAlreadyLoaded)

SET @calendarSemester = (SELECT CASE

WHEN MONTH(@maxDateAlreadyLoaded) IN(1,2,3,4,5,6) THEN 1

WHEN MONTH(@maxDateAlreadyLoaded) IN(7,8,9,10,11,12) THEN 2

end)

SET @calendarYear = YEAR(@maxDateAlreadyLoaded)

–/*

— Used the PRINTS to test values as script designed

PRINT ‘———– New Row ———–‘

PRINT ‘@maxDateAlreadyLoaded ‘ + CAST(@maxDateAlreadyLoaded as varchar(20))

PRINT ‘@dayNumOfWeek ‘ + CAST(@dayNumOfWeek as varchar(20))

PRINT ‘@dayNameOfWeek ‘ + CAST(@dayNameOfWeek as varchar(20))

PRINT ‘@dayNumOfMonth ‘ + CAST(@dayNumOfMonth as varchar(20))

PRINT ‘@dayNumOfYear ‘ + CAST(@dayNumOfYear as varchar(20))

PRINT ‘@monthNumOfYear ‘ + CAST(@monthNumOfYear as varchar(20))

PRINT ‘@monthNameOfYear ‘ + CAST(@monthNameOfYear as varchar(20))

PRINT ‘@monthOfQuarter ‘ + CAST(@monthOfQuarter as varchar(20))

PRINT ‘@calendarQuarter ‘ + CAST(@calendarQuarter as varchar(20))

PRINT ‘@calendarYear ‘ + CAST(@calendarYear as varchar(20))

PRINT ‘@endDate ‘ + CAST(@endDate as varchar(20))

PRINT ”

–*/

— Load row into database

INSERT INTO dimDate

(fullDate, dayNumOfWeek, dayNameOfWeek, dayNumOfMonth, dayNumOfYear,

monthNumOfYear, monthNameOfYear, monthOfQuarter, calendarQuarter,

calendarSemester,

calendarYear)

VALUES

(@maxDateAlreadyLoaded, @dayNumOfWeek, @dayNameOfWeek, @dayNumOfMonth,

@dayNumOfYear,

@monthNumOfYear, @monthNameOfYear, @monthOfQuarter, @calendarQuarter,

@calendarSemester,

@calendarYear)

— Increase the date to load by one

SET @maxDateAlreadyLoaded = DATEADD(day, 1, @maxDateAlreadyLoaded)

END


Partition Table in SQL Server 2008

June 13, 2009

1. What is Partition table?

Data is partitioned horizontally by range

Partition Table

2. How to Create Partition Table?

You can create partition table by 2 ways

  1. programmability
  2. wizard in SQL Server 2008 New

But you must understand 2 things before creation Partition Tables

Partition Function: Partition functions define partition boundaries and Boundary values can be assigned to LEFT or RIGHT

clip_image004

Partition Scheme: A partition scheme assigns partitions to filegroups

clip_image006

3. How to Create Partition Table by wizard in SQL Server 2008?

Right click on table then go to storage then select Create Partition as in fig bellow

1

Select a column you want to use it in partition as in fig bellow

2

Create Partition function as in fig bellow

3

Create Partition Scheme as in fig bellow

4

Determine ranges as in fig bellow

5

Create Partition by select run immediately as in fig bellow

6    

Congratulation your table now partitioned

4. How to Create Partition Table programmability?

/***************************************************

1)Create database TestDb

****************************************************/

CREATE DATABASE [TestDb] ON PRIMARY

( NAME = N’TestDb’, FILENAME = N’C:\db\TestDb.mdf’ , SIZE = 3072KB , FILEGROWTH = 1024KB ),

FILEGROUP [fg1]

( NAME = N’f1′, FILENAME = N’C:\db\f1.ndf’ , SIZE = 3072KB , FILEGROWTH = 1024KB ),

FILEGROUP [fg2]

( NAME = N’f2′, FILENAME = N’C:\db\f2.ndf’ , SIZE = 3072KB , FILEGROWTH = 1024KB ),

FILEGROUP [fg3]

( NAME = N’f3′, FILENAME = N’C:\db\f3.ndf’ , SIZE = 3072KB , FILEGROWTH = 1024KB ),

FILEGROUP [fg4]

( NAME = N’f4′, FILENAME = N’C:\db\f4.ndf’ , SIZE = 3072KB , FILEGROWTH = 1024KB )

LOG ON

( NAME = N’TestDb_log’, FILENAME = N’C:\db\TestDb_log.ldf’ , SIZE = 1024KB , FILEGROWTH = 10%)

GO

USE [TestDb]

GO

/***************************************************

3)Create Partition Function

****************************************************/

Create Partition Function pf_Date(Date)

as range right

For values(’01/01/2006′,’01/01/2007′,’01/01/2008′)

go

/***************************************************

3)Create Partition scheme

****************************************************/

Create Partition Scheme ps_Date

AS Partition pf_Date

to(fg1,fg2,fg3,fg4)

go

/***************************************************

3)Create Partition Contact Table

****************************************************/

CREATE TABLE [dbo].[Contact](

[ContactId] [int] IDENTITY(1,1) NOT NULL,

[FullName] [nvarchar](50) NULL,

[CreationDate] [date] NULL

)on ps_Date(CreationDate)

GO

/***************************************************

4)Insert rows to table

****************************************************/

insert into dbo.Contact values(‘amosa’,’01/01/2006′)

insert into dbo.Contact values(‘mmosa’,’01/01/2007′)

insert into dbo.Contact values(‘zemo’,’01/01/2008′)

insert into dbo.Contact values(‘test1′,’05/01/2005’)

insert into dbo.Contact values(‘test2′,’05/01/2006’)

insert into dbo.Contact values(‘test3′,’05/01/2007’)

insert into dbo.Contact values(‘test4′,’05/01/2008’)

insert into dbo.Contact values(‘test5′,’05/01/2009’)

go

/***************************************************

5)Query Table to know where did rows save

****************************************************/

Select *,$partition.pf_Date(CreationDate)FileGroupNo From dbo.Contact


Change Data Capture – Part I

May 31, 2009

With the arrival of SQL Server 2008 we gained access to a new feature called Change Data Capture (CDC). CDC enables you to constantly stream change data to a consuming process, most notably Extract, Transform, and Load (ETL) applications. In the SQL Server environment the ETL application used is of course SQL Server Integration Services (SSIS). 

the first obvious constraint in regard to SQL Server 2008’s CDC feature is that it is only available to SSIS applications that source their data from an OLTP system built on SQL Server 2008. In the field of implementing Microsoft BI there are many relational database systems we have to source data from that are not SQL Server based (and if they are SQL Server based there is a good chance it is not using the 2008 version). All major relational database systems use some form of a transaction log to record the history of its actions in case of rollback or hardware failure. I would imagine that the actual contents of each database vendor’s log is different, however they have the same basic requirement to temporarily persist database actions.

So how to enable Change Data Capture on a database, and on a table, and how to keep track of Data Definition Language changes on a table.

Step 1

Let’s create a database named SqlHero as shown below.

USE [master]
GO

/*** Object:  Database [SqlHero]   ***/
IF  EXISTS (SELECT name FROM sys.databases WHERE name = N'SqlHero')
DROP DATABASE [SqlHero]
GO
USE [master]
GO

/*** Object:  Database [SqlHero]   ***/
CREATE DATABASE [SqlHero]
GO

Step 2

Now let’s create a table named MyTable on the SqlHero database, as shown Below.

USE [SqlHero]
GO

/*** Object:  Table [dbo].[MyTable]   ***/
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[MyTable]')
	AND type in (N'U'))
DROP TABLE [dbo].[MyTable]
GO
USE [MyDataBase]
GO

/*** Object:  Table [dbo].[MyTable]   ***/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

SET ANSI_PADDING ON
GO

CREATE TABLE [dbo].[MyTable](
	[ID] [int] NOT NULL,
	[Name] [varchar](100) NULL,
 CONSTRAINT [MyTable_PK] PRIMARY KEY CLUSTERED
(
	[ID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF,
	IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON,
	ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

SET ANSI_PADDING OFF
GO

Step 3

In order to track changes on the table, we need to enable the Change Data Capture feature on the database. We can enable the Change Data Capture feature using the following Transact SQL command, as shown below.

Until this point, the only schema that exists on the database is dbo. Once we enable Change Data Capture, a new schema with a bunch of objects will be created.

USE [SqlHero]
GO
EXEC sys.sp_cdc_enable_db_change_data_capture
GO

The following CDC tables are created under the CDC schema, as shown below

cdc.captured_columns
cdc.change_tables
cdc.ddl_history
cdc.index_columns
cdc.lsn_time_mapping

When you query these tables, you will see only zero number of rows.

select * from cdc.captured_columns
select * from cdc.change_tables
select * from cdc.ddl_history
select * from cdc.index_columns
select * from cdc.lsn_time_mapping

Result

(0 row(s) affected)
 (0 row(s) affected)
(0 row(s) affected)
 (0 row(s) affected)
 (0 row(s) affected)

 

Step 3

In order to track changes on the table, we need to enable the Change Data Capture feature on the table as well. Let’s enable the Change Data Capture feature using the following Transact SQL command, as shown below.

USE [SqlHero]
GO
EXEC sys.sp_cdc_enable_table_change_data_capture
@source_schema = 'dbo',
@source_name = 'MyTable',
@role_name = 'cdc_MyTable'
GO

Result

Job 'cdc.MyDataBase_capture' started successfully.
Job 'cdc.MyDataBase_cleanup' started successfully.

 

By reading the result, we can easily understand that SQL Server Agent is a must to do the capture and cleanup. We can see these jobs are actually created as SQL Server Scheduled jobs. [Refer Fig 1.6, 1.7]

Now let’s execute the following Transact SQL to see if any data has been inserted on any of the CDC tables.

select * from cdc.captured_columns
select * from cdc.change_tables
select * from cdc.index_columns

Result

object_id, column_name, column_id, column_type, column_ordinal, is_computed
389576426, ID, 1, int, 1, 0
389576426, Name, 2, varchar, 2, 0

(2 row(s) affected)

object_id,  version,  source_object_id, capture_instance, start_lsn, end_lsn, supports_net_changes, has_drop_pending,
	role_name, index_name, filegroup_name, create_date
389576426, 0, 53575229, dbo_MyTable, NULL, NULL, 0, NULL, cdc_MyTable, MyTable_PK, NULL,
(1 row(s) affected)

object_id, column_name, index_ordinal, column_id
389576426, ID, 1, 1

(1 row(s) affected)

You can see that the CDC schema stores metadata information about which tables and columns are being tracked by Change Data Capture. It also stores information about what Index the table that has been tracked has.

Step 4

We can check to see if Change Data Capture is enabled on a database by using the following transact SQL statement.

SELECT is_cdc_enabled FROM sys.databases WHERE name = 'SqlHero'

Result

is_cdc_enabled
--------------
1

(1 row(s) affected)

Step 5

We can check to see if Change Data Capture is enabled on a table, by using the following transact SQL statement.

SELECT is_tracked_by_cdc FROM sys.tables WHERE name = 'MyTable'

Result

is_tracked_by_cdc
-----------------
1

(1 row(s) affected)

Step 6

Now let’s make some changes to table structure and see if Change Data Capture captures the changes. Execute the following query as shown below.

USE [SqlHero]

GO

Alter Table MyTable add Address varchar(500)

GO

Alter Table MyTable add Salary money

GO

Alter Table MyTable add Bonus money

GO

 

Query the cdc table ddl_history as shown below.

select * from cdc.ddl_history

Result

source_object_id, object_id, required_column_update, ddl_command, ddl_lsn, ddl_time
53575229, 389576426, 0, Alter Table MyTable add Address varchar(500)
, 0x0000001C000001350001,
53575229, 389576426, 0, Alter Table MyTable add Salary money
, 0x0000001C000001370018, 53575229, 389576426, 0, Alter Table MyTable add Bonus money
, 0x0000001C0000013D0018,
(3 row(s) affected)


Ranking Functions

May 31, 2009

Introduction

Ranking functions return a ranking value for each row in a partition. Depending on the function that is used, some rows might receive the same value as other rows. Ranking functions are nondeterministic.

Transact-SQL provides the following ranking functions:

RANK :The rank of a row is one plus the number of ranks that come before the row in question.

NTILE :The groups are numbered, starting at one. For each row, NTILE returns the number of the group to which the row belongs.

DENSE_RANK :The rank of a row is one plus the number of distinct ranks that come before the row in question.

ROW_NUMBER : Returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition.

Example

 

USE AdventureWorks;

GO

SELECT c.FirstName, c.LastName

      –Each Row Take row number based on postal code

    ,ROW_NUMBER() OVER (ORDER BY a.PostalCode) AS ‘Row Number’

    –Each same postalcode take same rank then next postalcode take (Total count of same PostalCode rows)+1

    ,RANK() OVER (ORDER BY a.PostalCode) AS ‘Rank’

    –Each same postalcode take same rank then next postalcode take rank+1

    ,DENSE_RANK() OVER (ORDER BY a.PostalCode) AS ‘Dense Rank’

    –divide Total Result Sets to 4 groups

    ,NTILE(4) OVER (ORDER BY a.PostalCode) AS ‘Quartile’

    ,s.SalesYTD, a.PostalCode

FROM Sales.SalesPerson s

    INNER JOIN Person.Contact c

        ON s.SalesPersonID = c.ContactID

    INNER JOIN Person.Address a

        ON a.AddressID = c.ContactID

WHERE TerritoryID IS NOT NULL

    AND SalesYTD <> 0;

result will be

FirstName

LastName

Row Number

Rank

Dense Rank

Quartile

SalesYTD

PostalCode

Maciej

Dusza

1

1

1

1

4557045

98027

Shelley

Dyck

2

1

1

1

5200475

98027

Linda

Ecoffey

3

1

1

1

3857164

98027

Carla

Eldridge

4

1

1

1

1764939

98027

Carol

Elliott

5

1

1

2

2811013

98027

Jauna

Elson

6

6

2

2

3018725

98055

Michael

Emanuel

7

6

2

2

3189356

98055

Terry

Eminhizer

8

6

2

3

3587378

98055

Gail

Erickson

9

6

2

3

5015682

98055

Mark

Erickson

10

6

2

3

3827950

98055

Martha

Espinoza

11

6

2

4

1931620

98055

Janeth

Esteves

12

6

2

4

2241204

98055

Twanna

Evans

13

6

2

4

1758386

98055


Recursive using Common Table Expressions (CTE)

May 29, 2009

Introduction

A common table expression (CTE) can be thought of as a temporary result set that is defined within the execution scope of a single SELECT, INSERT, UPDATE, DELETE, or CREATE VIEW statement. A CTE is similar to a derived table in that it is not stored as an object and lasts only for the duration of the query. Unlike a derived table, a CTE can be self-referencing and can be referenced multiple times in the same query.

A CTE can be used to

  • Create a recursive query (Current Example)
  • Substitute for a view when the general use of a view is not required; that is, you do not have to store the definition in metadata.
  • Enable grouping by a column that is derived from a scalar subselect, or a function that is either not deterministic or has external access.
  • Reference the resulting table multiple times in the same statement.

    Syntax of CTE

    with CTE_Name(Col1,Col2,Col3..)
    as (
    –Anchor Query
    Union All
    –Recursive member is defined referencing cte_name
    )
    Select * From CTE_Name

    Example

    use AdventureWorks

    go

     

    with CTE_EmployeeLevel(EmployeeId,FullName,ManagerId,ManagerName,EmpLevel)

    as

    (

    select

    EmployeeID,

    (Select FirstName +‘ ‘+LastName From Person.Contact where Person.Contact.ContactID= EmployeeID)FullName,

    ManagerID,

    (Select FirstName +‘ ‘+LastName From Person.Contact where Person.Contact.ContactID= ManagerID)ManagerName,

    1

    From HumanResources.Employee

    where ManagerId Is Null

     

    union all

     

    Select

    tbEmp.EmployeeID,

    (Select FirstName +‘ ‘+LastName From Person.Contact where Person.Contact.ContactID= tbEmp.EmployeeID)FullName,

    tbEmp.ManagerID,

    (Select FirstName +‘ ‘+LastName From Person.Contact where Person.Contact.ContactID= tbEmp.ManagerID)ManagerName,

    EmpLevel+1

    From HumanResources.Employee tbEmp inner join CTE_EmployeeLevel CTEEmp

    on tbEmp.ManagerID=CTEEmp.EmployeeId

    )

     

    Select * From CTE_EmployeeLevel

    Example Code Walkthrough

    when Anchor query run the result set will be

    EmployeeId

    EmployeeName

    ManagerId

    ManagerName

    EmpLevel

    Stephanie Bourne

    109

    NULL

    NULL

    1

    then We will take Employee Id from above result set then join it with employee id from Employee table

    Then the result will be

    EmployeeId

    EmployeeName

    ManagerId

    ManagerName

    EmpLevel

    6

    Frances Adams

    109

    Stephanie Bourne

    2

    we will repeat step before again Then We will take Employee Id from above result set then join it with employee id from Employee table

    Then the result will be

    EmployeeId

    EmployeeName

    ManagerId

    ManagerName

    EmpLevel

    2

    Catherine Abel

    6

    Frances Adams

    3

    46

    Stephen Ayers

    6

    Frances Adams

    3

    106

    Corinna Bolender

    6

    Frances Adams

    3

    119

    Christopher Bright

    6

    Frances Adams

    3

    we will repeat step before again Then we will take Employee Id from above result set then join employee id from Employee if no result

    CTE start join all result set to be

    EmployeeId

    EmployeeName

    ManagerId

    ManagerName

    EmpLevel

    Stephanie Bourne

    109

    NULL

    NULL

    1

    6

    Frances Adams

    109

    Stephanie Bourne

    2

    2

    Catherine Abel

    6

    Frances Adams

    3

    46

    Stephen Ayers

    6

    Frances Adams

    3

    106

    Corinna Bolender

    6

    Frances Adams

    3

    119

    Christopher Bright

    6

    Frances Adams

    3


  • Microsoft SQL Server 2008 Data Compression

    May 28, 2009

    We’re all feeling the financial crunch these days, and an increasing number of companies are trying to figure out how to get the most out of their IT dollars. One of the cost-saving ideas being floated around IT departments is to extend the replacement cycle of hardware. This can be done by extending your hardware manufacture warranty for one or more years. But extending the hardware lifecycle can only do so much. What can you do when you are running low on hardware resources like disk space?

    SQL Server 2008 introduces some new features to help deal with disk capacity issues without making IT go out and buy more disk drives. Some of SQL Server’s new compression options can help you reclaim some of your valuable disk space and postpone the purchase of additional disk capacity for a while, maybe even years. In this article, I will discuss table and backup compression options available in SQL Server 2008 to help you reduce your database disk space usage.

    Data Compression was first introduced in SQL Server 2005 with Service Pack 2 where Microsoft introduced a new storage format called vardecimal to store decimal and numeric data types as variable length columns. In SQL Server 2008 this concept is applicable even to fixed length data types such as integer, char and float. The main advantage of using Data Compression is that it reduces the storage cost and increases the query performance by reducing I/O.  Database Administrator needs to be very careful before using Database Compression as it is more CPU intensive activity. If you are not facing CPU crunch and you have more I/O based activities then it is a good idea to implement data compression, which can give you better query performance

    Compressing Tables and Indexes
    SQL Server 2008 allows you to compress both tables and indexes. Compression happens at the row or page level. No application changes are required to implement table or index compression. Data compression is supported for the following table objects:

    • Tables stored as a heap
    • Tables stored as a clustered index
    • Non-clustered indexes
    • Indexes views
    • Each partition of a partitioned table

    When a clustered index is compressed the non-clustered indexes are not automatically compressed. You need to set up compression on non-clustered indexes when they are created or altered. System tables cannot be compressed.

    Row and Page Compression
    What is the difference between Row and Page level compression? Row compression happens on each individual row, where as Page compressions considers all row in a page and compresses a page based on common strings of data throughout a page.

    Row compression takes all fixed length fields and compresses them by making them variable length fields. When I say fixed length fields I don’t mean only character data, but also numeric data. The numeric data type fields are evaluated, and are compressed into the smallest possible amount of space. For instance, a BIGINT column normally takes 8 bytes. But if a BIGINT column only contains, say, the value “1” then it would be compressed to use only 1 byte, instead of 8 bytes. In addition to compressing these fixed length fields row compression reduces the space required for metadata in each row.

    Page level compression looks at an entire page when it does compression and compresses common strings of data. This is done by using row compression plus prefix and dictionary compression techniques. Prefix compression is a process of removing commonly occurring prefixes from column values. These prefixes are placed in a Compression Information (CI) structure in the header of the page, and then the column values are changed so they reference the CI information in the header. Dictionary compression is done after prefix compression. Dictionary compression is the process of finding repeating occurrences of values in the page and then placing these values in the CI structure and replacing the values with appropriate index to the CI information.

    The diagrams below, taken from SQL Server 2008 Books Online, visually represent how prefix and dictionary compression are done at the page level.

    Page before compression Page after Prefix Compression Page after Dictionary Compression
    clip_image001 clip_image002 clip_image003

    When the first row is added to a page, page compression is not performed, only row compression is done. When the second row is added Page compression is undertaken. If the SQL Server engine determines that the space saving of page compression is not significant then page compress is not performed.

    By using Row and Page compression SQL Server can save considerable amount of disk space. Let’s look at examples of how to set up compression.

    Considerations for Compression
    There are a few things you need to know about compression. First, compression is only available in the Enterprise and Developer editions of SQL Server 2008.

    Compressing data allows more rows to be included in a page. For each I/O performed by the database engine the more rows of data are being read or written. Keep in mind that just because you are compressing rows, the maximum row size remains the same. SQL Server will not enable a table for compression if the maximum row size and the compression overhead space combined are greater than 8060. This is because SQL Server needs to ensure that a row can be stored in a single page even if each column is filled to it maximum capacity. If your table has multiple partitions you can enable data compression by partition, not all partitions need to be compressed. Indexes are not automatically compressed just because you compressed the table. You need to explicitly compress indexes if you want them compressed. There is one exception to this rule. If you define your table to use heap storage that is compressed, then when you create a clustered index for your table it will inherit the compression settings from the heap. If you change the compression settings of a heap, then all non-clustered index on that heap need to be rebuilt. This is so the row pointers in the non-clustered index can point to the new location for the row in the heap. If you store large record type data outside the row the large record data is not compressed.

    Compressing data does have some impact on the internal workings of SQL Server. When a page is written to disk it is compressed and when it is read back it is uncompressed. This means additional CPU overhead is used to do this compression, which might be a concern if your machine does not have extra CPU capacity to burn.

    How to Estimate Compression Savings
    Microsoft has provided a stored procedure (SP) named “sp_estimate_data_compression_savings.” This SP helps you identify how much disk space will be saved if you implement compression on a table. With this SP you identify the object you want to compress and the type of compression you what to perform. The SP then takes your object and creates it in “tempdb” and then compresses it to determine how much space a compressed table will take. The amount of used space for the original table and the expected compressed file size are output from this SP. By using this SP you can determine which tables and compression types are best for your environment. Here is an example execution of this SP:

    use AdventureWorks2008;

    go

    EXEC sp_estimate_data_compression_savings

    @schema_name = ‘Production’

    , @object_name = ‘TransactionHistory’

    , @index_id = null

    , @partition_number = null

    , @data_compression = ‘ROW’;

    In this example I’m checking how much space I can save by compressing the AdventureWorks200.Production.TransactionHistory table and all of its indexes using ROW compression. Here is the output I get from this SP:

    clip_image005  

    From this output we can see that the AdventureWorks2008.Production.TransactionHistory clustered index is currently using 6,352 KB. The “sp_estimate_data_compression_savings” SP has identified that by using ROW compression the compressed size for clustered index will be 4,176 KB. This is a space savings of almost 35 percent. As you can see performing ROW compress on the other non-clustered indexes on this table is not so significant.

    I can also check how much disk space will be saved if I use PAGE compression on the AdventureWorks.Production.TransactionHistory table by running the following command:

    use AdventureWorks2008;

    go

    EXEC sp_estimate_data_compression_savings

    @schema_name = ‘Production’

    , @object_name = ‘TransactionHistory’

    , @index_id = null

    , @partition_number = null

    , @data_compression = ‘PAGE’;

    When I run this command I get the following output:

    clip_image007  

    Here you can see the space savings by using PAGE compression is considerable more than using ROW compression for each index in the AdventureWorks.Production.TransactionHistory table. By using PAGE level compression you can see that index = 1 (the Clustered Index) saves around 65 percent of the currently used disk space for that index. The two non-clustered indexes do not have a huge amount of space savings, but still save some space is saved by using PAGE

    How to Set Up Table Compression
    There are a number of different ways to compress a table. Let me show you a few of these methods.

     

    In SQL Server 2008 there is an inbuilt wizard namely Data Compression Wizard which enables you to estimate the savings in storage space when compression is enabled for a particular table or index in a database.
    A DBA needs to first get connected to the SQL Server 2008 Instance and then expand the Databases Node | AdventureWorks Database | Tables, and right click Production.TransactionHistory table, select Storage and finally select Manage Compression as shown in the below snippet.
    clip_image008
    This will launch Data Compression Wizard Screen. Click Next to continue

    clip_image010  

    Click the “Next” button, which will bring up this screen:

    clip_image012  

    Here you can use the “Use same compression type for all partitions” checkbox option, and then the grayed out drop down box next to this label will become available to enter “None, “Row,” or “Page” compression option. This is useful when you have multiple partitions for your table. In my case The AdventureWorks2008.Production.TransactionHistory table resides on a single partition, so I can use the drop down under the “Compression” type column to select my compression type. I’ve decided to use “Page” compression. Once you have selected the compression type you might want to click on the “Calculate” button to determine the amount of space that will be used once the table is compress. The screenshot below shows what the wizard has determined will be my space savings in the “Requested compressed space” column:

    clip_image013
    Different Compression Type Options
    Once you select the compression type option as Row and click the calculate button on the wizard, it will calculate and let you know the amount of storage space required when row compression type is used. 
    clip_image014
    Storage space required when row compression is used as compression type
    If you select the compression type option as Page and click the calculate button on the wizard, it will calculate and let you know the amount of storage space required when page compression type is used. 

    clip_image016  

    After reviewing the compressed space needed I can click on the “Next” button, which brings up this window:

    clip_image018  

    Here I have multiple options for how to save the output of the wizard. I can create a script that can be used to compress my table, run the compression immediately, or schedule it. Click on “Next” or “Finish” will take you to a Data Compression Wizard Summary screen where you can click on the “Finish” button to complete whatever data compression output option selected. In my case I chose the “Run immediately” option.

    You can also use T-SQL to compress an existing table. To do this you use the “ALTER TABLE” statement. Here is the script I would use to compress my AdventureWorks2008.Production.TransactionHistory table with the same compression options I selected using the wizard:

    USE [AdventureWorks2008];

    GO

    ALTER TABLE [Production].[TransactionHistory] REBUILD PARTITION = ALL

    WITH (DATA_COMPRESSION = PAGE);

    You don’t have to first create a table and then use the “ALTER TABLE” statement to compress it. You can also define that you want a table compressed as part of the CREATE TABLE statement. Here I have created an AuditLog table and requested that it use PAGE compression:

    CREATE TABLE AuditLog (

    ID int identity(1,1),

    LoginName varchar(128),

    TransactionDescription varchar(100),

    ApplicationName varchar(100),

    TransactionDate date,

    TransactionTime date

    )

    WITH (DATA_COMPRESSION = PAGE);

    If your table has multiple partitions, each partition can have a different compression type. Here is an example of how to create a table where each partition has a different compression type:

    CREATE TABLE OrderTracking

    (ID int identity,

    OrderID int,

    OrderDate Date,

    OrderTime Time,

    DateSent Date,

    TimeSent Time)

    ON OrderDateRange (OrderDate)

    WITH

    (

    DATA_COMPRESSION = NONE ON PARTITIONS (1),

    DATA_COMPRESSION = ROW ON PARTITIONS (2),

    DATA_COMPRESSION = PAGE ON PARTITIONS (3)

    );

    In this code, partition 1 is not compressed, partition 2 is using ROW compression, and partition 3 uses PAGE compression.

    Cost Savings
    The cost and time saving advantages of using SQL Server 2008 compression feature should allow companies to immediately obtain some cost savings returns for their investment in upgrading to SQL Server 2008. If you are running up against space constraints on your server than potentially compressing tables and/or backups might allow you to postpone or eliminate an expensive disk acquisition in the future. If you are already using a third-party tool to compress your backups then you might also eliminate the need for this tool. If your backup window at night is not long enough to accomplish all your backups and you have CPU to burn, then compressing your database backups you should deliver a reduction in the time it takes to backup your SQL Server 2008 databases. The disk cost savings and shorter backup time related to using SQL Server 2008 compression features need to be factored in when you consider acquiring SQL Server 2008.