Warehouse Warriors: April 2013

Tuesday 30 April 2013

SCD Type 6: Add Type 1 Attributes to Type 2 Dimension

Type 6 builds on the type 2 technique by also embedding current attributes in the dimension so that fact rows can be filtered or grouped by either the type 2 value in effect when the measurement occurred or the attribute’s current value. The type 6 moniker was suggested by an HP engineer in 2000 because it’s a type 2 row with a type 3 column that’s overwritten as a type 1; both 2 + 3 + 1 and 2 x 3 x 1 equal 6. With this approach, the current attributes are updated on all prior type 2 rows associated with a particular durable key, as illustrated by the following sample rows:

Implementation part will come later.

Hope this helps.

Feel free to leave your comments.

Friday 26 April 2013

Working with Reject Files in Informatica

When we run a session, the integration service may create a reject file for each target instance in the mapping to store the target reject record. With the help of the Session Log and Reject File we can identify the cause of data rejection in the session. Eliminating the cause of rejection will lead to rejection free loads in the subsequent session runs. If the Informatica Writer or the Target Database rejects data due to any valid reason the integration service logs the rejected records into the reject file. Every time we run the session the integration service appends the rejected records to the reject file.

Working with Informatica Bad Files or Reject Files

By default the Integration service creates the reject files or bad files in the $PMBadFileDir process variable directory. It writes the entire reject record row in the bad file although the problem may be in any one of the Columns. The reject files have a default naming convention like [target_instance_name].bad . If we open the reject file in an editor we will see comma separated values having some tags/ indicator and some data values. We will see two types of Indicators in the reject file. One is the Row Indicator and the other is the Column Indicator .
For reading the bad file the best method is to copy the contents of the bad file and saving the same as a CSV (Comma Sepatared Value) file. Opening the csv file will give an excel sheet type look and feel. The firstmost column in the reject file is the Row Indicator , that determines whether the row was destined for insert, update, delete or reject. It is basically a flag that determines the Update Strategy for the data row. When the Commit Type of the session is configured as User-defined the row indicator indicates whether the transaction was rolled back due to a non-fatal error, or if the committed transaction was in a failed target connection group.

If our source file is like this:
7,John,5000.375,76.12345,,BrickLand Road Singapore,Malasia

The data in the reject/Bad file looks like:
0,D,7,D,John,D,5000.375,O,,N,BrickLand Road Singapore,T

Now let us understand What are the extra characters meant for?

List of Values of Row Indicators:

Row Indicator	Indicator Significance	Rejected By
0	Insert	Writer or target
1	Update	Writer or target
2	Delete	Writer or target
3	Reject	Writer
4	Rolled-back insert	Writer
5	Rolled-back update	Writer
6	Rolled-back delete	Writer
7	Committed insert	Writer
8	Committed update	Writer
9	Committed delete	Writer

If a row indicator is 3, an update strategy expression marked the row for reject.
Now comes the Column Data values followed by their Column Indicators, that determines the data quality of the corresponding Column.

List of Values of Column Indicators:

Column Indicator	Type of data	Writer Treats As
D	Valid data or Good Data.	Writer passes it to the target database. The target accepts it unless a database error occurs, such as finding a duplicate key while inserting.
O	Overflowed Numeric Data.	Numeric data exceeded the specified precision or scale for the column. Bad data, if you configured the mapping target to reject overflow or truncated data.
N	Null Value.	The column contains a null value. Good data. Writer passes it to the target, which rejects it if the target database does not accept null values.
T	Truncated String Data.	String data exceeded a specified precision for the column, so the Integration Service truncated it. Bad data, if you configured the mapping target to reject overflow or truncated data.

The reject file and session log contain information that helps you determine the cause of the reject. You can correct reject files and load them to relational targets using the Informatica reject loader utility. The reject loader also creates another reject file for the data that the writer or target reject during the reject loading.
Complete the following tasks to load reject data into the target:

    Locate the reject file.
    Correct bad data.
    Run the reject loader utility.

NOTE:
1. You cannot load rejected data into a flat file target
2. If you enable row error logging in the session properties, the Integration Service does not create a
reject file. It writes the reject rows to the row error tables or file.

Hope this helps.

Sunday 21 April 2013

Importance of Staging and Audit Tables in DataWareHouse

In Data Warehousing its quite common to use Staging tables and Audit tables.

Someone can think

What's their need in Data Warehousing.

Why do we need staging tables?
Why do we need audit tables?

Staging Area/Tables

Staging area is primarily designed to serve as intermediate resting place for data before it is processed and integrated into the target data warehouse.

This staging are serves many purpose above and beyond the primary function:

1. The data is most consistent with the source. It is devoid of any
    transformation or has only minor format changes.
2. The staging area in a relation database can be read/ scanned/ queried
    using SQL without the need of logging into the source system or reading
   files (text/xml/binary).
3. It is a prime location for validating data quality from source or auditing
   and tracking down data issues.
4. Staging area acts as a repository for historical data if not truncated

Auditing Tables

Auditing is the monitoring and recording of selected user database actions.
Auditing is normally used to:
1. Investigate suspicious activity. For example, if an unauthorized user is deleting data from tables, the security administrator might decide to audit all connections to the database and all successful and unsuccessful deletions of rows from all tables in the database.
2. Monitor and gather data about specific database activities. For example, the database administrator can gather statistics about which tables are being updated, how many logical I/Os are performed, or how many concurrent users connect at peak times.
3..Audit tables can often be used to track the Lineage, Quality, and Currency of data elements.
This design tip from Ralph Kimball details an audit dimension that can be used to tag dataYou might also want to audit attributes of the ETL jobs.

The use of audit tables depend on the project requirements.

Hope this helps

Currently Processed Flat File Name

Effective with PowerCenter 8.5 there is an option called Add Currently Processed Flat File Name Port.
If this flat file source option is selected, the file name port will be added in the ports of the source.
This is particularly helpful when loading the files in Indirect method and we need to pass the currently processed file name to the target.

To add the CurrentlyProcessedFileName port:

Open the flat file source definition in the Source Analyzer.
Click the Properties tab.
Select Add Currently Processed Flat File Name Port .
The Designer adds the CurrentlyProcessedFileName port as the last column on the Columns tab.
The CurrentlyProcessedFileName port is a string port with default precision of 256 characters.
Click the Columns tab to see your changes.
You may change the precision of the CurrentlyProcessedFileName port if you wish.

To remove the CurrentlyProcessedFileName port, click the Properties tab and clear the Add Currently Processed Flat File Name Port check box.

Note .While processing xml files you can connect this port to only one target because in xml source each hierachy work as an active source.

Hope this helps.

Friday 19 April 2013

Saving Logs For More Than One Run

By default, when you configure a workflow or session to create log files, the Integration Service creates one log file for the workflow or session. The Integration Service overwrites the log file when you run the workflow again. To create a log file for more than one workflow or session run, configure the workflow or session to archive logs in the following ways:

By run
By time stamp

Configuring the Session:
Goto Config Object-->Loog Options-->Save session log by

Configuring the Workflow
Goto Edit Workflow-->Properties-->Save Workflow log by

In the drop dow ther eare two options vz. Session Runs and Session Timestamp.

Archiving Logs by Run:

If you archive log files by run, specify the number of text log files you want the Integration Service to create.i.e.,set Save session log for these runs to the no. of logs you want.The Integration Service creates the number of historical log files you specify, plus the most recent log file. If you specify 5 runs, the Integration Service creates the most recent workflow log, plus historical logs zero to 5, for a total of six logs. You can specify up to 2,147,483,647 historical logs. If you specify zero logs, the Integration Service creates only the most recent workflow log file.

The Integration Service uses the following naming convention to create historical logs:
<session or workflow name>.n
where n=0 for the first historical log. The variable increments by one for each workflow or session run.

Archiving Log Files by Time Stamp:

When you archive logs by time stamp, the Integration Service creates an unlimited number of binary and text file logs.click Working with Log Files to see how to create text log files. You can see that Save session log for these runs will be disabled. The Integration Service adds a time stamp to the text and binary log file names. It appends the year, month, day, hour, and minute of the workflow or session completion to the log file. The resulting log file name is <session or workflow log name>.yyyymmddhhmi.

Note:

When you run concurrent workflows with the same instance name, the Integration Service appends a timestamp to the log file name, even if you configure the workflow to archive logs by run.
To prevent filling the log directory, periodically purge or back up log files when using the time stamp option.

Hope this helps.

Thursday 18 April 2013

Holding Previously Processed Row Values

We all know that in Powercenter transformations in a mapping reads, process and loadsone record at a time. There might be situations where you want to compare the currently processing row values with the previously processed row values or to use in the calculations.
what will you do? Whats the trick thats works for you.

Variable ports ports do the trick here, Yes.
We can make powercenter to hold the previously processed records by making use of the Variable ports.
IF you want to hold the customer_id make following changes in the expressio ntransformation.
1.Take two variable ports above the customer_id say,v_custid and prev_cust_id.
The ports in expression will be

PORT TYPE EXPRESSION
prev_cust_id v v_custid
v_custid v customer_id
customer_id i/o customer_id

You are done! Yes, prev_cust_id holds previously processed customer_id.
How it works? lets see.

If customer_id has the values as 1,2,3.....

When the first record reaches the expression transformation the values will be
prev_cust_id null
v_custid 1
customer_id 1

When the second record reaches the expression transformation the values will be

prev_cust_id 1
v_custid 2
customer_id 2

You can observe this by running the Debugger Instance.

You can do the same for any column for which you need the previously processed value.

Note:
The order of the column defined above should be the same.

Hope this helps.

Tuesday 16 April 2013

Working with Log Files

The Integration Service writes the workflow and session logs as binary files on the node where the service process runs. It adds a .bin extension to the log file name you configure in the session and workflow properties.

You can view log events for workflows with the Log Events window in the Workflow Monitor. When you try to access the log events from outside the workflow monitor you willl be restricted with the following error message.

When you come across a situatuion where you are only allowed to access the log from outside the workflow monitor.

what will you do?

There is way to do this!

Configure a workflow or session to write log events to log files in the workflow or session properties. The
Integration Service writes information about the workflow or session run to a text file in addition to writing log events to a binary file. If you configure workflow or session properties to create log files, you can open the text files with any text editor or import the binary files to view logs in the Log Events window.

You can configure the following information for a workflow or session log:

Write Backward Compatible Log File. Select this option to create a text file for workflow or session logs. If you do not select the option, the Integration Service creates the binary log only.
Log File Directory. The directory where you want the log file created. By default, the Integration Service writes the workflow log file in the directory specified in the service process variable, $PMWorkflowLogDir
Name. The name of the log file.

Note:

If you enter a directory name that the Integration Service cannot access, the workflow or session fails.
You must configure a name for the log file or the workflow or session is invalid.You can use workflow or worklet variable for the log file name.

Hope this helps.

Tuesday 9 April 2013

User Defined Functions in Informatica

Informatica Powercenter Transformation Language provides a long list of functions which one can use in Designer and workflow Manager.However Powercenter allows us to define functions of our own which are referred as User Defined Functions.

User-defined functions extend the PowerCenter transformation language. You can create and manage user-defined functions with the PowerCenter transformation language in the Designer. You can add them to expressions in the Designer or Workflow Manger to reuse expression logic and build complex expressions. User-defined functions are available to other users in a repository.

User Defined Functions are 2 types.

Public UDF: Public UDFs can be used by all the users in the repository. Public UDFs can be used in transformations, Workflows, Link conditions and in Command task.
Private UDF: Private UDF can only be used inside a public UDF.

Creating User Defined Function:
User defined functions are used to create complex functions by using the existing built-in functions provided by the informatica etl tool. Follow the steps below to create UDF of your own.

After logging to the designer and opening the folder right click on User Defined Functions folder in the Navigator and click on New. one pop up window will open as shown below
Now enter the UDF name in the Name field. Select the UDF type as public or private. Add the required number of arguments(Name, datatype, prcesision and scale).
Click on the Launch Editor button. This opens an expression editor window. Here define the functionality of the UDF by using the existing built-in functions, User defined function and Arguments.
Click on Ok and save the UDF.

Below Image shows creating UDE named RemoveAdditionalSpaces using expression

REG_REPLACE(LTRIM(RTRIM(INPUT)),'\S+',' ')

Which removes multiple spaces between the characters and extra spaces before and after the string.

After creating a UDF , You can edit or delete it in the future by right clicking on it iin the Navigator.
Now you can reference this function directly wherever you can by prefixing :UDF. before ude name.
like,
:UDF.RemoveAdditionlSpaces(Portname)

Important:

Public UDFs cannot be changed to private UDFs. Where as private UDFs can be changed to Public UDFs.
You cannot nest the a UDF within the same UDF.

Hope this helps.

Removing Non Printable and Control Characters from Source Data

In some cases source data may contain unwanted characters such as non printable characters.

Non printable & special characters in clinical trial data create potential problems in producing quality deliverables.

There could be major issues such as incorrect statistics / counts in the deliverables, or minor ones such as incorrect

line breaks, page breaks or appearance of strange symbols in the reports. Identifying and deleting these issues could

pose challenges. When faced with this issue in the Pharmaceutical & Biotech industries, it is imperative to clean them

up. We need to understand the underlying cause and use various techniques to identify and handle them

For the purpose of our topic, we can broadly classify the characters into 3 groups:

1. 33 non printable special charactersThe first 32 characters (decimal value from 0 to 31) and the DEL char(decimal value 127).

2. 94 standard printable characters (decimal value range from 33 to 126) which represent letters, digits, punctuation

marks, and a few miscellaneous symbols.

3. 128 special characters (Extended ASCII or ISO-8859-1. Decimal values range from 128 to 255). Decimal values

from 128 to 159 in the Extended ASCII set are non printing control characters.

PROBLEMS CAUSED BY NON PRINTABLE & SPECIAL CHARACTERS (NPSC):

In Clinical trials data, we do not expect to have any characters outside the decimal values range from 32 to 127

because of the problems mentioned below.

Following are some of the issues that might be caused by NPSC.

1. The line / page alignment in the output generated will be disrupted when some of these characters are present in the output. Most common problem is, even though there is plenty of space available in a line / page, with out

using all of it, the data will spill over to the next line / page.

2. Depending on their presence in the critical variables, one might get wrong statistics or counts in the outputs.

3. Unexpected conditional statement results and/or incorrect number of records get selected during subset.

4. Some characters (Extended ASCII characters) are not same across operating system s / applications/ fonts.

When such characters are present in a SAS dataset, it is possible that the character might have had a different

form or meaning in the source application compared to the final destination which is SAS dataset.

SOLUTION:

Here we will see how to resolve this problem in Informatica.

As we all know the power of Regular Epressions in manipulating typical data sets. Here we use RegEx to resolve this.

We can replace the NPSC with null using REG_REPLCAE .

REG_REPLACE('[^[:print:]]',NULL)

Here,
[:print:] denotes all printable characters in ASCII character set.

^ when used with POSIX Brackets works like a negation operator.So
[^[:print:]] will return all non printable characters.

Using this with REG_REPLACE we can replace all the npsc with charactres of our choice.

Hope this helps.

Dynamically Create Output Directories

There might be situations where we want to place our output files in a directory.

Business Requirement:

Suppose we have a business requirement to group the output files based on the date they were created ,
how you can configure your workflow to create output directories at run time for your session output file?

Solution:

we can achieve this by configuring the target Attributes in the session .
While this example creates directories using the date, it can be implemented as per your business requirement.

Observe the screenshot below

1.We should check in the Create Target Directory check box.
2.We should provide Output file directory location.
Here Output file directory location ' $PMTargetFileDir\OutputDir ' tells the IS to create a Sub Directory with name 'OutputDir' in the directory $PMTargetFileDir which is a default Target Destination folder.

When you run the workflow you will see your target file created in the OutputDir .

Well, this is a static way of creating Output directory.

To make this Dynamic, We use workflow parameter, say, $$OutputDir and assign a value to this parameter using the Assignment task.

Create a Assignment task before the session and name it as Asgn_OPut_Dir_Name and in User Defined variables section create new variable with name $$OutputDir and assign below expression to it.

SUBSTR(TO_CHAR(SYSDATE),1,2)||'-'||SUBSTR(TO_CHAR(SYSDATE),4,2)||'-'||SUBSTR(TO_CHAR(SYSDATE),7,4)

When you run the workflow the target files will be written in a new folder created each day.

Hope this helps.