Tuesday, 9 April 2013

Removing Non Printable and Control Characters from Source Data

In some cases source data may contain unwanted characters such as non printable characters. 
Non printable & special characters in clinical trial data create potential problems in producing quality deliverables.

There could be major issues such as incorrect statistics / counts in the deliverables, or minor ones such as incorrect
line breaks, page breaks or appearance of strange symbols in the reports. Identifying and deleting these issues could
pose challenges. When faced with this issue in the Pharmaceutical & Biotech industries, it is imperative to clean them
up. We need to understand the underlying cause and use various techniques to identify and handle them
 

For the purpose of our topic, we can broadly classify the characters into 3 groups: 
1. 33 non printable special charactersThe first 32 characters (decimal value from 0 to 31) and the DEL char(decimal value 127). 
2. 94 standard printable characters (decimal value range from 33 to 126) which represent letters, digits, punctuation
marks, and a few miscellaneous symbols. 
3. 128 special characters (Extended ASCII or ISO-8859-1. Decimal values range from 128 to 255). Decimal values
from 128 to 159 in the Extended ASCII set are non printing control characters. 

PROBLEMS CAUSED BY NON PRINTABLE & SPECIAL CHARACTERS (NPSC):

In Clinical trials data, we do not expect to have any characters outside the decimal values range from 32 to 127
because of the problems mentioned below.

Following are some of the issues that might be caused by NPSC.

1. The line / page alignment in the output generated will be disrupted when some of these characters are present in the output. Most common problem is, even though there is plenty of space available in a line / page, with out
using all of it, the data will spill over to the next line / page.

2. Depending on their presence in the critical variables, one might get wrong statistics or counts in the outputs.

3. Unexpected conditional statement results and/or incorrect number  of records get selected during subset.

4. Some characters (Extended ASCII characters) are not same across operating system s / applications/ fonts.
When such characters are present in a SAS dataset, it is possible that the character might have had a different
form or meaning in the source application compared to the final destination which is SAS dataset.

SOLUTION:

Here we will see how to resolve this problem in Informatica.
As we all know the power of  Regular Epressions in manipulating typical data sets. Here we use RegEx to resolve this.

We can replace the NPSC with null using REG_REPLCAE .

REG_REPLACE('[^[:print:]]',NULL)

 Here, 
[:print:] denotes all printable characters in ASCII character set.
^ when used with POSIX Brackets works like a negation operator.So
[^[:print:]] will return all non printable characters.
Using this with REG_REPLACE we can replace all the npsc with charactres of our choice.


Hope this helps.



5 comments:

  1. Hi Naresh,

    This is very useful information.
    thanks for the info.
    keep posting.

    ReplyDelete
  2. Thanks Naresh,

    Very good information, keep it up

    ReplyDelete
  3. Hi Naresh,

    Thanks a lot for such a important and usefull information, It resoloved high priority issue.

    Regards
    Smital

    ReplyDelete
  4. This is working perfectly for me..
    My requirement was lie, input file in UNIX machine has some spaces(by seeing). when i check with ":set list" command, i am able to see ^Z. But when i load this into table, i am seeing symbol as arrow(-->).
    I tried in so many ways with no use. Finally your REG_REPLACE('[^[:print:]]',NULL) as gave me the expected results.

    Thank you.

    ReplyDelete