Warehouse Warriors: Removing Non Printable and Control Characters from Source Data

Tuesday, 9 April 2013

Removing Non Printable and Control Characters from Source Data

In some cases source data may contain unwanted characters such as non printable characters.

Non printable & special characters in clinical trial data create potential problems in producing quality deliverables.

There could be major issues such as incorrect statistics / counts in the deliverables, or minor ones such as incorrect

line breaks, page breaks or appearance of strange symbols in the reports. Identifying and deleting these issues could

pose challenges. When faced with this issue in the Pharmaceutical & Biotech industries, it is imperative to clean them

up. We need to understand the underlying cause and use various techniques to identify and handle them

For the purpose of our topic, we can broadly classify the characters into 3 groups:

1. 33 non printable special charactersThe first 32 characters (decimal value from 0 to 31) and the DEL char(decimal value 127).

2. 94 standard printable characters (decimal value range from 33 to 126) which represent letters, digits, punctuation

marks, and a few miscellaneous symbols.

3. 128 special characters (Extended ASCII or ISO-8859-1. Decimal values range from 128 to 255). Decimal values

from 128 to 159 in the Extended ASCII set are non printing control characters.

PROBLEMS CAUSED BY NON PRINTABLE & SPECIAL CHARACTERS (NPSC):

In Clinical trials data, we do not expect to have any characters outside the decimal values range from 32 to 127

because of the problems mentioned below.

Following are some of the issues that might be caused by NPSC.

1. The line / page alignment in the output generated will be disrupted when some of these characters are present in the output. Most common problem is, even though there is plenty of space available in a line / page, with out

using all of it, the data will spill over to the next line / page.

2. Depending on their presence in the critical variables, one might get wrong statistics or counts in the outputs.

3. Unexpected conditional statement results and/or incorrect number of records get selected during subset.

4. Some characters (Extended ASCII characters) are not same across operating system s / applications/ fonts.

When such characters are present in a SAS dataset, it is possible that the character might have had a different

form or meaning in the source application compared to the final destination which is SAS dataset.

SOLUTION:

Here we will see how to resolve this problem in Informatica.

As we all know the power of Regular Epressions in manipulating typical data sets. Here we use RegEx to resolve this.

We can replace the NPSC with null using REG_REPLCAE .

REG_REPLACE('[^[:print:]]',NULL)

Here,
[:print:] denotes all printable characters in ASCII character set.

^ when used with POSIX Brackets works like a negation operator.So
[^[:print:]] will return all non printable characters.

Using this with REG_REPLACE we can replace all the npsc with charactres of our choice.

Hope this helps.

5 comments:

Anonymous10 April 2013 at 02:39
Hi Naresh,

This is very useful information.
thanks for the info.
keep posting.
ReplyDelete
Replies
siva14 May 2013 at 02:23
Thanks Naresh,

Very good information, keep it up
ReplyDelete
Replies
Devang20 December 2013 at 08:51
Very helpful information
ReplyDelete
Replies
Unknown27 November 2014 at 21:29
Hi Naresh,

Thanks a lot for such a important and usefull information, It resoloved high priority issue.

Regards
Smital
ReplyDelete
Replies
Unknown26 September 2019 at 04:35
This is working perfectly for me..
My requirement was lie, input file in UNIX machine has some spaces(by seeing). when i check with ":set list" command, i am able to see ^Z. But when i load this into table, i am seeing symbol as arrow(-->).
I tried in so many ways with no use. Finally your REG_REPLACE('[^[:print:]]',NULL) as gave me the expected results.

Thank you.
ReplyDelete
Replies

Add comment