In some cases source data may contain unwanted characters such as non printable characters.
Non printable & special
characters in clinical trial data create potential problems in producing
quality deliverables.
There could be major issues such as
incorrect statistics / counts in the deliverables, or minor ones such as
incorrect
line breaks, page breaks or
appearance of strange symbols in the reports. Identifying and deleting these
issues could
pose challenges. When faced with
this issue in the Pharmaceutical & Biotech industries, it is imperative to
clean them
up. We need to understand the
underlying cause and use various techniques to identify and handle them
For the purpose of our topic, we can broadly classify the characters into 3 groups:
1. 33 non printable special charactersThe first 32 characters (decimal value from 0 to 31) and the DEL char(decimal value 127).
2. 94 standard printable characters (decimal value range from 33 to 126) which represent letters, digits, punctuation
marks, and a few miscellaneous symbols.
3. 128 special characters (Extended ASCII or ISO-8859-1. Decimal values range from 128 to 255). Decimal values
from 128 to 159 in the Extended ASCII set are non printing control characters.
PROBLEMS CAUSED BY NON PRINTABLE & SPECIAL CHARACTERS (NPSC):
In Clinical trials data, we do not
expect to have any characters outside the decimal values range from 32 to 127
because of the problems mentioned
below.
Following are some of the issues that
might be caused by NPSC.
1. The line / page alignment in the
output generated will be disrupted when some of these characters are present in
the output. Most common problem is, even though there is plenty of space
available in a line / page, with out
using all of it, the data will spill
over to the next line / page.
2. Depending on their presence in
the critical variables, one might get wrong statistics or counts in the
outputs.
3. Unexpected conditional statement
results and/or incorrect number of
records get selected during subset.
4. Some characters (Extended ASCII
characters) are not same across operating system s / applications/ fonts.
When such characters are present in
a SAS dataset, it is possible that the character might have had a different
form or meaning in the source application
compared to the final destination which is SAS dataset.
SOLUTION:
Here we will see how to resolve this problem in Informatica.
As we all know the power of Regular Epressions in manipulating typical data sets. Here we use RegEx to resolve this.
We can replace the NPSC with null using REG_REPLCAE .
REG_REPLACE('[^[:print:]]',NULL)
Here,
[:print:] denotes all printable characters in ASCII character set.
^ when used with POSIX Brackets works like a negation operator.So
[^[:print:]] will return all non printable characters.
Using this with REG_REPLACE we can replace all the npsc with charactres of our choice.
Hope this helps.
Hi Naresh,
ReplyDeleteThis is very useful information.
thanks for the info.
keep posting.
Thanks Naresh,
ReplyDeleteVery good information, keep it up
Very helpful information
ReplyDeleteHi Naresh,
ReplyDeleteThanks a lot for such a important and usefull information, It resoloved high priority issue.
Regards
Smital
This is working perfectly for me..
ReplyDeleteMy requirement was lie, input file in UNIX machine has some spaces(by seeing). when i check with ":set list" command, i am able to see ^Z. But when i load this into table, i am seeing symbol as arrow(-->).
I tried in so many ways with no use. Finally your REG_REPLACE('[^[:print:]]',NULL) as gave me the expected results.
Thank you.