Saturday, June 16, 2012

Understanding the hash lookup in SAS

Hash is a one of the very interesting data structure that was every discovered.

Historically it was introduced in laguage named "AWK" and then was popularize by perl language. The concept has been picked by numerous other languages. SAS was one of the last to introduce hash objects based on numerous client and user requests I guess  :P

Lets first understand what a hash is and how it works.

Hash object is nothing but a set of key and value pairs.  There is an algorithm internally that will calculate address of the "value" based on the values of the "key".

So the first step is to build a hash with in the available memory space.

What happens is hash algorithm intelligently assigns an address for the value based on the value of the key. Once all the input key and value pairs are placed into the memory (RAM), then the second step starts.

In this we provide a key and ask hash to fetch us the value. 

The algorithm is such that it will give correct address location with a 99.99999 (or so) probability. For all practical purpose we can take it as 100 % hits.

Efficiency in hash comes because calculation operation happen fraction of the time it takes to iterate the list using any other available search algorithms.

The other greatest advantage is that key need not be numeric !! It can be any alphanumeric string :)  unlike normal arrays which need us to specify the array position in number to give us the value stored in it.


Developing hash algorithms is a topic and a course in it self so we will not touch on it, but will wish that like a magic wand the algorithm works which it really does practically !!

So now coming to SAS implementation.

SAS is not an object oriented programming language like java or C++. During the entire existence of the SAS, Mr Goodnight never felt a need to switch to object oriented languages or to introduce other beautiful data structures like arrays, list objects etc.

Because he felt that everything that he needed can be done using the manipulations of the data step created in C.

So entire SAS language is written in C.

Hash is an exception to this rule. Hash implementation of SAS is in an object oriented fashion. Exact reason I am not sure I guess it might because of the efficiency that SAS has chosen to implement hash in an object fashion.

 All being said, the point is - "hash is an object in SAS."
(one of the very first ever )

so we have to create this object before we can use it in any data step. Because then the object gets spawned in the memory.

Here is the way we do it -

declare hash h(dataset:'participants');
h.defineKey('name');
h.defineData('gender', 'treatment');
h.defineDone();

'h' is the name of the hash object that will be created in the memory.

If you are familiar with any object oriented languages you will know that objects will have methods which are defined in the class definition.

So here also we have some basic methods like defineKey which is used to define the name of the key named 'name'. Similarly we have defineData method which defines the value for the given key. Here we see that there are two values for a key. When defining the hash object has completed we use defineDone().

This will initialize the hash object. Note in the above code snippet we are providing/passing another dataset.

" dataset:'participants' "

So that we can directly build hash after it is defined dynamically.

So what SAS does is that -

1) It will create the object in the memory
2) It will try to fetch the key and value pairs from the dataset option specified\
3) It will then populate the object with the required data [By placing the values at the location in memory that is determined by the algorithm using the input key value ]


Sometimes we can plan to add values in a hard coded fashion with in the data step instead of building the hash from already existing dataset or there may be case where we have to add additional values in a hard coded manner to the existing hash object.

In those cases we can make of method 'add'.


---

name='Nageswara';
gender='M';
treatment='Hash headache';

h.add();


or

h.add(Key: 'Nageswara', Value: 'M' , Value: 'Hash headache');

 ----


Finally once we have the hash object and data populated into the hash we will always want to lookup data, which is aim of our entire exercise. 


For that purpose SAS has provided us with function named "find()".

So if in the source dataset there already the column named with key we will automatically get the value in the column named that of value as defined in hash if not we need to pass the values as required. Here is code snippet -

/* suppose in the input dataset 'in' has variables as user_name, user_gender and user_treatment.
The problem is to dertermine the gender and treatment based on user name
*/



data out;
set in;
/* As we will want to build hash object once and use it multiple times */
if _n_ = 1 then do;

/* declaring the hash object on the dataset named participants */
declare hash h(dataset:'participants');
h.defineKey('name');
h.defineData('gender', 'treatment');
h.defineDone();

end;

name=user_name;

 h.find();

user_gender=gender;
user_treatment=treatment;

drop gender name treatment;

run;


Note we can also find the value for any key present in the hash if can provide it to the hash as follows

find( key: 'Nageswara' );


----


There are lot of other useful functions on hash which we can use. There is also hash iterator object which we have to build to get the data in sequence. (More like linked list )


Please go through support.sas.com ( http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a003143739.htm )

Saturday, March 31, 2012

SQL Transform and its limitations

SQL transform is one of the key transform that we use in DI Jobs. In SAS DI Studio there is no transform for data step specifically for merging various data sets that the SAS developer are so used to. So for all the merge/join operations we use the standard SQL transform that is available in SAS DI.

SQL transform generates at the back end proc sql code. There are lot of advantages when we use the sql transform instead of the user written which should be avoided to the extent possible.

Some of the advantages that I can think right over my head are.
1) The metadata adaptation of the transform is good. So even in case of the any change in the underlying data set the SQL transform automatically adjusts.

2) There is option in SQL transform to generate code for explicit sql pass through.

3) Sub queries and the functionality of the case statement is good so we can maintain the code in a better fashion.

4) Specification of the types of joins and the filter conditions conditions is good.

5) And finally as with any visual tools the automatic code generation by the transform save times lot of precious coding and maintenance time for the developers.


As the great person has said that perfection of the tool only comes when the user knows all the limitations that it keeps on him.

With my experience I have faced the following limitations which lead me to use user written code even though I wanted to avoid it.

1) Problem with explicit sql pass through when DBMS column size is more than 32 chars.

As known column width in the data sets is limited in SAS to 32 chars. But the column width can be more than 32 chars in DBMS such as  SQL Server, so when we want to do explicit sql pass through in Such a case, but interestingly the implicit sql pass through does not fail. Which means that the SQL has cognisance of the full name of the column but some how the transform developer never envisioned such a problem will come :( . Hope this will be fixed in future versions of SAS DI.

2) Problem with readbuff option.

SAS provides with various performance improvement options. One of such option is readbuff. The readbuff options drastically improves the I/O performance by increasing the buffer that is available by reading more than 1 (default) row into buffer.

By default if we dont set the readbuff option on the individual datasets SAS generates the code with does the implicit sql pass through of the join. But when I keep the readbuff option the SAS tries to pull both the data sets used in the join to the SAS layer using the readbuff option and then do the join at the SAS layer. There is no option for use to specify in the SQL transform to apply readbuff directly onto the SQL join output i.e., to apply readbuff option on the result set of the implicit sql pass through.

The readbuff option can also be applied at the library level but I have not tried it as I didnt have the rights to modified the library registration.

I think that irrespective of where the option applied on the library level helps or not it is my opinion that DI must have a place where I could have provided the option on the result set of implicit sql pass through.

3) Problem with Subquery in CASE statement.

SQL transform has limitation that it cannot have a subquery with in case statement, which is possible in the proc sql.

4) Whenever a new change happens to the dataflow of the DI jobs all the columns are erroneously propagated from the source to the target.

It is very irritating for the developer to correct the column mapping again and again whenever there is a change in the job. Though this problem is not specific to the SQL transform but I wanted to specify this because the tool is over intelligent to do what we are not saying also. This may result is erroneous data flow big time pain for the developers as they will need to check and correct the data mapping for each and every transform.


5) Vertical string concatenation is not possible in proc sql.

Suppose there is a data set
A
x   y
1  a
1  b
1  c
2 d
2  e
3  a

data set B
x  y_str
1  a,b,c
2  d,e
3  a

This operation is not possible in proc sql, so we have to write user written code ( data step ) to do the job.

---

My First Techno Tip

I have decided that I share my experiences in technology.

This will help me on two fronts.

1) It will keep track of my experiments and learning in technology.
2) Share idea and thoughts in technical arena with my everyone out there.

My current technical competencies and interest area are -

1) SAS - BASE, ADVANCED, BI
2) Predictive Analytics
3) Perl
4) phython
4) Shell Scripting
5) Java, C
6) R
7) Unix

Want to learn new technologies and adapt myself with know how in every spear.

I am passionate about open source technologies. I use fedora Linux as my desktop operating system and encourage every one to try their hands on Linux.

For initial post I am adding a link to post on why technical people should blog :)