Saturday, June 16, 2012

Understanding the hash lookup in SAS

Hash is a one of the very interesting data structure that was every discovered.

Historically it was introduced in laguage named "AWK" and then was popularize by perl language. The concept has been picked by numerous other languages. SAS was one of the last to introduce hash objects based on numerous client and user requests I guess  :P

Lets first understand what a hash is and how it works.

Hash object is nothing but a set of key and value pairs.  There is an algorithm internally that will calculate address of the "value" based on the values of the "key".

So the first step is to build a hash with in the available memory space.

What happens is hash algorithm intelligently assigns an address for the value based on the value of the key. Once all the input key and value pairs are placed into the memory (RAM), then the second step starts.

In this we provide a key and ask hash to fetch us the value. 

The algorithm is such that it will give correct address location with a 99.99999 (or so) probability. For all practical purpose we can take it as 100 % hits.

Efficiency in hash comes because calculation operation happen fraction of the time it takes to iterate the list using any other available search algorithms.

The other greatest advantage is that key need not be numeric !! It can be any alphanumeric string :)  unlike normal arrays which need us to specify the array position in number to give us the value stored in it.


Developing hash algorithms is a topic and a course in it self so we will not touch on it, but will wish that like a magic wand the algorithm works which it really does practically !!

So now coming to SAS implementation.

SAS is not an object oriented programming language like java or C++. During the entire existence of the SAS, Mr Goodnight never felt a need to switch to object oriented languages or to introduce other beautiful data structures like arrays, list objects etc.

Because he felt that everything that he needed can be done using the manipulations of the data step created in C.

So entire SAS language is written in C.

Hash is an exception to this rule. Hash implementation of SAS is in an object oriented fashion. Exact reason I am not sure I guess it might because of the efficiency that SAS has chosen to implement hash in an object fashion.

 All being said, the point is - "hash is an object in SAS."
(one of the very first ever )

so we have to create this object before we can use it in any data step. Because then the object gets spawned in the memory.

Here is the way we do it -

declare hash h(dataset:'participants');
h.defineKey('name');
h.defineData('gender', 'treatment');
h.defineDone();

'h' is the name of the hash object that will be created in the memory.

If you are familiar with any object oriented languages you will know that objects will have methods which are defined in the class definition.

So here also we have some basic methods like defineKey which is used to define the name of the key named 'name'. Similarly we have defineData method which defines the value for the given key. Here we see that there are two values for a key. When defining the hash object has completed we use defineDone().

This will initialize the hash object. Note in the above code snippet we are providing/passing another dataset.

" dataset:'participants' "

So that we can directly build hash after it is defined dynamically.

So what SAS does is that -

1) It will create the object in the memory
2) It will try to fetch the key and value pairs from the dataset option specified\
3) It will then populate the object with the required data [By placing the values at the location in memory that is determined by the algorithm using the input key value ]


Sometimes we can plan to add values in a hard coded fashion with in the data step instead of building the hash from already existing dataset or there may be case where we have to add additional values in a hard coded manner to the existing hash object.

In those cases we can make of method 'add'.


---

name='Nageswara';
gender='M';
treatment='Hash headache';

h.add();


or

h.add(Key: 'Nageswara', Value: 'M' , Value: 'Hash headache');

 ----


Finally once we have the hash object and data populated into the hash we will always want to lookup data, which is aim of our entire exercise. 


For that purpose SAS has provided us with function named "find()".

So if in the source dataset there already the column named with key we will automatically get the value in the column named that of value as defined in hash if not we need to pass the values as required. Here is code snippet -

/* suppose in the input dataset 'in' has variables as user_name, user_gender and user_treatment.
The problem is to dertermine the gender and treatment based on user name
*/



data out;
set in;
/* As we will want to build hash object once and use it multiple times */
if _n_ = 1 then do;

/* declaring the hash object on the dataset named participants */
declare hash h(dataset:'participants');
h.defineKey('name');
h.defineData('gender', 'treatment');
h.defineDone();

end;

name=user_name;

 h.find();

user_gender=gender;
user_treatment=treatment;

drop gender name treatment;

run;


Note we can also find the value for any key present in the hash if can provide it to the hash as follows

find( key: 'Nageswara' );


----


There are lot of other useful functions on hash which we can use. There is also hash iterator object which we have to build to get the data in sequence. (More like linked list )


Please go through support.sas.com ( http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a003143739.htm )

No comments:

Post a Comment