Mar 29

Automated synchronization for journalled objects.


One of the challenges for all High Availability products is what to do when a journalled objects goes out of sync. Not that it should happen too often but once it has you need to be able to re-sync the object between the systems as effectively as possible. When RAP first appeared we relied on manual intervention to allow the re-sync process to occur, this was because we needed to ensure the timing of the re-sync fell during a period with no user activity, plus a receiver change had to occur at the right time to ensure any entries pre-save were not applied after the restore of the new object occurred.

One of our customers found a way of carrying this out during low activity by forcing a lock of the object while he carried out the save and restore process and changing the receivers in the appropriate place. We took that a step further and by using a journal entry allowed the sync process to take place without consideration for the receiver changes, we still locked the object for the entire replication time but at least it could be automated (You could submit the request through a job schedule entry at any time of the day). This satisfied most customers as they could schedule the request overnight and come back in the morning to ensure the process had worked as required, but a couple of customers requested additional functionality, they want to be allowed to submit the request from the target system based on the status screens. This meant we had to add the ability to submit a job from the target back to the source plus they wanted to be able to control when the synchronization would occur. The solution while simple is pretty elegant, we created a process that used a specific job queue on the source system and used job schedule entries to hold and release the job queue at specific times of the day. Now the customer can select all of the objects in error and they are submitted to the job queue on the source. Those objects which could be processed during the times the job queue is in a released state will sync and correctly restore at the correct place in the journal apply process. If a request is running it will run to completion, so we had to ensure the time periods took this into account, but the chance of lock contention between the sync process and user applications which required locks on the objects was significantly reduced.

As usual this was not enough, one of our customers had a problem with the speed of the link between the source and target system, sync’ing a 16GB file over a 2 MB link would interfere with the application due to the time it took to pass the object between the systems, they needed something which would reduce the lock time while still ensuring the sync-point could be correctly managed by the apply process. The solution was to move the sync-point management from the sending process to the apply process, we needed to make sure the apply would not process a receiver with journal sync points until the required objects existed on the target system. This resulted in a new process that we have called the SYNCMGR, it handles all of the sync process between the systems while ensuring the data is in the right place at the right time to allow timely recovery. We only lock the object for the time it takes to make the save, after that the object is open and available for application updates and HA4i manages the sync point processing between the systems. New interfaces and commands are provided to access the functionality required to automate the release and hold of the sync process in the same manner as the old job queue process, only this time its a lot more effective.

HA4i continues to grow and we are adding new functionality all the time, a new PTF which we hope to release in the next few weeks will bring all of this and other new features to the product. The next big release is also on the cards with a new apply process that removes the APYJRNCHG process and replaces it with our own apply process. While we have been happy with the functionality of the APYJRNCHG command that are times when IBM’s reluctance to make changes can stifle progress. Adding our own apply process will allow the customer to decide on the best approach for their environment, either the APYJRNCHG process or our own. We think that will allow us to be more flexible than our competitors.

The product is proving itself in many challenging environments and we make changes quickly to resolve any perceived or real roadblocks to having a true High Availability solution for the clients. Our pricing structure and low overhead allows us to complete very effectively with our competitors and we are seeing a number of replacement opportunities cropping up. If you are fed up with paying to much or want to investigate what your options are give us a call and we will be happy to discuss the product and what it can offer your company.

Chris…

Mar 17

Canadian Banking system is one to base your own on! Not according to the Royal Bank..

I had to laugh at this, after all the press and puffing and blowing of smoke about just how good the Canadian banking system is I get this response from the Royal Bank after requesting a wire transfer made in error is returned to the sender 3 days ago.

“Unfortunately it can take a minimum of 20 days to return the funs.”

So the reason our banking system is one to model your own on is because we refuse to send your money back :-) 20 days to electronically transfer funds between banks has got to be as bad (and probably worse) as a Third World bank! The sender sent the wire and one day later it appeared in our account, somehow it takes us 20 days (or more!) to do the same in reverse! Being a computer literate person this is mid boggling and absurd. Even the mail (and Canada Post is not a model to follow either) gets there quicker than that. I could have drawn the funds, created a new wire transfer and it would have been there a lot quicker, and probably cost the sender less overall.. Its shameful! Royal Bank you should be ashamed…

Chris…

Mar 17

Restore does not restore all attributes of the saved object.


If like us you are dealing with multiple systems and use the save and restore process to keep things in sync you may have a shock when you look at some objects. The SAVRST process does not restore all object attributes if the object already exists, the ALLOBJDIFF *ALL does not affect every attribute! We have found a number of important attributes which we feel should be restored when you carry out a restore but they are not, after all the main reason you restore anything is to bring it back to its saved state??

Couple of important ones we have come across.

Audit settings.
When you save an object which has system auditing set and restore it you would expect the audit setting to be preserved on the restored object, its not! Obviously the audit setting cannot be set if the Audit Journal has not been set up but when it is the restore should certainly preserve the setting.

Journal information.
This is an important attribute for those running HA, but if the object exists and it is not journalled the restore will leave journalling off, it wont even restore the last journalled information.

Journal ID.
This is another important attribute, even if you rely on a save and restore process to keep a warm copy if the existing object exists the journal ID will not be set by the restore process. The only time this is important is when you use the APYJRNCHG command as this is the link the OS uses to apply any changes to the object, if it is not the same as the entry in the journal the object is not updated. Even worse the system does not let you know it did not find the object to update.

Object attributes.
This is particularly important for IFS objects as these are the security settings for the object (dwrxwrxwrx etc.) This is an obvious exposure because you could change the attributes on the source system and expect those changes to be reflected on the target when you restore the object.

There is a way around this though, if you delete the object before you do the restore the attributes from the saved object are correctly restored at the same time. We have adjusted our replication processes to make sure we take this into account as we replicate any object through the save and restore process.

Chris…

Mar 16

Setting remote attributes from local IFS object


As part of the IFS auditing process we check the attributes on the IFS object on each system and compare them. The comparison was pretty easy but we then have to be able to rectify any errors we find. Setting the attributes was necessary and thankfully IBM provides an API which can be used to set the remote attributes called chgmod which requires an integer value for the attributes which are to be set. Not wanting to check each individual bit in the remote objects attributes we thought that we could simply take the mode_t object from the source system and apply it to the target object. Unfortunately that does not work because the mode_t structure contains the directory bit setting which cannot be set using the chgmod API.

The answer is to create a new mode_t object on the target, check the bit settings of the mode_t structure received from the source and then set the relevant bits in the new mode_t object before passing it into the chgmod API.

Here is some sample code which shows the process. Path and Old_Mode are passed in.


int newmode = 0;

if(Old_Mode & S_IRUSR)
newmode = newmode | S_IRUSR;
if(Old_Mode & S_IWUSR)
newmode = newmode | S_IWUSR;
if(Old_Mode & S_IXUSR)
newmode = newmode | S_IXUSR;
if(Old_Mode & S_IRGRP)
newmode = newmode | S_IRGRP;
if(Old_Mode & S_IWGRP)
newmode = newmode | S_IWGRP;
if(Old_Mode & S_IXGRP)
newmode = newmode | S_IXGRP;
if(Old_Mode & S_IROTH)
newmode = newmode | S_IROTH;
if(Old_Mode & S_IWOTH)
newmode = newmode | S_IWOTH;
if(Old_Mode & S_IXOTH)
newmode = newmode | S_IXOTH;
chmod(Path,newmode) ;

Hope that helps others looking to set the attributes of an IFS object using a source object.

Chris…

Mar 14

More functionality for HA4i


We have been busy yet again with improvements to the HA4i product. Customers have the uncanny ability to find ways to configure applications that we would never have thought of, this usually means we get a call asking if we can fix up a problem or provide a work around for some replication issue.

One of the most asked after requests we get is the ability to filter out objects from the object replication function. These are usually objects which are constantly locked or get created and deleted with amazing regularity! (Why the developers don’t use QTEMP for such objects is beyond us, after all that’s what it was designed for). Our original mandate has always been if its in the library we will try to replicate it, this is fine where you have fairly static objects that only change occasionally or don’t have locks applied to them constantly. The problem is the error messages can build up significantly in some environments where they create and lock objects constantly and the only time the lock is released is when its deleted. So after some deliberation we have admitted defeat and brought a level of object filtering to the product, its not as complex as some of our competitors filtering but it should provide relief for many of our customers who are unable to fix locking issues in any other way. All you have to do is configure the object you want to omit and we do the rest.

Another regular request is to allow the synchronization of files at the member level through the AUTOSYNC process, it could be that only a single member has gone out of sync with the source but previously we could only sync the entire file. For customers with hefty multi-member files that have millions of records in each member this can be a daunting task. So yet again we have listened to our customers and provided a mechanism to synchronize each member individually, the default is still to synchronize the whole file but the command now supports entering and individual member. A side effect of this has been the recovery option provided from the target side, automatically synchronizes every file object at the member level. Many of our customers who are running 24×7 shops said this would be a major benefit as the time to AUTOSYNC a particular object can be reduced significantly where multi-member files are used.

Finally the IFS, one of the most difficult file structures to manage on the IBM i (I like to term it the wicked step child for lots of reasons) which is supported by both user journalling and object level replication has an auditing process. This is the last audit process we had slated to release in the current version and I have to admit had been avoiding for a long time. Yet again it was a customer request which pushed us into a corner and forced us to provide something which could check the IFS between the systems. They have over 90,000 objects in one directory and while they are using the IBM Apply Journal process they felt they needed some check and balances to ensure the IFS was really being updated effectively. You will have noticed (if you read the blog regularly) that we had developed a process using the Qp0lProcessSubtree() API for creating CRC values for every object in a directory and all of its subdirectories. This was the base of the auditing process and all we had to add was the ability to send the data between the systems for comparison and return any out of sync conditions back to the source. One of the really amazing results we found was just how quickly the process would read through and entire directory structure. Our entire IFS /home structure took seconds to audit..

There are a number of other fixes and updates which a have been developed and tested that will form the next PTF which should be available once we have tested the functionality a little more.

Chris…

Mar 08

Ever wanted to know how many objects are in a directory tree and how much space they take up?


I was having a discussion with a business partner today about the new IFS auditing process when we started to look at what options to offer for each of the errors it found. Originally I was going to simply save and restore the object to the target system, but what about a directory? The problem is we have no idea what sits below a sub directory so we would have no idea what size the actual save of that directory would be (we would have to save all sub directories because nothing could exist on the target system!) or how many objects it consisted of. One option would be to simply create just the top directory because each of the objects reported in the audit that did not exist would have their own individual record in the error listing with a command we could build from the data from the source system.

Eventually we looked at the API’s IBM provides for the IFS to see if there was anything which would allow us to find out the number of objects involved in a directory as well as the total size. There is nothing, but we then thought about the Qp0lProcessSubtree() API, we are already using it to do the audit and it seemed very capable of providing what we needed.

The program we wrote is available for download on the download page if your interested in giving it a spin. Here is some sample data we pulled back when we ran it against our IFS structure.

This is for the /QIBM directory.

Directory Entered = /QIBM
Successfully collected data
Size = 3.5GB Objects = 39997 Directories = 4601
Took 7 seconds to run

This is for the /home directory

Directory Entered = /home
Successfully collected data
Size = 187.4MB Objects = 984 Directories = 373
Took 0 seconds to run

So based on the above we can see it took 7 seconds to collect the data about the /QIBM directory and all of its sub directories. The total space taken up by those objects is 3.5GB (this is actual space, it does not take into account any header data or padding). The interesting thing for me was the number of sub directories (*DIR) 4,601 plus 39,997 objects (*STMF). Another interesting point is the data in the /home directory, you can see it reported 373 (*DIR) objects and 984 (*STMF) objects, a quick look at the structure using the WRKLNK command showed these numbers to be vastly incorrect! We could not find that many objects so we decided to do a save of the objects to see how many actually got saved. It saved 1357 objects! Which is exactly what our API reported so where are the missing objects!

If you look in the /home directory you will find a sub-directory called QIBMHELP, if you look in that directory you will see it is empty. But when we run the test against that directory we find the following.

Directory Entered = /home/QIBMHELP
Successfully collected data
Size = 17.3MB Objects = 263 Directories = 332
Took 1 seconds to run

So where are these objects? Doing a save automatically saves all the objects and if you start journalling against the directory you will see all of the objects listed! IBM must have set some hidden attributes for the objects because you cannot see them when you use WRKLNK but they definitely exist and do get actioned by the various API’s. One thought is the objects all sit in a directory .eclipse so maybe the fact that it starts with a ‘.’ is keeping them hidden? If you look at the directory using OpsNavigator or a mapped drive they show up….

If you would like to give the tools a test run just download the save file from the downloads page, restore the objects into your favorite library and give them a run.. There is a command RTVDIRSZ and a program of the same name. We have saved the objects back to V5R4 and they were saved from CHLIB..

If you have any questions let us know.

Chris…

Mar 06

Audit IFS links using Qp0lProcessSubtree.


One of the jobs we have been ignoring was adding the ability to audit the attributes of each object in the IFS. The main reason was the complexity we thought would be required to ensure we checked each and every object within the directory structure.

After some thought we decided to go with the Qp0lProcessSubtree() API as it seemed to be just what we needed. As usual we started off by just creating a sample program, this would simply read through the IFS directory we passed in and create a CRC value for the attributes of the objects. The documentation provided about the API is very comprehensive but not very clear, the nice thing is they provide a sample program at the bottom on how to use it. After some fighting with the parameters we finally managed to get the API working and it provided a very simple and effective way to walk through the entire sub-directory structure and touch each and every object we needed.

The next challenge was to create a CRC for each of the objects so we could check each object between the source and target system. We did not want to check the data between each object as this would be quite a jump up from where we are plus very CPU intensive to do it for each and every object. The CRC was an easy fix, we simply used the ZLIB adler32 code we have been using for other CRC generating processes. This also helped us reduce the traffic because we only had to send the object ID and the subsequent CRC from the source.

Once we had this all running we then added the ability to store the data into a Physical File so we could see what data was generated for each object and how the CRC values looked (we could not verify the CRC at this time other than the same CRC was generated for the same object every time it ran). This is where we needed to understand the call processes between the API and the User Function we had written to handle each object the API passed. The documentation describes the Control Block which is passed to the exit program as a pointer to data which is not processed by the API? It is in fact a very important piece of the puzzle because it allowed us to significantly reduce the file manager overhead and the ability to pass a socket to the user function instead of it having to be opened and closed all the time. Initially we just passed in some sample data but it was important for storing the data as we passed in a file pointer that was used to write the records.

Once all of the testing on a single system was done we than had to write the part which allowed us to take the results of the source system and check them against the objects on the target. After some messing around with the structure passed from the API and determining how we could keep the two systems in a complete discussion we finally got the process working. What really surprised us was the speed it would run through the directories. We only have a few directories and a few objects on our systems but the process took mere seconds to complete against the entire /home directory, it also correctly identifies any objects which did not exist on the target plus each and every object that had a difference in its attributes! The file was correctly populated with the errors so all we have to do now is build a couple of interface over the data to allow the user to view the issues and take actions to resolve them.

The IFS is not our favorite technology on the IBM i, we feel it is overly complex and has a number of very significant problems with the tools to allow anyone to use it effectively, but the Qp0lProcessSubtree() API is one thing we have to agree is a very nice piece of technology.

We hope to have the technology we built in the next PTF for HA4i so that users can ensure the IFS is correctly synchronized between the systems! We may also create a tool from the test program which will allow anyone to create CRC’s from a directory listing and store it in a file, a relatively simple process could then be used to allow you to check the IFS between 2 systems.

If you have any questions about the technology or the HA4i product please contact us and we will be happy to discuss.

Chris…

Mar 01

HA4i gains new functionality again!

We have just released the latest PTF for the HA4i product and it is packed with new functionality. Most of this has come from the experiences we have been through with customers as they have implemented the product in varying application environments and found simple but effective solutions to particular problems. It always amazes us just how diverse the application environments can be and what types of problems they will throw at any replication software. As the install base grows we find that we have to adopt the software to meet challenges that we had never considered possible.

Here are the highlights of this PTF.

Monitoring
Customers do not want to babysit the product all day, they like to be told when they have an issue to address. We have been providing an alerting function on the target processor for sometime with our Email Manager which will email selected accounts with relevant messages. This works fine but what if Email Manager is down? This PTF brings a new feature which runs via the IBM Job Scheduler that will wake up every 1 – 1440 minutes and check that all jobs which should be running are actually running, if not it sends a message into the QSYSOPR message queue detailing the problem. This can then be picked up by a monitoring solution and distributed as required. Eventually this will be picked up by the Email Manager as well..
Additional functionality added in PTF01 such as Spool File replication was not included in the PHP interface so this PTF now adds the ability to monitor and control the spool file replication processes and errors from the PHP interface. It also adds new information such as how many objects are in backlog to be processed for all replication processes.
New functions have been added to help with the License Key management so the user can now see the license key information both in the product menus and the PHP interface.
The APYJRNCHG command can generate large numbers of so called errors which have no meaning (0x00 unidentified error) when certain problems arise, as part of the replication process we now remove any 0x00 errors from the output files and remove any empty members which significantly reduces the management for the users. We have also added a new error management interface which only shows the individual objects in error instead of every time the object was reported in error by the APYJRNCHG process.

Autonomics
The product will try its best to ensure any change to the data and objects are successfully replicated to the target system, however as we all know sometimes that cannot be done due to things such as object locks or the lifetime of the object itself (its deleted as soon as its created). This can result under certain circumstances in a large number of errors being logged by the product which are stored in database files that are viewed by the users which in turn can be a time consuming task to clean up the errors. In this PTF we have released a couple of commands that significantly help with this task by going through the data and checking if the object exists on the source system, if it does it will replicate the object to the target system and remove the existing errors. If the object does not exist it will remove the error from the file and create a spool file showing all entries it removed. We have also made a number of changes to the programs that will reduce the number of errors logged due to previous replication errors. Eventually we will implement a new process which will attempt to fix certain errors as it goes through the process, this should significantly reduce the error reporting seen today.

Processing
The filtering capabilities of the object replication have been improved to ensure multiple requests for the same object are not acted on individually. This will reduce the number of times a single object which is changing constantly is replicated to the target system thus saving not only system resource but also network bandwidth. We have also improved the APYJRNCHG processing to ensure the object locking (carried out by the APYJRNCHG command) is reduced to a minimum.
The auditing of files for large companies can be a long running task, previously we had implemented a process which would skip a number of records between records to be audited. This has now been replaced with an option to audit a percentage of the records which not only improves scalability but also the randomness of the records being audited.
The restriction of the number of libraries which can be configured for object replication has been removed and now has a theoretical limit of 559,000 libraries ( I don’t think anyone has that many? Do They!). As part of this change we also took the opportunity to add a level of artificial intelligence into the object processing which should improve the throughput in highly volatile environments.

The above reflects some of the highlights but is not an exhaustive list of the changes we have made. Being a small company means we listen to the customers and usually try to react to their concerns or requests very quickly, many of the above improvements were driven by our customers and often ran in the customers environment before we added them to the PTF list. We pride ourselves on being reactive to customers and providing a service level not often seen in larger companies where service seems to be an annoyance even it is paid for…

The PTF is available for download from the web site and installs as any IBM PTF does with the LODPTF and APYPTF commands. The PTF cover letter provides details about all of the fixes and new features in some detail.

If you are considering HA or would like to look at the value proposition we can provide in replacing an existing product let us know, we are here to serve…

Chris…