Feb 25

Update to Remote Journal Problems

We have removed the remote journals we now have a stable system again. The logs are back to normal and we are not seeing message queues wrapping every couple of hours.

The problem was due to the way we had set up the remote journals, when you set the journal receivers to System Managed and DELETE *YES the system will run a process to constantly (Every XXX minutes depending on what you setup, 10 minutes is the default) try to delete the receivers from the LOCAL journal. Because we had a Remote Journal attached to the LOCAL journal the receivers have to be replicated to the remote system before they can be deleted. Once a receiver has been identified as being ready for deletion there is no way to set it back to do not delete, so when we took down the Remote System for maintenance (its been down for a week due to a lack of time to install some new features) the replication could not occur. This means the delete will not be allowed for the receivers and every time we created a new journal receiver the detached receiver became suitable for delete etc etc etc.

We did not realize this was occurring until our system slowed down to a crawl and we started to look at what was happening! We had over 1,500 QHST log files and any message queue which was attached to the journal plus QSYSOPR was being wrapped every hour or so. DASD utilization was not high but the process of sending messages, wrapping messages queues, filling and changing QHST logs and trying to connect to the remote system constantly must have clogged the system up.

IBM’s suggested solution of removing the Remote Journal created an additional headache because it then allowed all of the journal receivers to be deleted. So we now have to copy the databases to the target once we start the system backup plus set up all of the remote journalling again.

If you are going to take down your target system for any period of time which puts you into a situation like we had ENSURE you change the Journal definition to DELETE *NO and do a journal receiver change BEFORE you bring the system down. This will ensure you don’t get into the loop we did. If you do forget to do it BEFORE you bring the target down do it as soon as you can after that to minimize the pain. Just inactivating the RJ link will not help… If your system goes down due to a failure make sure you change the DELETE option ASAP.

I have asked IBM to look at providing a fix to allow the delete cycle to be suspended or ended should this situation occur again, if they decide its worth fixing I will post the PTF details.

Chris…

Feb 25

New Sugar CRM & Gentoo

We had the need to install the Community Edition of Sugar CRM for a client who is setting up a LAMPS server the other day and were very impressed with the install process for the latest version.

In the past we have had to change sections of the PHP config files to allow it to be installed on the Gentoo system, this time everything just worked! So here is a big well done to the Sugar CRM developers, we really liked the new install process.. Maybe one day a Gentoo package will be available to download using ‘emerge’ instead of having to work outside of the Portage system??

Chris….

Feb 24

Removed RJ

OK had to give up, the queues and logs filled up so fast I could not delete things quick enough!

Problem is, the receivers have been deleted by the system so I have to rebuild the target DB using save restore. At least I am not wasting hours trying to stop the disks from filling!

I hope IBM can fix this, my DB is relatively small so a re-sync isn’t a major issue. If you have a replication tool that is keeping terabytes of data in sync and this happens it may not be so simple to resolve!

Chris…

Feb 24

Remote Journal issue

I was developing and compiling code yesterday when I noticed a definite slowdown in the systems response times. I thought it was the usual Java programs which kick off occasionally to call home to IBM and report on any issues the system is having, but not this time.

I looked into the QSYSOPR message Queue to find it had been trying to extend 1652 times and failed! so I did a clear message queue and it cleared OK only to start to fill rapidly with messages about a journal receiver not being available. Looking into the problem there is a job QJORETRY trying to replicate the RJ to a remote system which is down at the moment. The RJ is showing as inactive which is what it should be but the job doesn’t seem to mind that? The reason this started is because the development and test process uses programs which generate thousands of records in files attached to the journal which is causing the problem. Even though I have stopped generating more data it is still cycling through the receiver chain trying to replicate it…

I sent a SEV 1 message to IBM at 8pm last night and cleared up the message queues before retiring for the night at 11pm. I got a call at 8:30 am this morning from support and we have exchanged details on the problem. Seems IBM doesn’t respond as quickly to problem submissions as they did, or is this normal now?

After more investigation this morning I find that the QHST log is filling up very fast as well, I had over 700 generated overnight alone (in the time it took me to write this note it has generated another 9!). The QSYSOPR message queue plus the DB message queues and journal message queue all overflowed and could not be extended. So I spent sometime deleting all of the history logs and clearing all of the message queues in the hope that this would improve things. All that I am doing now is keep refreshing the lists and clearing out the logs etc….

I could remove the Remote Journals from the local journals in the hope that it would stop the flood, but we are not sure it will? I am hoping IBM can fix the fact that if an RJ is inactive it knows not to try to send the journal receivers to the other system.

I will post the results when I get them from IBM, in the meantime I have thousands of messages and logs to to clean up again..

Chris…

Feb 18

Still around but very busy on various projects

We have not posted in some time due to overwhelming activity on various projects we have on the go. We are picking up many new ideas and skills as we progress through the projects so when we get the time we should have lots of information to share.

Here is a quick update on the projects we are running with.

RAP V4 is now in final testing with some additional work being done on the switch code to take into consideration the new object replication features we have built in. We had hoped to have an announcement before now but we feel its better to make sure we do thorough testing before we move the product into the customer base. The current version is seeing some activity on the sales front and we have a number of trials underway.

We have nearly finished the Website Project which has some neat features built in, this has taken up more time than we anticipated due to incompatibilities between browsers. We spent days building and testing using the Mozilla Browser only to find that IE7 just didn’t work with the code! After a lot of digging around we found that neither browser is totally compliant with the specs but Mozilla is more forgiving of silly errors than IE7 (Missed a > in the code which Mozilla ignored and IE7 didn’t, yet IE7 didn’t report any errors either! It simply ignored large parts of the code.).

We have started a new project with a local Organization to implement a LAN with a central server running LAMPS to manage all of their internal activities and external customer contacts. The project is in the early stages with the LAN wiring being installed and two Internet Gateways already in place. Eventually they will have a solution which will allow them to better manage their business and provide a solid IT infrastructure.

Hope to get back to posting soon…
Chris..