PowerHA for AIX is the new name for HACMP (High Availability Cluster Multiprocessing). HACMP is an application HACMP Basics. History IBM’s HACMP exists for almost 15 years. It’s not actually an IBM product, they ou!ht it from C”AM, #hich #as later renamed to. A$ailant. HACMP basics. HACMP Cluster can be managed by 1) Smit 2) WebSMIT. Config tasks. Configure Cluster Topology; Then HACMP Resources.
|Published (Last):||8 November 2015|
|PDF File Size:||3.82 Mb|
|ePub File Size:||5.35 Mb|
|Price:||Free* [*Free Regsitration Required]|
Introduction to PowerHA
I heartly thank you for sharing u r knowledge. Thanks Sir for the hacp, Will appreciate if u can give some more knowledge about the topic that can tell us some of the practical issues as well. Hi santhosh, I am following your blog from long time. Thanks santhosh ji, i am following your blog forlong time and i clear interview by reading your blog. Multiple applications running on the same nodes with shared or concurrent access to the data.
A high availability solution based on HACMP provides automated failure hzcmp, diagnosis, application recovery and node reintegration. With an appropriate application, HACMP can also provide concurrent access to the data for parallel processing applications, thus offering excellent horizontal scalability.
What needs to be protected? Ultimately, the goal of any IT solution in a critical environment is to provide continuous service and data protection. The High Availability is just one building block in achieving the continuous operation goal. The High Availability is based on the availability hardware, software OS and its componentsapplication and network components.
Cluster Components Here are the recommended practices for important cluster components.
While it is possible to have all nodes in the cluster running applications a configuration referred to as “mutual takeover”the most reliable and available clusters have at least one standby node – one node that is normally not running any applications, but is available to take them over in the event of a failure on an active node.
Additionally, it is important to pay attention to environmental considerations. Nodes should not have a common power supply – which may happen if they are placed in a single rack.
Similarly, building a cluster of nodes that are actually logical partitions LPARs with a single footprint is useful as a test cluster, but should not be considered for availability of production applications. That is, twice as many slots as would be required for single node operation. This naturally suggests that processors with small numbers of slots should be avoided. Use of nodes without redundant adapters should not be considered best practice.
Blades are an outstanding example of this. And, just as every cluster resource should have a backup, the root volume group in each node should be mirrored, or be on a RAID device. Note that the takeover node should be sized to accommodate all possible workloads: However, these resources must actually be available, or acquirable through Capacity Upgrade on Demand.
The worst case situation — e.
HACMP networks not only provide client access to the applications but are used to detect and diagnose node, network and adapter failures. By gathering heartbeat information on multiple nodes, HACMP can determine what type of failure has occurred and initiate the appropriate recovery action. Being able to distinguish between certain failures, for example the failure of a network and the failure of a node, requires a second network!
Therefore, in addition there should be at least one, ideally two, non-IP networks. Failure to implement a non-IP network can potentially lead to a Partitioned cluster, sometimes referred to as ‘Split Brain’ Syndrome. This situation can occur if the IP network s between nodes becomes severed or in some cases congested. Since each node is in fact, still very alive, HACMP would conclude the other nodes are down and initiate a takeover.
After takeover has occurred the application s potentially could be running simultaneously on both nodes. If the shared disks are also online to both nodes, then the result could lead to data divergence massive data corruption.
This is a situation which must be avoided at all costs. The most convenient way of configuring non-IP networks is to use Disk Heartbeating as it removes the problems of distance with rs serial networks. Disk heartbeat networks only require a small disk or LUN.
Be careful not to put application data on these disks. Although, it is possible to do so, you don’t want any conflict with the disk heartbeat mechanism! While it is possible to build a cluster with fewer, the reaction to adapter failures is more severe: AIX provides support for Etherchannel, a facility that can used to aggregate adapters increase bandwidth and provide network resilience.
Clustering: A basic tutorial
When done properly, this provides the highest level of availability against adapter failure. Refer to the IBM techdocs website: Many System p TM servers contain built-in Ethernet adapters. If the nodes are physically close together, it is possible to use the built-in Ethernet adapters on two nodes and a “cross-over” Ethernet cable sometimes referred to as a “data transfer” cable to build an inexpensive Ethernet network between two nodes for heart beating. Note that this is not a substitute for a non-IP network.
Some adapters provide multiple ports. One port on such an adapter should not be used to back up another port on that adapter, since the adapter card itself is a common point of failure. The same thing is true of the built-in Ethernet adapters in most System p servers and currently available blades: When the built-in Ethernet adapter can be used, best practice is to provide an additional adapter in the node, with the two backing up each other.
Be aware of network detection settings for the cluster and consider tuning these values. There are four settings per network type which can be used: With the default setting of normal for a standard Ethernet network, the network failure detection time would be approximately 20 seconds.
Clustering: A basic 101 tutorial
With todays switched network technology this is a large amount of time. Be careful however, when using custom settings, as setting these values too low can cause basixs takeovers to occur.
These settings can be viewed using a variety of techniques including: Applications The most important part of making an application run well in an HACMP cluster is understanding the application’s requirements. Hadmp is particularly important when designing the Resource Group policy behavior and dependencies. For high availability to be achieved, the application must have the ability to stop and start cleanly and not explicitly prompt for interactive input.
Some applications tend to bond to a particular OS characteristic such as a uname, serial number or IP address. In most situations, these problems bsics be overcome. Application Data Location Where should application binaries and configuration data reside? There are many arguments to this discussion. Generally, keep all the application binaries and data were possible on the shared disk, as it is easy to forget to update it on all cluster nodes when it changes.
This can prevent the application from starting or working correctly, when it is run on a backup node. However, the correct answer is not fixed. Many application vendors have suggestions on how to set up the applications in a cluster, but these are recommendations. Just when it seems to be clear cut as to how to implement an application, someone thinks of a new set of circumstances.
Here are some rules of thumb: If the application is packaged in LPP format, it is usually installed on the local file systems in rootvg. This action will show the install paths, then symbolic links can be created prior to install which point to the shared storage area. If the hacpm is to be used on multiple nodes with different data or configuration, then the application and configuration data would probably be on local disks and the data sets on shared disk with application scripts altering the configuration files during fallover.
This is particularly useful for applications which are installed locally. Intelligent programming should correct any irregular conditions that may occur. The cluster manager spawns theses scripts off in a separate job in the background and carries on processing.
Some things a start script should do are: First, check that the application is not currently running! This is especially crucial for v5. Using the default startup options, HACMP will rerun the application start script which may cause problems if the application is actually running. A simple and effective solution is to check the state of the application on startup.
If the application is found to be running just simply end the start script with exit 0. Are all the disks, file systems, and IP labels available? Check the state of the data. Does it require recovery? Always assume the data is in an unknown state since the conditions that occurred to cause the takeover cannot be assumed.
Are there prerequisite services that must be running? Is it feasible to start all prerequisite services from within the start script?
Is there an inter-resource group dependency or resource group sequencing that can guarantee the previous resource group has started correctly? Finally, when the environment looks right, start the application. If the environment is not correct and error recovery procedures cannot fix the problem, ensure there are adequate alerts email, SMS, SMTP traps etc sent out via the network to the appropriate support administrators.
Stop scripts are different from start scripts in that most applications have a documented start-up routine and not necessarily a stop routine. The assumption is once the application is started why stop it? Relying on a failure of a node to stop an application will be effective, but to use some of the more advanced features of HACMP the requirement exists to stop an application cleanly.
Some of the issues to avoid are: Be sure to terminate any child or spawned processes that may be using the disk resources. Consider implementing child resource groups. Verify that the application is stopped to the point that the file system is free to be unmounted.
The fuser command may be used to verify that the file system is free. Clearly the goal is to return the machine to the state it was in before the application start script was run. Failure to exit the stop script with a zero return code as this will stop cluster processing.