Results 1 to 11 of 11
  1. #1
    Join Date
    May 2004
    Location
    Dominican Republic
    Posts
    721

    Unanswered: Problems with Oracle RAC 10gR2 on Solaris 9

    Hi all. Recently I have got into problems with RAC, I have posted this on OTN (see the thread here). I thought I would post it here as well, to see if any of you could shed some light. For those who doesn't still have an OTN account, I'm going to repeat it here.

    Hi all. Recently, we've setup a Sun Cluster consisting of two V490 box that are scheduled to go in production soon. We have installed into this cluster Oracle RAC 10gR2 (10.2.0.2.0) and everything went fine. The application and all are working flawlessly. However, we've now detect some issues with it. We had a communication problem recently that caused the net segment the vip and pip on which rac are on to fail, and both instance went down. I've never expected this behaivour. When the communication went back on, I had to manually startup both instance of the cluster. This has happened two times already and I am now worried since this will mean that if for some reason the vip and pip interconnect fails, I will have to bring both instance (and possible N instance, since we're thinking of adding more in the future) up again after the communication re-established. I have talked with the OS guys and they say everything is OK. I have checked Oracle and the alert logs and here's an excerpt of what the alert log says on one of the nodes:

    Code:
    Reconfiguration started (old inc 16, new inc 18)
    List of nodes:
     0
     Global Resource Directory frozen
     * dead instance detected - domain 0 invalid = TRUE 
     Communication channels reestablished
     Master broadcasted resource hash value bitmaps
     Non-local Process blocks cleaned out
    Fri Jul  7 20:30:48 2006
     LMS 1: 0 GCS shadows cancelled, 0 closed
    Fri Jul  7 20:30:48 2006
     LMS 0: 0 GCS shadows cancelled, 0 closed
     Set master node info 
     Submitted all remote-enqueue requests
     Dwn-cvts replayed, VALBLKs dubious
     All grantable enqueues granted
     Post SMON to start 1st pass IR
    Fri Jul  7 20:30:48 2006
     LMS 0: 18198 GCS shadows traversed, 0 replayed
    Fri Jul  7 20:30:48 2006
     LMS 1: 18641 GCS shadows traversed, 0 replayed
    Fri Jul  7 20:30:48 2006
     Submitted all GCS remote-cache requests
     Post SMON to start 1st pass IR
     Fix write in gcs resources
    Fri Jul  7 20:30:48 2006
    Instance recovery: looking for dead threads
    Fri Jul  7 20:30:48 2006
    Beginning instance recovery of 1 thReconfiguration complete
    Fri Jul  7 20:30:49 2006
     parallel recovery started with 3 processes
    Fri Jul  7 20:30:49 2006
    Started redo scan
    Fri Jul  7 20:30:50 2006
    Completed redo scan
     152 redo blocks read, 46 data blocks need recovery
    Fri Jul  7 20:30:50 2006
    Started redo application at
     Thread 2: logseq 89, block 68271
    Fri Jul  7 20:30:50 2006
    Recovery of Online Redo Log: Thread 2 Group 3 Seq 89 Reading mem 0
      Mem# 0 errs 0: /dev/vx/rdsk/oraclerac/atm_redo2a
    Fri Jul  7 20:30:50 2006
    Completed redo application
    Fri Jul  7 20:30:50 2006
    Completed instance recovery at
     Thread 2: logseq 89, block 68423, scn 6728493
     44 data blocks read, 46 data blocks written, 152 redo blocks read
    Switch log for thread 2 to sequence 90
    Fri Jul  7 20:31:24 2006
    Shutting down instance (abort)
    License high water mark = 29
    Instance terminated by USER, pid = 12953
    It looks like as if Oracle commanded both instance to shutdown. This doesn't looks quite normal. If any of you have any insights on why is this happening and how to correct it, it would be extremely helpfull for me.

    Thank you all!

  2. #2
    Join Date
    Aug 2004
    Location
    France
    Posts
    754
    I experienced the same issue with 10gR1 on Linux Red Hat and a 2 nodes RAC. I think this is simply the expected behaviour : as the 2 instances cannot synchronize anymore, they are shut down by CRS.

    Obviously one of the two should be shut down, but I don't see why both should. I don't know the behaviour of RAC with more than 2 nodes, but AFAIK you have to use a switch (we have direct interconnect for now, no switch), so I hope that if only one network cable fails between one machine and the switch, the whole cluster will not go down .

    Clearly and AFAIK, with 2 nodes, the cluster interconnect is a single point of failure.

    I can only hope we both missed something .

    Regards,

    rbaraer
    ORA-000TK : No bind variable detected... Shared Pool Alert code 5 - Nuclear query ready .

  3. #3
    Join Date
    May 2004
    Location
    Dominican Republic
    Posts
    721
    I pray in god we did missed something. It just doesn't go with the whole idea of "high availability" in my opinion. Suppose one node's interconnect fails (hardware failure), would that cause the other nodes to shut down ? No, it should not. They cannot synchronize, granted... but is that enough reason to shutdown all the instances across all nodes? I believe I will have to deal with support on this.

  4. #4
    Join Date
    Jul 2006
    Posts
    3
    I too have two 490s with Solaris 9. What type of storage hardware did you use in your cluster? The documentation on exactly what type of hardware to use is vague at best.

  5. #5
    Join Date
    May 2004
    Location
    Dominican Republic
    Posts
    721
    We're using SAN, especifically EMC symmetrix. At least you need shared storage between both nodes if you're going RAC. I have already filed a TAR with support on this issue, they're working on it. Will post a follow up once we finish it.

  6. #6
    Join Date
    Jul 2006
    Posts
    13
    I will like to know what oracle said about this problem.....

    As I am also supporting a similar database....

    Thanks

  7. #7
    Join Date
    Jul 2006
    Posts
    3
    basically they said to not trust the cluster verify utiltiy. Even though it told me that it was failing I setup my disks and it installed. make sure to not write over the first 1mb with your partition and size it exactly as they specify, 120mb and 20mb. Then you can size the other partitions as you want. I took about 4 pages of notes while on the OWC with oracle. Once I was finally able to get someone to help it went very smoothly. Oh one other big thing, make sure your node names are lowercase. I'm use to lowercase but the primary SA was not. He is now.

  8. #8
    Join Date
    Jul 2006
    Posts
    3
    Going to Oracle lab next week to setup our db and app on oracle supplied rac. I can supply more information then.

  9. #9
    Join Date
    May 2004
    Location
    Dominican Republic
    Posts
    721
    Sorry for the delay, but here's what Oracle support said about the problem. It turned out to be RBARAER said, the _expected behaivour_ in 10gR2. Here's the excerpt from their response to my SR:
    We had a discussion with expert team. If the public network goes down then all vips will fail. Instance is dependant on the vip. So If VIP goes down automatically dependents (Instance) will go down. In our case both VIPs are failing because of public network failure so the dependent instances are also getting shut down. And this is the expected behavior. As of 10.2.0.2 there is dependency between the vips instance. As of 10.2.0.3, the dependency from instance to VIP can be dropped, if you prefer that the instance does not stop.
    However, they said this _dependancy_ can be dropped in patch set #3. Hope they will release it soon.

  10. #10
    Join Date
    Aug 2004
    Location
    France
    Posts
    754
    Quote Originally Posted by JMartinez
    Sorry for the delay, but here's what Oracle support said about the problem. It turned out to be RBARAER said, the _expected behaivour_ in 10gR2.
    Thanks for the info.

    I still wonder if when having a switch (not a direct interconnect), the failure of only one cable makes the cluster fall down. From what Oracle said, if the switch (ie the "public network", right ?) falls down, then the cluster falls down, which is understandable, but what if only one or two network cables don't work properly ? Will only the instances running on the disconnected machines fall, which seems to me the "expected behaviour" ?

    It's OK to be able to drop the dependency but how will it work ? Only one instance could work properly if the VIP falls down (ie no more cluster), and then how would all connections come to this very instance ? If the cluster connection is properly configured in the tnsnames.ora file, the client will loop through all the cluster instances until it finds the one which works, won't it ?

    Regards,

    rbaraer

    PS : Maybe I am mixing Cluster Interconnect and VIP... The question I asked above remains, but the return you had from Oracle relates to the "public network" ie the network used to connect to the cluster from outside, not the cluster interconnect network... so in fact there are two single points of failure ?
    Last edited by RBARAER; 08-18-06 at 04:58.
    ORA-000TK : No bind variable detected... Shared Pool Alert code 5 - Nuclear query ready .

  11. #11
    Join Date
    May 2004
    Location
    Dominican Republic
    Posts
    721
    Yes, I believe only the machines on which the VIP has fail with go down, others will be up and ready. VIP and Interconnects are different -- as you said, one is used for rpcs messages and the other to make the cluster available for the public network.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •