Skip navigation.
Home
The QNX Community Portal

View topic - Troubleshoot QNX system hang

Troubleshoot QNX system hang

Discussion about the QNX6 OS.

Troubleshoot QNX system hang

Postby lullaby » Tue Feb 26, 2013 5:42 am

Hi all,

I have tested my QNX application which continuously writes some raw data to disk from start sector to end sector overnight. The application GUI is photon based. I have enabled sloginfo to read my error prints. But when I check it in the morning, the application hangs and also the system hangs. Even the system time hasn't changed after the application hang. I restarted my system and analyse the log file. But I couldn't see any error prints. The test PC is quad-core. There are multiple threads running on my application and the application is also trying to calculate the speed of each write call and prints it a log file. Functions like ThreadCtl, SYSPAGE_ENTRY, Clock_cycles() are used in every loop. I have tested the same application in a single core PC and I couldn't reproduce this hang issue. Could someone give some hints to troubleshoot this issue? I am stuck in this issue as I have no clue how to proceed.

Thanks,
Lullaby
lullaby
Active Member
 
Posts: 54
Joined: Fri Jun 15, 2012 6:19 am

Re: Troubleshoot QNX system hang

Postby maschoen » Tue Feb 26, 2013 8:17 pm

Well the first thing to think about if an application works on a single cpu but not a multi-cpu is whether there is a race condition somewhere in the code.

It is possible in QNX for a hung application to hang the GUI, but for that to happen, you have to have pushed the application priorities up high.

Lastly, by sheer coincidence I ran into what might be a related problem yesterday. A dual core system was loaded with QNX 6.5 using Photon, an archive was loaded and make was run. And it was hanging randomly. The problem went away when the video driver was changed from VGA to VESA or the native hardware driver. You might want to check which video driver you are using.
maschoen
QNX Master
 
Posts: 2439
Joined: Wed Jun 25, 2003 5:18 pm

Re: Troubleshoot QNX system hang

Postby lullaby » Thu Feb 28, 2013 2:48 am

Hi maschoen,

Thank you for your reply. But is there any way to identify the race condition scenario (other than source code walk-through and enable debug prints)? I find it difficult as the QNX PC itself gets hang and I couldn't even analyse the state of the threads in my application.
Even the system time hangs. This system hang issue is not visible if I test the same photon based application overnight on a single core or a dual core PC.

You have mentioned something like this, right?
Well the first thing to think about if an application works on a single cpu but not a multi-cpu is whether there is a race condition somewhere in the code.


So do you mean that if the application works on a single cpu and system hangs, it can be due to a race condition? is my understanding correct? So what happens when the application runs in a quad-core (multi-core) machine? Are you trying to tell that race condition may not be the reason in a multi-core PC? Could you please share your thoughts on my queries?

Thanks,
Lullaby
lullaby
Active Member
 
Posts: 54
Joined: Fri Jun 15, 2012 6:19 am

Re: Troubleshoot QNX system hang

Postby maschoen » Thu Feb 28, 2013 6:13 am

lullaby wrote:Hi maschoen,

Thank you for your reply. But is there any way to identify the race condition scenario (other than source code walk-through and enable debug prints)?


There's no magic.

I find it difficult as the QNX PC itself gets hang and I couldn't even analyse the state of the threads in my application.

Here's something extreme you can do.
In any permanent loops, put a mutex lock at the beginning and an unlock at the end.
Then for anything not in a loop, like a callback, put in a mutex lock when the routine is entered and unlock when it leaves.
This will force your code to be single threaded.
If the problem goes away, it was a race condition. If not, well something else.

Even the system time hangs. This system hang issue is not visible if I test the same photon based application overnight on a single core or a dual core PC.

You might want to look into the partition schedule. It would allow you to reserve some time for a shell that should always work, unless the OS is hosed.

So do you mean that if the application works on a single cpu and system hangs, it can be due to a race condition? is my understanding correct? So what happens when the application runs in a quad-core (multi-core) machine? Are you trying to tell that race condition may not be the reason in a multi-core PC? Could you please share your thoughts on my queries?


You can have a race condition whether single or multi-core. It becomes easier to trigger a race condition on a multi-core because you can two threads in your process running at the same time.

This reminds me of one more thing you can try. If you ThreadCtl() your process onto one cpu and use the FIFO scheduling, you should eliminate any race conditions. I suggest this not as a good way to run, but rather as a way to find out if the problem is a race condition.
maschoen
QNX Master
 
Posts: 2439
Joined: Wed Jun 25, 2003 5:18 pm

Re: Troubleshoot QNX system hang

Postby lullaby » Thu Feb 28, 2013 6:26 am

Thank you a lot for your suggestions, Maschoen.
I will try it out and let you know the results.

Thanks,
Lullaby
lullaby
Active Member
 
Posts: 54
Joined: Fri Jun 15, 2012 6:19 am

Re: Troubleshoot QNX system hang

Postby Tim » Thu Feb 28, 2013 5:40 pm

There is something else you can do.

Before running your app you can spawn one terminal (since you are running in Photon) at a REALLY high priority (30+). That way if your app is merely consuming 100% of the CPU as opposed to hanging the O/S you'll still be able to run commands like hogs/sin/pidin in this high priority terminal to see what's going on.

The 'renice' command is handy for increasing the priority of an existing terminal spawned at normal priority.

Tim
Tim
Senior Member
 
Posts: 1153
Joined: Wed Mar 10, 2004 12:28 am

Re: Troubleshoot QNX system hang

Postby mario » Thu Feb 28, 2013 8:41 pm

Can you give more details about "even system time hangs?"
mario
QNX Master
 
Posts: 4107
Joined: Sun Sep 01, 2002 1:04 am

Re: Troubleshoot QNX system hang

Postby mario » Thu Feb 28, 2013 11:01 pm

Your usage of ThreadCtl/ClockCycles is creating some level of "chaos" in the system. Your interpretation of the documentation isn't bad but not quite right.

When you intent to use ClockCycles() in a thread you should "lock" that thread to a core ahead of time, when the thread is started and leave it there for the whole life of the thread. Doing it just before ClockCycles() creates all sort of disrupting behavior. First if the thread is running on core 3 but you are assigned it to core 0, you are creating an thread migration, that isn't good, it will affect performance. What ever number you will get will not give you a "real case scenario". If another thread is also doing the same thing and is begin also assigned to thread 0, then you have contention. Only one of thread can run as before each of them were happily running on own core.

Check the model of CPU you are using, most modern x86 have their RTDSC synchronised and unaffected by thinks like SpeedStep. They even have a mode then ensure RTDSC can be used to measure time and not clock cycles ( which is that there were designed for in the beginning ). We have been using Xeon familly for quite a while and never had to worry about that.

Another solution if you don't do lots of reading per second is to write a "ClockCycles Server". You get that server's affinity set to one core and have any process wanting to get the time to send a message to it.

Playing with thread affinity is in general a very bad idea unless one possesses great knowledge of CPU and OS architectures.
mario
QNX Master
 
Posts: 4107
Joined: Sun Sep 01, 2002 1:04 am

Re: Troubleshoot QNX system hang

Postby mario » Thu Feb 28, 2013 11:01 pm

Your usage of ThreadCtl/ClockCycles is creating some level of "chaos" in the system. Your interpretation of the documentation isn't bad but not quite right.

When you intent to use ClockCycles() in a thread you should "lock" that thread to a core ahead of time, when the thread is started and leave it there for the whole life of the thread. Doing it just before ClockCycles() creates all sort of disrupting behavior. First if the thread is running on core 3 but you are assigned it to core 0, you are creating an thread migration, that isn't good, it will affect performance. What ever number you will get will not give you a "real case scenario". If another thread is also doing the same thing and is begin also assigned to thread 0, then you have contention. Only one of thread can run as before each of them were happily running on own core.

Check the model of CPU you are using, most modern x86 have their RTDSC synchronised and unaffected by thinks like SpeedStep. They even have a mode then ensure RTDSC can be used to measure time and not clock cycles ( which is that there were designed for in the beginning ). We have been using Xeon familly for quite a while and never had to worry about that.

Another solution if you don't do lots of reading per second is to write a "ClockCycles Server". You get that server's affinity set to one core and have any process wanting to get the time to send a message to it.

Playing with thread affinity is in general a very bad idea unless one possesses great knowledge of CPU and OS architectures.
mario
QNX Master
 
Posts: 4107
Joined: Sun Sep 01, 2002 1:04 am

Re: Troubleshoot QNX system hang

Postby lullaby » Sat Mar 02, 2013 5:42 am

Hi all,

A little update. I think I forgot to mention one thing. My multi-threaded application continuously writes to disk sectors; calculates speed; writes the log to a file; also updates the log to a PtMultiText continuously. One non-photon thread is doing log file write and multitext update. I am using PtMultiTextModifyText() function to add/delete entries from multi-text. That is, when the log entries reach some limit in the PtMultiText widget, I need to delete some old entries from the PtMultiText widget and this works periodically. I have done updates to PtMultiText widget within a PtEnter()...PtLeave() block.
This is my pseudo code:-
PtEnter(0)
if deletion condition is satisfied
{
for loop to get the length of 100 lines
PtMultiTextInfo()

PtMultiTextModifyText - to delete those 100 lines
}
PtMultiTextModifyText - to add the new line
PtSetResource(vertical scroll pos)
PtLeave()


Even now, if my application is put for a continuous run, it hangs after some time in an Intel Xeon E5540 machine. :cry: No issues when testing is done with single core/dual core machine. I couldn't find any possibility of race condition in my application on code analysis. Now I doubt if there is some issue in continuous update in PtMultiText. On reading through QNX help pages, I saw some functions like PtHold(), PtRelease(), PtContainerHold(), PtContainerRelease() etc. The help says these are used to force/delay display update. I was not aware of any of these functions and I haven't used any in my application.

My doubt is that:- Is there any issue like when my non-photon thread is constantly updating the PtMultiText and suddenly the display is updated and this causes the system hang? What is the purpose of using PtHold(), PtRelease(), PtContainerHold(), PtContainerRelease() etc? I haven't understood it clearly. Is the hang issue in my application caused due to the absence of these functions?

could you please share your thoughts on this issue?

Thanks,
Lullaby
lullaby
Active Member
 
Posts: 54
Joined: Fri Jun 15, 2012 6:19 am

Re: Troubleshoot QNX system hang

Postby maschoen » Sat Mar 02, 2013 6:59 am

Just something to try.

I never understood exactly why, but I think you should code the PtEnter(), PtLeave() this way.

int rc;

rc = PtEnter(0);

...


PtLeave(rc);

It usually works without this, but try it and see if your problem changes any.
maschoen
QNX Master
 
Posts: 2439
Joined: Wed Jun 25, 2003 5:18 pm

Re: Troubleshoot QNX system hang

Postby lullaby » Mon Mar 04, 2013 6:36 am

Hi Maschoen,

This is an excellent clue and thank you a lot for pointing out my mistake. I have updated the source code and put it for continuous run in the quad-core machine. I will get back to you with its results tomorrow.

Thanks,
Lullaby
lullaby
Active Member
 
Posts: 54
Joined: Fri Jun 15, 2012 6:19 am

Re: Troubleshoot QNX system hang

Postby lullaby » Tue Mar 05, 2013 6:27 am

Hi all,

Our QNX system hang issue is finally solved. The problem was with the argument used in PtLeave() call. I have used zero instead of the return value from PtEnter(0) call. So modified PtLeave() as Maschoen told in the earlier post. Now the application run continuously in the Quadcore machine for 25 hours.
Thank you a lot !!!

Thanks,
Lullaby
lullaby
Active Member
 
Posts: 54
Joined: Fri Jun 15, 2012 6:19 am


Return to QNX6 - OS

Who is online

Users browsing this forum: No registered users and 1 guest