ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


Linux Compatibility on BSD for the PPC Platform: Part 4
Pages: 1, 2, 3

Inconsistent signal delivery

Working on Java tests, new problems apeared. The most obvious was that it was impossible to pass any test featuring a native program launch from a Java program. In fact, the native program was launched, but the Java program was not notified of its child's death. This problem was not Java specific, but it has only been possible to reproduce it with a Java program.



// exec_test.java -- run a native program
import java.lang.*;
import java.io.*;

class exec_test
{
  static Process pid;
  static String cmdstring = "/bin/ps";

  public static boolean execIt(String argv, String pname)
    {
        try {
          System.out.println(" Will call "+pname);
          pid = Runtime.getRuntime().exec(argv);
        } catch (IOException e) {
          System.out.println("Failed to execute "+pname);
          return false;
        }
        System.out.println("Waiting for "+pname+" to die");
        try{pid.waitFor();}
        catch(InterruptedException e){return false;}
        System.out.println("end of "+pname);
        return true;
    }

    public static void main(String args[])
    {
        System.out.println("In exec_test");
        execIt(cmdstring,"Testing /bin/ps");
    }
  }

The program basically launches /bin/ps and waits for its death. Sometimes it works; sometimes it fails. Success is somewhat related to the load average, but it is not completely related. It was only possible to see the bug effect on a particular race condition between different signals. This made the bug extremely difficult to spot.

Hendricks was finally able to find what was wrong by using the logging feature of the JDK. This is done by executing the Java program with the java_g syntax. Note that this feature has been disabled in JDK-1.3.0. We used JDK-1.1.8 for the tests.

Using java_g -green -l6 exec_test, we got a lot of output, including a line complaining about an unexpected signal 20, where we expected a SIGCHLD. Having a quick look to NetBSD's sys/sys/signal.h shows that signal 20 is NetBSD's SIGCHLD. In Linux, SIGCHLD is signal 17. The trace also complained about a bad signal 23 instead of a SIGIO. For NetBSD, signal 23 is SIGIO, and for Linux, it's signal 29.

Obviously, the signal numbers are not being correctly translated between NetBSD and Linux. The first idea I tried was to check carefully the native_to_linux_sig[] array in sys/compat/linux/common/linux_signal.c in case signal numbers were mixed. This was not the case.

The next step is to check the linux_sendsig() function in sys/compat/linux/arch/powerpc/linux_machdep.c, which is responsible for sending signals to Linux processes. This function takes a sig parameter, which is the signal number. This sig parameter is used twice in the linux_sendsig() function.

First, when building the Linux struct sigcontext on the processes' stack, this structure has a field to hold the signal number. The signal number is copied here with the appropriate translation:

/*
 * Prepare a sigcontext for later.
 */
sc.lsignal = (int)native_to_linux_sig[sig];
sc.lhandler = (unsigned long)catcher;
native_to_linux_old_sigset(mask, &sc.lmask);
sc.lregs = (struct linux_pt_regs*)fp;

Second, when setting up the trap frame prior transfering control to the signal trampoline, the appropriate translation was missing :

/*
 * Set the registers according to how the Linux 
 * process expects them
 */
tf->fixreg[1] = (int)fp;
tf->lr = (int)catcher;
tf->fixreg[3] = (int)sig;
tf->fixreg[4] = (int)&fp->lgp_regs;
tf->srr0 = (int)p->p_sigctx.ps_sigcode;

Once the problem was found, it was quite easy to fix by changing the third line in the above code fragment:

tf->fixreg[3] = (int)native_to_linux_sig[sig];

With this fix, Java programs forking native programs are able to work without suffering random failures. It also has the side effect of fixing the mail and news part of Netscape Communicator that was previously broken.

Non-standard behavior of asynchronous I/O

The previous fix helps the JDK a lot, but there are still some rare hangs. One can be observed when building Apache foundation's Jakarta-Ant, the make(1)-like build utility for Java. Another hang occurred when attempting to run Jakarta-Tomcat, the Apache foundation's JSP server. In this section, we will focus on the problem with Jakarta-Ant.

The offending program here was javac, the Java compiler. The problem was obviously emulation related because it was possible to successfully build Jakarta-Ant using a native build of Jikes, the Java compiler written in C.

The JDK-1.2.2 logging feature was again very useful. For the Java compiler, this can be enabled by invoking javac_g -J-Xl6 (no space after the J) instead of just javac. This is worth the comment because the -J flag is not documented except in the JDK sources.

Note that anyone can get the JDK sources, the only requirement is to make an agreement with Sun. But be aware that reading the JDK sources will make you unable to contribute to any open-source Java implementation such as Kaffe.

Pages: 1, 2, 3

Next Pagearrow





Sponsored by: