From Serge.Dussud at Sun.COM Fri Feb 6 17:29:51 2009 From: Serge.Dussud at Sun.COM (serge) Date: Fri, 06 Feb 2009 16:29:51 +0100 Subject: tsch cores if nscd disabled on Solaris LDAP sasl/gssapi client In-Reply-To: <4947BC69.9020806@Sun.COM> References: <20081203154911.67F1A5654E@rebar.astron.com> <4947BC69.9020806@Sun.COM> Message-ID: <498C576F.4000408@sun.com> Following-up ... Serge Dussud wrote: > > > On 12/ 3/08 04:49 PM, Christos Zoulas wrote: >> On Dec 2, 5:04pm, Serge.Dussud at Sun.COM (Serge Dussud) wrote: >> -- Subject: tsch cores if nscd disabled on Solaris LDAP sasl/gssapi >> client >> >> | | hello tcsh-bugs, >> | | we've seen bash and tcsh core dumping under certain conditions on >> | solaris 10 (and newer), because of multiple malloc() routines issues >> | with certain ld(1) options. For gory details, please refer the | >> discussion on bash-bugs at gnu.org: >> | | http://thread.gmane.org/gmane.comp.shells.bash.bugs/11855 >> | | and to the bug referred in [1]. >> | | The preferred solution to this issue seems to be to compile tcsh >> with | -DSYSMALLOC on Solaris. so my question to you at this point is >> the same | I posted earlier today to bash-bugs: are there any real >> benefits these | days of using tcsh's own malloc function rather than >> the one from the | system ? or, otherwise said, what do tcsh lose when >> it's compiled with | option -DSYSMALLOC ? >> | | TIA ! >> | Serge >> >> Well, the preferred solution should be to fix RTLD_GROUP semantics > > it's indeed work in progress. > ... and it's now completed. The linker issue [1] was the root cause of all this, and was fixed in recent OpenSolaris build (snv_106). We're working on fixing it to previous version of Solaris, when/if applicable. Bottom line, we're not applying any patch to the way we build our tcsh delivery bundled to [Open]Solaris :) Thanks for the help/chat on this one, serge [1] http://bugs.opensolaris.org/view_bug.do?bug_id=6778453 From leonardo.lists at gmail.com Thu Feb 19 21:27:32 2009 From: leonardo.lists at gmail.com (Leonardo Chiquitto) Date: Thu, 19 Feb 2009 16:27:32 -0300 Subject: builtin suspend can trigger infinite loop Message-ID: Hello, This is similar to the problem already fixed in 6.12.02 (9. Don't go into an infinite loop when tcgetpgrp() returns an error.), but here tcgetpgrp() won't return an error. How to reproduce: first # tcsh --version tcsh 6.15.00 (Astron) 2007-03-03 (i586-suse-linux) options wide, nls,lf,dl,al,kan,sm,color,filec first # tcsh -f # tcsh -f # suspend Suspended # exit first # At this point, the suspended tcsh process will be looping in the "retry" block of dosuspend(): --- SIGTTIN (Stopped (tty input)) @ 0 (0) --- rt_sigaction(SIGTTIN, {SIG_IGN}, NULL, 8) = 0 ioctl(15, TIOCGPGRP, [5305]) = 0 rt_sigaction(SIGTTIN, NULL, {SIG_IGN}, 8) = 0 rt_sigaction(SIGTTIN, {SIG_DFL}, {SIG_IGN}, 8) = 0 kill(0, SIGTTIN) = 0 --- SIGTTIN (Stopped (tty input)) @ 0 (0) --- rt_sigaction(SIGTTIN, {SIG_IGN}, NULL, 8) = 0 ioctl(15, TIOCGPGRP, [5305]) = 0 rt_sigaction(SIGTTIN, NULL, {SIG_IGN}, 8) = 0 rt_sigaction(SIGTTIN, {SIG_DFL}, {SIG_IGN}, 8) = 0 kill(0, SIGTTIN) = 0 Closing the last shell will make tcgetpgrp() return -1 and the process will finally rest: --- SIGTTIN (Stopped (tty input)) @ 0 (0) --- rt_sigaction(SIGTTIN, {SIG_IGN}, NULL, 8) = 0 ioctl(15, TIOCGPGRP, [0]) = -1 EIO (Input/output error) exit_group(1) = ? Process 5327 detached I imagined a trivial fix (adding a retry counter to identify that we entered a loop), but I'm failing to understand the reasons behind the pgrp and SIGTTIN handling in dosuspend()'s last block. I'd really appreciate if someone could explain the purpose of triggering SIGTTIN there. Thanks, Leonardo From christos at zoulas.com Sat Feb 21 17:36:36 2009 From: christos at zoulas.com (Christos Zoulas) Date: Sat, 21 Feb 2009 10:36:36 -0500 Subject: builtin suspend can trigger infinite loop In-Reply-To: from Leonardo Chiquitto (Feb 19, 4:27pm) Message-ID: <20090221153636.76CD05654E@rebar.astron.com> On Feb 19, 4:27pm, leonardo.lists at gmail.com (Leonardo Chiquitto) wrote: -- Subject: builtin suspend can trigger infinite loop | Hello, | | This is similar to the problem already fixed in 6.12.02 (9. Don't go | into an infinite loop when tcgetpgrp() returns an error.), but here | tcgetpgrp() won't return an error. How to reproduce: | | first # tcsh --version | tcsh 6.15.00 (Astron) 2007-03-03 (i586-suse-linux) options wide, | nls,lf,dl,al,kan,sm,color,filec | | first # tcsh -f | # tcsh -f | # suspend | | Suspended | # exit | first # | | At this point, the suspended tcsh process will be looping in | the "retry" block of dosuspend(): Mine does not look like this, it looks like: [10:33am] 2505>tcsh -f > set prompt=first% first%tcsh -f > suspend Suspended first%exit There are suspended jobs. first%exit exit [10:33am] 2506> And now the internal tcsh loops killing itself. christos From leonardo.lists at gmail.com Tue Feb 24 18:07:16 2009 From: leonardo.lists at gmail.com (Leonardo Chiquitto) Date: Tue, 24 Feb 2009 13:07:16 -0300 Subject: builtin suspend can trigger infinite loop In-Reply-To: <20090221153636.76CD05654E@rebar.astron.com> References: <20090221153636.76CD05654E@rebar.astron.com> Message-ID: > Mine does not look like this, it looks like: > [10:33am] 2505>tcsh -f >> set prompt=first% > first%tcsh -f >> suspend > > Suspended > first%exit > There are suspended jobs. > first%exit > exit > [10:33am] 2506> > > And now the internal tcsh loops killing itself. Strange.. I thought it could be something Linux specific, but I just reproduced the same problem with OpenBSD 4.4. What exit path are you seeing? Is tcgetpgrp() that fails and causes the abort? Thanks, Leonardo From christos at zoulas.com Tue Feb 24 21:04:21 2009 From: christos at zoulas.com (Christos Zoulas) Date: Tue, 24 Feb 2009 14:04:21 -0500 Subject: builtin suspend can trigger infinite loop In-Reply-To: from Leonardo Chiquitto (Feb 24, 1:07pm) Message-ID: <20090224190421.B1CBA5654E@rebar.astron.com> On Feb 24, 1:07pm, leonardo.lists at gmail.com (Leonardo Chiquitto) wrote: -- Subject: Re: builtin suspend can trigger infinite loop | > Mine does not look like this, it looks like: | > [10:33am] 2505>tcsh -f | >> set prompt=first% | > first%tcsh -f | >> suspend | > | > Suspended | > first%exit | > There are suspended jobs. | > first%exit | > exit | > [10:33am] 2506> | > | > And now the internal tcsh loops killing itself. | | Strange.. I thought it could be something Linux specific, but I | just reproduced the same problem with OpenBSD 4.4. What | exit path are you seeing? Is tcgetpgrp() that fails and causes | the abort? ktrace shows a loop of kill(0, SIGTTIN) IIRC. This is NetBSD current. christos From leonardo.lists at gmail.com Wed Feb 25 15:45:15 2009 From: leonardo.lists at gmail.com (Leonardo Chiquitto) Date: Wed, 25 Feb 2009 10:45:15 -0300 Subject: builtin suspend can trigger infinite loop In-Reply-To: <20090224190421.B1CBA5654E@rebar.astron.com> References: <20090224190421.B1CBA5654E@rebar.astron.com> Message-ID: > | > Mine does not look like this, it looks like: > | > [10:33am] 2505>tcsh -f > | >> set prompt=first% > | > first%tcsh -f > | >> suspend > | > > | > Suspended > | > first%exit > | > There are suspended jobs. > | > first%exit > | > exit > | > [10:33am] 2506> > | > > | > And now the internal tcsh loops killing itself. > | > | Strange.. I thought it could be something Linux specific, but I > | just reproduced the same problem with OpenBSD 4.4. What > | exit path are you seeing? Is tcgetpgrp() that fails and causes > | the abort? > > ktrace shows a loop of kill(0, SIGTTIN) IIRC. This is NetBSD current. Now I'm confused, are you seeing the infinite loop too or does the tcsh process correctly exits? AFAIR, SIGTTIN won't terminate the process, right? Thanks, Leonardo From christos at zoulas.com Wed Feb 25 16:15:38 2009 From: christos at zoulas.com (Christos Zoulas) Date: Wed, 25 Feb 2009 09:15:38 -0500 Subject: builtin suspend can trigger infinite loop In-Reply-To: from Leonardo Chiquitto (Feb 25, 10:45am) Message-ID: <20090225141538.BC8AE5654E@rebar.astron.com> On Feb 25, 10:45am, leonardo.lists at gmail.com (Leonardo Chiquitto) wrote: -- Subject: Re: builtin suspend can trigger infinite loop | > | > Mine does not look like this, it looks like: | > | > [10:33am] 2505>tcsh -f | > | >> set prompt=first% | > | > first%tcsh -f | > | >> suspend | > | > | > | > Suspended | > | > first%exit | > | > There are suspended jobs. | > | > first%exit | > | > exit | > | > [10:33am] 2506> | > | > | > | > And now the internal tcsh loops killing itself. | > | | > | Strange.. I thought it could be something Linux specific, but I | > | just reproduced the same problem with OpenBSD 4.4. What | > | exit path are you seeing? Is tcgetpgrp() that fails and causes | > | the abort? | > | > ktrace shows a loop of kill(0, SIGTTIN) IIRC. This is NetBSD current. | | Now I'm confused, are you seeing the infinite loop too or does the | tcsh process correctly exits? AFAIR, SIGTTIN won't terminate the | process, right? I see the infinite loop too. christos From christos at zoulas.com Thu Feb 26 00:54:30 2009 From: christos at zoulas.com (Christos Zoulas) Date: Wed, 25 Feb 2009 17:54:30 -0500 Subject: builtin suspend can trigger infinite loop In-Reply-To: <18853.28349.561083.837686@gromit.timing.com> from John Hein (Feb 25, 9:15am) Message-ID: <20090225225430.943F95654F@rebar.astron.com> On Feb 25, 9:15am, jhein at timing.com (John Hein) wrote: -- Subject: Re: builtin suspend can trigger infinite loop | Attaching ktrace while it's spinning gives: | | 34503 tcsh RET ioctl 0 | 34503 tcsh CALL sigaction(0x15,0,0xbfbfe260) | 34503 tcsh RET sigaction 0 | 34503 tcsh CALL sigaction(0x15,0xbfbfe210,0xbfbfe1f0) | 34503 tcsh RET sigaction 0 | 34503 tcsh CALL kill(0,0x15) | 34503 tcsh RET kill 0 | 34503 tcsh CALL sigaction(0x15,0xbfbfe260,0) | 34503 tcsh RET sigaction 0 | 34503 tcsh CALL ioctl(0xf,TIOCGPGRP,0xbfbfe240) | 34503 tcsh RET ioctl 0 | 34503 tcsh CALL sigaction(0x15,0,0xbfbfe260) | 34503 tcsh RET sigaction 0 | 34503 tcsh CALL sigaction(0x15,0xbfbfe210,0xbfbfe1f0) | 34503 tcsh RET sigaction 0 | 34503 tcsh CALL kill(0,0x15) | 34503 tcsh RET kill 0 | 34503 tcsh CALL sigaction(0x15,0xbfbfe260,0) | 34503 tcsh RET sigaction 0 | 34503 tcsh CALL ioctl(0xf,TIOCGPGRP,0xbfbfe240) | 34503 tcsh RET ioctl 0 | 34503 tcsh CALL sigaction(0x15,0,0xbfbfe260) How about this patch then? christos Index: sh.c =================================================================== RCS file: /p/tcsh/cvsroot/tcsh/sh.c,v retrieving revision 3.141 diff -u -u -r3.141 sh.c --- sh.c 15 Oct 2008 16:42:00 -0000 3.141 +++ sh.c 25 Feb 2009 22:53:47 -0000 @@ -1103,17 +1103,7 @@ } #endif /* NeXT */ #ifdef BSDJOBS /* if we have tty job control */ - retry: - if ((tpgrp = tcgetpgrp(f)) != -1) { - if (tpgrp != shpgrp) { - struct sigaction old; - - sigaction(SIGTTIN, NULL, &old); - signal(SIGTTIN, SIG_DFL); - (void) kill(0, SIGTTIN); - sigaction(SIGTTIN, &old, NULL); - goto retry; - } + if (grabpgrp(f, shpgrp) != -1) { /* * Thanks to Matt Day for the POSIX references, and to * Paul Close for the SGI clarification. @@ -2356,3 +2346,28 @@ rechist(NULL, adrof(STRsavehist) != NULL); } } + +/* + * Grab the tty repeatedly, and give up if we are not in the correct + * tty process group. + */ +int +grabpgrp(int fd, pid_t desired) +{ + struct sigaction old; + pid_t pgrp; + size_t i; + + for (i = 0; i < 100; i++) { + if ((pgrp = tcgetpgrp(fd)) == -1) + return -1; + if (pgrp == desired) + return 0; + (void)sigaction(SIGTTIN, NULL, &old); + (void)signal(SIGTTIN, SIG_DFL); + (void)kill(0, SIGTTIN); + (void)sigaction(SIGTTIN, &old, NULL); + } + errno = EPERM; + return -1; +} Index: sh.decls.h =================================================================== RCS file: /p/tcsh/cvsroot/tcsh/sh.decls.h,v retrieving revision 3.54 diff -u -u -r3.54 sh.decls.h --- sh.decls.h 11 Mar 2007 06:21:05 -0000 3.54 +++ sh.decls.h 25 Feb 2009 22:53:47 -0000 @@ -52,6 +52,7 @@ #else extern void xexit (int); #endif +extern int grabpgrp (int, pid_t); /* * sh.dir.c Index: sh.func.c =================================================================== RCS file: /p/tcsh/cvsroot/tcsh/sh.func.c,v retrieving revision 3.149 diff -u -u -r3.149 sh.func.c --- sh.func.c 16 Nov 2008 15:44:24 -0000 3.149 +++ sh.func.c 25 Feb 2009 22:53:47 -0000 @@ -2272,10 +2272,9 @@ dosuspend(Char **v, struct command *c) { #ifdef BSDJOBS - int ctpgrp; struct sigaction old; #endif /* BSDJOBS */ - + USE(c); USE(v); @@ -2295,17 +2294,8 @@ #ifdef BSDJOBS if (tpgrp != -1) { -retry: - ctpgrp = tcgetpgrp(FSHTTY); - if (ctpgrp == -1) + if (grabpgrp(FSHTTY, opgrp) == -1) stderror(ERR_SYSTEM, "tcgetpgrp", strerror(errno)); - if (ctpgrp != opgrp) { - sigaction(SIGTTIN, NULL, &old); - signal(SIGTTIN, SIG_DFL); - (void) kill(0, SIGTTIN); - sigaction(SIGTTIN, &old, NULL); - goto retry; - } (void) setpgid(0, shpgrp); (void) tcsetpgrp(FSHTTY, shpgrp); } From leonardo.lists at gmail.com Thu Feb 26 19:14:02 2009 From: leonardo.lists at gmail.com (Leonardo Chiquitto) Date: Thu, 26 Feb 2009 14:14:02 -0300 Subject: builtin suspend can trigger infinite loop In-Reply-To: <18854.49585.267774.214781@gromit.timing.com> References: <18853.28349.561083.837686@gromit.timing.com> <20090225225430.943F95654F@rebar.astron.com> <18854.49585.267774.214781@gromit.timing.com> Message-ID: On Thu, Feb 26, 2009 at 1:22 PM, John Hein wrote: > Christos Zoulas wrote at 17:54 -0500 on Feb 25, 2009: > ?> How about this patch then? > > That fixed it on FreeBSD 7. Works for me too on Linux. Thanks Christos! Leonardo From christos at zoulas.com Thu Feb 26 19:34:54 2009 From: christos at zoulas.com (Christos Zoulas) Date: Thu, 26 Feb 2009 12:34:54 -0500 Subject: builtin suspend can trigger infinite loop In-Reply-To: <18854.49585.267774.214781@gromit.timing.com> from John Hein (Feb 26, 9:22am) Message-ID: <20090226173454.84D4556550@rebar.astron.com> On Feb 26, 9:22am, jhein at timing.com (John Hein) wrote: -- Subject: Re: builtin suspend can trigger infinite loop | Christos Zoulas wrote at 17:54 -0500 on Feb 25, 2009: | > How about this patch then? | | That fixed it on FreeBSD 7. | Just curious - why does NetBSD not see this behavior? It does, but it manifests differently as an endless stream of SIGTTIN's delivered to the process. I will need to investigate further. Looks like NetBSD is different/broken. christos