Juniper J-series upgrade to 8.4R2.3

28 09 2007

We’ve just been in the lab trying to get a router upgraded from 8.2 to the latest 8.4R2.3, and have been having some issues… 

JunOS 8.4R2.3 is just out, and we need it for a customer who is having repeated crashes of the fwdd daemon under 8.3R2.8. Here is a chop from the messages log file that shows fwdd crashing:

Sep 26 04:07:40.109 juniper fwdd[4567]: smpmutex_lock() called from unix context (ra = 0x808e0ef)
Sep 26 04:07:40.129 juniper fwdd[4567]: smpmutex_unlock() called from unix context (ra = 0x808e124)
Sep 26 04:07:40.129 juniper fwdd[4567]: smpmutex_lock() called from unix context (ra = 0x808e0ef)
Sep 26 04:07:40.129 juniper fwdd[4567]: smpmutex_unlock() called from unix context (ra = 0x808e124)
Sep 26 04:07:40.130 juniper fwdd[4567]: smpmutex_lock() called from unix context (ra = 0x808e0ef)
Sep 26 04:07:40.130 juniper fwdd[4567]: smpmutex_unlock() called from unix context (ra = 0x808e124)
Sep 26 04:08:15.153 juniper /kernel: psdd 1 (0xc1ff4000): exception 0xe in fwdd (pid 4567) at 0x818beea (0x818beea); killing
Sep 26 04:08:15.153 juniper /kernel: CPU=0 eip=0818beea eflags=00010206
Sep 26 04:08:15.153 juniper /kernel: eax: 00000000 ebx: 00000002 ecx: 0000003d edx: 4cbd0084
Sep 26 04:08:15.153 juniper /kernel: esi: 0000003a edi: 00000002 ebp: 4a449324 esp: 4a4492ac
Sep 26 04:08:15.153 juniper /kernel: ds: 4a44002f es: 4a44002f ss: 002f cs: 001f
Sep 26 04:08:15.153 juniper /kernel: Start of stack for thread 0xc1ff4000:
Sep 26 04:08:15.153 juniper /kernel: Frame 0: sp = 0x4a449324, pc = 0x818beea
Sep 26 04:08:15.154 juniper /kernel: Frame 1: sp = 0x4a449344, pc = 0x819974c
Sep 26 04:08:15.154 juniper /kernel: Frame 2: sp = 0x4a449384, pc = 0x81aa357
Sep 26 04:08:15.154 juniper /kernel: Frame 3: sp = 0x4a449624, pc = 0x80d0eea
Sep 26 04:08:15.154 juniper /kernel: Frame 4: sp = 0x4a449664, pc = 0x80b4584
<snip>
Sep 26 04:08:15.154 juniper /kernel: End of stack
Sep 26 04:08:15.153 juniper fwdd[4567]: --------------------------------------
Sep 26 04:08:15.153 juniper fwdd[4567]: Bus error!
Sep 26 04:08:15.153 juniper fwdd[4567]: Registers:
<snip>
Sep 26 04:08:17.490
Sep 26 04:08:17.490 juniper /kernel: pfe_send_failed(index 0, type 10), err=32
Sep 26 04:08:17.490 juniper /kernel: pfe_listener_disconnect: conn dropped: listener idx=0, tnpaddr=0x1, reason: socket error
Sep 26 04:08:17.548 juniper chassisd[2918]: CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 0 offline: Error
Sep 26 04:08:17.548 juniper chassisd[2918]: CHASSISD_IFDEV_DETACH_FPC: ifdev_detach(0)
Sep 26 04:08:17.568 juniper chassisd[2918]: CHASSISD_IPC_WRITE_ERR_NULL_ARGS: FRU has no connection arguments fru_send_msg FWDD
Sep 26 04:08:17.631 juniper init: forwarding (PID 4567) terminated by signal number 10. Core dumped!
Sep 26 04:08:17.632 juniper init: forwarding (PID 4637) started
/kernel: pfe_send_failed(index 0, type 10), err=32
Sep 26 04:08:17.490 juniper /kernel: pfe_listener_disconnect: conn dropped: listener idx=0, tnpaddr=0x1, reason: socket error
Sep 26 04:08:17.548 juniper chassisd[2918]: CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 0 offline: Error
Sep 26 04:08:17.548 juniper chassisd[2918]: CHASSISD_IFDEV_DETACH_FPC: ifdev_detach(0)
Sep 26 04:08:17.568 juniper chassisd[2918]: CHASSISD_IPC_WRITE_ERR_NULL_ARGS: FRU has no connection arguments fru_send_msg FWDD
Sep 26 04:08:17.631 juniper init: forwarding (PID 4567) terminated by signal number 10. Core dumped!
Sep 26 04:08:17.632 juniper init: forwarding (PID 4637) started

Not very nice – you can see an exception occurring in fwdd, and the process is killed.  It then does a stack dump and writes some stuff to the logs.  The PFE then detaches, and the fwdd process is restarted.  When this is happening you lose your telnet session.

So…  Juniper kindly brought out 8.4r2.3 with a fix for this in it.  (The previous version – 8.4r1 – didn’t have it).  But we ran into another issue.  Even a unit with a mostly blank config on it (just an IP address and root password) would not take the software – it complained that there was a configuration incompatibility and that the check-out failed for the Chassis control process.

Here’s the output found in /cf/var/log/install:

Checking compatibility with configuration
Initializing...
Verified manifest signed by PackageProduction_8_2_0
Using /var/tmp/junos-jseries-8.4R2.3-export.tgz
Checking junos requirements on /
Available space: 133919 require: 51215
Verified manifest signed by PackageProduction_8_4_0
mtree: line 57: unknown user ext
mtree: line 57: unknown user ext
Hardware Database regeneration succeeded
Validating against /config/juniper.conf.gz
<xnm:warning xmlns="http://xml.juniper.net/xnm/1.1/xnm" xmlns:xnm="http://xml.juniper.net/xnm/1.1/xnm">;
<message>
Couldn't open /packages/mnt
</message>
</xnm:warning>
mgd: error: Check-out failed for Chassis control process (/usr/sbin/chassisd) without details
mgd: error: configuration check-out failed
Validation failed
WARNING: Current configuration not compatible with /var/tmp/junos-jseries-8.4R2.3-export.tgz
</output>
<package-result>1</package-result>
root@% cli
root>

Turns out, this is an additional bug – documented under PR 237369.  The upgrade is failing in the Validate part.  There’s no software fix for this at the moment. You just have to do the upgrade on the CLI only, and use the ‘no-validate’ command:

juniper> request system software add /var/tmp/junos-jseries-8.4R2.3-domestic.tgz no-validate

While JTAC were coming up with this answer, however, my colleague managed to upgrade to 8.3, and from there to 8.4.  It moaned about the config not being compatible, but worked anyway.  We loaded the customer’s config on using ‘load override’ and it complained about there being an incompatibility in one line.  We committed it, and then did a diff on the before and after configs – the only line that was different was the encrypted password for the root user.   I wonder if they changed the encryption algorithm in this new version or something… 

I’ll have to ask if these bugs affect the M and T series of routers.  Something for tomorrow morning I think.

Advertisements

Actions

Information

One response

17 10 2007
DataPlumber

I’ve tried to get an answer out of JTAC about whether the validate problem affects M and T-series routers, but the guy I am speaking to doesn’t seem able to answer a straight question.

I tried to be as specific as I could be, but either he is incapable of understanding, or he’s not allowed to answer the question!

I can only assume that this affects M and T-series, so if you get it, stick no-validate on the end of the “request system software add” command.

It is probably worth checking the MD5 hash before you do this – I don’t know exactly what the validation part does, so knowing you’ve got a good package (because the MD5 hash matches) is a good thing.

If you don’t know how to do this, try this:

1. Note down the MD5 sum for the package on the Juniper website (in notepad for example)

2. Get your installation package onto a unix machine

3. Type “md5 ” followed by the package name and compare the string it spits out with what you have noted down from the Juniper website.

If your unix machine can’t find the program ‘md5’, try doing ‘/usr/sbin/md5’ instead.

If you dont have a unix machine, try doing a ‘start shell’ on the Juniper router you’ve put the file on – that’ll give you a unix prompt.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




%d bloggers like this: