Rigol DS2072 Lock-Ups from Network Interface

#25 Reply
Posted by motocoder on 21 Nov, 2014 16:29
Quote from: Murray on 21 Nov, 2014 15:50
You might want to check the DHCP lease time in your router. If there is traffic during the last half of the lease, the lease will be renewed. If your lease time is quite low, and you left the scope on and connected but did no network activity for half the lease time, your DHCP lease might expire, and would have to be re-negotiated. If the scope does not handle the re-negotiation properly, this might be enough, like the other network disturbances that have been tested, to fubar the scope. If you increase you lease time to more than twice the longest time you expect to leave the scope on and connected to the net with no network activity, it MIGHT reduce the lockups.

interesting theory. I think my next step will be to set up a static ip and see if it locks up. If so, I am going to unplug everything else from the switch and see if it still happens. I also want to try another switch. If it really is switch related, I will just send this one to Rigol.

#26 Reply
Posted by motocoder on 21 Nov, 2014 21:12
The Rigol reps in Oregon have shared this thread with Rigol engineering, so I'm going to continue to record my findings here. Everyone else feel free to add any detail you might have.

This morning I swapped out the network cable and verified that the unit still locks up.
I then configure the scope with a static IP address, and turned off "Auto IP" (whatever that is). The scope still locks up.

I'm going to try a few more things over the weekend, but I think we can rule cable issues and DHCP out.

#27 Reply
Posted by Tyrian on 21 Nov, 2014 21:37
I recently got an MSO2072A scope from TEquipment a few days ago and began experiencing the same screen freezing bug. However, my scope is not connected to a LAN. We're discussing it in another thread here: https://www.eevblog.com/forum/testgear/rigol-mso2072a-problem-display-stops-updating/

I've also contacted Rigol tech support about this and have been talking to Jason (out of Ohio, I believe). I've been posting about my experience in the above thread. Have you tried resetting the FRAM during boot? The theory in the other thread is that there may be a software bug that corrupts a section of the FRAM, resulting in the screen freeze, and bumping a control knob makes the screen update again.

#28 Reply
Posted by motocoder on 21 Nov, 2014 23:05
Quote from: Tyrian on 21 Nov, 2014 21:37
I recently got an MSO2072A scope from TEquipment a few days ago and began experiencing the same screen freezing bug. However, my scope is not connected to a LAN. We're discussing it in another thread here: https://www.eevblog.com/forum/testgear/rigol-mso2072a-problem-display-stops-updating/

I've also contacted Rigol tech support about this and have been talking to Jason (out of Ohio, I believe). I've been posting about my experience in the above thread. Have you tried resetting the FRAM during boot? The theory in the other thread is that there may be a software bug that corrupts a section of the FRAM, resulting in the screen freeze, and bumping a control knob makes the screen update again.

Yes, resetting the FRAM is SOP after doing a BIOS update - I always do it. And when I got the demo unit from Rigol, this is one of the first things I did before starting testing. The scope does not lock up if it's not connected to anything.

However, note that one of the other posters on this thread mentioned that he has seen some issues with USB, so if you're scope is connected via USB, you might want to try disconnecting that as well.

Thanks for the link to the other thread - I'll take a look and see if there are any additional clues there.

#29 Reply
Posted by motocoder on 22 Nov, 2014 01:20
Locks up with a static IP. I have connected it to a 10MBps hub now (by itself). If it locks up on the hub, it will be useful to use for wireshark as every device on the hub receives every packet (not so with a switch).

#30 Reply
Posted by motocoder on 22 Nov, 2014 18:01
Ok, so last night I ran through several test scenarios. I reduced the things connected to my network switch, starting with the scope by itself, and gradually adding other computers.

What I found is that the lock-up only seems to occur when my primary desktop computer (which runs Windows 8.1) is connected to the switch along with the scope. I can remove everything except the scope and that computer from the switch, and see the lock-up, but if I remove the Windows computer and add the other machines, I don't see the lock-up.

I have an old hub that someone at work loaned me. I am trying to repro the lock-up with the scope and my computer connected to that. The hub will let me use Wireshark and see all the packets coming from the scope, so I'd like to go that route if possible. If not, I can run Wireshark on my desktop and at least capture broadcast packets and packets between the desktop and the scope.

Update: no lock-up with the hub. Swapping out the switch.

#31 Reply
Posted by motocoder on 23 Nov, 2014 19:59
So yesterday I tried a number of things, and thought I had discovered what was triggering the bug. I still may have, as the scope is no longer locking up, but it's not conclusive. Here are the details:

I swapped out the switch - no change, i.e. the scope still locks up. I also determined that the scope would lock up when connected to a switch along with my Windows desktop computer and nothing else, but would not lock up even if every computer in my house other than my Windows desktop were also connected. So, clearly whatever the issue is, it was being triggered by some traffic originating from my Windows desktop.

At that point, I decided to set up WireShark and get some network captures. Unfortunately, my switch does not support port mirroring, and I was not able to reproduce the lock-up with the scope connected to a 10MB hub (the only hub that I have). So I decided to just put the scope and my desktop computer on the switch, and run WireShark from my desktop. I captured a number of lock-ups this way. The attached capture shows one that was especially interesting. In this capture, the scope booted and locked up almost immediately; the whole process from first network packet from the scope until lock-up took less than 30 seconds.

Now note that this capture is pretty brief, and does not include one type of network packet that the scope may have been receiving, namely broadcast packets. So I did some more captures using broadcast captures. I don't want to post those here, as the likelihood that there may be some security-related or personal info goes up and I don't have the time to review every packet to make sure that's not an issue. However, I can tell you that there were ARP broadcast packets, some broadcasts for that same scanner software, some NETBIOS broadcast packets, and some dropbox local sync broadcast packet. I also see some HTTP exchange between the scope and the PC, specifically it is some LXI identification query. I don't think this is causing any issues, because I also see these when the scope is stable with no lock-ups for 12+ hours. Also note that when the scope locks up, it always right after it sends 3 ARP queries to the desktop, each of which asking what MAC address owns its IP address.

So, looking at this capture, and the others, I decided the only thing that looked unique was these strange scanner software packets. In WireShark they are labelled as protocol "BJNP", but they are actually UDP packets to some particular port. The UDP point-to-point packets are sent to port 8612, and the UDP broadcast packets are sent to port 8612. The BJNP broadcast packets seem to be some mechanism for the Canon printer/scanner drivers to identify Canon printers and scanners on the network. I have no idea why the driver also sends such packets directly to specific IP addresses. But in any case, I no longer own any Canon printers, so I uninstalled this driver software. And then after rebooting everything, including the scope, wonder of wonders, the lock-ups stopped. This is a plausible explanation, as this computer is the only computer on the network that still had the Canon software installed on it, and would explain why the scope only locks up when this computer is on the same LAN with it.

However, that's not the end of story. I decided to write a small "UdpPing" program to allow Rigol to recreate this scenario without having to install old Canon printer drivers. I did that, but so far I have not been able to recreate the lock-up. It's possible that there is some network traffic that I didn't capture (multicast packets maybe?), or perhaps my program isn't faithfully recreating the exact packets that I did capture. I'm going to poke around with it a little bit later today, and will post back here if I find out more.

Trace_2__Fast_Lockup_Scope_and_Windows_PC_on_Switch.zip

#32 Reply
Posted by tautech on 23 Nov, 2014 20:19
motocoder from now on you shall be known as Sherlock Holmes.
Great detective work.

#33 Reply
Posted by marmad on 23 Nov, 2014 20:29
Hey motocoder - you never responded to my question in the other thread, so I'll just ask it again here:

Based on the behavior of my RUU utility, there were some major changes to the LAN code in the DS2000 FW, beginning with, I believe, v.3 (although maybe it was v.2). Have you downgraded to FW 02.01.00.03 or FW 01.01.00.02 to see if the lock-ups persist?

Not that it solves the problem, but if the bug was introduced with the changes to the LAN in that specific FW#, it might help Rigol locate it more quickly.

#34 Reply
Posted by motocoder on 23 Nov, 2014 20:56
Quote from: marmad on 23 Nov, 2014 20:29
Hey motocoder - you never responded to my question in the other thread, so I'll just ask it again here:

Based on the behavior of my RUU utility, there were some major changes to the LAN code in the DS2000 FW, beginning with, I believe, v.3 (although maybe it was v.2). Have you downgraded to FW 02.01.00.03 or FW 01.01.00.02 to see if the lock-ups persist?

Not that it solves the problem, but if the bug was introduced with the changes to the LAN in that specific FW#, it might help Rigol locate it more quickly.

Hi marmad - sorry, I haven't been monitoring that thread. You raise a good point. I was running firmware 02.01.00.03, and only upgraded to 03.01.00.04 in troubleshooting this (before contacting Rigol). I haven't tried 01.01.00.02, although I think that may be what was on the scope when I got it. So it is definitely possible that the bug was introduced in the 02.* firmware.

At the moment, I'm not able to recreate the lock-up anymore, though. I just modified my UdpPing program to allow you to specify the payload and the TTL value, so that I can hopefully recreate exactly the packets that the Canon software was sending.

#35 Reply
Posted by Dave Turner on 23 Nov, 2014 21:42
I'm probably missed something - but if you want to use a DHCP server then just tell it that the scope's MAC address always is given the same IP - i.e. reserved. There are other solutions but on a network this usually a good solution.

#36 Reply
Posted by motocoder on 23 Nov, 2014 22:13
Quote from: Dave Turner on 23 Nov, 2014 21:42
I'm probably missed something - but if you want to use a DHCP server then just tell it that the scope's MAC address always is given the same IP - i.e. reserved. There are other solutions but on a network this usually a good solution.

I'm not sure what problem it is you are trying to solve. The IP address of the scope isn't an issue, and I aware of ways to ensure the scope always gets the same IP, with DHCP or without.

#37 Reply
Posted by motocoder on 23 Nov, 2014 22:16
Ok, I recreated the exact sequence of Canon printer driver broadcast and unicast (to the scope) packets from one of the captures I had, and this isn't recreating the lock-up. Also, I have tried to reinstall the Canon drivers, but I'm not able to get that set up correctly without actually having the Canon printer. So it seems I won't easily be able to recreate the lock-up.

At this point, I think I've donated several thousand dollars worth of engineering time to help Rigol troubleshoot this - several times the cost of the scope for sure, so I am going to just ask them again to send me my scope back.

#38 Reply
Posted by Rasz on 24 Nov, 2014 11:08
Quote from: motocoder on 23 Nov, 2014 19:59
causing any issues, because I also see these when the scope is stable with no lock-ups for 12+ hours. Also note that when the scope locks up, it always right after it sends 3 ARP queries to the desktop, each of which asking what MAC address owns its IP address.

So, looking at this capture, and the others, I decided the only thing that looked unique was these strange scanner software packets. In WireShark they are labelled as protocol "BJNP", but they are actually UDP packets to some particular port. The UDP point-to-point packets are sent to port 8612, and the UDP broadcast packets are sent to port 8612. The BJNP broadcast packets seem to be some mechanism for the Canon printer/scanner drivers to identify Canon printers and scanners on the network. I have no idea why the driver also sends such packets directly to specific IP addresses. But in any case, I no longer own any Canon printers, so I uninstalled this driver software. And then after rebooting everything, including the scope, wonder of wonders, the lock-ups stopped.

This is a plausible explanation, as this computer is the only computer on the network that still had the Canon software installed on it, and would explain why the scope only locks up when this computer is on the same LAN with it.

However, that's not the end of story. I decided to write a small "UdpPing" program to allow Rigol to recreate this scenario without having to install old Canon printer drivers. I did that, but so far I have not been able to recreate the lock-up. It's possible that there is some network traffic that I didn't capture (multicast packets maybe?), or perhaps my program isn't faithfully recreating the exact packets that I did capture. I'm going to poke around with it a little bit later today, and will post back here if I find out more.

This is why I pinpointed arp broadcast in the first place:

https://news.ycombinator.com/item?id=8565161

look at your captures again and you will find this shitty Canon software spamming mDNS crap all over your network

its not only type of packet, but also amount per second.
I read similar story on one of the bofh forums some time ago where one computer was able to start arp flood in 30 computer office, it started with one, but 100ms later every computer in the office (with mdns support) was spamming that shit 100Mbit flood on every port in the building.

edit: btw to not give Rigol stupid ideas (like charging you for shipping) - this is TOTALLY their fault, they have shit in low level network stack.

#39 Reply
Posted by Kjelt on 24 Nov, 2014 12:07
Probably just me, but looking on the wireshark capture, I don't get it.
AFAICT the Rigol scope is on an IPv6 ...........:01:8a address and your computer on an IPv6 .........:30:8f address.
So who or what is on the 192.168.1.120 and 192.168.1.9 IPv4 addresses, one of them is probably still your computer, but what is the other
address they all want to discover?

#40 Reply
Posted by motocoder on 24 Nov, 2014 14:54
Quote from: Rasz on 24 Nov, 2014 11:08
Quote from: motocoder on 23 Nov, 2014 19:59
causing any issues, because I also see these when the scope is stable with no lock-ups for 12+ hours. Also note that when the scope locks up, it always right after it sends 3 ARP queries to the desktop, each of which asking what MAC address owns its IP address.

So, looking at this capture, and the others, I decided the only thing that looked unique was these strange scanner software packets. In WireShark they are labelled as protocol "BJNP", but they are actually UDP packets to some particular port. The UDP point-to-point packets are sent to port 8612, and the UDP broadcast packets are sent to port 8612. The BJNP broadcast packets seem to be some mechanism for the Canon printer/scanner drivers to identify Canon printers and scanners on the network. I have no idea why the driver also sends such packets directly to specific IP addresses. But in any case, I no longer own any Canon printers, so I uninstalled this driver software. And then after rebooting everything, including the scope, wonder of wonders, the lock-ups stopped.

This is a plausible explanation, as this computer is the only computer on the network that still had the Canon software installed on it, and would explain why the scope only locks up when this computer is on the same LAN with it.

However, that's not the end of story. I decided to write a small "UdpPing" program to allow Rigol to recreate this scenario without having to install old Canon printer drivers. I did that, but so far I have not been able to recreate the lock-up. It's possible that there is some network traffic that I didn't capture (multicast packets maybe?), or perhaps my program isn't faithfully recreating the exact packets that I did capture. I'm going to poke around with it a little bit later today, and will post back here if I find out more.

This is why I pinpointed arp broadcast in the first place:

https://news.ycombinator.com/item?id=8565161

look at your captures again and you will find this shitty Canon software spamming mDNS crap all over your network

its not only type of packet, but also amount per second.
I read similar story on one of the bofh forums some time ago where one computer was able to start arp flood in 30 computer office, it started with one, but 100ms later every computer in the office (with mdns support) was spamming that shit 100Mbit flood on every port in the building.

edit: btw to not give Rigol stupid ideas (like charging you for shipping) - this is TOTALLY their fault, they have shit in low level network stack.

Very perceptive, Rasz! I think "arp flood" may be what happened here. I looked at the packet capture again,and actually it is my desktop computer that was sending those last 3 arp packets to the scope, not the scope sending the last 3 arp packets, right before or after the lock-up. Last night I used some software (bit-twist) to simulate that, but it didn't trigger a lock-up even if I sent hundreds of them.

Also, I didn't have capture of multicast packets enabled - only broadcast unicast to/from my desktop - so I may have missed whatever the Canon software was doing.

Can you elaborate on any theories you might have as to how the Canon software might be triggering the issue? Now that I know how to use bit-twist to send arbitrary packets, it would be great if I could exactly reproduce the lock-up so that Rigol can start working on a fix.

#41 Reply
Posted by motocoder on 24 Nov, 2014 15:01
Quote from: Kjelt on 24 Nov, 2014 12:07
Probably just me, but looking on the wireshark capture, I don't get it.
AFAICT the Rigol scope is on an IPv6 ...........:01:8a address and your computer on an IPv6 .........:30:8f address.
So who or what is on the 192.168.1.120 and 192.168.1.9 IPv4 addresses, one of them is probably still your computer, but what is the other
address they all want to discover?

Those are MAC addresses, not IPv6 addresses.

#42 Reply
Posted by Kjelt on 24 Nov, 2014 15:31
Quote from: motocoder on 24 Nov, 2014 15:01
Those are MAC addresses, not IPv6 addresses.
ok so the scope asks who has 192.168.1.9 but wants a reply to 0.0.0.0 never seen that how do you explain that?

#43 Reply
Posted by motocoder on 24 Nov, 2014 15:35
Quote from: Kjelt on 24 Nov, 2014 15:31
Quote from: motocoder on 24 Nov, 2014 15:01
Those are MAC addresses, not IPv6 addresses.
ok so the scope asks who has 192.168.1.9 but wants a reply to 0.0.0.0 never seen that how do you explain that?

I think it is how arp works. When the scope starts up, it sends a broadcast arp request to make sure some other device doesn't have the IP address it wants to use. The arp broadcast packets are a normal thing. It is the way a device determines the ethernet mac address to send traffic to for a particular IP.

I think the unicast arp requests (from the Asustek... to RigolTec...) were actually being generated by the Canon software.

#44 Reply
Posted by Kjelt on 24 Nov, 2014 15:43
Ok well not to rock the boat for nothing just interested, he're is an ARP wireshark cap from me.
You can see that here that the Hewlett device that asks for the IP address from the Cisco already has a valid IP address on its own.
It might well be that 0.0.0.0 is used if the device has no IP address yet.

#45 Reply
Posted by Monkeh on 24 Nov, 2014 15:50
This is a standard method used by most DHCP clients to determine if the address they're about to use is already in use on the network. It's easier to check than to just assume and try and cope with the 'hilarity' of two devices claiming the same IP.

Quote from: RFC2131
The client SHOULD perform a check on the suggested address to ensure that the address is not already in use. For example, if the client is on a network that supports ARP, the client may issue an ARP request for the suggested request. When broadcasting an ARP request for the suggested address, the client must fill in its own hardware address as the sender's hardware address, and 0 as the sender's IP address, to avoid confusing ARP caches in other hosts on the same subnet. If the network address appears to be in use, the client MUST send a DHCPDECLINE message to the server.

#46 Reply
Posted by motocoder on 24 Nov, 2014 16:01
Quote from: Kjelt on 24 Nov, 2014 15:43
Ok well not to rock the boat for nothing just interested, he're is an ARP wireshark cap from me.
You can see that here that the Hewlett device that asks for the IP address from the Cisco already has a valid IP address on its own.
It might well be that 0.0.0.0 is used if the device has no IP address yet.

There are both unicast and broadcast arp exchanges. It looks like the unicast are typically between the default gateway/router and some other machine, and is just a way to refresh the cached arp info. The broadcast messages are when the device sending them really has no clue who owns a particular IP address.

#47 Reply
Posted by Rasz on 25 Nov, 2014 11:31
Quote from: Kjelt on 24 Nov, 2014 15:31
Quote from: motocoder on 24 Nov, 2014 15:01
Those are MAC addresses, not IPv6 addresses.
ok so the scope asks who has 192.168.1.9 but wants a reply to 0.0.0.0 never seen that how do you explain that?

https://tools.ietf.org/html/rfc5227#section-2.1.1

Quote from: motocoder on 24 Nov, 2014 14:54
I think "arp flood" may be what happened here. I looked at the packet capture again,and actually it is my desktop computer that was sending those last 3 arp packets to the scope, not the scope sending the last 3 arp packets, right before or after the lock-up. Last night I used some software (bit-twist) to simulate that, but it didn't trigger a lock-up even if I sent hundreds of them.

it probably only works if you spam with same IP scope just requested
from your brief screenshot it looks like scope replies like it already owns that particular IP _while it still waits for DHCP request reply_

Quote from: motocoder on 24 Nov, 2014 14:54
Can you elaborate on any theories you might have as to how the Canon software might be triggering the issue? Now that I know how to use bit-twist to send arbitrary packets, it would be great if I could exactly reproduce the lock-up so that Rigol can start working on a fix.

canon loves polluting local networks, quick google shows another garbage packet type
https://ask.wireshark.org/questions/5178/why-gratuitous-arps-for-0000

I never had to deal with it personally, so all I got is guesses, would be easier with a live example - too bad you fixed yours.

#48 Reply
Posted by motocoder on 25 Nov, 2014 18:19
Quote from: Rasz on 25 Nov, 2014 11:31
it probably only works if you spam with same IP scope just requested
from your brief screenshot it looks like scope replies like it already owns that particular IP _while it still waits for DHCP request reply_

Yes, I changed the IP address before replaying the packets.

#49 Reply
Posted by motocoder on 29 Nov, 2014 16:50
Got my scope back from Rigol yesterday. Since they sent no shipping notice or tracking number, I wasn't expecting it, and it sat outside in the inclement weather (fortunately under the overhang of my porch) for some indeterminate amount of time. I brought indoors and let the condensation dry out overnight, and it seems to be working.

So in summary: Rigol had my scope for 4 weeks. There was actually nothing physically wrong with my scope; the lock-up issue I encountered is caused by a bug in their network implementation that was triggered by some network traffic on my LAN.

Positives
- The Rigol reps I communicated with were generally pleasant to deal with.
- The Rigol reps generally seemed to have an interest in determining the root cause of the bug, or at least gathering as much info as they could to assist the engineers in investigation.
Negatives
- No notification or tracking number on return shipments. My scope sat outside for some time, and could easily have been stolen or damaged by rain.
- I had to pay for shipping, not just once, but twice.
- Rigol does not seem to be set up to communicate with the customer on the status of the repair process. Plan on sending your equipment into an 8 week black hole
- Rigol does not live up to their written warranty commitments, specifically to provide a loaner or replacement if the repair will take more than 10 days.

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

There was an error while thanking

Thanking...

Go to page:

« 1 2 3 » All

Full site Menu

Navigation

Powered by SMFPacks Advanced Attachments Uploader Mod