Here are many reasons why I don't like Dell.
Why I hate Dell servers:
Every Dell machine which my clients have purchased and paid big money for has
caused problems. I'm not very happy with my Dell experience overall. Note that
I didn't choose Dell. I recommended against it. The organization footing the
bill chose Dell. I get to install and manage the Dells and get paid for my time.
But I would prefer to get paid for time being productive and not fighting
the hardware.
On the bright side: Future employers: I have LOADS of experience with Dell
hardware and have found workarounds for many of their warts! :)
Now, pardon my rant as I blow off some frustration:
First was the sales process. I don't want to have to haggle for a week to get a
good price. But that's what we did. And the price came down a fair bit.
Probably not as much as the time it cost us though. I don't want to have to pay
extortionate prices for RAM or hard drives either. I hate that a 6 bay hot swap
machine comes with blanks instead of drive trays. If you want more trays you
have to buy them from Dell with marked up Dell drives. I understand wanting to
only support drives known to work but tell me what model number that is so I
can get them wherever I want and give me drive trays with the machine. If the
machine has 6 drive bays it better come with 6 drive trays in those bays. It's
games like this...
We bought a memory upgrade from Dell for our 2970's to bring them up to 32G of
RAM. After installing the RAM and rebooting the computer said the memory
configuration was not optimal and prompted me to press F1 to continue. It would
then boot up just fine. But I can't have the servers requiring human
intervention for a reboot. So I had to figure out what the problem was. I
called Dell support and it turned out that the BIOS did not properly support
32G without a BIOS upgrade.
We were told they supported up to 32G when we bought them but it turns out the
BIOS they were shipped with didn't properly support 32G. So...that's broken at
time of purchase in my book.
Every one of our Dell servers has required a BIOS upgrade. The 610's would
spontaneously reboot after a couple of months in operation at first. They all
did it. Then I upgraded the BIOS. Now it has been at least 9 months since that
happened and I hope it is cured. Now standard practice is a BIOS upgrade right
out of the box. I really don't expect to ever have to upgrade BIOS in a server.
If I do that means it was broken when I bought it. Bugs don't appear by
themselves over time, they are there at time of shipment. Not only that but
there is mainboard BIOS firmware, DRAC/BMC firmware, and RAID controller
firmware all in need of updating. That's just too much stuff requiring
post-sale fixing.
As for the process of doing the BIOS upgrade there is room for improvement.
First, I am happy that there are Linux executables for doing this. It used to
be that only DOS binaries were distributed for stuff like this. But the process
for obtaining and executing the upgrade is rather obtuse.
The first step is to download the BIOS update. I was given this url by tech
support:
http://support.dell.com/support/downloads/download.aspx?c=us&cs=555&l=en&s=biz&r
T&osl=en&deviceid=11598&devlib=0&typecnt=0&vercnt=11&catid=-1&impid=-1&formatcnt
362396
Wow. That's a mess of a url. I don't like to have to download the BIN file on a
desktop or laptop and then scp the file over to the Linux server as it is
inconvenient. We don't run a web browser or any GUI desktop at all on our
servers as it is a waste of resources and not best practice. But I pretty much
need one to copy and paste that url and navigate the webpage it points to.
It would be nice if Dell provided a simple direct download link. Or at least
didn't wrap the Download button with a javascript function. If I am on my
laptop I like to right click the download link on my laptop and select "Copy
link location", then paste the url into an ssh terminal on the server and pull
the binary directly down to it. Currently when I right click the download
button and copy the link I get:
javascript:downloadslink('http://ftp.us.dell.com/bios/PE2970_BIOS_LX_4.1.1_1.BIN
verDownloadManager.application?c=us&l=en&fileid=362790&fileloc=ftp://ftp.us.dell
alse','PE2970_BIOS_LX_4.1.1_1.BIN');
Ugly and unusable. However, from this I can see that the actual path to the
file is:
http://ftp.us.dell.com/bios/PE2970_BIOS_LX_4.1.1_1.BIN
So on the server I can do:
# wget http://ftp.us.dell.com/bios/PE2970_BIOS_LX_4.1.1_1.BIN
and download the file directly onto the server.
Much more convenient. I can even type that by hand without copy and paste if I
really have to.
The firmware upgrade executables never work on CentOS. This is a gratuitous
limitation since it is functionally the same as RHEL. I can usually just change
one line in the shar file and make it work but I shouldn't have to.
When I execute this BIN file it produces an error indicating that it wants
another program called lockfile to be installed on the system. It took me a
while to remember this program. I had seen it before somewhere. Turns out it is
part of the procmail mail filtering program which we do not normally install
onto our servers. Most people shouldn't be installing that unless they need it
as part of a mail server. I had to install it to get the file to run.
Then I find that I also have to install compat-libstdc++-33-3.2.3-47.3.i386.rpm
but at least the BIN file gives me a useful error directing me to install it.
This is only needed for executables compiled against the old C++ library.
Moving to the newer one (why wouldn't they just use straight C for a firmware
installer?) would remove a barrier to getting the firmware update done.
This is pretty sweet:
Continue? Y/N:y
Executing update...
WARNING: DO NOT STOP THIS PROCESS OR INSTALL OTHER DELL PRODUCTS WHILE
UPDATE IS IN PROGRESS.
THESE ACTIONS MAY CAUSE YOUR SYSTEM TO BECOME UNSTABLE!
.../tmp/PE2970_BIOS_LX_4.1.1_1.BIN-6001-9159/./UpdRollBack: error
while loading shared libraries: libxml2.so.2: cannot open shared
object file: No such file or directory
.
The update failed to complete
Oops...looks like it is complaining that it can't find libxml2.so.2 so I gess
there is some XML nuttiness in this firmware somewhere. Installing libxml2 with
yum resolved that.
Then the firmware update installed and I rebooted. Yeay.
So that covers firmware.
The RAID card management tools leave MUCH to be desired as well. As far as I
can tell, the MegaCli package is the way to manage the PERC from the command
line in Linux. To work with it you have to hunt down the
MegaCli-1.01.39-0.i386.rpm tools since the tools are proprietary to LSI and
don't ship with RHEL.
[omstorage stuff is the right way to do this but that isn't clear at first]
Then you RPM install it and go looking for the software it installed. MegaCli
is rarely used. Only when setting up disks. They didn't call it megacli or
something I might remember. They called it MegaCli64 (case sensitive) which is
installed in /opt/MegaRAID/MegaCli/MegaCli64.
Then you have to figure out how to use it.
# /opt/MegaRAID/MegaCli/MegaCli64
Fatal error - Command Tool invoked with wrong parameters
hmm...ok
# /opt/MegaRAID/MegaCli/MegaCli64 --help
Invalid input at or near token -
hmmm
# /opt/MegaRAID/MegaCli/MegaCli64 -h
whoah! This gets you a massive amount of cryptic command line options with no
explanation as to their purpose. I have pasted the output here:
http://pastebin.ca/1968565
This is their idea of "help". I'm a command line commando of 20+ years and this
scares even me! It would have been nice if they at least tried to make it work
somewhat like the Linux mdadm command or at least provided some examples of
common use cases etc. Because of the oddity of this command various people out
on the net have compiled "cheat sheets" to help poor souls like me figure out
how to use this thing:
http://tools.rapidsoft.de/perc/perc-cheat-sheet.pdf
Usually I avoid using this command and just reboot the server into the BIOS and
configure the RAID card from there but often it is not a convenient time for a
server reboot. I also avoid it because it is so complicated and one wrong
command can lose all of the data in the server. Yes there are backups which I
would really rather not have to restore.
I needed to add a couple of disks on the fly and did not want to reboot. The
command line I seemed to need and response it gave me was:
# /opt/MegaRAID/MegaCli/MegaCli64 -CfgLdAdd -r0 [32:4] -a0
Adapter 0: Configured the adapter!!
Not a very reassuring response. Configured it how with what? It would be nice
if it said "Added virtual disk number 4 as a RAID 0" since that is what that
command told it to do.
Using the command:
/opt/MegaRAID/MegaCli/MegaCli -LDInfo -Lall -aALL
I was able to verify that it had in fact created virtual disk number 4 as a
RAID 0. However, I didn't have a file to work with in /dev representing the
disk. The operating system simply refused to see the disk so that I could
actually do something with it. I spent some time trying to figure out why but
couldn't come up with a solution. So I called tech support.
Dell tech support people are always friendly and, thankfully, seem to be US
based. That is a big help when the tech support person and I are yelling
instructions at each other over a noisy datacenter on a mobile phone. They
don't always have the solution, though. In this case with the RAID controller
I had added a disk and was trying to make it usable/visible to the OS. The guy
first guessed that I needed to partition the disks. I explained that the disks
were not visible to the OS to be partitioned. Then he guessed at some MegaCli
commands which were not useful. Eventually I had to get off the phone and head
out for an appointment. Later I got an email explaining that he had the
solution: I needed to run partprobe. That command finds partitions. You can't
find partitions on disks which you can't see. Way off the mark. Eventually it
became more convenient to reboot the server. So that is what I did and the
disks appeared. Problem solved, sort of. Although with this hot swap stuff it
really should be possible to add disks on the fly. That's the whole point.
Speaking of RAID controllers, we have a pair of identical R410's. And they BOTH
consistently produce these errors:
mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
They produce these errors at a rate of around 10 per day throughout the day.
Both machines produce the exact same error. Same hex codes, etc. Identical. I
don't think it is actually a drive failing because the chances of both machines
failing at the same time in exactly the same way are slim. One of these
machines had what looked like a RAID controller crash which lost data and
didn't do our filesystems any good.
Whenever I call Dell tech support I always wonder why it is that Dell's phone
system always asks me for the long service code number instead of the shorter
service tag which is just the base-36 encoding (therefore much shorter) of the
service code. Sometimes I have one but not the other on hand. They are clearly
the same thing. Lots of people have even put up little webpages (which I have
used) that will convert from one to the other for you:
http://www.google.com/search?q=dell+service+tag+converter
Why would they ever ask for or deal in the long version and make me yell it at
them over a mobile phone in a noisy datacenter?
Then the next person I talk to wants the service tag again even though I just
told the phone system the service code.
Then the NEXT person wants to confirm the service tag.
At least they tend to understand the ICAO phonetic alphabet so we don't have to
haggle over whether I said b, c, d, e, g, p, t, v, z or 3.
I hate those pointless bezels that come with the machines. I try not to pay the
small amount of extra money for them anymore because they just go in a pile.
These machines sit in a datacenter, not a showroom.
Apparently there are at least two different kinds of DRAC: iDRAC Enterprise and
iDRAC Express. I suspect they are exactly the same hardware, perhaps with
different licensing or firmware.
My machines have iDRAC Express. iDRAC used to be something called BMC. Not sure
why they changed the name. The iDRAC stuff is nice. It took me a while to get
around to learning how to use it but it is worthy. Reminds me of some old
systems I had worked with in the past such as Sun, HP, and even Pyramid which
had service processors. I have long awaited the day that x86 servers got this
feature.
However, it has some weird limitations and is expensive compared to the latest
stuff from Supermicro. For example, it is odd that iDRAC Enterprise supports
public key auth and Express does not. The DRAC is a little processor (MIPS or
ARM on most platforms) running Linux or Busybox. Why not support public key? We
do everything with ssh keys. Without public key auth I have another password to
worry about.
A java applet for console in the DRAC web interface? With all of the 0-day
exploits for the JVM they want me to have the Java plugin enabled in my
browser? Why can't I just VNC? Or RDP? Tunnel it over ssl or ssh if you must.
The Java app is flakey. The JVM says "Downloading application" ...after a
couple of minutes that window will go away and be replaced by a window which
says "Unable to launch the application." It has "Ok" and "Details" as menu
options. If I click details it says "Error: Malformed reply from SOCKS server"
and a window full of XML. This happens sometimes. Restarting my browser
doesn't help. Hmm...I tunnel all of my web browsing through a SOCKS proxy with
SSH -D. I have an exception for the 10/8 network which doesn't get proxied.
Works for accessing the DRAC web interface itself. But the console java applet
is apparently somehow trying to use the proxy and failing. If I disable the
proxy in firefox the console applet works again. I would really rather just
VNC...
Once up and running he arrow keys don't work in DRAC java console. This is a
real problem in navigating BIOS and configuring things on, say, ESX console.
Turns out that you have to do some work to get them working:
http://ceph.github.com/sepia/drac/remote-console-keys/
http://www.anchor.com.au/blog/2011/03/evil-hack-to-make-arrow-and-sysreq-keys-work-with-a-dell-idrac-kvm-and-linux-desktop/
This last url says:
The KVM software makes a connection back to the iDRAC on the standard VNC
port (5900) (with the single use credentials that were provided to it in
the .jnlp file).
At this point, you could easily be mistaken into thinking, “Ah, VNC, that’s
got to work well right. Such a simple thing and all“. Unfortunately you
would be mightily wrong :( .
Whilst the iDRAC is using the standard VNC port, it appears that the
implementation has been somewhat customised.
So this is all based on VNC but Dell took standard VNC and fsckd with it! :(
All of the Dell DRAC SSL certs are the same with the same serial number. This
causes firefox to freak out and not accept it. Have to delete cert8.db from
firefox (stored cert cache) and restart firefox as a workaround.
The virtual media functionality in the DRAC doesn't seem to work properly in
Safari. I click virtual media and the little window where I can mount the media
never shows up. Works ok in Firefox/Linux.
Sometimes the DRAC web interface gets confused and all of the menu items become
labeled "undefined". Have to clear cache and try again and it works.
I've been using the Dell DRAC console quite a bit lately to remotely install OS
etc. It has terrible stuck-key/repeat problems. Typing slow helps but quite
often it is simply impossible to enter a 10 character password. Others have had
this problem:
http://lists.us.dell.com/pipermail/linux-poweredge/2008-January/034515.html
This is mostly a function of network latency and the fact the protocol sends
keydown/keyup messages. So if you get some latency longer than the interval
between your keydown/keyup the keyboard auto-repeat starts.
I added a Dell MD1120 disk array to an R415 with redundant external SAS
connections...BIOS complained:
Number of devices exceeded the maximum limit of the devices per quad
Please remove the extra drives and reboot system to avoid losing data
System has halted due to unsupported configuration.
Firmware upgrade fixed it. So again, it was broken when we bought it.
I was asked to evaluate the Dell H700 controller with Self-Encrypting Disks for
an organization with very serious security requirements. You key the controller
and then the controller keeps the key forever until manually cleared. So if
someone steals the server they can boot it right up and get the data. You can't
configure it to lose the key on power-off and require re-entry. I called Dell
and they took a couple of days to find the right people internally to ask and
confirmed that this is the case. Not very useful. This is only useful if the
server and the disks are separated.
I was trying to set up a new Dell R620. Got it racked, started to configure the
DRAC, couldn't ping it. Double checked everything, cabling, switch port, on the
right VLAN, went through all of the DRAC options. Noticed a little message
slyly hidden among the various config options: NIC Selection: Dedicated: A
require license is missing or expired. WTF? Dell sold us hardware (NIC on the
DRAC) which is completely useless without an additional license?
WHY!?!?!
Dell DRACs use Java and have vulnerable OpenSSL implementations which is driving our security auditors nuts. And we can't fix it other than to turn off the web interface because Dell doesn't yet have a fix. Turning off the web interface is fine as I prefer to use ipmitool and serial over LAN but a lot of my clients don't want to do that. So they run vulnerable SSL and have Java enabled in all of their browser including the most security sensitive employees with the highest levels of access. "What could go wrong?"
My SuperMicro gear is SO much simpler. I've never upgraded BIOS or had IPMI or
RAID problems on any of them. It just works.