Friday, October 24, 2014

ESXi 5.5 support for CSR 10.5

I've been testing CSR 10.5 (UCM 10.5, UCCX 10.5) on ESXi 5.5 U2.

I kept running into problems I've never seen before, such as the lovely VMware "pink screen" (Pink Screen Of Dead).  Consistent high CPU usage on UCM and UCCX, etc.

I couldn't find the pattern of "pink screen".  But it happens quite a couple time, which I've never seen in the past 8 years with Cisco UC on VMware.

The other issue is high CPU on UCM and UCCX.  The CPU usage was consistently at 99% that I couldn't even log into CLI.

Just FYI that that server is a HP DL380 G5 with 32G of RAM and 8x146G SAS drives (a.k.a. Cisco MCS-7845-H2).

Did some research online.  The culprit seems to be the Cisco OVA file.  Somehow the OVA works fine on ESXi 5.0 but not ESXi 5.5.  The solution is NOT to use the OVA.  Instead of importing the OVA, I manually created the virtual machine with the specification in OVA.

CUCM:
1 CPU 1 Core
4G RAM
80G HDD
LSI Logic Parallel
VMXNET 3 NIC

UCCX:
1 CPU 2 Core
8G RAM
146G HDD
LSI Logic Parallel
VMXNET 3 NIC

Things seem to be much better now.  CPU is around 10%.  Maybe it's just me.  But if you're running into the same issue, it's worth trying.  You don't have to reinstall the VM.  Just create a VM from scratch (without using OVA).  Then point the hard drive to the existing VM.

This is CPU utilization with fresh created OVA.  After the system "warmed up" (about 30 minutes), utilization drops from 99% to 10%.



Some reference links:
http://ciscocollab.wordpress.com/2014/01/28/esxi-5-5-support-latest-information/

http://docwiki.cisco.com/wiki/Unified_Communications_in_a_Virtualized_Environment

https://communities.vmware.com/thread/459962



Tuesday, September 30, 2014

"Cloud" device in IOU Web

I've been using IOU Web for network emulation.

"Cloud" device is the bridge between internal devices (such as routers within IOU) and external devices (such as PCs, a real/virtual router outside of IOU, etc.).

I'm not going to get into the details of how to set up VMware network or IOU.  There are plenty of documents online about that.

What I'm going to share is the solution to a weird problem.

I wanted to build a simple lab as shown below.  Two LAN segments are connected via two routers back-to-back.


NETMAP file and device config as below.


Pretty straight forward, right?  But the problem is - I cannot turn on device 1 (LAN1).  Notice that device "LAN1" stays in red below which means it's off.

I scratched my head for quite a while.  Tried to tweak the parameters, device ID, naming, IOU host, VMware Network Editor.  No avail.

Then I looked at the logs and noticed the following:

Why it asked me check the NETMAP file?  I don't see any error there.  What is "instance"?  Why is it not found?

After a little bit research, I realized "instance" is the same as "device".  As shown in the diagram above, we have four instances - 1, 2, 3 and 4.

We have problem with instance 1 (LAN1), which is connecting (referencing) instance 2 (R1).  If the system was complaining about "instance not found", it can only be 1 or 2.

I also noticed that instance 4 (LAN2) always works.  What's the difference between 1 and 4?

It turns out that in NETMAP file (connection definition), the "cloud" device cannot be the preceding one.  The "correct" NETMAP should be written like this:

 Notice that instead of "1:0/0 2:0/0", I swap them and make it "2:0/0 1:0/0".  Then try to start the LAN1 device.  There we go:

This seems to be a software bug.  But the point is - a good engineer should be able to recognize the pattern from the symptom, perform deductive reasoning, and propose possible solution.  :)

Tuesday, September 23, 2014

UC 10.5, ESXi 5.5U2, DL380 G5

My home lab has been collecting dust for a while.  During the weekend, I wanted to refresh it with the latest and greatest, which means:

1) Upgrade the server (MCS7845-H2 a.k.a. HP DL380 G5) BIOS and firmware.
2) Upgrade VMware ESXi 5.0 to 5.5U2.
3) Upgrade UC 7.0 to UC 10.5.

It turned out that upgrading a system that's been collecting dust is VERY different from upgrading a system that's been up and running.


First of all, the system won't boot.  Just gives me long beeps and the "Internal Health" and "External Health" LEDs are both red.  Pull all memory chips out and resit them solves the problem.

Then iLO configuration seems to be lost due low power level of the system battery.  I can't log into iLO at all (the 'default password' is system specific with unique numbers).  Set the "System Maintenance Switch" S1 to "On" bypasses the iLO password.


When trying to upgrade to ESXi 5.5 U2, I got the following error:


I know what it is.  But how could this be not enabled while I have ESXi 5.0 on it before?  Maybe it's also due to the motherboard battery?  Anyway, go into BIOS and enable the "No-Execute Memory Protection".

After ESXi upgrade, I noticed that VMware persuade move from native VM client (based on C#) to "Web Client" (based on Adobe Flash).  The initiative is to move from "fat client" to "thin client" so all new features can be hosted on the vCenter server.   You may still use the "native client" but some of the features will be missing.  Features as basic as editing a version 10 VM settings.




In order to use the "Web Client", you'll have to set up a vCenter server.  Also, to view VM console from web browser, you'll need to install a plug-in, which doesn't work with Internet Explorer (as of today).

When installing UCM 10.5, it took extremely long (> 10 hours).  Further investigation revealed that the array controller battery died.  Without battery, the array controller will disable cache, which makes it very, very slow on a RAID5 (slower than my laptop).

I have multiple options:

Option 1: Order one from eBay.  It's not expensive (~ $12 a piece).  The problem is - this kind of batteries are obsolete.  Thus the ones on eBay are all used ones, which were manufactured a couple years ago.  Who knows how long they'll last.


Option 2: Make my own battery like this: http://opensource.wrenhill.com/?p=63.  Then I can use cheap AA or AAA batteries instead of buying proprietary ones.

Neither of the above options is quick enough for me.  Thus I choose...

Option 3: "Enable Cache Without Battery".

To do this, you'll need ACU (Array Configuration Utility).  You can do it with the ROM-based interface (BIOS).

With VMware ESXi, the easiest way is to download the "offline ACU", which is a CD you boot from.  Then configure the array controller from there.


For a RAID, it's the write operation that takes more time.  Thus you want to make sure the write cache is not zero.


Last but not the least, download HP SPP DVD to update all firmwares and BIOS.

P.S. DHCP doesn't work on UCM 10.5 in case you want to use UCM as a DHCP server.  https://supportforums.cisco.com/discussion/12224526/cucm-105-dhcp-not-working

Monday, August 11, 2014

Network Engineer Should Know A Little Bit Scripting and Excel

I was working on a network migration project for a large enterprise.  They are migrating their Catalyst 6509 network to Nexus (7ks, 5ks, 2ks).

Part of the migration is to move hundreds (if not thousands) of servers from 6509 switches to Nexus 2Ks.

In an ideal world, it would be as easy as copy the interface configuration from 6509 and paste it into N5K (where N2K homed to).  But we don't live in an ideal world.

The challenge we are facing are:

1) There are many local significant VLANs due to poor network design, which means, VLAN 100 on legacy switch may or may not be the same VLAN 100 on new switch.  Thus you cannot just blindly copy the "switchport access vlan 100" command from legacy switch and paste it into new switch.  We might have to create a L2 trunk from legacy switch to new switch.  We might have to create new VLANs and SVIs.

2) Even if the VLANs are perfectly fine, copy/paste the configuration for hundreds of ports are still a tedious work and prone to human errors.  Some Catalyst commands need to be translated into NX-OS commands.

3) Port-mapping is another process prone to human error.  Cabling team might tell you the cable from Catalyst-Switch-23 port G3/27 is going to be moved to FEX-Switch-19 port 11.  If the cabling team fat-fingered the FEX port number, network team could overwrite a FEX port that is currently being used and cause an outage.  Sure you may review the FEX port before applying the changes.  But again, reviewing hundreds of ports is a tedious work.

4) Due to project schedule, cabling team has to build the port-mapping even before the FEX was online at N5K.  Thus they reference the FEX by their grid location (e.g. "AB23") versus the "FEX number" in N5K (e.g. "Ethernet101").  How do we build the configuration script with mapping table referencing grid numbers?

Solution:

Spreadsheet is a very useful tool because:
  • (Almost) everyone has a spreadsheet application on their computer (Microsoft Excel)
  • Spreadsheet is easy to use and format data, even the user is not very computer savvy (such as the cable guys)
  • Formulas can be used to validate data and generate desired results

I asked server team provide us a spreadsheet with servers they want to migrate in the first phase.  Each row of the spreadsheet contains server IP address, subnet mask, default gateway, current switch name and switch port the server is connecting to.

I wrote a VB script to format the "show run" output from switches into Excel spreadsheet with switch name, switch port, and interface configuration.



By cross-referencing server team's spreadsheet and the "show run" spreadsheet (done by computer of course), I have a new spreadsheet that tells me what VLANs and what default GWs are required by the servers.  I review the configuration on new switches.  If VLANs or default GWs are not ready, I submit change request to create them.



This is just the preparation stage.  We haven't got to the FEX script stage yet.

Next is to build a script that translate the Catalyst commands into NX-OS commands in the "show run" spreadsheet.  (You may also do "find/replace".  But IMO, scripting is more flexible).

Next is to use a formula to translate the FEX grid number into N5K FEX numbers (i.e. from "AB23" to "Ethernet101").  Since we have more than one pair of N5K, this can't be done by simply "find/replace).  E.g. "AB23" is corresponding to "Ethernet101" on first pair of N5Ks.  However, "CD45" is corresponding to "Ethernet101" on the 2nd pair of N5Ks.  Excel VLOOKUP function can achieve this.

Next is to use a formula to build the FEX interface configuration.  As we need to look up both switch name and port number, Excel INDEX function is used.

Last but not the least, we also need to factor human errors.

1) For each server on the spreadsheet, we should have old switch name, old port number, new switch name and new switch number.  We cannot migrate the server if one of those was missing.  I build a column to validate this.  If something is missing, the value on corresponding row will be 'ERR'.  Then I can filter all 'ERR' rows by this column.

2) For each port we're migrating, there should be no existing config on the new switch (FEX).  If there's existing config, we might have a conflict.  I build another column to validate this.  Again, it'll generate 'ERR' if a port was already configured.  Then I can filter all 'ERR' rows by this column.

In summary, with VB script and spreadsheet formulas, I save 95% of the time and lower the risk of human errors.


Monday, June 9, 2014

Put a text file on router flash without file transfer

Say, you want to put a text-based file on a router's flash memory.  It could be a license file, a config file, or some scripts.

The 'regular' way is to use TFTP/FTP to transfer the file.  But it could be a problem in some circumstances.  For example:

1) You're accessing the router through a terminal server (console port).  There's no network connectivity between your PC and the router.
2) Firewall/security policy prevents TFTP/FTP from happening.

It would be great if Cisco IOS has a 'notepad' (or 'vi') so we can create/edit the file from IOS CLI.  But it has not.

Fortunately, Cisco IOS has tclsh.  You may use tclsh create a file in flash memory and write some text to it.

Router#tclsh
Router(tcl)#puts [open "flash:script.txt" w+] "Some sample text"
Router(tcl)#tclquit

Router#dir flash:
Directory of flash:/
2 -rwx 2072 Jan 9 2014 10:24:23 -06:00 multiple-fs
3 -rwx 676 Feb 28 1993 18:01:35 -06:00 vlan.dat
4 -rwx 3570 Jan 9 2014 10:24:23 -06:00 private-config.text
5 -rwx 16 Jun 9 2014 09:34:35 -05:00 script.txt
6 drwx 192 Feb 28 1993 18:06:36 -06:00 c2960-lanbasek9-mz.122-55.SE7
562 -rwx 7340 Jan 9 2014 10:24:23 -06:00 config.text

32514048 bytes total (18987520 bytes free)

Router#more flash:script.txt
Some sample text

Router#


 What if you want to create a file with multiple lines?  Just escape the 'enter' with '\n'.  For example:

Router(tcl)#puts [open "flash:script.txt" w+] "Line 1 \n Line 2 \n Line 3"

Hope this helps!


Thursday, March 6, 2014

Build a $30 Wireless Lab

One of the recent project has quite a lot wireless LAN stuff.  So I feel the urge to build a home lab.

To build a wireless LAN lab, you need at least two things - a WLC (Wireless LAN Controller) and some compatible APs (Access Points).

WLC was easy since you may download the virtual WLC (vWLC) software from cisco.com and throw it on VMware.

It's not that easy when it comes to AP.  There are so many different models from Cisco.  I want the one that I can test most (if not all) the features with, while not costing me a fortune.  After some research (both on cisco.com and eBay.com), I decided 1242AG is the one.  This is a not-so-old AP that has 802.11a/b/g frequency and support many enterprise WLAN features (such as FlexConnect).  Most importantly, it's pretty affordable.  I got two for $30 (free shipping) from eBay.  I ordered two in case I need to test the "roaming" feature.

It looks like this:



Two things to be aware of:
1) Make sure to order one with antennas.  Otherwise it'll cost you some extra bucks.
2) They are mostly POE.  So you'll need a POE switch or power adapter.  You may get a cheap POE switch for less than $20.  But those switch won't support VLAN trunking, just FYI.

Luckily I still have my 3750G POE switch sitting around (from my CCIE voice lab).  Now I have to design the network.

In case you don't know, in real-life enterprise WLAN, they usually use DHCP option 43 to deliver the WLC IP address to APs.  I'd like to do the same in my lab.

But my Linksys router doesn't have the capability to configure DHCP options.  Thus I need to set up a another DHCP server.  How may I set up a secondary DHCP server while not interfering with the primary one?  The answer is to put them into different VLAN/subnets.

Here's my network design:


My Linksys home router connects to 3750 switch VLAN 1.  The two APs connect to 3750 switch VLAN 3.

3750 configuration:
ip dhcp excluded-address 192.168.3.1 192.168.3.10
!
ip dhcp pool Wireless-Lab
   network 192.168.3.0 255.255.255.0
   default-router 192.168.3.1
   option 43 hex f104.c0a8.0216
!
interface Vlan1
 ip address 192.168.2.1 255.255.255.0
!
interface Vlan3
 ip address 192.168.3.1 255.255.255.0
!
ip route 0.0.0.0 0.0.0.0 192.168.2.100
!
interface GigabitEthernet1/0/1
 description Linksys Router
!
interface GigabitEthernet1/0/2
 description AP-1
 switchport access vlan 3
!
interface GigabitEthernet1/0/3
 description AP-2
 switchport access vlan 3
Linksys configuraiton:

Now you should be able to ping from home PC (VLAN1) to VLAN 3 and vice versa.

On the vWLC virtual machine, I set the NIC to bridge network so I can configure a static IP in my home network segment (I used 192.168.2.22).

Now you should be able to open a web page to the vWLC management portal.  Also, you should be able to ping from the vWLC (192.168.2.22) to VLAN3 (192.168.3.1) and vice versa.

In theory, when I plug the APs to the switch, they should:
1) Power up
2) Get their IP address and the vWLC's IP address (via option 43 from DHCP)
3) Join the WLC

Well, not surprisingly, they didn't work as desired.  (if they did, there will be not much value for CCIEs)

As a WLAN newbie, I went for documents, turned on debug, capture error messages, post questions on Cisco support forum.  After spent quite some time on troubleshooting, I was advised to upgrade the IOS (does that sound familiar?)

There are many different software, tools and procedures regarding AP upgrade:
  • Autonomous vs. Lightwight vs. Recovery
  • TFTP vs. Upgrade Tool
  • etc.
After many trial and err, here are my conclusions:
1) Upgrade to the latest IOS version before you troubleshoot
2) All you need is a TFTP server.  Don't use "upgrade tool"

High-level recovery(upgrade) process:
1) When the AP boots into recovery mode, it'll set its own IP address to 10.0.0.1 and search for TFTP server in the range of 10.0.0.2 - 10.0.0.30.
2) If it found one, it'll try to download the "default" image.  File name of the "default" image depends on the AP model.  For 1242AG, the default image file name is "c1240-k9w7-tar.default".
3) If the above file is found on TFTP, AP will download and install it.  Then reboot with that image.

Now you have a high-level view, let's talk about the details and catchas.

1) How to put a AP into recovery mode
Power off the AP.  Hold the "mode" button.  Plug in the power (POE or Power Adapter).  Now the status LED will be orange.  Keep holding the button for about 30 seconds.  You'll see the status LED turned purple.  That means the AP is in recovery mode.  You may release the button.

2) What TFTP server to use
You need a TFTP server that can customize the timeout threshold.  Cisco recommends 30 seconds timeout.  I set it to 60 just in case.

3) What IP address to configure for the TFTP server
You may use any IP in the range of 10.0.0.2 - 10.0.0.30.  I normally use 10.0.0.2.  If you got a "IP Conflict" message, just pick another one.

4) What IOS image I should download
There are three different IOS images you can download:
Autonomous Image (e.g. c1240-k9w7-tar.124-25d.JA2.tar)
Lightweight Image (e.g. c1240-k9w8-tar.124-25e.JAO3.tar)
Recovery Image (e.g. c1240-rcvk9w8-tar.124-25e.JAO3.tar)

You'll ultimate goal is to upgrade to the latest lightweight image (that's the image who can work with a WLC).  But you might need to flash the AP with other images first in some situations (e.g. when your AP has a very very old firmware).

When AP joins a WLC, it'll compare its IOS version and the ones on the WLC.  If there's any discrepancy, it'll download and use the one from WLC.  This is similar to IP phones download firmware from CallManager during registration.

Because of that, it's recommended to put the recovery image on AP in recovery mode.  The recovery image is a small footprint image that boot up the AP, provide network function so the AP can download the latest IOS from WLC.

5) How do I make the AP take the image I specified?

Remember that AP will only take a "default" image with specific file name in recovery mode.  If you want AP to take the image, you'll need to rename it to the specific file name.  See this link for naming conventions: http://www.cisco.com/c/en/us/td/docs/wireless/access_point/conversion/lwapp/upgrade/guide/lwapnote.html#wp160918

Be aware that Windows normally hide the file extensions.  You need to configure Windows Explorer to show file extension so you can name the file correct.

For example, you want to rename c1240-rcvk9w8-tar.124-25e.JAO3.tar to c1240-k9w7-tar.default.  By default, Windows explorer will display "c1240-rcvk9w8-tar.124-25e.JAO3" as the file name.  If you rename it to "c1240-k9w7-tar.default" in Windows Explorer, the file name actually becomes "c1240-k9w7-tar.default.tar", which is NOT correct.

If AP successfully joined a WLC, you'll see something like this:

For troubleshooting, take a look at http://www.cisco.com/c/en/us/support/docs/wireless/4400-series-wireless-lan-controllers/99948-lap-notjoin-wlc-tshoot.html

Enjoy your $30 wireless lab.  :)