Friday, February 27, 2009

Phone reset issue

Phone reset issue is the most common yet most nasty issue (especially when it's "sporadic" or "intermittent").

Please note the different between reset and re-register.

If the phone lost connection with CUCM, it'll try to re-register. "Lost connection" means, the phone lost three keepalives from CUCM in a row. By default, those keepalives are sent every 30 seconds. You may verified that from "Cisco CallManager" trace. If CUCM sent keepalives but phone didn't receive it, it's usually network issue.

Reset usually happens when IP address on the phone was lost. In that case, the phone need to go through a reset process to acquire a new IP address. This is usually a DHCP (server) problem.

When the DHCP client reaches half-life, it'll try to renew the lease with DHCP server. e.g. If the DHCP lease was 72 hours, the client will try to renew at 36 hours. In normal situation, DHCP server will agree to renew. So the client can keep its IP address.

If DHCP server explicitly refused the renew, DHCP client has to release the IP. This is unusual and probably would be a problem of DHCP server.

On phone console log, you would see something like below:

NOT 08:11:10.854439 DHCP: Restart - delay = 0
NOT 08:11:10.866112 DHCP: Sending Release...
NOT 08:11:10.894059 DHCP: dhcpSendReq: status 0x12300000
NOT 08:11:10.894946 DHCP: Sending Request...
NOT 08:11:10.899614 DHCP: NAK received
NOT 08:11:10.901451 DHCP: clear info - IP = 10.2.16.37, state = 2
NOT 08:11:10.902400 DHCP: Sending Release...

"NAK received" means the DHCP server refused to renew the lease.

Next time, if you got phone reset periodically (say, every 36 hours), check DHCP lease time. If the cycle matches the "half-life", it's most likely DHCH server issue.

Wednesday, February 25, 2009

KISS - Keep It Simple Stupid

Most IT guys would know about KISS rule (Keep It Simple Stupid). However, not too many really understand and utilize it. Let's take a look at some Cisco Unified Communication products and see how we can utilize KISS rule.

The most frequently seen problem description is "it doesn't work".

"It" could mean lots of things. "It" could involved different products from different products/vendors (such as CUCM/IPCC/CUPS from Cisco, MOC from Microsoft, PBX from Avaya, T1 trunks from AT&T, etc.).

In order to simply the problem, we have to narrow down the problem quickly.

For example, if a customer said "My call center agents cannot make phone calls", I would ask "can you make calls from IP phone to IP phone in the same office?". This question could potentially eliminate call center software, voice gateway, PSTN and codec issues. If you didn't ask, you'd have to troubleshoot those items one by one (assuming you know how to troubleshoot those items)

Another example is network issue. All Unified Communication software rely on network connectivity. They wouldn't function properly if network didn't. Sometimes, network issue was not as obvious as you had thought. For example:

1) Windows Firewall service was stopped. But traffic wouldn't pass through until you explicitly open ports on it. (Hard to believe. But it happens)

2) You're not using VPN. But VPN client was running as a service and have firewall option turned on. (works as designed)

3) You claimed there was no firewall in the network, but there's a FWSM(Firewall Switching Module) on the switch.

4) You opened all ports on ASA (Firewall and VPN), CUPC still won't work. That is because one of the ASA bugs prevent large SIP message from passing through.

...

To troubleshoot network issue, you have to:
a) Have visibility on every components in the network
b) Be very good at all network layers (from physical layer to application layer)
c) Know how to use sniffer (such as Wireshark)

The difficult part is: sometimes you wouldn't think it's the network because it's not that obvious. Hence you wouldn't go down that path at all. You have to use KISS rule to find it out.

Example #1:
Customer: "My CUPC doesn't work."
You: "Doesn't work for all users? Or for some users?"
Customer: "For those users working from home."
You: "If those users were in office, would CUPC work for them?"
Customer: "Yes."

Now you know the problem is outside CUPC. Probably on the network (VPN?)

Example #2:
Customer: "My CUPC doesn't work."
You: "Doesn't work for all users? Or for some users?"
Customer: "It works for John but doesn't work for Mary. And they are both in the same office."
You: "On John's computer, can you log into CUPC with Mary's account? See if it works?"
Customer: "Yes, it works."

Now you know the problem is outside CUPC. Probably on Mary's computer (Firewall?)

Some KISS rules for Cisco UC (Unified Communication) products:

#1 If you don't know if it's case sensitive, assume it is.
This becomes a problem when Cisco moved from Windows platform to Linux.

#2 Because of #1, try to use lower case as much as you could.
Some people use capital case for cosmetic purpose. This could potentially cause some problems and it could take weeks to troubleshoot.

#3 Eliminate dependencies as much as possible.

Example A: When you installing CUCM, you have the option to use DNS, NTP, etc. Do NOT use them. If you use them, the installation might fail if those components weren't configured properly. You chance to configure them after install. I can't tell you how many problems are caused by DNS (even after install).

Example B: Don't use same "service account" for different applications. For example, you used the same active directory account for CUCM LDAP integration, CUPC LDAP search and CUPS Calendar. If CUPS admin change the account password (for whatever reason), it breaks CUCM and CUPC.

Example C: Get rid of CUCM subscribers during Windows-to-Linux migration. When you migrate from CCM 3.x/4.x (Windows) to CUCM (Linux), DB replication is always a headache. DB replication would fail if hostname, IP address was changed during migration (or some other changes between Pub and Sub). To avoid those headaches, remove subscribers from the cluster before migration. With a single server (Publisher) in the cluster, your chance of failure is far less than a 10-server cluster. After migration, you may add the subscriber to cluster one by one.

#4 Be a "minimalist"
Sales people tend to sell all the "bells and whistles" to customer. Sure that's the selling point. But as an engineer, if you want to get the job done smoothly, try to start with the minimum.

Example A: Use TCP instead of TLS.
Sure we want the security of TLS. But don't try to run before you can walk. Make sure the product is working before attempting TLS. If it didn't work with TLS, you know where the problem is.

Example B: Use simple passwords.
Sure we want the security of a long, complex passwords (how about 1024-character long?). But for installation and troubleshooting purpose, keep it short and simple (don't use special characters)

Example C: Build a simple test bed.
I've seen some integrators tried to deploy their first CUPS/CUPC installation over the VPN (because they are not onsite). This is a bad idea unless VPN is what you want to test. If something didn't work, you won't know if it's the VPN or CUPC.

Same for the computers. Instead of testing on a computer with bunch of custom-installed software, you'd better test on a clear/fresh-installed computer. Stick with Windows XP. Stay away from Vista, unless you understand what is UAC, Windows Defender (or offender?), and other security "features".

Thursday, February 12, 2009

Decrypt CUCM version numbers

In an ideal world, version 6.x is better than 5.x, version 7.x is better than 6.x, and so on so forth. However, we're not in an ideal world.

Cisco builds different "trains" in parallel. Currently, the active trains for CUCM are 5.x, 6.x and 7.x.

This "multiple trains" approach is a compromise between market demands and compatibility. In order to support new features, big changes need to be made to the infrastructure (e.g. database schema). Sometimes, the changes are so big that it's impossible to be compatible with previous versions. So they introduce a totally different "train" to lower the risk.

It's really hard to tell which train is the "best". Of course, newer train would have more features. But they also have more requirements. For example, CUCM 6.x is compatible with CUPS 6.x and 7.x. But CUCM 7.x is only compatible with CUPS 7.x.

On each train, there are many "sub-versions". For example, on 6.x train, you have 6.1.1, 6.1.2 and 6.1.3. Read the release note carefully. Some versions won't be able to upgrade to another train. For example, CUCM 6.1.3 won't be able to upgrade to 7.0.x (because of different database schema)

On each sub-version, there are also "build-numbers". e.g. 6.1.2.1000, 6.1.2.2000, etc. Build-number is the most confusing part.

Generally speaking, build numbers should increase in 1000, such as 6.1.2.1000, 6.1.2.2000, etc.

CUCM is built on Linux OS. Whenever Cisco release an OS security patch, they'll increase the build number by 1000. This is called PSIRT patch.

Remember CUCM is an application running on Linux. OS patch does not contain any CUCM bug fixes. Any bug fixes would be in ES (Engineering Special). ES versions would be indentified by the last three digits in build numbers (e.g. 6.1.2.1112)

OS team and CUCM (application) team are two different teams. When the OS team release OS patches, they don't include any application patches at all. But the version number was increased by 1000.

Quiz: 6.1.2.2000 and 6.1.2.1112, which one is "better"?

Answer: it depends on how you define "better". But most of the people would think "less buggy" is better. When they say "less buggy", they mean "less bugs in CUCM". If that's the case, 6.1.2.1112 is better. Because it has ES number of 112, which means it fixed quite a lot bugs. While 6.1.2.2000 has no CUCM bug fixes at all (it contains OS patches though).

Confusing enough? I don't know which genius invented this version schema. But that's the way it is. If you try to "upgrade" 6.1.2.1112 to 6.1.2.2000, it'll fail with some vague error messages. You have to open a TAC case to understand why it failed.

Interesting? Yeah, that's the way to keep TAC engineers' jobs. :)

Sunday, February 8, 2009

NTP - Network Time Protocol

NTP is critical in Cisco voice products. Time synchronization not only provides consistent time in trace files, but also a mandatory requirement for some components.

Architecture

On a CUCM publisher, you may choose to use internal clock (computer hardware clock) or external clock (NTP server, such as a router).

Regardless of your choice, all other servers in the cluster will use NTP protocol to synchronize time with publisher. In another word, NTP is only configurable on publisher.

Basic concepts

http://en.wikipedia.org/wiki/Network_Time_Protocol


Tips

1. Before you configure NTP on publisher, configure the local time as accurate as possible. This will shorten the time to synchronize after you configure NTP.

2. Be patient after you configured NTP. It might take hours to synchonize based on the time difference between publisher and NTP source. This works as designed. This is to comply with IETF RFC.

3. If NTP was configured on publisher, subscribers won't synchronize to publisher until publisher is in-sync with NTP source. If you're having problem sync the publisher to NTP source, but you want the whole cluster in-sync on time, disable NTP on publisher.

Frequently used commands

utils ntp status

ntpd (pid 3638) is running...

remote refid st t when poll reach delay offset jitter
==============================================================================
127.127.1.0 127.127.1.0 10 l 9 64 377 0.000 0.000 0.008
*171.68.10.80 64.103.34.14 2 u 921 1024 377 38.233 3.336 1.182
+171.68.10.150 10.81.254.202 2 u 988 1024 377 37.044 3.252 12.236


synchronised to NTP server (171.68.10.80) at stratum 3
time correct to within 60 ms
polling server every 1024 s

Current time in UTC is : Sun Feb 8 14:38:36 UTC 2009
Current time in America/Chicago is : Sun Feb 8 08:38:36 CST 2009


The output above tells you:
1. The box is synchronized to 171.68.10.80 at stratum 2.
2. Internal clock is at stratum 10 (the box won't synchonrize to any time source with stratum equal or greater than 10)

Other commands include:

utils ntp config
utils ntp restart
utils ntp start

Troubleshooting

utils network capture port 123
Executing command with options:
size=128 count=1000 interface=eth0
src= dest= port=123
ip=

08:56:01.125718 cm6-sub.ntp > cm6-pub.ntp: v4 client strat 4 poll 10 prec -18 (DF) [tos 0x10]
08:56:01.125965 cm6-pub.ntp > cm6-sub.ntp: v4 server strat 3 poll 10 prec -17 (DF) [tos 0x10]
08:56:18.270720 cm6-pub.ntp > ntp-sj1.ntp: v4 client strat 3 poll 10 prec -17 (DF) [tos 0x10]
08:56:18.308956 ntp-sj1.ntp > cm6-pub.ntp: v4 server strat 2 poll 10 prec -18
08:57:24.271526 cm6-pub.ntp > ntp-sj2.ntp: v4 client strat 3 poll 10 prec -17 (DF) [tos 0x10]
08:57:24.309282 ntp-sj2.ntp > cm6-pub.ntp: v4 server strat 2 poll 10 prec -16


Port 123 is NTP port. The output above shows the incoming/outgoing NTP packets on publisher:
1) cm6-sub is the NTP client on stratum 4
2) cm6-pub is the NTP server on stratum 3 (because the external NTP source is on stratum 2)
3) ntp-sj1 and ntp-sj2 are the external NTP source on stratum 2

NTP logs

Use RTMT to get "ntp logs".

Troubleshooting time offset on phones

If the time on CUCM server was correct, but the phones showed wrong time, it's most likely due to misconfiguration.

First of all, we need to understand the difference between UTC time and local time.

There are many different time zones in the world. In US, we have EST, CST, MST, PST, etc. 8AM EST means 7AM CST. Daylight saving also adds more complex to this. Different countries have different daylight saving cutoff dates.

To provide consistency around the world, NTP server feeds UTC (GMT) time to clients. How to manipulate it to get "local time" would be the client's responsibility.

On CUCM Admin > System > Date/Time Group, you may configure different groups to reflect different time zones. Then you may associate date/time group to different device pools. Hence, different phones in different device pools can have different local time.

One thing to notice is:
The "old" phones (7940/7960) get local time from CUCM server.
The "new" phones (7941/7961 and newer) get UTC time and time zone info from CUCM server. Then they do the math and display the local time.

Use Windows server as NTP source

Depending on your Windows version, there are some registry settings you need to set:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\TimeProviders\NTPServer\Enabled
Changing the ‘Enabled’ flag to the value 1 enables the NTP Server.

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\Parameters\Type
Change the server type to NTP by specifying ‘NTP’ in the ‘Type’ registry entry.

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\Config\AnnounceFlags
Set the ‘Announce Flags’ registry entry to 5, to indicate a reliable time source.

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\Config\LocalClockDispersion
Set 'LocalClockDispersion' to 0

The last one is most important one.

After changing registry, you need to restart "Windows Time" service.

P.S.  You either turn off Windows Firewall or have to allow UDP port 123, which is used by NTP protocol.

Wednesday, February 4, 2009

Integration between CUPS and MOC/OCS

For now, to integrate MOC (Microsoft Office Communicator) with Cisco IP phone system (CUCM) you need CUPS for phone presence and phone control.

Phone presence

Phone presence info will flow like this: CUCM -> CUPS -> OCS -> MOC.

Phone control

Phone control info will flow like this: MOC -> OCS -> CUPS -> CUCM.

Best Practices

Use TCP instead of TLS on your first deployment. TLS/Cetificates are not something fun to play with. They are optional for the integration. Per the KISS (Keep It Simple, Stupid) principle, don't mess with TLS unless you have to.

Let's talk about phone control first. Currently, Cisco support RCC (Remote Call Control) for MOC.

RCC was configured in Active Directory Users and Computers (ADUC) > Communications.


As you can see from the picture above, we need to configure Server URI and Line URI.

6002 is the IP phone DN (Directory Number).

htluo-cups is the proxy domain that will process the request. We'll further discuss this part later.

When MOC starts up, it'll send "INVITE 6002@htluo-cups6" to OCS.

When OCS receives the INVITE message, it'll try to route it to the right destination (CUPS in this case).

How OCS routes the message is more complicated than it looks like. It could be static route, it could be DNS lookup. For more details see "SIP domain and DNS domain".

Again, per KISS principle, it's recommended to use static route on OCS to eliminate any misconfiguration of DNS.


As shown above, the static route means "for any SIP message with domain htluo-cups6, forward it to 10.88.229.209, port 5060, with TCP protocol".

In order for CUPS to accept this message, OCS' IP address must be added to CUPS Incoming ACL. (or you may configure an "ALL" incoming ACL)

When CUPS server receives the message from OCS, the first thing it does is to determine if message has reached the final destination. CUPS compares its own configuration with the domain portion of the SIP message. If the domain portion of the SIP message matches one of the following, CUPS would think the message arrives at its final destination and take care of that.

a) SIP domain name (configured under CUPS Admin > System > Service Parameters)
b) CUPS node name (configured under CUPS Admin > System > Server)
c) node name + SIP domain
d) other alias name configured on CUPS

To see a full list of alias names, set "SIP Proxy" trace to detail. Restart SIP Proxy service. SIP Proxy would write a list of alias names to trace files during startup.

If CUPS decided to take care of the INVITE message from OCS, it will do the following:

1) Determine if the MOC user has permission to control the phone
2) If step 1 was ok, open a CTI request to CUCM CTIManager
3) If step 2 was successful, return "200 OK" SIP message to OCS

Determine if the MOC user has permission to control the phone

In the Server URI (tel:6002;phone-context=dialstring;device=SEP001E7A24429A), if a device name was specified (which was in this case), CUPS will check if that device was in the "Controlled Devices" list on CUCM Admin > User Management > End User.

If no device name was specified in Server URI, CUPS will try to find the device by DN. For details, please see: http://www.cisco.com/en/US/docs/voice_ip_comm/cups/6_0_1/install_upgrade/deployment/guide/dgmsint.html#wp1049685

Again, per KISS principle, you'd better specify device name in Server URI on your first deployment.

Open a CTI request to CUCM CTIManager

CUPS will open a CTI request to CUCM CTIManager with the credential configured on CUPS Admin > Application > CTI Gateway > Settings (CUPS 6.x).

Of course, the credential needs to exsit on CUCM > User Management > Application User. It needs to be in "Standard CTI Enabled" and "Standard CTI Allow Control of All Devices" groups.

And of course, the phone device needs to be registered to CUCM.

If all above was successful, CUPS will send "200 OK" to OCS as an response to the INVITE.

At this point, CUPS has done its job. But it does not necessarily mean MOC gets phone control.

In order for OCS to accept the "200 OK" message from CUPS, CUPS' IP address must be added to OCS "Host Authorization" (please note, it's IP address, not hostname or FQDN).


Don't forget to restart OCS Front End services after making changes.

The best tool to debug OCS/CUPS integration is on OCS. Right-click on the pool > Logging Tool > New Debug Session. Choose SIP Stack. Optionally, you may filter by the MOC user ID in the filter settings.

Known caveats:

1. CUPS sent "200 OK" for the INVITE. But MOC still not getting phone control.

This is because OCS doesn't trust CUPS. OCS SIP stack log will show "SIPPROXY_E_INVALID_RECORD_ROUTE"

Resolution: Check "Host Authorization" on OCS.

2. Load balancer

If you have load balancer for OCS, more likely than not, you will run into "one-way phone control" issue. The symptom is: you can make phone calls from MOC. But the call status was not updated on MOC. For example, when the call was connected, MOC still showing "calling".

This problem was caused by misconfiguration of load-balancing.

When OCS sends message to CUPS, it doesn't go through load-balancer (based on your exsiting configuration).

When CUPS tries to reply to OCS, it looks up DNS and DNS resolve the pool name to the load-balancer virtual IP. So the traffic goes through load-balancer to get to OCS. When OCS received the message, the last hop was load-balancer. However, the load-balancer didn't add its IP to the SIP header. OCS will reject this message and send "400 Missing correct Via header" to CUPS.

Resolution:
Check your load-balancer, see if it's capable of modifying SIP header. Or contact Microsoft to see if they can turn off the "Via header" check on OCS.