Contact Expert v7.0 for Skype for Business Server
High Availability Options for Contact Expert
Introduction
Contact Expert inherently supports some level of high availability in the form of CE domains. CE Domains refer to the physical architecture of the deployed solution whereby a single domain represents an isolated operational environment using a dedicated application server. In other words one CE Domain means one CE Core Host server computer. If a domain – the relevant CE Core Host – goes down then only the resources (agents, campaigns, etc.) assigned to that specific domain (CE Core Host) are affected. Agents and campaigns in other domains remain unaffected. When looking at this from a survivability and administration perspective, a number of problems are apparent:
- Agents and campaigns must be assigned to domains explicitly
- Agents cannot work and campaigns cannot run until the application servers hosting their domain are up and running
- Therefore when a CE application server fails, a labour-intensive administration needs to be performed to reconfigure all resources to another server during which the system is effectively off
- Even if there is an additional application server, it might not have been scaled to withstand the increased load two domains worth of resources might impose.
While the domains approach provides some level of load balancing capability, it is not considered a highly available or fault tolerant solution in itself.
For further information on how CE operates with OnCall IVR in case of failover please read the Behavior of Contact Expert and OnCall IVR in High-Availability cases article.
Automatic Failover
Contact Expert supports having two application servers in a domain: a Primary and a Secondary, providing an "active-passive" highly available service.
These two servers must be identical in terms of the hardware and software specifications as both host the same components including CE licenses and each should be able to provide the full function set to the whole range of users configured in that domain.
Being "active-passive", only one of the servers are active at any one time, though the CE services must be up and running on both computers for the HA function to operate properly. In a nominal situation, the secondary server runs in passive mode and the primary provides all functions, bearing all the load. After a failover though, the secondary server becomes active and the primary server goes dormant.
The Switchover Process
Normally, when both servers, the network, etc. operate fine, the two servers exchange heartbeat information and both of them register the result of this to the CE system database periodically. The system database plays the witness role in this context. See the diagram for an overview.
The switchover process is initiated by the witness party – the database.
Automatic switchover is initiated only if the below are all true
- The previously active server did not register status to the database for a while...
- ...AND the previously passive server registers status to the database...
- ...AND the previously passive server indicates that sending heartbeats to the previously active server fails.
When a failover occurs, the passive server takes up the role of the active server. During a failback it is the other way around. The server that became active at the end of such a switchover process starts running the queues and campaigns assigned to the domain and starts accepting connections from agents.
Automatic Failback
After a successful failover process, all CE services are provided by the newly activated server – it is now the Active one. The criteria triggering the failback and the process itself is exactly same as with the failover.
Mean Recovery Time Objective
The mean recovery time objective (RTO) is the following:
- automatic failover or failback (aka. active server is lost or recovered): 3 minutes decision threshold + 3 minutes switchover process = 6 minutes
- manual failover or failback: 3 minutes switchover process
Automatic Failover and Failback Decision Threshold
This is the time CE will wait from the moment it observes the currently active server is not operational until it actually initiates the switchover to the currently passive server.
Read the Application Servers article for more information on configuring this setting.
Failover and Failback Process Time
This is the time CE needs to perform the switchover of control from one server to the other, either in a failover scenario (primary to secondary) or failback (secondary to primary).
This time period can not be changed and it assumes a system environment at or above the minimum requirements (IT equipment performance, network, etc).
The switchover process time is not a guaranteed maximum value as – on top of the IT environment performance – it also depends on the number of business resources configured in CE.
Important Notices
The HA functionality will only perform as described if the following notices and advices are followed:
- HA works in a single datacenter only. There is no built-in HA for multiple datacenters (aka. there is no built-in disaster recovery).
- Maximum component startup/shutdown time frame is 5 seconds by default. Make sure the physical resources (CPU, IO) dedicated to the primary and secondary application servers guarantee this startup/shutdown requirement. On slower machines you need to increase the maximum component startup/shutdown time frame to a higher value (e.g. to 10 seconds) in ServerAgent.Config.xml.
- CE databases should be always available. Otherwise, CE might not work at all.
- Make sure the CE Storages – specified on the portal – are always available. Unavailable storages might result in CE startup failure.
- Make sure the callback list cache file (CallbackList.xml) is stored in a network share which is always available. Otherwise, callback requests might not be performed at all.
- Make sure the custom presence state file (ace_presence.xml) is stored on a web server which is always available. Otherwise, agents might not be able to sign in.
- In a virtualized environment, never host the primary and secondary application servers on the same physical host. Otherwise, HA functionality might become completely useless.
- Never restart the CE Server Agent Windows service manually as it – by design – stops and starts components one by one that results in service disruption. Use the CE Server Manager tool for this.
- Make sure the startup mode of CE Server Agent is set to Automatic (Delayed).
Always available means that the referred objects are
- stored neither on the primary nor secondary application server (so a HW failure of these would not render them unaccessible)
- stored at a location that is highly available on its own (does not pose a single point of failure itself)
Supported Failover Scenarios
The failover mechanism provides high availability for the following scenarios:
- Active server is 'HW-down' meaning:
- Server is powered off
- Server suffers power outage or power supply failure
- Server suffers network hardware failure
- Server suffers any HW problem rendering it out of order or unavailable over the network
- Local networking issue affecting only the active server:
- Network card(s) failure
- Router or other network component configuration issue
- Subnet networking issue
- CE services are completely stopped on the active server, meaning that even the components responsible for the heartbeat synchronization are off.
Scenarios not supported
This HA solution does not address vis major situations such as global networking outages, issues with the database or the unified communications (UC) infrastructure. You can implement database server high availability (e.g. SQL AlwaysOn) and UC level high availability (e.g. SfB server pools) to defend against the latter two.
Switchover will also not occur when some of the server components are affected by internal (software bug) or external issues (network, prerequisite component issues or one of many other potential factors), but the particular software components responsible for the HA connectivity and heartbeat protocol are NOT affected and continue to work OK. Such situations can result in an erroneous overall CE operation on the active server, but the automatic failover would still not occur.
The CE High Availability features are not replacing active human supervision by IT administration staff.
Manual Failover and Failback
On top of the automatic process described above, CE also supports initiating both the failover and failback features manually via the administration portal.
Microsoft Skype for Business Specific Details
In a Microsoft SfB environment, both the primary and the secondary server should be provisioned as trusted application server (New-CsTrustedApplicationComputer
) hosting the same applications and the same application endpoints (CE campaigns and recorder). Additionally, the primary and the secondary servers should be organized into a trusted application pool (New-CsTrustedApplicationPool
). Since only the active server registers application endpoints, SfB Front End server(s) will route incoming calls to the active server automatically.
Associated User Interfaces
Administrator Perspective
The CE portal provides an overview of the HA state including information on heartbeat connectivity, active-passive server states, etc.
- Navigate to Infrastructure → Application Servers.
- Click Edit of the preferred domain.
- Go to the bottom of the page and click MORE ACTIONS.
- Click High Availability.
- The active application server is indicated in a blue frame.
For further details on how to identify the state of the failover system, read the How to verify the Contact Expert high availability status article.
The portal also provides real-time feedback indicating the expected remaining time to finish manually initiated failover or failback procedures.
Agent Perspective
Identifying the Active Server
Among many other details, the Home → Diagnostics menu of the CE agent application reveals which CE server the client is connected to in a HA environment. See the highlighted parts in the screenshot.
Automatic Recovery
When a failover/failback procedure is initiated – either manually or automatically – the agent application presents a recovery dialog to the agent to provide feedback on the situation and the system attempts to reconnect to an active CE server component. This might take a couple of minutes in case a CE server is available.
If the agent is in the middle of an administrative work at the time the dialog shows up then clicking on the Close button will make it disappear. It will however also stop any further attempts to automatically reconnect with the server components.
In case both CE servers in a HA pair fails – or a network glitch blocks access to both of them –, or the failover/failback procedure takes considerably more time than configured, then all automatic reconnection attempts will fail and the system will eventually give up trying.
In case the failover procedure succeeds or fails for any reason, the recovery dialog will let the agent know this. Any particular reason for a failed recovery attempt is also presented to the agent in a temporary message that disappears after a while.
SfB User Perspective
Any SfB user – even employees not associated with Contact Expert operations – can take a look at the campaign endpoint contact in their SfB client software to reveal which CE server the given campaign endpoint is currently registered from. To do this, they simply need to type in portions of the CE queue or campaign name into the SfB client application contact search field.
Networking, SfB Server and CE Server Configuration by Example
This chapter describes how to configure networking, the Skype for Business Front-end servers and CE application servers for high availability by going through an example system setup. The following configuration is for a single CE domain assuming that CE application servers host both core and recording components.
We are using the following FQDN and IP addresses in our example setup:
FQDN | IPv4 address | |
---|---|---|
CE server pool | ce-pool.msvoip.dev | n/a |
Primary CE server | ce-dev.msvoip.dev | 10.168.3.101 |
Secondary CE server | ce-ins.msvoip.dev | 10.168.3.102 |
SfB Front-end pool | sfb-pool.msvoip.dev | n/a |
DNS Server Configuration
Create DNS A records in the forward lookup zone msvoip.dev for the CE application servers:
Name Type Mapped to ce-pool Host (A) 10.168.3.101 ce-pool Host (A) 10.168.3.102 Execute the
nslookup ce-pool.msvoip.dev
command at least 3 times in a command prompt window on both the SfB Front-end and CE servers and make sure the DNS server returns the IP addresses belonging to the CE servers and the order of these IP addresses changes each time nslookup command is issued
Computer Certificates on the CE Servers
Assign a computer certificate to each CE application servers with the following certificate attributes (Computer account > Personal folder)
Attribute name | value |
---|---|
Certificate Template Name | WebServer |
Friendly Name | CEPoolCert |
Subject (CN) | ce-pool.msvoip.dev |
Subject Alternate Name (DNS Name) | ce-pool.msvoip.dev ce-dev.msvoip.dev ce-ins.msvoip.dev |
Key Usage | Digital Signature Key Encipherment |
Enhanced Key Usage | Server Authentication |
SfB Server Configuration
Use SfB Server Manager Shell to run the following cmdlets (required permission is SfB Server CSAdministrator)
Data in these cmdlets are examples – do not use these in your environment!
Create an application pool using the CE pool FQDN and the 1st CE application server FQDN:
New-CsTrustedApplicationPool -Identity ce-pool.msvoip.dev -ComputerFqdn ce-dev.msvoip.dev -Registrar sfb-pool.msvoip.dev -Site 1
Add the 2nd CE application server to the pool:
New-CsTrustedApplicationComputer -Identity ce-ins.msvoip.dev -Pool ce-pool.msvoip.dev
Create a trusted application for CE core:
New-CsTrustedApplication -ApplicationId ACE -Port 9000 -TrustedApplicationPoolFqdn ce-pool.msvoip.dev
Create at least one campaign endpoint:
New-CsTrustedApplicationEndpoint -ApplicationId ACE -TrustedApplicationPoolFqdn ce-pool.msvoip.dev -SipAddress sip:campaign_1@msvoip.dev -DisplayName CE: Campaign 1 -LineURI tel:+XXXXXXXXX
Create a trusted application for CE recording:
New-CsTrustedApplication -ApplicationId ACE_Recorder -Port 9100 -TrustedApplicationPoolFqdn ce-pool.msvoip.dev
Create at least one recorder endpoint:
New-CsTrustedApplicationEndpoint -ApplicationId ACE_Recorder -TrustedApplicationPoolFqdn ce-pool.msvoip.dev -SipAddress sip:recorder_1@msvoip.dev -DisplayName CE: Recorder 1
Publish the SfB topology:
Enable-CsTopology
CE Server Configuration
Data in these cmdlets are examples – do not use these in your environment!
Launch CE PowerShell on the Primary CE Core Host – in our example it is ce-dev.msvoip.dev – and execute the following cmdlet with local Windows Administrator role to configure the primary CE Core Host to work with the designated Trusted Application in Microsoft telephony:
Set-CESfbConnectorProperties -ApplicationName ACE -ApplicationFqdn ce-dev.msvoip.dev -ApplicationPort 9000 -CertificateFriendlyName CEPoolCert -ApplicationGruu [ACE trusted app's ComputerGruu for ce-dev.msvoip.dev]
Execute the following cmdlet to configure the CE recording services – in our example residing on the same computer as the primary CE Core Host – to work with the designated Trusted Application in Microsoft Telephony:
Set-CESfbRecorderProperties -ApplicationName ACE_Recorder -ApplicationFqdn ce-dev.msvoip.dev -ApplicationPort 9100 -CertificateFriendlyName CEPoolCert -ApplicationGruu [ACE_Recorder trusted app's ComputerGruu for ce-dev.msvoip.dev]
Execute the following cmdlet to establish the required firewall rules on the primary CE Core host:
Add-CEFirewallRules
Launch CE PowerShell on the Secondary CE Core Host – in our example it is ce-ins.msvoip.dev – and execute the following cmdlet with local Windows Administrator role to configure the secondary CE Core Host to work with the designated Trusted Application in Microsoft telephony:
Set-CESfbConnectorProperties -ApplicationName ACE -ApplicationFqdn ce-ins.msvoip.dev -ApplicationPort 9000 -CertificateFriendlyName CEPoolCert -ApplicationGruu [ACE trusted app's ComputerGruu for ce-ins.msvoip.dev]
Execute the following cmdlet to configure the CE recording services – in our example residing on the same computer as the secondary CE Core Host – to work with the designated Trusted Application in Microsoft Telephony:
Set-CESfbRecorderProperties -ApplicationName ACE_Recorder -ApplicationFqdn ce-ins.msvoip.dev -ApplicationPort 9100 -CertificateFriendlyName CEPoolCert -ApplicationGruu [ACE_Recorder trusted app's ComputerGruu for ce-ins.msvoip.dev]
Execute the following cmdlet to establish the required firewall rules on the secondary CE Core host:
Add-CEFirewallRules
Callback List Cache File in HA Environment
Callback requests are managed by the RuleServer which maintains a local cache file (CallbackList.xml) for them in order to handle them efficiently. By default this file is stored locally on the application server in the C:\Geomant\CE\Backup directory. In a HA environment this file should be stored on a network share in order to avoid loss of callback requests in case of a server failure.
The RuleServer on both the primary and secondary servers has to be configured to read/write the callback list cache file from/to this directory.
- Create the folder on a file server (e.g.: cefiles) where the callback list cache file is to be stored (e.g.: filesrv).
- Grant both the primary and secondary server computer accounts (e.g.: cesrv1$, cesrv2$) full permission on this folder. Also make sure these permissions are inherited by all child objects within this folder.
Share this folder on the network (e.g.: \filesrv\cefiles).
Please note that '\filesrv\cefiles' is only an example! Provide the exact path name of the network share where you store the callback list cache file.
Stop Contact Expert services using the CE Server Manager tool.
- Stop the CE Server Agent service in the Windows Services administration tool.
Configure the RuleServer on both the primary and secondary server to store the callback list cache file (CallbackList.xml) in the shared folder ({
-backupdir \\filesrv\cefiles
}).- Navigate to the ServerAgent.Config.xml config file located at C:\Geomant\CE\Services\ServerAgent\ by default.
Find the argument on the location of the backup directory among the RuleServer parameters:
<Argument name=backupdir value=C:\Geomant\CE\Backup></Argument>
Replace the default backup path to the location of the network share where the callback list cache file is stored.
- Start CE Server Agent in Windows Services admin tool.
- Start Contact Expert services using the CE Server Manager tool.
Notes on the ApplicationGruu parameter in a HA CE environment
The CE Powershell cmdlets in this document require ApplicationGruu parameters – the contents of which might be a bit counter-intuitive because the Microsoft objects either refer to a service Gruu or a Computer Gruu, but not Application Gruus... So which one is needed here? Follow these steps to acquire the correct data:
Execute the following SfB Powershell cmdlet to list all the available trusted applications:
Get-CsTrustedApplication
Find the trusted application entry dealing with the CE pool you created earlier – in our example it is ce-pool.msvoip.dev.
Find the Computer Gruu parameter of this trusted application and extract the part that deals with the actual CE Core Host you currently deploy Example ApplicationGruu for a HA CE environment, copied from the Computer Gruu of the trusted application:
sip:ce-dev.msvoip.dev@msvoip.dev;gruu;opaque=srvr:ce-pool:bJ-p03O411CzeiHbxYOOhQAA
Reporting Data Collection During HA Failover
Contact Expert collects statistical data from all aspects of the ongoing operation of the contact centre in real-time. This is accomplished by server components installed on the application server of the first domain.
In a HA deployment where two CE servers are operating to form a single domain – assuming this is the first domain – then only the active server fo domain 1 collects data. In case of a failover situation the passive server takes over the data collection duties in an automated fashion.
The reporting data collection service is deployed to domain 1 only, it is not installed on any other domain. In case of a complex system where more than one domain is deployed and domain 1 is (or if it is a HA pair, then both its active and passive are) disconnected or inoperable, then reporting data collection stops for all other servers too. This can be remedied, but requires manual intervention.
Additional Requirements
SQL Server
The high availability solution described here assumes that both the CE system and reporting databases are hosted by a dedicated SQL Server, preferably using some of the offered redundancy features such as a flavour of AlwaysOn High Availability Groups.
The SQL server must be running on a separate host (most probably running on an SQL cluster).
CE databases must not be hosted on a CE application server.
SfB Custom Presence States
SfB custom presence states required by the CE agent application are defined in the following XML file on the CE Core Host: ace_presence.xml
In order to force the SfB client to download and use these customer presence states, the installer of the CE Agent Application sets the Windows registry key CustomStateURL on each agent PC. The location of this registry key depends on the version of the SfB client:
SfB Client version | Registry location |
---|---|
Lync 2013 or SfB 2015 Client | HKLM\SOFTWARE\Policies\Microsoft\Office\15.0\Lync |
SfB 2016 or O365 SfB Client | HKLM\SOFTWARE\Policies\Microsoft\Office\16.0\Lync |
These registry hives contain the CustomStateURL key with the following value:
http://[CustomPresenceHostFQDN]:8080/ClientAccessServer/ace_presence.xml
By default, the xml file is downloaded from one of the CE application servers by default. Obviously, this is not a good solution for high availability. Instead, the ace_presence.xml file should be stored on a separate web server and the CustomStateURL Windows registry key should be set accordingly.
You do not need to change the Windows registry key on each agent PC one by one. CustomStateURL can be specified from SfB client policy as well. Use the Set-CsClientPolicy
cmdlet in the Skype for Business Server Management Shell.
Network Shares Configured
CE is using a shared folder for playing back recorded audio files via the CE portal, and also to store exported contact lists regardless of which servers of a HA pair are active. While these functions work just fine in a non-HA environment even if these shares are not configured, it is essential to set them up for HA:
Add-CEContactManagementShare
Add-CEMediaReplayShare
For more information on these and other CE PowerShell cmdlets please read the PowerShell Commands article.
Recorded Conversations Stored Outside of CE
Media files should be stored on external (network) drives. CE recording rules should be configured accordingly.
The following article contains information on how to setup external file shares: CE Recording
Each URL must be set in Agent Policy
The script capability and lookup / customer history functions use web based data served by a web server on the CE Core hosts. If these features are active and in use, then the URL of these should be set in the agent policies.
Each URL is a template. You do not need to change these templates; you just need to check the checkbox in the Set column. These URL templates guarantee that tasks and histories are always downloaded through the active CE application server.