/----------------------------------------------------------------------------\ | | | | The Greater Scroll | | | -=*=- of -=*=- | | | Dialing Wisdom | | | | \----------------------------------------------------------------------------/ *=* Copyright 1996, 1997 WebTV Networks, Inc. All rights reserved. Revised: 1998/01/07 Bugs --> fadden *** WebTV Confidential *** --> This document contains a great amount of detail on how our service <-- --> works. While not strictly "trade secret" information, care should <-- --> be taken to keep this within WebTV/Microsoft. <-- History in brief: 1998/01/07 Added IV.P and made numerous small changes. 1997/11/13 Added some new sections and many corrections. General release. 1997/11/03 Several corrections and new additions. First public draft. 1997/10/30 Complete rewrite; renamed to "Greater Scroll". 1997/08/11 Touched up a bit for Microsoft folks. 1996/11/19 Assorted notes on interactions between dialing options, "visible dialing", black holes, and Spooky dial Options. [ This is the version most people are familiar with. ] 1996/11/17 Load balancing makes its debut. 1996/11/11 We now have flat-rate IAPs and associated nastiness. 1996/09/01 Did something. 1996/08/30 Added comment about handling "1-800 addicts". 1996/08/29 Clarified access number handling. 1996/08/27 First draft of the "Great Scroll". Contents ======== I. System Overview A. Whassup with this B. Joe gets wired C. Recap II. Telco Issues A. Fancy words and TLAs B. Dial patterns C. Area code splits D. Semi-automatic number identification E. The local, the toll, and the ugly F. POP, phone line, and network quality issues III. Service Fundamentals A. Dial overrides, Satan, and you B. Introduction to tellyscripts C. Call ordering and the fallback number D. The "clientinfo" command E. Vend-A-Telly F. Visible Dialing IV. Dialing Details A. PhoneDB details B. Intro to POP load balancing and provider rotation C. Tellyscript return codes D. Dial patterns revisited E. Secret codes, NVRAM, and "have you moved?" F. How phone settings work G. Radius, access numbers, and PSI H. OpenISP I. Client upgrades and brain-dead boxes J. ComingSoon and friends K. Pick-yer-POP L. MessageWatch and EPG M. Idle timeouts N. Adding new providers O. VideoAds P. Automatic Number Frustration V. Extra Goodies A. OraclePhoneDB and POPtimization B. Fallback usage cap C. MCI VI. For Further Reading A. On the web B. In the service source tree | | -=*=- I. System Overview -=*=- | | -= I.A =- Whassup with this The WebTV system is a combination of a set-top box and an online service. The set-top box ("WebTV Internet Terminal" or "WebTV Plus Receiver"; henceforth just "box"), is connected to a television and a phone line. Once it has successfully dialed into an Internet Service Provider, it connects to the WebTV service, and great things happen. The simple act of getting a user connected to a local ISP is surprisingly difficult. This document explains the fundamentals of getting connected. The intended audience is Customer Care, SOC, Network Operations, QA, and Engineering. Not all sections are relevant for everyone. The focus of this document is on the U.S. phone system. International issues, including a description of the Japanese phone system, can be found in the "IntlPhoneNotes" document. Sections I, II, and III should be generally useful. Sections IV and V are more technical and aren't important for everyone to understand. I've chosen not to include the internal workings of tellyscript and PhoneDB generation in this document, because they're complicated, volatile, and really only necessary for engineering and a few people in SOC and netops. Recommendations on Customer Care practices (notably with regard to dial overrides) are simply that: recommendations. They may or may not be consistent with current Customer Care policies. -= I.B =- Joe gets wired A brief example should help illustrate the major components of the WebTV system. When Joe User brings his box home from the store, the first thing he does is try to set it up, usually without reading the instructions. Sometimes this even works. When the box is powered on, it listens for a dial tone on the phone line. (You can turn this off in the dialing options; if you do, it wants for a few seconds and then dials blindly.) If the phone line hasn't been plugged in or isn't hooked up correctly, the box will complain that it can't hear a dial tone, and offer to try again or let you tweak the dialing options. Joe gets everything wired up and tries again. This time the box hears a dial tone, so it dials a toll-free 800 number. This number is usually referred to as the "scriptlessd number" or (for historical reasons that we're hoping to obliterate) the "prereg number". Most users can connect to this without trouble once their box is set up properly. Once connected, Joe's box starts talking to the scriptlessd server. scriptlessd gets the caller's phone number via a feature called ANI (Automatic Number Identification) that is similar to CallerID, except that it works from almost everywhere and can't be blocked. If the service is unable to get the user's number from ANI, scriptlessd will put up a screen asking the user to enter their phone number. From the user's ANI we know where they are and what the closest POPs are (POP is Point Of Presence, typically a bank of modems connected to an Internet Service Provider, or ISP). The two (someday, three or more) best POPs are assigned to that user, and put into a set of dialing instructions called a "tellyscript" (a bad pun on a product from General Magic). The tellyscript tells the box which numbers to dial and how to dial them. After getting the tellyscript, the box hangs up and dials the first POP in the list, which is hopefully a local call. If the first number is busy it will hang up and try another. After trying each of the POPs twice it will give up and call a toll-free 800 "fallback" number, the use of which may be restricted to a few hours per month. Barring network outages or excessive local congestion, most users shouldn't need to use the fallback number. With a little luck, most users will successfully connect to the WebTV Network without further intervention. The box connects to the "headwaiter" server, which tells it where to go. Shortly after connecting the box sends up a "phone log" (sometimes called a "connection log") that shows what numbers were dialed, what failed, and what ultimately succeeded. These logs are used to generate POP failure statistics and to debug problems. When Joe turns his box off with the keyboard or remote, the tellyscript is saved in NVRAM (Non-Volatile Random Access Memory, which isn't what we're actually using but it works the same). The next time the box is powered on, it skips the scriptlessd step and dials directly into the local POP. -= I.C =- Recap The "box" is the thing what sits on your television set. The "service" is what it talks to when it gets dialed in. The service is composed of multiple "servers" that do specific things, like hand out "tellyscripts", show you the home page, or let you read mail. The box knows how to find "scriptlessd", scriptlessd knows how to find the "headwaiter", and headwaiterd knows how to find all the other servers. ANI tells us the caller's phone number. From the ANI data we assign local POPs to the user. The specific dialing instructions are contained in a tellyscript. Boxes with tellyscripts dial into their local POP, and connect to the headwaiter. Boxes without tellyscripts go to "scriptlessd" first to get a tellyscript, and then hang up and redial a local POP. | | -=*=- II. Operation in Detail -=*=- | | -= II.A =- Fancy words and TLAs While the evolution of the US phone system shows a great deal of careful and occasionally ingenious thought, there are some things about it that just plain suck. Before we can go into detail, there are a few terms and proper nouns that should be defined. "telco" means "telephone company". It's a generic term, as in "this telco billing stuff drives me nuts." Telco guys like to say telco a lot. Telco. "CCMI" is Center for Communications Management Information. CCMI sells us a database of call pricing information that is frequently accurate. "POP" is Point Of Presence, generally a bank of modems with a terminal server that connects to a network. The modems are usually part of a "hunt group", so that you can dial just one number, and if the first line is busy it "hunts" for the next free one. When we try to dial a POP, but get a person instead, we refer to it as dialing a MOM. You sign up for a "calling plan" when you get telephone service hooked up. In the Bay Area you usually just choose between flat-rate and measured-rate service, but in other places you have a wide range of choices. For example, by adding a higher monthly charge to your phone bill you could get flat-rate local calling to a larger area. A "dial pattern" tells you how many digits to dial when you're calling a particular number. In the US, these may be 7, 10, or 11 digits long. In the Bay Area you can usually call yourself with a short number like 614-5539 or a long one like 1-650-614-5539, but in other areas the systems are less lenient. In some cases it can be more expensive to dial 11 digits than 7. The customer's choice of calling plan can affect the dial patterns that they have to use. "Tellyscripts" are a WebTV creation; they're programs sent to the box by the service. They contain instructions that tell the box how to configure the modem, which POPs to call, and how it should dial them. "LEC" is Local Exchange Carrier. These are the guys who handle local calls and "local toll". Pacific Bell is our LEC. CLECs are Competitive Local Exchange Carriers, a new kind of carrier made possible by the 1996 Telecommunications Act. You too can run your own phone company. "RBOC" is Regional Bell Operating Company. These are the "baby Bells" that got spun out of AT&T several years ago. Pacific Bell is an RBOC. Sometimes these are just referred to as "BOC"s. "IOC" and "UOC" are CCMI abbreviations for Independent Operating Company and Unknown Operating Company. Contrast with BOC. IOCs tend to be smaller phone companies or CLECs, UOCs are usually phone companies run by rural cooperatives or out of somebody's garage. An IOC that CCMI doesn't know anything about is a UOC. "IXC" (sometimes "IEC") is Inter-eXchange Carrier. This is a fancy term for "long distance company" that telco people like to throw around. AT&T is an IXC. When you make a long-distance call, the IXC pays money to the LEC where the call came from and the LEC where the call went to, so calls that avoid IXCs tend to be cheaper. "LATA" is Local Access Transport Area, a geographical region defined by the phone companies. The way things traditionally worked is that your LEC handles local calls and intra-LATA (in the same LATA) toll calls, while your IXC handles inter-LATA (between LATA) toll calls. So a toll call to a location 20 miles north might be handled by Pacific Bell, while a similar call in the other direction might be handled by AT&T, based on where the LATA boundaries fall. Calls that cross state boundaries follow an even more mysterious set of rules. The "Telecommunications Act of 1996" really screwed everything up. Your IXCs can be LECs, CLECs can provide local service with the LEC's equipment, and generally anybody can do anything. This is why MCI can offer local service now. "PIC" is Primary Interexchange Carrier. This term can be used both as a verb and an adjective. Your phone line can be "PICed" to use a specific carrier for your IXC, and more recently you can have an intra-LATA PIC done for local toll calls. A "PIC code" is a sequence of digits that you can enter before dialing a number to choose a different carrier; examples are 10288 (1-0-ATT) or 10321 (Telecom*USA's 10-3-2-1 program). "PIC charges" are the fees that your IXC pays to your LEC when you change your long distance company. The PIC code format is in the process of changing from 10XXX to 101XXXX. "Tariffs" tell you how much a call between two points costs. For long distance calls, the tariffs from the LECs on both ends and the relevant IXC all have to be factored in. "PUC" is Public Utilities Commission. The PUC in each state has a great deal of control over the tariffs that the phone companies use. There are places where a long-distance call handled by AT&T is completely free, because the PUC decided that it should be. "Local calls", in the telco world, are not necessarily free calls. The difference between local and toll is defined by the tariffs, which are filed by the phone companies and monitored by the PUCs. Pacific Bell defines "zone 3" calls, which charge per-minute rates even to subscribers with flat-rate plans, as local. In the WebTV world we try to define "Local" as least-cost and "Expensive Local" (ExpLocal) as any local call that is more expensive than the minimum. We calculate the minimum by figuring out what it would cost for the customer to call himself. Any local call that costs more is labeled ExpLocal. The "rate center" is a geographic point used for billing purposes. "MTS" (Message Toll Service) coordinates are based on the rate center. The cost of a long distance call is based on "major MTS coordinates" for calls over 40 miles, and "minor MTS coordinates" for calls under 40 miles. For local calls the "wire center" coordinates are used. Yes, it could be more complicated: the coordinates are specified in "V&H" (Vertical and Horizontal) units, 1670 feet each. "POTS" is Plain Old Telephone Service. The term is used to differentiate standard phone service from things like ISDN or cellular. "C.O." is Central Office. In the typical house or apartment, a pair of copper wires runs from your telephone to the central office. The distance between your phone (or, more importantly, your WebTV box) and the central office, and how well the wires are shielded, can affect the quality of your phone connection and hence your modem connect rate. "NPA/NXX" is the obfuscated term for area code and prefix. If your phone number is 650-614-5539, your NPA is 650 and your NXX is 614. The NPA and NXX are enough to identify where the call is coming from. The last four digits of the phone number are sometimes called the "subscriber number". In some contexts the term "exchange" is synonymous with NPA/NXX. An "Exchange Area" is a collection of NPA/NXXs for which the billing is identical. For example, two calls from anywhere in Palo Alto will have the same cost so long as both callers have the same calling plan and service providers. Exchange areas may include dozens of NPA/NXXs or might only have one. They might overlap geographically (because of paging/cellular exchanges), but each NPA/NXX is part of only one exchange area. "LCA" is short for Local Calling Area. The LCA for Palo Alto is the set of exchange areas that are local calls from the Palo Alto exchange area. Put more simply, if you're a local call for me, then you're in my LCA. LCAs may overlap. LCAs aren't necessarily symmetric; just because you are a local call for me doesn't mean that I am a local call for you. "NANP" is the North American Numbering Plan. The Plan defines all the area codes, how dialing patterns will work in the future, and other dry subjects. It's NANP rather than USNP because it applies to Canada, Guam, and places out in the Caribbean, all of which are part of North America if you lean back and squint. It does not cover Mexico. "ISP" and "IAP" are Internet Service Provider and Internet Access Provider. They are essentially the same thing, with a subtle and unimportant difference. We usually refer to them as IAPs. Concentric Networks Corp. (cnc), PSINet, Inc. (psi), and UUNET Technologies, Inc. (uunet) are examples of IAPs. The "backhoe" is a large piece of construction equipment used for digging trenches and cutting through network cables at inopportune moments. The "PhoneDB" is a WebTV creation that combines the CCMI data with a list of POPs from several IAPs, and comes up with POP assignments for every NPA/NXX. (If you understand what I just said, you're ready to graduate.) The POP-O-Rama web page lets you do queries on current and past PhoneDBs. -= II.B =- Dial patterns People who grew up in California were spoiled by Pacific Bell's coherent dialing pattern system. For the most part, you can dial to any point within the same area code by entering a 7-digit number, and you get to numbers in other area codes by entering an 11-digit number. Dialing numbers in the same area code using an 11-digit number is allowed. Other parts of the country aren't as straightforward. There are actually four kinds of calls you can make: HL - Home area code, Local call. Calls within Mountain View are HL. HT - Home area code, Toll call. Sunnyvale (408) calling Santa Cruz (408). FL - Foreign area code, Local call. Mountain View (650) to Sunnyvale (408). FT - Foreign area code, Toll call. Mountain View to New York. Each of the four types can have a different "expected" dialing pattern, as well as a "permitted" dialing pattern. Certain combinations have unpleasant consequences. HL is almost always 7 digits, but some places (like Maryland) require 10-digit dialing for *all* local calls. Yes, you have to include the area code to call your neighbor down the street. Enlightened areas like California have 11-digit dialing as a "permitted" HL pattern. HT is generally 7, 11, or both. Places that require 7-digit dialing for home/local calls and require 11-digit dialing for home/toll calls are troublesome, because the number of digits depends on whether the destination is a local call, and the definition of "local" depends on your calling plan. In many cases there is no way for WebTV to know ahead of time how many digits the box should dial. Guessing wrong results in a recording from the phone company. FL is usually 10 or 11, but in some cases is 7. In nasty cases it's 7 and 10/11 aren't allowed at all. It's nasty because we are *required* to dial a 7-digit number into a different area code when the call is local, but would be dialing an 11-digit number if the call were toll. So if we think something is local when it really isn't, we could be dialing a 7-digit number in the *caller's* area code rather than the *callee's* area code, and the WebTV box will be waking up somebody's grandmother. The service takes great pains to avoid this situation. FT is always 11, no exceptions. Using the right pattern can be important. For example, there are places where you are either not allowed to dial 11-digit numbers for local calls, or are charged more than you would for dialing 7 (presumably because the call is routed through the IXC as soon as the leading '1' is seen, instead of being handled by the LEC). The CCMI database has "hints" on dialing patterns, but they are sometimes inaccurate. Because the dialing pattern depends on whether a call is local or toll, it depends on what your calling plan defines as being local. This makes it a bit of a challenge to get the dial pattern right. To work around these issues, the WebTV service takes the best guess it can, and remembers the cases that succeed. The service remembers a set of dialing patterns that looks like this (output is from "dpedit", the Dial Pattern EDITor): The dial patterns for '01fad82501b002ba' (ANI=004154631671) are: S # ANI POP Mode + 0 415-614-5539 415-233-0570 7-digit + 1 415-614-5539 415-322-0489 11-digit + 2 415-463-1671 415-233-0570 7-digit I 3 415-463-1671 415-666-9999 7-digit + 4 415-463-1660 415-322-0489 11-digit + 5 415-463-1660 415-233-0570 7-digit N 6 415-463-1660 510-742-0207 11-digit - 7 Each line is one entry in the dial pattern table. It has the person's ANI at the time the call was placed, the POP number that the person was calling, and how many digits were used to dial it. We have to record the ANI, because if they move the box to a different place, or even to a different phone line with a different calling plan, the dial patterns can be different. Same story for area code splits (see next section). When a user first signs up, or first appears at a new number, we have no information about a person's dial patterns. The tellyscript that gets sent down will first try one pattern, then if that fails, it will try the next. When one succeeds, we add an entry to the table. Suppose the tellyscript for Palo Alto first tries 7-digit dialing and then tries 11-digit dialing. What happens if the POP happens to be busy on the first attempt, but succeeds on the second? We will end up recording a success with 11-digit dialing, and will use that from then on. This isn't perfect, but it's hard to tell the difference between different kinds of failures ("all circuits are busy" sounds just like "you don't need to dial a 1 in front of that" to the modem). Most of the time it works. A problem that occasionally surfaces is with customers who turn "audible dialing" on and get excited when the first attempt fails. If they were to wait for a minute or two until the box timed out and tried the next number, everything would work out fine; but instead they hear the first attempt fail and immediately call Customer Care. The solution is NOT a dial override, but rather to encourage the customer to have more patience. (In one case the user was told to use the 32768 secret code, which clears out all of the settings in NVRAM. This turned off audible dialing. The customer successfully dialed in shortly thereafter.) It is also possible for a customer's dialing patterns to change over time, perhaps because they change local calling plans. This is not handled automatically, because the service can't easily distinguish a dead POP from a bad pattern. Once again, the solution is NOT a dial override. The "dpedit" utility can be used to adjust the dial patterns. Once changed, send the user through the "new number" routine so they go back through scriptlessd and get a script with the updated data. See the dpedit "README" file for details on using it. Sometimes there are exceptions to dial pattern rules within a certain area. For example, there was an InternetMCI POP at 415-482-2900 in Redwood City that was a local call from Palo Alto. Every other call to Redwood City could be dialed with 7 or 11 digits, but not that one. If you didn't use 7-digit dialing, you got a recording chastising you for being so clueless. The moral of the story is that there's no way to know for sure what will work until it's tried. Things can get pretty weird. In the 608-326 exchange in Wisconsin, if you call "873-xxxx", you get a local number in Iowa at 1-319-873-xxxx. If, on the other hand, you dial 1-608-873-xxxx, you make a toll call to another point in Wisconsin. Even though you're in the 608 area code, and there's a 608-873-xxxx, your call to "873-xxxx" goes to a different area code. In this particular case, we're allowed to dial 1-319-873-xxxx, so by using 11-digit dialing there's no ambiguity. One other note: the list of dial patterns only determines whether the box dials 7, 10, or 11 digits when calling a POP. It does *not* decide which POP a customer will get, or in what order they will be tried. -= II.C =- Area code splits Area code splits come in two varieties, geographical splits and overlays. Geographical splits are done like the 415/510 and 415/650 splits, where a geographic region gets a different area code. With overlays, the same area gets two area codes. Usually one area code is used for voice, while the other is used for FAX machines, pagers, and cellular phones. For both kinds of splits, the transition is done over a period of a few months. The following chart illustrates the process, assuming that somebody in San Francisco at 415-659-0610 and somebody in Palo Alto at 415-614-5539 (changing to 650-614-5539) are trying to call each other. (1) Pre-split. The 650 area code does not exist yet. From S.F., dialing 614-5539 works. From S.F., dialing 1-415-614-5539 works. From S.F., dialing 1-650-614-5539 results in a "what the hell area code is that?" message. From P.A., dialing 659-0610 works. From P.A., dialing 1-415-659-0610 works. The ANI for the person in Palo Alto is 415-614-5539. (2) "Permissive" dialing. You are allowed, but not required, to dial 650. From S.F., dialing 614-5539 works. From S.F., dialing 1-415-614-5539 works. From S.F., dialing 1-650-614-5539 works. From P.A., dialing 659-0610 works. From P.A., dialing 1-415-659-0610 works. The ANI for the called person is now 650-614-5539. (Sometimes the local phone companies blow this, and do it early or late. It's unwise to assume that the ANI will change at the very start of the permissive period.) (3) "Mandatory" dialing (usually starts about 6 months after "permissive"). From S.F., dialing 614-5539 gets a "you need to dial 650" recording. From S.F., dialing 1-415-614-5539 gets a "you need to dial 650" recording. From S.F., dialing 1-650-614-5539 works. From P.A., dialing 659-0610 gets a "you need to dial 415" recording. From P.A., dialing 1-415-659-0610 works. (4) Eventually the no-longer-used numbers get reassigned. From S.F., dialing 614-5539 gets a wrong number. From S.F., dialing 1-415-614-5539 gets a wrong number. From S.F., dialing 1-650-614-5539 works. From P.A., dialing 659-0610 gets a wrong number. From P.A., dialing 1-415-659-0610 works. What makes area code splits especially frustrating for us is that the dial pattern can change. Before the split, if you were in Palo Alto and calling a San Francisco POP at 415-659-0610, you could just dial 659-0610. After the split, you would be calling a number in a different area code, and would be required to dial 1-415-659-0610. Even though you haven't moved, your ANI has changed out from under you. The WebTV service can't fix you if you can't log in, and guess what, you can't log in except through the 800 number. The good news is that if you make your box go back through scriptlessd, it will detect that your ANI has changed, and all of your old dial patterns will be ignored because they were tied to your old ANI. Ideally we wouldn't have to put the users through this manual step, and would either send them back through scriptlessd automatically or just make the change to their area code directly. But how do we do this? One solution here is to have an 800 fallback number that also gets your ANI, and compare the current ANI with the ANI on record. If all of your local POPs are failing because we're using the wrong dial pattern, you end up on the fallback number, and once there we can automatically detect that it's because your area code changed. Also, given sufficiently detailed information about area code splits, we could program the box to dial a different set of numbers depending on whether "today" is pre-split or post-split. The latter solution isn't perfect, because if the box loses power it forgets what day it is, but it's a little cleaner. You might be tempted to think that dialing the full 11-digit number every time would solve this problem. In the San Francisco/Palo Alto example above, the 11-digit pattern worked correctly in every case. Unfortunately, as mentioned in the section on dial patterns, 11-digit calls might either be disallowed or might be more expensive than a 7-digit call to the same number. A particularly troublesome area code split happened in Maryland in the middle of 1997. Not only did the area code split, but all local calls suddenly had to be dialed with 10-digit numbers. This change required that the service "forget" all 7-digit patterns for callers whose ANI showed them to be in Maryland. The service config option IgnoreDialPattern was added to deal with changes like this in the future. -= II.D =- Semi-automatic number identification When we get the caller's phone number via ANI on the 800 scriptlessd number, we get a little more data with it. A typical ANI string looks like "006506145539". The last 10 digits are the phone number. The first two are the OLS (Originating Line Screening) code. This allows us to tell if somebody is calling in from a prison, hotel room, or pay phone rather than a standard phone line. At least, it would, if we were able to get at the OLS code with our systems, which we can't. But I digress. If you're calling in from a point in the United States, Canada, or affiliated areas like Puerto Rico, chances are the ANI number is valid. There are specific regions that don't support ANI, however, and there are times when the ANI just doesn't seem to want to show up. In cases like these, the service will ask the user to enter their own phone number. It doesn't need to be exactly right; it just needs to be in the same "exchange area" as the box. If the person has two phone lines, and puts in the voice number, it will usually work just fine. If the service for the lines are provided by different local phone companies, though, the billing can be quite different, so the system works best when the number comes from ANI. To make it easier to diagnose cases where the user entered the wrong value for their phone number, the service labels "manual ANI" entries by replacing the OLS code with a WebTV-defined value. Some interesting values: 99 (+ 10 digits) - number was entered on "enter your phone number" screen. 98 (+0000000000) - special code used; probably an international demo box. 97 (+0000000000) - special code used; probably an international demo box. 96 (+ 10 digits) - number changed with dpedit or clientpopedit. 95 (+0000000000) - service is ignoring ANI values (never on production!) If somebody is dialing a totally inappropriate set of POPs, and their ANI number starts with "99", chances are they entered the wrong number on the "enter your phone number" screen. WebTV isn't responsible for toll charges incurred by sticky-fingered users, but diagnosing this quickly will leave the customer happier. Sometimes you need to check the "ANI history" to see if they blew it at some point in the past. What happens if we successfully get the user's ANI but can't recognize the number? This happens when new exchanges are added faster than CCMI can keep up. In cases like this, we give the user the "global default" POP, which is usually an 800 number embedded in the PhoneDB. When we finally put out a PhoneDB that does recognize their ANI, we will automatically send them a new tellyscript with the appropriate POPs when they next visit the headwaiter. If the PhoneDB "forgets" some numbers, possibly because an old area code split has caused some exchanges to cease to exist, we will simply stop updating their tellyscript until the next time they go through scriptlessd. (The service should actually force them back through scriptlessd once, in case their ANI changed as part of an area code split but we never caught it. This is currently an open bug.) If we get the ANI, and we recognize it, but it's for an area that we don't yet support (e.g. Puerto Rico), we don't send the user a tellyscript at all. Instead they just get a message saying that WebTV isn't yet supported in their area. What happens if we don't get their ANI, and it's a "Classic" box doing a flash download? Now we're in trouble: we don't have their ANI, and we can't put up a user interface and ask because the "Classic" flash downloader doesn't *have* a user interface. If they're talking to scriptlessd, they must be brain-dead, probably from an earlier failed download. We temporarily send them to an 800 number (the "NoANI" number), until they can finish the download. When the download finishes successfully, the box will automatically go back through scriptlessd. This has the added bonus of giving most users a more stable environment for doing the download, because the POP they're calling is under our control. One of the pitfalls of using ANI is that it only works when the user dials into an 800 number. It's very important that we know where the box is, because if we have the wrong value for their ANI we will be handing out the wrong set of POPs. If one of those POPs is a 7-digit number, we could be dialing a 7-digit number in the wrong area code, and call a MOM instead. On the other hand, 800# calls are expensive, and we have limited capacity on the modem racks, so we can't have the box dial into the 800 number every time the box powers up. The current approach for dealing with this is to assume that the box might have moved whenever it loses power. We display a message the first time the box turns on after losing power that shows their phone number (e.g. "650-614-XXXX"; the last four digits are blanked in case they return the box to the store). If the user has moved the box to a different phone number, they can just hit "Moved", and the box will go back through scriptlessd. Versions of the box before client 1.2 weren't able to display the ANI number in the dialog. A practical issue that has arisen on a few occasions is when a helpful store salesman runs the box through an initial scriptlessd connection before the customer takes it home. If the customer gets home and asserts that the box hasn't moved, they will end up with a tellyscript for the store's ANI rather than their own ANI. Because most of the units on shelves are client 1.0, they can't display the partial ANI in the "have you moved" dialog. The workaround was to put a test at the start of registration that figures out how long it has been since the box went through scriptlessd. If it has been more than a certain amount of time, the box is thrown out and must come back in through the 800 number. In the usual (non-helpful-salesman) case, the box will proceed to registration within a few minutes of visiting scriptlessd, so with a suitably defined interval -- currently 15 minutes -- we can solve the problem without creating a new one. -= II.E =- The local, the toll, and the ugly Figuring out what's local and what's not is far more difficult than you might expect. The single biggest obstacle is the lack of completely accurate data. What we get from CCMI is fairly accurate, but they're collecting tariff data from dozens of companies on hundreds of calling plans for 25,000 different exchange areas. With that much data, in a system as convoluted as the U.S. phone system, there's bound to be problems, and there's an awful lot of "process" between finding a problem and getting it fixed. We also have trouble with missing data. Some LCAs are entirely unsupported, others are partially supported. A "partially supported" LCA is one where the data is loaded once, when somebody asks for it. It isn't kept up to date, and there is no pricing information associated with the local calls. Based on this data the PhoneDB generator can tell that a call is local, or at least *was* local in the recent past, but can't tell how much it costs. This makes it impossible to distinguish between "Local" and "Expensive Local". The myriad filters and fancy footwork we do when generating a PhoneDB are outside the scope of this document. What's important is to understand how far you can trust the data and why it might be wrong, so that you can understand POP-O-Rama output and try to differentiate customer error from CCMI error. Here's an example of output from the "lookuppop" tool, which generates the output for the POP-O-Rama web page: For 561-357-0000 from W PALM BCH, FL (base cost=0): cnc/561-227-0012 in or near "West Palm Beach, FL" (W PALM BCH, FL) LOCAL 0.0mi [wc=7.6mi] cost=0 --> 227-0012 then 1-561-227-0012 uunet/561-681-9557 in or near "West Palm Beach, FL" (W PALM BCH, FL) LOCAL 0.0mi [wc=5.4mi] cost=0 --> 681-9557 then 1-561-681-9557 cnc/561-226-0010 in or near "Boca Raton, FL" (BOCA RATON, FL) ExpLocal 23.7mi [wc=19.0mi] cost=1840 --> 226-0010 then 1-561-226-0010 uunet/561-368-8801 in or near "Boca Raton, FL" (BOCA RATON, FL) ExpLocal 23.7mi [wc=19.0mi] cost=1840 --> 368-8801 then 1-561-368-8801 psi/954-971-5720 in or near "Pompano Beach, FL" (POMPANOBCH, FL) toll* 31.9mi [wc=26.6mi] cost=2927 --> 1-954-971-5720 uunet/954-486-4806 in or near "Fort Lauderdale, FL" (FTLAUDERDL, FL) toll 39.9mi [wc=31.9mi] cost=2927 --> 1-954-486-4806 cnc/954-845-0336 in or near "Ft. Lauderdale, FL" (FTLAUDERDL, FL) toll 39.9mi [wc=36.4mi] cost=2927 --> 1-954-845-0336 cnc/305-651-1819 in or near "Miami, FL" (NORTH DADE, FL) toll 53.5mi [wc=46.8mi] cost=2927 --> 1-305-651-1819 The first line identifies the exchange where the caller is. In this case, I asked for "561-357", and it filled in the last four digits with zeros (remember, you only need the NPA and NXX to identify the location). The location name is "W PALM BCH, FL". The names are cryptic because the CCMI database only has space for 10 characters, and they're all upper case. "FL" is the state, in this case Florida. "Base cost" is what we computed it would cost for somebody in the 561-357 NPA/NXX to call themselves, based on a call of a certain duration at a certain time of day. DO NOT tell this cost to a customer! It might be based on a calling plan other than what the customer has, and we don't want to be responsible for giving out cost figures that are based on inappropriate or possibly even inaccurate data. After the first line are eight sets of three lines, with one line for each POP. The first line in each set identifies the POP. "cnc/561-227-0012" means it's a Concentric Networks POP at 561-227-0012. There are two city names, "West Palm Beach" and "W PALM BCH". The latter is supplied by CCMI. The former is sent to us by the IAP, can be edited fairly easily, and is displayed to the customer in the "have you moved" dialog. The names don't always match up; note that the last entry says "Miami" and "NORTH DADE". This is generally because the CCMI entry describes things from the telco perspective. For example, the Pacific Bell phone book describes Cupertino as being in "San Jose 2", and CCMI shows Cupertino numbers as being in "SAN JOSE W". Ditto for Menlo Park, which appears to be in PALO ALTO. In general, the "nice" name is more accurate. If you believe the two are totally out of whack, ask the SOC to look into it. There is no "nice" name on the top line, because (1) we only have "nice" names for places where the POPs are, and (2) the NPA/NXX isn't enough to tell you what city the person lives in. Some NPA/NXXs cover more than one city. The next line tells you about what it costs for a user at the NPA/NXX to call that POP. The first word is one of the following: LOCAL - we believe the call is local, and that the cost of the call is the same as if the user called themselves. ExpLocal - CCMI says it's a local call, but it's more expensive to call than other local calls. Zone 3 calls in California are ExpLocal. PsuedoLocal - equivalent to ExpLocal in almost every respect. Explained below. toll - this is a toll call. It might be a "local toll" handled by the LEC or a long-distance call handled by an IXC. (In the ancient days of yore, there was a distinction between "LOCAL" and "local". The LocalMustEqualCostToSelf feature removed this distinction.) Regardless of how the calls price out, local calls always come before ExpLocal, and ExpLocal calls always come before toll. Toll calls that are cheaper than local calls are extremely rare, so we always prefer the local calls just in case there's an error in the tariff data. Entries with an asterisk (i.e. "toll*") denote a certain kind of IAP. This is explained later. Usually you should just ignore the asterisk. The number after the local/toll indication is the distance in miles between the rate center for the caller and the rate center for the POP, using the "minor" (a/k/a "under 40") MTS coordinates. Put more simply, it's how far apart the phone company thinks the two points are. Calls aren't usually local beyond 10 or 15 miles, but there's one case in Florida where you could make a 135-mile local call for $0.25 per call. The next number in square brackets is the distance between the wire centers for the caller and the POP. In some situations the wire center distance is used when pricing local calls. As you can see in the example above, the MTS coordinate distances are both 0.0, but the wire center distances are slightly different. Usually the numbers are pretty close, but because of the way some POPs are connected to the phone system, the wc numbers can be large (perhaps 20 miles). When tracking down problems, it's usually best to pay attention to the first number (the MTS coordinate) and ignore the wc coordinate. The final item on the line is the cost of a call made for a given duration at a specific time of day on a particular day of week with a certain calling plan. Sometimes we average rates from multiple carriers together, which complicates things. At any rate (no pun intended), it's the most important value we use when deciding the order in which to hand out POPs. The last line of the output shows the dialing patterns that we will try, in the order that we will try them. For the first entry we will try 7-digit dialing and then 11-digit dialing (it's a home/local call); for the last entry we just try 11-digit (it's foreign/toll). Occasionally you will see entries that look like this: For 205-526-0000 from LEESBURG, AL (base cost=241): tdsnet/205-927-6200 in or near "Centre, AL" (CENTRE, AL) PsuedoLocal 5.1mi [wc=5.1mi] cost=2040 [LCA not sup] --> 927-6200 then 1-205-927-6200 tdsnet/205-528-6200 in or near "Crossville, AL" (CROSSVILLE, AL) toll 14.5mi [wc=14.5mi] cost=3137 [LCA not sup] --> 1-205-528-6200 then 528-6200 The end of the second line in each set may have a special code in square brackets. The most popular ones are "unsupported local" and "LCA not sup". When you see "unsupported local", it means that we have the LCA (Local Calling Area) definition, but no rate information (this is the "partially supported" LCA data mentioned earlier). Chances are the LCA is not getting updated regularly, but since these LCAs are usually small rural areas, it probably doesn't *need* to get updated very often. When you see "LCA not sup" it means we have no information at all about the LCA for this area. We just plain can't tell what calls are local, and have to punt. Well, that's not *entirely* true. If the caller and POP are in the same exchange area, we go ahead and assume that it's a local call. We also have a feature where we declare that everything within a specific radius (currently 10 miles) of the caller in an "LCA not sup" area is local. Since we can't determine the cost, we define them to be ExpLocal. To make the distinction clear, we display ExpLocal calls in "LCA not sup" areas as "PseudoLocal". As mentioned above, PseudoLocal is functionally equivalent to ExpLocal; we just show it differently because the definition of "local" is based purely on MTS distance rather than telco tariffs, and therefore is more prone to problems. The motivation for doing PseudoLocal was that ExpLocal calls are always prioritized ahead of toll calls. Because of weirdnesses in the phone system, it may cost more to call yourself with AT&T than it would to call the other side of the country. Without PseudoLocal, people in some rural areas -- who most likely had local POPs nearby -- were being told to dial distant locations, because an AT&T call cost less, and the only rating information we had was for the IXCs. (You might be tempted to just do the POP assignments by distance rather than cost, but there are many areas where distance and cost don't correlate. Some 50-mile calls in Florida are more expensive than 300-mile calls into a different state.) There's a problem with doing this though. Suppose we're in an area where local calls that cross area code boundaries (FL) require 7-digit dialing. Suppose further that we're in an unsupported LCA. We're now in the uncomfortable position of telling the box to use 7-digit dialing across area codes, based solely on the fact that the POP is less than 10 miles from the caller. Fortunately it's easy to manually verify that we're not doing bad assignments; just dial the 7-digit POP number, using the *caller's* area code. If you get something other than a recording, we're in a lot of trouble. (Turning off UnsupLCADistOnlyRadius fixes it, but then we lose PseudoLocal, which will make us rather unpopular with some customers.) Ideally we would be able to add our own LCA definitions to the CCMI data, and avoid the problems entirely. Of 25,000 or so exchange areas, 5,000 are completely unsupported. Maintaining a complete set of data for areas with a tiny handful of people isn't cost-effective, for us or CCMI, but it would be nice if we could fix the areas where we do have some customers. A more insidious problem has occurred in a few places, notably parts of Texas (Grand Prarie, anyone?). In these cases, CCMI had only one local calling plan in the database, and it was an extended-area "metro" plan that not all of our customers had signed up for. The data that we got out of CCMI showed certain POPs as being free local calls, and sure enough, they were for everybody who had signed up for the extended plan. The rest of the people were a trifle irked. The PhoneDB generation process scans the entire set of local calling plans, and always uses the most restrictive definition. When a wide-area plan is the most restrictive definition of an LCA, we're in trouble. This sort of problem is difficult to deal with, because in these situations the CCMI data *is* accurate. It just happens to be incomplete. In this particular case I asked them to add the standard calling plan, and they said they would look into it. This is another scenario where being able to tweak the local calling plan definitions would be useful. We can do a limited amount of fixing with the "ChangeCallCost" PhoneDB feature, but that's clumsy at best. There are some other odd things you might see in POP-O-Rama output, like: For 604-523-0000 from NWESTMNSTR, BC (base cost=??): uunetdan/360-383-1000 in or near "Bellingham, WA" (FERNDALE, WA) toll?? 29.4mi [wc=0.0mi] cost=?? [origin not in DB] --> 1-360-383-1000 "Origin not in DB" happens because the point of origin is in Canada, and we don't currently have data from CCMI for calls made from Canada. Note that "base cost" is "??", which means that we weren't able to figure out what it would cost for someone in 604-523 to call themselves. For 817-278-0000 from EULESS, TX (base cost=0): cnc/972-375-0501 in or near "Dallas, TX" (GRAND PRAR, TX) ExpLocal 8.9mi [wc=8.2mi] cost=242 [hacked!] --> 1-972-375-0501 then 972-375-0501 You will see "hacked!" when the kind of call and cost of the call have been explicitly changed by the person generating the PhoneDB. (There's probably a better word to use than "hacked".) All of our local cost calculations are actually based on business rate plans. There are residential rate plans available in the CCMI database, but very few of CCMI's customers actually use them, so they're not as carefully scrutinized. A comparison of residential vs. business rates done early in 1997 suggested that, while some areas were more accurately rated using the residential data, other areas seemed wildly inaccurate. The decision was made to avoid residential rate data for now. If you find yourself answering a phone call or an e-mail message from a customer who claims that a POP isn't local even though we think it is, don't jump to any conclusions without some corroborating evidence. I received a handful of bug reports saying that 510-742-xxxx (in Fremont) wasn't local from Palo Alto, even though the pages in the front of the Pacific Bell white pages showed that it was. People in areas with low population densities will often assume that exchanges they don't recognize aren't local. (This problem has returned, too: now people in 510 don't realize that they can dial into the northern part of San Jose. Sigh.) Of course, it would be a bad idea to dismiss such claims out of hand. The best evidence is a phone bill that shows the POP as being non-local. There have been several cases where the phone company mis-billed a call, either because of 11-digit dial patterns or errors on their part; with the bill in hand we can easily get either the telco or CCMI to straighten out their data. If they haven't yet received a bill, a call to the business office or even an operator at the telco that handles the call will resolve the matter, but there have been cases where conflicting answers have come from the same source on subsequent calls. Also, be sure that you're talking to the right LEC, because different carriers will have different calling plans. Local vs toll issues should be reported to the SOC. If you're the one investigating a complaint, and we don't have a phone bill to look at, you should talk to the operator about the calls in question and ask whether they are (1) local, (2) local but expensive (e.g. zone 3 calling), (3) local toll, or (4) long distance. Most operators will just say "local" for #1 and "toll" for #2, #3, and #4 to avoid confusing the customer, but the distinction is important for us. -= II.F =- POP, phone line, and network quality issues Not all POPs are created equal. WebTV requires that all POPs we use are capable of 28.8Kbps communication, and we take steps to ensure that there is adequate network capacity between our IAPs and us. Even so, there are cases where an individual POP or individual user will see substandard performance. This section provides a quick overview of symptoms and their causes. The most common problems are in the user's house or apartment. Line splitters, large numbers of phones on the same line, phone extenders that plug into an A/C power outlet (commonly used with DSS systems), and old wiring are common sources of problems. They can interfere with the phone line, resulting in slow connections. The initial connect rate shown on the tricks-info page and in the phone logs doesn't tell the whole story. One of the features of modern modems is that they will "negotiate down", or start talking more slowly, if a lot of errors are detected. This is done because the modems are less susceptible to disruption at lower speeds. If the line conditions improve, the modem will negotiate back up. Unfortunately, we have no way to monitor the current speed or know the lowest speed used, so it's difficult to identify problems just by looking at the initial connect rate. Even so, if you see connections being established at 21600bps or lower, there's a good chance that the user's phone connection is poor. If many users are reporting similar troubles with that POP, and you connect at a slow rate when calling the same POP from here (you can do this with Vend-A-Telly, described in a later section), there's a chance that the POP itself is poorly connected. Most phone companies won't guarantee connect rates of 28.8Kbps or higher. Pacific Bell only guarantees 4800bps, which is pretty pathetic. The box will refuse to connect at less than 14.4Kbps, but could conceivably negotiate lower. It may be possible to disable downward negotiation below 14.4, but it's not clear that this is always desirable. In the very early days, before the service went public, we displayed the connect rate right below the WebTV logo that you see before you get to the home page. The information was removed to avoid being swamped with calls from customers wondering why they weren't getting the full 33.6Kbps connections that they paid for. The reality is that not all IAPs have POPs that go above 28.8Kbps, and even then, most 28.8, 33.6, and 56K modem users don't get the speed they would hope for (26.4, 31.2, and 42K are much more common) because of noisy phone lines or other external factors. The reviewers of some 56K modems were unable to get actual data rates above 44K with even the best of modems. The worst couldn't break 30K. When LECs won't even guarantee 14.4Kbps, it's impossible for WebTV to guarantee anything higher. We should make every effort to determine the cause of poor performance, but some things are beyond our control. If the user has a PC with a modem that has no trouble connecting, try to get the WebTV box configured as close to what the PC does as possible, or ask the user to have the PC call the POP that the WebTV box is calling. They don't need to log in, just call the POP and watch the connect rate. There's more to POP quality than just modem connect speed. Everything that the box receives has to be sent from our servers, across either the Internet or a private network connection to the IAP, from the IAP to the terminal server at the POP, then out through the modem and down to the user's box. The modem speed is a good place to start, but it's also important to consider the network performance. It's difficult to get a simple performance number out of the network connections, because they may hit peaks where traffic grinds to a crawl for short periods, may exhibit spasmodic behavior with bursts of activity followed by long periods of silence, or may just move at a steady snail's pace. The easiest way to check the performance is to try to download a large image file (say a 150K GIF or JPEG) and see how long it takes to arrive. This feature is also provided by Vend-A-Telly. An issue related to POP performance is line drops. There are a number of reasons why the box might suddenly disconnect from the service, some of which are discussed in a later section on "idle timeouts". Disabling or reducing the sensitivity of call waiting in the Dialing Options screen resolves most problems with unexpected disconnects. The cause of some of our troubles with call waiting is that the box doesn't detect the call waiting "bong" accurately. Any substantial disruption, including somebody picking up an extension phone or a random burst of noise on the line, will be interpreted as an incoming call. Adjusting the sensitivity setting will reduce false-positives and missed calls, but for many customers the system is not 100% reliable, and never will be with the modems built into WebTV "Classic" boxes. It appears that "Plus" boxes will be similarly unreliable. Some line drops don't go away with the call waiting setting. There have been cases where the IAP's modems dropped the connection when a significant amount of line noise was detected, regardless of the setting on the WebTV box. This can usually be corrected by the IAP. More information on diagnosing and correcting the above should be available from Customer Care. This document is long enough without having a complete troubleshooting guide in it as well. | | -=*=- III. Service Mechanisms -=*=- | | -= III.A =- Dial overrides, Satan, and you Dial overrides are a quick and easy way to send somebody to a particular number with a specific dial pattern. Unfortunately they're a little too easy. They can solve a problem (or at least placate a customer) quickly, but they don't go away when the underlying problem gets solved. In general dial overrides are a Bad Thing, and alternate solutions should be used whenever possible. In the early days of the service, there was no such thing as a dial override. Because there was no quick solution, the problems were fixed in other ways, or were analyzed until it was determined that the problem was unrelated to the POP number being dialed. This was time-consuming but very effective at identifying the root cause of problems. The issue that drove the existence of dial overrides was that some customers bought special calling plans through their phone company that allowed them to call a specific region or number for a flat rate per month. If the PhoneDB got updated, and their primary number changed, they would no longer be dialing the preferred number. We needed a way to send people to a specific area. The initial solution wasn't pretty, but it was the best that could be done with the available facilities: the user's ANI of record was changed to an NPA/NXX that had the target POP as the primary. Since there were only two IAPs (cnc and uunet), and load balancing was a distant dream, this worked fairly well. Unless, of course, the box lost power, and the user said "yes, I've moved". Clearly we needed something else. The first version of dial overrides was added a few hours after a service release had frozen, because by consensus it had been placed on the C-grade "would be nice" list, and wasn't really supposed to be done at all. Consequently it was done in a big hurry. The database stored one override that had an ANI, a provider name, and the exact string of digits needed for dialing the POP. If the ANI matched, we sent a tellyscript for that POP and provider, complete with a warning dialog. This mechanism quickly became popular, and eventually support for it was added to the CMR tool. With a little experience it became clear that the mechanism was insufficient. You couldn't put in an override for a box behind a store's PBX, because the ANI value might be different each time the box logged in. You couldn't override to an 800 number because the warning dialog would show the 800 number (this is a bad thing, as explained in a later section). The override didn't go away if the POP went away. And you couldn't have the override go dormant if the user moved to an area with local coverage. The second generation of dial overrides provided for these, mostly. It was again done at the last minute and at a low priority. Nearly a year later the CMR tool still couldn't (and even now can't?) parse the new format, and some of the features -- like disabling the override when the POP goes away -- weren't implemented. The only way to do the new-style overrides is with "clientpopedit" (the first version of which, incidentally, was a truly frightening piece of work). There are things that can be done to make overrides less harmful. The trouble with them is that it will require CMR changes to make them accessible to Customer Care. High on the SOC's request list are "negative overrides", where you get to specify a number (or perhaps a complete exchange area) that the user says they don't want to be calling to. You can remove the POPs that the user doesn't like, and leave all the rest in. Another desirable item are overrides with expiration dates, for cases where a POP is temporarily out of commission, and the user is screaming because they're too impatient to wait for it to give up and try the next number. One interesting "feature" of overrides is that they are bound to a box, not to a subscriber. If a user swaps a box because of defects and has their account moved over, the dial override doesn't move with them. This isn't necessarily a bad thing, because the dial override might have been entered as part of diagnosing a problematic box. When the old box is "unregistered" prior to adding a new account, the dial override is purged automatically. Whatever fancy features get added to overrides, the rule of thumb remains: don't use them unless you absolutely need to. And the only valid reason for needing to are for users with specific calling plans that we can't take into account otherwise. Some common abuses of dial overrides are: - Dial pattern fixes. Use "dpedit" for this. Edit the patterns, then if they can't get in at all, tell them to unplug and say they've moved so they'll go back through scriptlessd. - Dead POP workarounds. Tell them to be patient, we're working on it. There is support in the service for temporarily removing a POP from everybody's tellyscripts, but it's too clumsy to use at present (the DisablePOP config option). - Slow POP workarounds. This is harder, because the POP is connecting but is performing poorly. A simple technique is to turn audible dialing on, then unplug the phone after the first dialing sequence completes. When it gives up it'll try the second number (unless they only have one local call, in which case it tries the first number twice). It's a pain, but it works. If they insist on getting a fix, give them the override but leave the trouble ticket open. Remove the override a few days later when things are better and close the ticket then. - PhoneDB local vs. toll problems. Using a dial override to fix these *temporarily* is okay, but the ticket should be left open as long as the override is in place. The problem is not solved until the PhoneDB is correct. When the PhoneDB is fixed, the override gets removed, and only then is the ticket closed. Like the saying says, "if you don't have time to do it right, when will you have time to do it over?" Every dial override that gets added also has to get removed, because sooner or later that POP will go away or more local numbers will be added or whatever. If everybody gets overridden to POP #2 when POP #1 gets congested, the load balancing algorithms can't do their work, and pretty soon POP #2 is going to be congested and all those people are going to be calling you all over again. Customers that can get a quick fix by calling Customer Care will do so every time their POP gets slow. Don't encourage people to call up every time they have the slightest problem. Avoid quick fixes that just postpone the inevitable. -= III.B =- Introduction to tellyscripts A tellyscript is a C-like program that is interpreted by the box. Their most important and most obvious function is to tell the box what numbers to dial, but they do a lot of other work besides. Most communication software use what are known as "send/expect" scripts. Send/expect scripts send a particular string, and then expect a certain response. The MacPPP configuration is a simple example: generally you send a dial string, expect the word "Login:", send your user name, expect "Password:", and then send your password. The fancier versions will allow you to expect one of several different responses, and perform different actions based on what you get back. Andy Rubin thought this was a little simple-minded, so he combined the send/expect concept with a minimal C interpreter, and named the result after a product from his former company (General Magic). The result was a program that could do all the usual sending and expecting, but with the flexibility of C code. The current batch of tellyscripts will: - Initialize the modem. All of the phone settings in the user interface, including things like dial speed and call waiting sensitivity, are put into practice by the tellyscript. - Update the message on the progress bar in an appropriate language while the box is connecting. - Send the appropriate login and password to one or more of several different ISPs (including OpenISP ISPs). - Parse all modem result codes, and convert them into connect rate and protocol values for display by the box (like on the tricks-info page). - Do some really funky things involving NVRAM and phone settings. - Combine dial prefixes, including the special "only for long-distance calls" prefix on the Obscure Dialing Options page. - Work around bugs in certain versions of the modem firmware. - Deal with several different failure modes, and return appropriate error status codes. - Post "this may be a toll call" alert dialogs. - Dial POPs several times and in different orders, moving on to the next when one fails. - With POPtimization, use one of up to eight different *sets* of POPs based on day of week, time of day, and what month it is. - Set the primary and secondary name servers that the box uses when in proxy-less mode. - Send and expect. Each tellyscript is divided into four sections. The pieces are combined on the service, and the full script is then tokenized and compressed before being sent to the client. On disk, the files are named ".tsf", which stands for TellyScript Fragment. The four sections are: base.tsf - common functions. locale.tsf - country-specific features (e.g. Japanese connect messages). .tsf - one or more tellyscript fragments, one per IAP. These are named after the IAP, so CNC's .tsf file would be called cnc.tsf. These are very short; usually they just have the IAP's Radius login info. - tellyscript code generated on the fly. This is where the actual phone numbers and "this may be a toll call" warnings go. The combined size of the four sections is about 40K when in C code form. This boils down to about 12K when tokenized, and 5K when compressed. When the service sends a script down, it saves a blob of information in the service that looks like this (line broken in half for readability): 0x34567117-0x4abf9aa7-base:36:-|locale:2:-|__wpb:1:3261095|__cnc:2:6870610| __wpb:1:3261095|__cnc:2:16506870610|__artemis:1:18006108918 Translated into human-readable form, it looks like this: Hash 0x4abf9aa7, sent Tue Oct 28 15:02:17 1997 v36 base/- v2 locale/- v1 wpb/3261095 v2 cnc/6870610 v1 wpb/3261095 v2 cnc/16506870610 v1 artemis/18006108918 The "vN" part tells you what version of the script was sent down. We gave the user version 36 of base.tsf, version 2 of locale.tsf and cnc.tsf, and version 1 of cnc.tsf and artemis.tsf. The "sent Tue Oct ..." part tells you when the script was sent down, and the numbers after the providers' names show you the exact string of digits that the box is going to dial. (In the example, the user has the wpb/650-326-1095 and cnc/650-687-0610 POPs. He will use 7-digit dialing on both wpb attempts, but will try 7-digit dialing on the first cnc attempt and 11 on the second. This user has apparently established a 7-digit dialing for the wpb POP, but hasn't yet determined the pattern to use for the cnc POP.) If the user were given a toll warning message, the first line for the provider would look something like this: v2 wpb/3261095 {toll warning sent} and "__wpb:1:3261095" would be "_W_wpb:1:3261095" (with a 'W' up front). The "Hash 0x4abf9aa7" part is the key to getting tellyscripts updated. This number is a (hopefully unique) representation of the big blob. It's sent down to the box with the tellyscript and handed back up on every connection. When the box reaches the headwaiter, we recompute the tellyscript that they should have, and compare the new hash value with the box's hash value. If any part of the blob changes, the new "hash" value will be different, and we know that they need a new script. This means that if a provider or dial pattern changes, a tellyscript fragment gets updated, or a toll warning dialog is added or removed, the service will automatically send the box a new tellyscript. Since the box tells the service what it has, there's no risk of the service thinking that the box has a different tellyscript than it actually has. (Which, incidentally, is a real problem, because the box doesn't save the tellyscript into NVRAM until the box is powered off with the remote control or keyboard. If the box crashes or loses A/C power before the tellyscript is written, or the user hits the reset button on a "Classic" box, the previous tellyscript will be used on the next connect. For this reason, the service tracks the *two* most recently sent tellyscripts.) Most people don't need to understand the above in detail. Either trust that the system works, or read the above until you're convinced (one way or the other). -= III.C =- Call ordering and the fallback number Once we've chosen the POPs and checked the available dial patterns, we have to dial the phone. We know which POP to try first, but should we do the first POP twice in a row and then do the second, or alternate between the first and second? What if we have one POP or three POPs? The call ordering depends on how many POPs they have and what kind of call each is. In every script, we bail out when we connect successfully or if we are unable to detect dialtone before dialing. "Black holes", where we connect successfully but then are unable to talk to the WebTV service, are handled specially (explained later). If we only have one POP: 1. try number 2. IF we have a secondary dial pattern, try it; otherwise skip this step 3. retry number 4. call 800 fallback If both POPs have the same cost (i.e. both are LOCAL, or both are ExpLocal or toll but have the same estimated cost): 1. try pop#1 2. try pop#2 3. retry pop#1, using secondary dial pattern if it exists 4. retry pop#2, using secondary dial pattern if it exists 5. call 800 fallback If one POP is more expensive than the other (perhaps one local and one toll): 1. try pop#1 2. retry pop#1, using secondary dial pattern if it exists 3. try pop#2 4. IF we have a secondary dial pattern for pop#2, try it; otherwise skip 5. call 800 fallback In no case do we try more than 5 numbers, and we don't try a more expensive number more than once unless we're trying to figure out what the correct dialing pattern is. The service doesn't yet support three POPs, so the call ordering for that situation isn't shown here. We show the toll warning dialog before the first time we call an ExpLocal or toll POP. The warning contains the number to be dialed and the city name where the POP lives, using the "nice" form of the city name. The toll-free fallback number, sometimes called "fallover" or "failover", has been around since the early days of dialing. The idea was to prevent certain kinds of failures, such as POP outages or number assignment glitches, from giving the service a bad name. It is important to remember that nowhere in the Terms of Service does it guarantee connectivity, and we have never promised customers that they would have unlimited toll-free access at our expense. The fallback number is supported as a courtesy, and may go away or have its use restricted at any time and without notice. The 800 fallback number will be omitted in certain circumstances. The most significant one is called the "AllTollNoRoll" feature. It was added because some users without local POPs had, strangely, neglected to order long distance service on their WebTV line. Every POP number would fail, until the box called the fallback number. The easiest way to avoid this situation was to leave the fallback number out of tellyscripts for users with nothing but toll calls. A similar situation existed for a customer with phone service that only allowed calls to 800 numbers and 911 (Universal Lifeline Service?). In this case, not even local calls could be made, so despite having two local POPs the user ended up on the fallback number every time. The cure for such users (besides asking them to get a real phone line) is the "disable fallback" flag in the customer's account. It should be possible to set this from the CMR tool. Of course, it's always possible for users to disrupt the dialing sequence several times until the box dials the 800 number. For most people this is unnecessary and inconvenient: if they didn't have (in CCMI's and our opinion) a local call, they wouldn't have the fallback number in their script, so either they'll never get to the fallback number or they're trying really hard to avoid making local calls. We can identify such users through usage reports, and deal with them on an individual basis as necessary. A recent development in the service is the 800 fallback usage cap. This is explained later. Allowing calls on the fallback number to be billed at an hourly rate for customers without local POPs has been suggested. It may be implemented in a future release of the service. "Black hole" is the WebTV term for a POP that accepts modem connections but is unable to carry network traffic between the box and service. The tellyscript believes it has made a successful connection, but the box is unable to do anything after getting connected. Early boxes (pre-client 1.1) would connect to black hole POPs and stay there until disconnected by a timeout or an impatient user. As of client 1.1, the box will try to connect to the service for a minute and a half. If it is unable to get a response from the headwaiter in that time, it will disconnect, then restart the tellyscript at the point where it left off. (There was a fun bug related to black holes, where a box would get connected successfully but not realize it. This usually happened during registration. After being connected for about a minute and a half, the box would spontaneously disconnect and redial the service.) -= III.D =- The "clientinfo" command The "clientinfo" tool is a UNIX shell command. It got its name because the database DEVICE table entries are referred to as "Client" structures in the service. The tool was written to dump certain fields from the Client structure, but it has grown beyond that. (For those of you not up on your database lingo, the "device" entry is linked to a physical box, and has a "subscriber" associated with it. When you move a user's account from one box to another, you are changing the link to make the subscriber associated with a different device. The device entry is usually created as part of the manufacturing process so that we can get the back-of-unit serial numbers into the database, but if it doesn't exist it will be created by scriptlessd when the box first connects. The subscriber is always created by registerd when registration is complete.) There are several sections in the clientinfo output. The first is the PhoneDB version info: ----- Using PhoneDB v25 USA (built Mon Sep 8 23:11:48 1997 by uid=1057) Features: [CCMI] [com] [ld-avg] [wlca] [zd] [lec] This PhoneDB is for personal services ONLY PhoneDB_Map() -/- PhoneDB.c:146 (unknown)[15617] 10/28 16:47:24 ----- This tells you what version the PhoneDB is, whether it was built for the US or for a foreign country like Japan, when it was built, who built it, and what features were enabled. You don't usually need to worry about this, but keep an eye out for bad dates. Next comes the options header: ----- --- Client info for serial '01100f7401000004' --- ANI ....................... 99 650-614-5539 (PALO ALTO, CA) Shared secret ............. 'PVKwgp8nv44=' Script locked? ............ no Fallback disallowed? ...... no Revisit scriptlessd? ...... no Call Waiting Threshold .... 0 (not set) PSI account ............... 0 AppROM/bootROM versions ... v3049/v2046 Last successful connect ... 324-0657 Category .................. normal ----- Some of the entries are self-explanatory. For the others: "Script locked" indicates that scriptlessd handled the box specially for some reason, and doesn't want the headwaiter to override the tellyscript. "Fallback disallowed" blocks access to the fallback number. If "Revisit scriptlessd" is set, the box will reboot and go back through scriptlessd the next time it connects to the headwaiter. "PSI account" tells you if an account has been created with PSI for this user, which isn't something you really need to worry about. The "Call waiting threshold" field is currently unsupported. The "Category" field is a little funny. It was added so that we could put certain users into a specific category before they had registered. After that we see the tellyscript description. We saw one of these earlier: ----- Most recent script sent to client: Hash 0x7717cd29, sent Tue Oct 28 13:58:09 1997 v36 base/- v2 locale/- v1 wpb/3261095 v2 cnc/6870610 v1 wpb/16503261095 v2 cnc/16506870610 v1 artemis/18006108918 Previous script sent to client: Hash 0x6da221d6, sent Tue Oct 28 13:45:04 1997 v36 base/- v2 locale/- v4 psi/14062473000 {toll warning sent} v4 psi/2473000 v3 uunet/18013991119 {toll warning sent} ----- After that we have the set of known dial patterns: ----- Established dialing patterns: ANI 650-463-1671 + POP 650-326-1095 --> mode=7-digit ----- Naturally we don't have a dial override, but if we did, it would look like this: ----- Dialing overrides: ANI POP Cst? Dlg? Lnk? ONL? Provider Digits 650-463-1671 650-326-1095 N N Y N wpb '3261095' ----- "ANI" should be obvious. "POP" is the full 10-digit POP number. "Cst?" is set if it's a pick-yer-POP override (explained later); "Dlg?" is set if a warning dialog should be set; "Lnk?" means the override should be linked to the POP, and should go away if the POP goes away [currently unsupported]; and "ONL?" is set if the override should be used Only when the user has No Local POPs [currently unsupported]. The "Provider" field says who owns the POP, and "Digits" is the actual string of digits to use. The output is similar to what "clientpopedit" shows. After this comes the load-balanced POP assignments for this user (you can see the non-load-blanced version on the POP-O-Rama page): ----- POPs we would assign to this user (with load-balancing): psi/650-390-0900 (MOUNTAINVW, CA) 5.7mi cost=240 (wc=3.5mi) (tries 390-0900 then 1-650-390-0900) LOCAL* wpb/650-326-1095 (PALO ALTO, CA) 0.0mi cost=240 (wc=2.2mi) (tries 326-1095 then 1-650-326-1095) LOCAL cnc/650-687-0610 (PALO ALTO, CA) 0.0mi cost=240 (wc=29.7mi) (tries 687-0610 then 1-650-687-0610) LOCAL ziplink/650-687-2255 (PALO ALTO, CA) 0.0mi cost=240 (wc=29.7mi) (tries 687-2255 then 1-650-687-2255) LOCAL compuworld/415-423-0070 (REDWOOD CY, CA) 4.7mi cost=240 (wc=7.0mi) (tries 1-415-423-0070) LOCAL uunet/650-687-0796 (PALO ALTO, CA) 0.0mi cost=240 (wc=29.7mi) (tries 687-0796 then 1-650-687-0796) LOCAL ziplink/650-429-2255 (MOUNTAINVW, CA) 5.7mi cost=240 (wc=29.7mi) (tries 429-2255 then 1-650-429-2255) LOCAL ziplink/650-226-2255 (SANCLSBLMT, CA) 7.0mi cost=240 (wc=9.2mi) (tries 226-2255 then 1-650-226-2255) LOCAL ----- We currently compute eight entries for every NPA/NXX. The load balancing is explained later. Each pair of lines has most of the information included in the POP-O-Rama output, but in a slightly different format. See the earlier section on local calling for the POP-O-Rama explanation. After this is the POPtimization data: ----- POPtimized assignments: MONTH Oct 1997 DAYS SMTWRFS TIMES 00:00 - 00:00 POP 1 0:650-326-1095 conn=F POP 2 1:650-687-0610 conn=P POP 3 2:650-687-2255 conn=H MONTH Nov 1997 DAYS SMTWRFS TIMES 00:00 - 00:00 POP 1 0:650-326-1095 conn=F POP 2 1:650-687-0610 conn=P POP 3 2:650-687-2255 conn=H ----- This is also explained later. The nice thing about clientinfo is that it tells you what they *are* dialing, what they *were* dialing, and would they *would be* dialing, all in one place. POP-O-Rama can show you the set of POPs that the service has to choose from for a particular area, but can't tell you which ones will be given to a specific user, because the actual assignment depends on the box serial number. A potentially useful option for SOC folks is the "-t" flag, which causes clientinfo to write the tellyscript to stdout. If you want to see what tellyscript the user would get if they showed up right now, run "clientinfo -t > script.out". The output is tokenized but not compressed, so it's hard to read but you should still be able to find the phone numbers. "strings -a script.out" may be helpful. Note that there are always two copies of the phone number, a 10-digit version with dashes (e.g. 650-326-1095) and the actual number dialed with no dashes (e.g. 3261095). If you're trying to see if a dial pattern has taken hold, be sure you're looking at the right set of numbers. -= III.E =- Vend-A-Telly Vend-A-Telly is a web page attached to the "WebTV Tricks" page in the service. From there you can tell your box to dial any POP from any provider. You can even include modem AT commands as part of the dial string; these will override some of the features that are usually set by the box, so use only with caution. The page should be used whenever a POP is suspected of being flaky or slow. You can enter the POP number, dial in, check the connect rate, and download a large test image to see if the network is slow. If the POP is dead or deathly slow, DO NOT give the user a dial override unless you leave an open trouble ticket in Remedy that will allow somebody to remove the override when the POP gets better. Only when the override is removed should the matter be marked as "resolved". Network congestion is a fact of life; moving users between POPs will most likely just make the problem move with the users. Troublesome POPs should be reported to the SOC. -= III.F =- Visible Dialing The current generation of WebTV boxes will display the phone number being dialed as part of the connection progress messages. In the early days, because of some weird sense of paranoia, the box didn't tell you what it was dialing. (This same paranoia accounts for the XXXXs over the last four digits of the phone numbers in the WebTV Phone Book on our web site.) Version 1.1 and later clients support "visible dialing", where we show the phone number to the user as we dial it. It got its name because there was concern that showing phone numbers was a user interface aberration, and people would become greatly disturbed if the deep inner workings of the box were revealed. For this reason we only displayed the phone number when "Audible Dialing" was turned on; hence the nickname "visible dialing". As it happens, people really like knowing what the box is doing with their phone line, and are better able to identify local/toll problems before they get a huge phone bill. In some cases though we want to mask the phone number, such as when calling a toll-free number. Here are some examples: visible dialing off (also v1.0 clients and "Classic" boxes doing upgrades): "Dialing WebTV" normal case: "Dialing 14156145539" normal case, with a prefix of "9": "Dialing 9,14156145539" dialing a toll-free POP (e.g. the fallback number) "Dialing WebTV..." access number "324-0657" used: "Dialing A/N 324-0657" Toll-free POPs have numbers starting with "1800" or "1888". (Yes, it's checked before the "remove leading 1" function is handled.) If someone puts in an override with clientpopedit that starts with "1-800" instead of "1800", the user is going to be able to see the number. Appropriately nasty warnings have been added to clientpopedit. The call waiting disable prefix will also be shown. If you have too many numbers to display in the field, the end will be cut off, and "..." will be displayed. | | -=*=- IV. Dialing Details -=*=- | | -= IV.A =- PhoneDB details You may have noticed when looking at POP-O-Rama that the POPs aren't always sorted in the order you'd expect. In a boring world we could sort by cost and distance be walk away, but in the exciting world of WebTV we don't have that luxury. The first complicating factor is the amount that the provider costs us to use. Some providers are less expensive than others, or simply have more capacity, and as a result are given a higher priority during PhoneDB generation. Some POPs from the same provider may be more expensive than others. This cost is sometimes referred to as a "static priority". If two calls have the same cost and MTS distance, we sort based on the provider cost. A second factor is failure containment. If one of our major providers had a serious network outage affecting half the country, it wouldn't be very useful for a user to have a tellyscript with several POPs from the same provider. If a backbone gets backhoed, all the POPs are going to be useless. For this reason we try to hand out POPs from multiple providers whenever possible. Priority is given to leaving the primary provider in place, but the later POPs are shuffled around freely so long as they are listed as LOCAL calls and the provider costs us the same amount. We try to get a mix of different providers in the first few POPs, so that users will have numbers from more than one IAP whenever possible. This is known as "provider interleaving". Toll and ExpLocal calls aren't subject to provider interleaving. One of the more troublesome aspects of all this POP shuffling is dealing with providers who charge us a flat rate per user. Every month, certain IAPs charge us a fixed amount for each user who touches their system, even if the user only logged in once. If we gave a flat-rate IAP as a secondary POP to a customer with a very good primary, and the primary failed once at any point during a month, we would have to pay the full charge for that user for that one call. Clearly, we only want to give flat-rate IAPs out as primaries. This is where things start to get messy (it gets worse in the next section). Ensuring that the second POP isn't a flat-rate IAP can require making some tough choices. For example, suppose that the first three POPs listed for an NPA/NXX are an hourly-rate LOCAL, a flat-rate ExpLocal, and an hourly-rate toll. The initial POP layout looks like this: 1. hourly-rate LOCAL 2. flat-rate ExpLocal 3. hourly-rate toll However, we can't leave the flat-rate in the second position. We can't put it in the primary position, because that would take the local call away, and if we swap it with the toll call we replace a relatively inexpensive secondary with a nasty toll one. In cases like these, we do the latter. Because they can cause expensive calls to move ahead of less expensive ones, flat-rate IAPs are marked with an asterisk in POP-O-Rama output (i.e. "LOCAL*", "ExpLocal*", or "toll*"). Flat-rate IAP assignments have an unfortunate tendency to undo POP cost ordering, provider interleaving, and some of the load-balancing measures described in the next section. The problem is alleviated by "hybrid" IAPs, which can be used as either flat-rate or hourly-rate. For hybrid-billed IAPs, we treat the call as flat rate if it's the primary POP, and hourly rate if it's not. This gives us the price savings of a flat-rate IAP with the flexibility of an hourly-rate IAP. -= IV.B =- Intro to POP load balancing and provider rotation The previous section talked about how POPs may be shuffled while the PhoneDB is being created. There are some further things that the service does before sending the POPs down to the client. In some situations our choice of POPs is limited, and we have no choice but to give out two local POPs from a particular provider. If we just used the assignments straight out of POP-O-Rama, we would end up sending everybody to the first POP, and nobody to the second (provider interleaving will in most cases put a POP from a different IAP in the second slot, rather than the second POP from the same IAP). To avoid this situation we use "provider rotation". Provider rotation is a simple form of load balancing. If there are two local POPs from the same provider, it ensures that each will get no more than 50% of the traffic. If there are three, each gets 33%, and so on. This is done by using the last byte of the silicon serial number to choose between the available options. The rotation code swaps the primary POP with one of the others. Nothing else is changed. The POPs must be from the same provider, have the same cost for us, and must be LOCAL. One of the limitations of the data in the PhoneDB is that it operates on entire exchange areas. If the PhoneDB assigns wpb/650-326-1095 as the primary POP for Palo Alto, everybody in Palo Alto will hit that POP. In an attempt to avoid swamping some POPs with users while ignoring others, a simple load balancing system was implemented. As usual, it was done at the last minute and in a big hurry. The basic idea is that we carve up the POP assignment pie into pieces. Some of the providers get a piece, some don't. Each piece can be a different size. The last digit of your silicon serial number (which happens to be a checksum with a very nice distribution over the set of our users) determines which piece you're a part of. If the tellyscript generator can find a LOCAL POP from that provider, it makes that your primary POP; if not, nothing changes. The initial implementation had one definition of the pieces for the entire country. Several months later, the system was enhanced to allow the pieces to be defined in individual exchange areas, which came in handy when trying to put Bay Area people on the "wpb" (WebTV PacBell) POP. As you may have noticed, the system is less than perfect. For example, if the load balancing parameters say "50% cnc, 50% uunet", and the users in a particular area have nothing but psi and ziplink, they won't be affected at all. Chances are they'll all be piled on top of the same primary POP, and the next local POP will always be listed as the secondary. (Yes, they'll end up spilling over onto the secondary when the primary fills, but it's so much nicer to not have to wait for the "all circuits are busy" timeout.) This scheme is expected to be replaced by the POPtimization system, described later. As mentioned earlier, flat-rate IAPs will cause problems for us. For example, suppose we had three local POPs: PSI (flat-rate local) ZipLink (hourly-rate local, sort of) UUNET (hourly-rate local) Suppose the load balancing algorithm says we should use UUNET as our primary. The POPs above would get rearranged to be UUNET, then ZipLink. PSI wouldn't be used, because it's in the 3rd position, and we're currently only using two POPs per tellyscript. If, on the other hand, the initial arrangement was: PSI (flat-rate local) UUNET (hourly-rate local) some hourly-rate toll number This is difficult to rearrange, because we can't make PSI the secondary, and we don't want to give them a toll number when they have two local ones. Refusing to rearrange POPs like the above could lead to situations where a flat-rate provider receives a much heavier load in a certain area than we'd like. To deal with this, the configuration file allows a "tenacity" setting to be adjusted. The primary can be left alone, moved into the secondary slot, or swapped with a more expensive toll call. This decision applies globally. The default is to leave it alone; in the above case, PSI would still be the primary and UUNET the secondary. The setting also affects what happens when *all* of a user's local POPs are flat rate. The default behavior is to go ahead and give them the local POPs anyway. Here's a real-life example from an old PhoneDB: For 510-799-0000 from HERCULSROD, CA (base cost=240): psi/510-848-1398 in or near "Berkeley, CA" (OAKLAND, CA) LOCAL* 10.1mi [wc=10.1mi] cost=240 --> 848-1398 then 1-510-848-1398 uunet/510-982-1757 in or near "Berkeley, CA" (OAKLAND, CA) LOCAL 10.1mi [wc=17.1mi] cost=240 --> 982-1757 then 1-510-982-1757 psi/510-254-7549 in or near "Orinda, CA" (ORINDA, CA) LOCAL* 10.8mi [wc=10.1mi] cost=240 --> 254-7549 then 1-510-254-7549 psi/510-688-2363 in or near "Concord, CA" (CONCORD, CA) ExpLocal* 13.3mi [wc=13.3mi] cost=420 --> 688-2363 then 1-510-688-2363 Four POPs were found. The 1st, 2nd, and 3rd say "LOCAL", which means that they can be swapped in with the primary. The 1st, 3rd, and 4th have an asterisk after the call type, meaning that they're flat-rate and therefore can't be put into the secondary position. (Actually, the asterisk means they can't be moved, and therefore they're flat-rate, but that's a detail worth forgetting.) This satisfies the POP interleave rules (1st and 2nd provider are from different POPs), and the flat-rate rule (2nd provider isn't flat-rate). If the load balancing algorithm wanted to use CNC or UUNET as the primary, it would fail, because there's no CNC POP and there's no POP eligible for use as a secondary if UUNET were moved into the first position. There is nothing the POP load balancing routines can do here. Things are looking better for provider rotation though. If the last byte of the silicon serial for a user at that location was odd, the script handed out would have the first two POPs shown above. If the byte were even, the tellyscript generator would use the 3rd POP as primary instead. The fourth POP is ExpLocal, and therefore isn't eligible for rotation. -= IV.C =- Tellyscript return codes After a failure that occurs while the box is connecting to the service, the box will display a dialog with an error message. If you hit the "Options" key on the keyboard or remote, it will display an "M" code and an "S" code, e.g. "M-26/S10". The "M" code is the box's message code, and the "S" code is the return value from the tellyscript. The current set of tellyscript return values ("S" codes) are: 0 ParseError - tellyscript was bad. 1 Connecting - (not really an error) 2 Success - tellyscript finished successfully 3 ConfigurationError - modem and box not on speaking terms. 4 DialingError - modem not saying what we wanted it to. 5 NoDialtone - didn't hear a dial tone on the phone line. 6 NoAnswer - POP number just kept ringing. 7 Busy - POP number was busy. 8 HandshakeFailure - modem handshake failure; this is rare. 9 UnknownError - got an unknown result code back from the modem. 10 BadPassword - authentication failure. 11 PPPHandshakeFailure - couldn't negotiate PPP successfully. 12 NoCarrier - something answered, but it wasn't a modem. 13 BlackHole - rare; last POP was a black hole, and we ran out of POPs. 14 VerySlowConnect - modems connected at less than 14.4Kbps. 15 BadPasswordNR - same as #10, but we don't reboot the box. 16 UnhappyScript - the tellyscript generator blew it. This is bad. When dealing with customers who are having trouble calling in, it is important to get both the "M" codes and the "S" codes. The "M" codes are described elsewhere. Incidentally, the codes defined in the current (client 2.2) box sources look like this: 0 kTellyParseError 1 kTellyConnecting 2 kTellyLinkConnected 3 kTellyConfigurationError 4 kTellyDialingError 5 kTellyNoDialtone 6 kTellyNoAnswer 7 kTellyBusy 8 kTellyHandshakeFailure 9 kTellyUnknownError 10 kTellyBadPassword 11 kTellyPPPFailed 12 kTellyNoCarrier 13 kTellyBlackHole 14 kTellyDownloadOK 15 kTellyNoLoader 16 kTellyNoFirmware 17 kTellyLoaderFailed 18 kTellyNoResponseFromLoader 19 kTellyFirmwareFailed 20 kTellyNoResponseFromFirmware 21 kTellyScriptExpired The meanings of 14, 15, and 16 don't agree, which is unfortunate but not fatal. Because the box codes have to do with modem firmware initialization and not dialing it's possible to tell which is which from their context. -= IV.D =- Dial patterns revisited An earlier section explained that the service remembers successful dial patterns, and uses them when generating tellyscripts. This section explains the mechanism in more detail. At about the time that the splash page (the WebTV logo that comes up before you get to the home page) is appearing on the screen, the box is talking to a service called logserverd. The purpose of logserverd isn't to serve anything; rather, it collects different types of logs that are sent up by the box, including crash logs, TCP logs, error and warning logs, TV logs, and phone logs. What we're interested in here are phone logs, which are sometimes referred to as "connection logs" or occasionally "configuration logs". A simple phone log looks like this: PhoneLog from 014f7c8201000055 (version=27, length=195) numPhoneBusy=0 tcpInputPackets=1442 numPhoneNoAnswer=0 tcpOutputPackets=1589 [ ... blah blah blah we don't care about this blah blah blah ... ] realAudio2Used=0 realAudio3Used=0 Records: 0x05 Disconnection when=0x3456bde6 (Tue Oct 28 20:39:02 1997) disconnectionType=5 "inactivity timeout" flags=0x04 connectWhen=0x3456bcb6 (Tue Oct 28 20:33:58 1997) dialString='3261095' fullPOPNumber='650-326-1095' [] LastConnectionSpeed=28800 LastConnectionCompression=2 PowerOnReason=0 "normal" 0x06 NVRAMWrite when=0x3456bde6 (Tue Oct 28 20:39:02 1997) 0x01 RunScriptReport when=0x3456bde9 (Tue Oct 28 20:39:05 1997) id=0x30e859c5 modWhen=0x344ce6be [Tue Oct 21 10:30:38 1997] 0x03 GetDialInSuccess when=0x3456be04 (Tue Oct 28 20:39:32 1997) dialString='3261095' fullPOPNumber='650-326-1095' [] callWaitingPrefix='' dialOutsidePrefix='' longDistancePrefix='' accessNumber='' tollFreeAccessNumber='' flags=0x04 (waittone ) dialSpeed=1 cwSensitivity=1 dceRate=33600 dteRate=234000 protocol=0 compression=2 totalScriptTime=1602 boxIPAddress=207.79.32.54 PhoneLog_Log() 29677074/014f7c8201000055 PhoneLog.c:334 logserverd[26157] 10/28 20:41:28 Every time the box does something "interesting", it adds an entry to its phone log. When the box gets connected to the service, it sends the log up to logserverd, and erases its local copy. The service collects the logs, which are used to generate usage reports and POP health statistics. A complete discussion of phone logs is beyond the scope of this document. For now we're just interested in the last entry in the log, which tells us that the box connected successfully to the service. (By definition, the last entry is *always* an indication of a successful connection. If you weren't successfully connected, how did you post the log?) The entry shows that the box connected to the POP at 650-326-1095 by dialing "3261095". When logserverd sees this, it adds an entry to the list of dial patterns indicating that calls from the user's ANI to the POP at 650-326-1095 should be made with 7 digit dialing. The service screens out numbers that don't correspond to POPs that might be sent to the box. If you put a number in the access number field or give the box a dial override to a POP that it wouldn't normally use, the dial pattern table will be unaffected. There are two motivations for being so picky: limited space, and the need to avoid garbage. If you had to dial "9" followed by a 10-digit number, you might be given an override or access number like "96503240657". If the service isn't careful, it would record that you needed an 11-digit dial pattern to dial that number, which wouldn't be accurate. Rather than establish a complex set of rules for screening out *bad* numbers, the service uses a restricted notion of the set of *good* numbers. Dial pattern entries are stored in "most recently used" order. What this means is that the most recently used dial pattern is always at the top of the list. The service only holds onto eight entries, so if we already have eight and then make a new discovery, the entry at the bottom is thrown out. If the box logs in, and we see that the ANI, POP, and dial pattern are already known, we just pull the entry up to the top. If the ANI and POP match but the dial pattern is different, we replace the dial pattern field and then pull it up. To make matters more complicated, we try to reduce database accesses by not adjusting the order if the entry is already one of the top three. The feedback mechanism seems pretty clean on the surface, but there's actually a race condition during login. There's no way to be sure that the phone log will get uploaded before the headwaiter checks to see if the box needs a new tellyscript. If the phone log comes up first, then the headwaiter will compute a new tellyscript that takes into account the latest dial pattern information. If the phone log comes up second, the headwaiter will make its tellyscript decisions without the benefit of knowledge learned from the current phone log. Of course, it's even worse than this if you're on the phone with a customer. It's possible for them to have the right patterns but not have a tellyscript that includes that knowledge, because the knowledge was gained after they got through the headwaiter. They have to hang up, come in again, get a new tellyscript with the new patterns, then hang up and redial *again* to actually use the new patterns. For these reasons, customers whose dial patterns have been edited manually are usually told to go back through scriptlessd. They will immediately get a script with the latest information. -= IV.E =- Secret codes, NVRAM, and "have you moved?" The "have you moved?" dialog was briefly described in the ANI section. In short, it appears whenever the box is unplugged. The dialog has changed over time, with the wording being updated with almost every client release. Back in v1.0 the default action (i.e. what would happen if you just hit the "go" button without moving the selection rectangle) was "I haven't moved", in v1.1 and later it changed to "I have moved", and in v1.2 we started showing the user's ANI as well. When you tell the box that you've moved, all it really does is throw out the tellyscript and the headwaiter's IP address. When the box sees that it doesn't have these, it heads off to scriptlessd, which gets the ANI data and sends down a new tellyscript. You can get similar behavior by using the "7265" secret code. This is related to the "7264" secret code, which has a long history of not working right. I don't know offhand which client versions implement the code correctly (I'm told that *none* of the 1.x releases through 1.3 does it right!), so unplugging the box and entering "yes, I've moved" is still the most reliable way to wipe out the tellyscript. Unfortunately this also causes the clock to be reset, and in "Plus" boxes this means that the box can't show the current TV listings until it reconnects. The "32768" secret code wipes out all of NVRAM. This is generally a bad thing, because it kills some other things like screen centering and TV configuration. The phone log lives in NVRAM when the box is powered off, so if a user uses 32768 he loses all phone log data collected up to that point. Since the information could potentially help us identify a problem with his POP or phone line, losing it is bad. For these reasons, 32768 should only be used as a last resort. On internal boxes, the "93288" secret code allows you to choose which service you want to connect to. The box will wipe out the tellyscript to force the box to go back through scriptlessd. This is necessary because scriptlessd hands out a shared secret that is used for secure communication. If the different services have different notions of what the shared secret should be, the box won't be able to talk to the new service, so we send it through scriptlessd to make sure the secret is in sync. IMPORTANT: the connection setup information is transient until you actually get connected. If you power the box off, or even go into the dialing setup screen through the convenient button at the bottom of the screen on some builds you will lose the information and end up connecting to the default service for that box. The "1-800-GoWebTV" code (actually 18004693288) clears NVRAM and then sets a "force registration" flag. When the service sees the flag, it sends you back through registration so you can set up a new subscriber. The interactions with tellyscripts are a little funny, because unregistering the box causes some fields in the device to get reset. These fields are normally initialized by scriptlessd. Since we've already been through scriptlessd, though, they get cleared and not set again. In the current service this is generally harmless, but could cause unexpected behavior. Historical note: in the very early days, the box really did have NVRAM (Non-Volatile RAM). As a cost-cutting measure, we decided to remove the NVRAM part and dedicate a small piece (about 16K for US "Classic" boxes) at the upper end of the flash ROM for storage. The name "NVRAM" stuck, even though it now refers to flash ROM for "Classic" boxes and a disk block for "Plus" boxes. -= IV.F =- How phone settings work There are three prefix fields that may be applied to a dial string. The "Basic" screen has the "Prefix" field, "Call Waiting" has the "Block calls" field, and "Obscure Dialing Options" (known as "Spooky Dialing Options" in v1.1 and v1.2 clients) has the "Long-distance prefix" field. The "block call waiting" prefix always gets sent first. After that comes either the prefix or the long-distance prefix, depending on which were set and what kind of call you're making. The following chart shows all four combinations of prefixes, and what a local and a long distance call would look like for each: prefix=(none), LD prefix=(none) local=6145539 long=18005551212 prefix=9, LD prefix=(none) local=9,6145539 long=9,18005551212 prefix=(none), LD prefix=8 local=6145539 long=8,18005551212 prefix=9, LD prefix=8 local=9,6145539 long=8,18005551212 The determination of "local" or "long" is made by the service when the tellyscript is generated. POPs that are LOCAL or ExpLocal are treated as local, and toll calls are treated as long. The fallback number is always considered long distance, as are numbers entered with Vend-A-Telly or clientpopedit (if you're using the latter two methods, you shouldn't need a long-distance prefix anyway... just enter the full set of digits you need). Ditto for tellyscripts handed out when "IgnoreANI" is set in the config file (only development servers are configured this way). For the LD prefix we regard LOCAL and ExpLocal as non-LD, but the system works differently for the "this may be a toll call" dialog. Local calls don't get the dialog, but both ExpLocal and toll calls do. The reason it's like this is that the dial prefix is assigned based on the telco definition of what a local call is, which often has little to do with the call being inexpensive. While we're here, I should mention that the "Don't dial 1 for long distance" flag in Obscure Dialing Options doesn't really have anything to do with making long distance calls. If the flag is set, and you're not using an access number, it just checks for a leading '1' on the POP number, and removes it if found. It has no effect on leading '1's in prefix fields. One final note on prefixes: most of the "Classic" boxes on store shelves and in warehouses are v1.0 clients. These boxes only have the "basic" prefix, so the script behaves as if the other prefix fields exist but were left blank. Understanding how the other phone settings are handled isn't vital but may come in handy. If you need to understand precisely how something is handled, Initialize() in base.tsf (found in the network source tree) has the ultimate answer. Pulse Dialing - we send a DT to the modem for tone or DP for pulse. Call Waiting - for the US, the S10 setting determines all. There are five values, one meaning "off" and four meaning "on" with different sensitivity levels. For Japan we also set S220. Wait For Dialtone - determines whether the modem should wait until it hears a dialtone, or just sit there for three seconds and then go. Set by tweaking S6. Audible Dialing - send M0 or M1. The tellyscript always turns audible dialing off when connecting with a VideoAd. Dial Speed. Three settings, set with S11. We now also set &P to control the speed of pulse dialing. This really only applies in Japan, but the US seems to work with the Japanese settings. The cool thing is you can now crank up the pulse speed if you select "fast dialing". Access numbers are covered in the next section. -= IV.G =- Radius, access numbers, and PSI When a box logs in, its tellyscript knows how to send a login and password that the provider of the POP will accept. All of our providers use a system called Radius to verify login names and passwords, and all but one uses a "proxy Radius" configuration that allows WebTV to make the actual accept/reject decision. The usual sequence of events during login starts with the box sending up a login and password to the IAP's Radius server, using the PAP authentication protocol. Attached to the login name is a special prefix or suffix that tells the IAP that the request is coming from a WebTV box. The IAP's Radius server forwards the request to our Radius server, which verifies the login and password, and sends back an ACK ("yes") or NAK ("no") response. By doing things this way, we retain control over which boxes are allowed in, and avoid the hoops we had to jump through with a provider like PSI. PSI refused to do proxy Radius, so we have to create an account with them for every box before the box ever logs in. This means that scriptlessd has to connect to their system and create the account before the box can hang up and redial. Otherwise, if the account creation attempt failed (as it occasionally does), we would end up giving the user a tellyscript with POPs that they can't dial into. If scriptlessd isn't able to contact PSI, all PSI POPs are stripped out of the script, and the box gets whatever is left. A more thorough discussion of service changes and the potential dangers involved with doing things this way can be found in the source tree in network/src/doc/PSI. Hybrid IAPs, which can be accessed as either flat-rate or hourly-rate providers, have two different Radius prefixes or suffixes available. One prefix indicates the connection should be billed at the flat rate, the other indicates it should be treated as an hourly rate call. In the 1.0 client, if the last POP in your list failed with a Radius authentication error, you would get a message that said "your box needs to be reconfigured". As of v1.1 the box would simply wipe its brain and restart. More recent service releases removed this behavior, but it may come back depending on what sort of security mechanisms we choose to implement. Authentication failures on POPs other than the last in the list just cause the tellyscript to roll on to the next POP. It's only the last POP that has potentially dire consequences. We know that each IAP has a different Radius suffix or prefix. Each may also require a different password. If you type a POP number into the "Access Number" field in the Dialing Options screen, which values should it use? The trouble is that the tellyscript has no way of determining which IAP the phone number in the Access Number field is associated with. The number is held directly on the box, not by the service, so we'd have to download the complete set of POPs to the box to make this work smoothly. The solution we chose to implement was to use whatever IAP happens to come first. If your primary POP is from CNC, then you can enter any CNC number in the access number field and it will work. Entering a POP number for UUNET, ZipLink, PSI, or any of the other IAPs will fail, unless Radius is configured in a particularly forgiving manner. (Incidentally, this is why we're so paranoid about showing toll-free numbers: we were using CNC's 800 number for quite a while, and the Radius authentication information was exactly the same as CNC's regular POPs. If you were one of the 60% of our customers who had CNC as their primary POP at the time, you could get toll-free access at our expense just by putting the CNC 800 number in your Access Number field.) Some people who do international demos have had cause to enter "cnc-palo", "uunet-palo", or "artemis-palo" in the "Enter Your Phone Number" screen that scriptlessd shows when it can't get ANI data. The reason this was added wasn't so much to allow them to use a specific POP as it was to get a specific IAP into the first position. If you know that a UUNET POP comes first in your tellyscript, you know that putting a UUNET POP in the access number field (along with whatever weird things need to be done to dial out of a foreign country or to dial that IAP's POPs within the foreign country) will work. Because of the difficulty in getting the POP number matched up with the first entry in the tellyscript, using this field is strongly discouraged except in certain rare cases. One place where the access number field is useful is when it's not really used as an access number. A special feature was added to the service to support dialing *suffixes* via the access number field. If the '$' character appears in the access number field, the tellyscript will replace it with the POP number currently being dialed. For example, if you set your access number to "10288,$,54321", and the POP numbers assigned by the service are 3261095 and 6145539, the box will dial the string "10288,3261095,54321" (the commas are brief pauses), and if that fails, it will next try "10288,6145539,54321". (Prefixes like 10288 really ought to go in the prefix fields rather than the access number field; I included it here to show that the '$' can be anywhere.) This isn't really the intended use for the Access Number field, but since the intended use is all but useless it was deemed acceptable. In v1.1 and later clients, 77437 brings up the Obscure Dialing Options page. Here you can enter an "800 access" number that will replace the toll-free scriptlessd number. The scripts sent down by the service don't even look at this field, because it only matters when you're dialing into scriptlessd. In v1.0 boxes, the Access Number field does double duty, and will change the number used to dial scriptlessd. This makes it extremely cumbersome to use, because you have to set it to one thing while dialing scriptlessd, and then change it to another before dialing into a POP. The toll-free access number field was added to help people doing international demos and other situations where an access number was needed just for scriptlessd calls. -= IV.H =- OpenISP OpenISP, which has also been known as Pick-an-ISP, BYOISP, and OpenAccess, has been around conceptually since one of the early "connectfolk" meetings in late 1996. It wasn't until the first part of 1997 that it went from being considered more trouble than it was worth to a high priority. One of the driving factors was competition: every competitor we had claimed to work with arbitrary ISPs, and in fact some of our competitors used this feature as their sole distinguishing characteristic. The idea behind OpenISP is that you can choose to use your own ISP instead of the ones that WebTV provides. Any ISP that supports PPP (a standard network protocol) and PAP (a standard method of sending up login and password information) will work. All you have to enter are your login name, password, and the phone number to dial, and everything else just works. Surprisingly few changes were needed to implement OpenISP. Most had to do with presenting an appropriate user interface, and making sure that the feature was activated and disabled when appropriate. The login, password, phone number, alternate phone number, and an ISP name (which isn't really used) are all stored in NVRAM on the box, and the tellyscript pulls these values out and uses them. The service doesn't store these values (see "Keeping OpenISP Closed" on the DocArchive web site listed in section VI). Because of an early design decision that later got changed (for a while the box was going to inject the login and password into the script; now the script goes looking for the data), and also to keep the size of a tellyscript small, tellyscripts are either OpenISP scripts or standard scripts. We don't send down a script that can either dial OpenISP or dial standard POPs. This may change. You can tell if somebody most recently received an OpenISP tellyscript by looking at the information shown by clientinfo or CMR. It will look something like this: Most recent script sent to client: Hash 0xdb4fc0fc, sent Tue Oct 28 13:13:03 1997 v36 base/- v2 locale/- v2 OpenISP/- The only IAP listed is "OpenISP". The service doesn't know what provider they're using or what number they're dialing, so those can't be shown. The call ordering for OpenISP is like this if they entered one number: Call first number Retry first number If they entered two numbers, it goes like this: Call first number Call second number After making two calls we give up. We never try a fallback number. If you want to use your own ISP, guess what, you're going to use your own ISP. The Access Number field is ignored for OpenISP users, unless they use the fancy kind of access number that has a '$' in it. Dial patterns and dial overrides have no effect on OpenISP customers. -= IV.I =- Client upgrades and brain-dead boxes Client upgrades for WebTV "Plus" boxes are terribly uninteresting, because they can do the download without disconnecting. Also, WebTV "Plus" boxes with damaged approm images go into the "mini-browser", which has most of the features you'd find in a full v1.3 client. The discussion here concentrates on "Classic" boxes, which are far more interesting. Client upgrades (a/k/a flash downloads) for WebTV "Classic" boxes are done by the boot ROM, because you can't be executing code from a ROM image that you're updating. The boot ROM has a minimal subset of the features available in the full ROM (it's 1/8th the size). All it really knows how to do is dial in, issue simple requests, and write chunks of data into flash ROM. The usual behavior is that flashromd tells the client to go flash itself. The box hangs up, dials back in with the current tellyscript, reconnects to the same flashromd, and starts asking for parts. When it has all of the pieces, it hangs up and reboots. (More details than you could possibly be interested in are available from network/src/doc/flashromd.) Because the box is essentially a v1.0 client during downloads -- regardless of what client version was running on the box before -- some tellyscript gymnastics are required to get at dialing options added after v1.0, notably the "don't dial 1" flag and OpenISP settings. These were broken for a while, but should work as expected now. (See the "trouble with dial options" document at http://webhost-1/~fadden/DocArchive/ for details.) The term "brain-dead box" refers to a WebTV "Classic" unit with a damaged ROM image. The easiest way to get brain-damaged is to initiate a download and have it interrupted before completion. When the box restarts, it does a checksum on the ROM, and discovers that things don't look the way it had expected. It boots into the boot ROM and immediately starts a flash download. The boot ROM ignores everything in NVRAM, because flash is corrupted and NVRAM is held in flash. It will accept an access number and a dial prefix, which have to be entered with the extremely limited user interface supplied by the boot ROM, but most of the other dial options can't be set. You can't use any secret codes with a brain-dead box, because the codes are handled by the full client ROM, not the boot ROM. Every time the brain-dead box is powered on, it connects to scriptlessd, asks for a tellyscript and an IP address for flashromd, then disconnects and executes the tellyscript. After connecting to the local POP, it initiates a download. An interesting problem arises when an OpenISP box becomes brain dead. We no longer have access to the person's OpenISP login and password, because those are kept in NVRAM, and we can no longer believe that NVRAM is valid. We have to send them somewhere else. But where? The obvious choice is to send them to the POPs that they would have if they weren't an OpenISP user, but there are a couple of problems with that. First of all, the user might have signed up for OpenISP because they didn't have any local POPs. Their POPs might be toll calls, which isn't going to make them very happy. Second, it's possible that their primary POP is a flat-rate IAP, which means we will have to pay for a full month of service for this user if they only show up once to do a download. There are two alternate solutions. The first is to send the user to an 800 number. This is a fairly good solution, because it doesn't cost the user anything, and it may well cost us less than the usual primary POP. The down side is that it requires a large short-term increase in port capacity on our 800 lines. If we have a hundred thousand OpenISP users, and even a small percentage go brain-dead, we're going to need to add a lot of modems for a couple of weeks to handle the load. The second solution is better but more difficult. The box ignores the NVRAM settings because the ROM checksum failed, and it can't trust that the values in the NVRAM section of memory are good. However, the tellyscript that we send down is capable of running its own checksum on NVRAM, and using the values there if they're valid. This gets complicated when you consider that, until now, tellyscripts are either OpenISP or non-OpenISP. The second solution requires that the box be able to dial either, and decide which it's going to do when the box starts up. The only good news is that the box will arrive at scriptlessd when brain-dead, and won't store the "double" script in NVRAM, so any wackiness in the script doesn't necessary have to affect anybody else. We will need to move to the second solution at some point, but for now we're just sending brain-dead OpenISP boxes to an 800 number. -= IV.J =- ComingSoon and friends The "coming soon" program was, arguably, a bad idea. What is indisputable is that it cost an arm and a leg. We will likely have some hangers-on for a while yet, so it's worth explaining what it is, why we did it, and why it went away. In the halcyon days of WebTV's youth, we discovered that our IAPs' claims of covering well over 90% of the country were subject to interpretation. They weren't far off -- the actual figure was around 87% -- but that last 13% was a large and noisy bunch. In an attempt to kick-start an increase in local coverage as we were entering the 1996 holiday season, we were directed to institute what became known as the "coming soon" coverage plan. Rather than wait until we had a signed contract with an IAP, we would provide the same coverage that the IAP did using an 800 number. To be eligible for "coming soon" access, you had to be in a situation where you didn't have a local call to a "real" POP, but did have a local call to a "coming soon" POP. That meant you didn't have a local call, but you were going to have one real soon. The POP lists for the IAPs that were coming real soon were added to the PhoneDB, and pretty soon we were letting hundreds of people surf the net at our expense. Getting new IAPs to sign up turned out to be a bit of an ordeal. Some of the IAPs we threw into the mix weren't technically competent or didn't have (and would likely never have) the kind of capacity we needed. Others were unwilling or unable to configure Radius servers the way we wanted, and some took months of negotiation before either they signed or we gave up in frustration. The net result was that we were paying per-minute charges for several months. The project, which cost several million dollars over its lifetime, was finally killed in October 1997. A couple hundred people still didn't have a "real" ISP, so they were "grandfathered" in with dial overrides to a different 800 number. A similar but less painful situation exists in Phillips, WI and Webb, MS. These two small towns were to be part of an advertising campaign capitalizing on the names of the cities. Since neither had a local ISP, both were granted perpetual free access via an 800 number. Nothing ever came of the marketing plan, but we still shell out money for a box in the library in Phillips. In both cases, the override was done for the entire NPA/NXX by making a special entry in the PhoneDB. -= IV.K =- Pick-yer-POP The Pick-yer-POP program was a good idea that had some serious flaws. The basic idea was to allow the customer to choose their own POPs from a list. They would be able to specify how many digits to dial, and change to a different POP at will. The most significant barrier to implementing this was flat-rate IAPs. If a user switched between three different flat-rate IAPs during the course of a month, we would have had to pay 3x the fees for that one user. A related issue is what happens when a user chooses an hourly-rate IAP as the primary, and then proceeds to use it for a large number of hours. With a flat-rate primary we would pay a fixed amount, but with an hourly primary the costs could be much higher. We can't afford to lose control over POP assignments unless we have some way of making the user share the costs. If they use a POP that costs us more than the POP that we would have given them, we have to bill them for the difference. Unfortunately this is difficult to calculate, and even more difficult to explain to the customer. Pick-yer-POP also removes any hope of load balancing. I would expect users struggling to get in during peak hours to change their POP frequently, resulting in large swings between local IAPs and lots of complaints. The proposed implementation for Pick-yer-POP was essentially a user-driven dial override. Even now clientpopedit allows you to specify whether an override is beign set for Pick-yer-POP or not. This will likely be removed in a future service release. -= IV.L =- MessageWatch and EPG MessageWatch is the fancy name we use for a feature that allows the box to dial in at a specified time and check for new mail. The idea was to have it log in during the early morning hours, so that you can see if you have new mail without needing to log in when you wake up. Unfortunately, a fairly large number of people configured it to log in around 5pm, so that the mail light is set when they get home from work. This is unfortunate because it means the boxes on the west coast are coming in at the height of peak usage on the east coast. Whatever the case, MessageWatch connections are vastly simplified versions of normal connections. A few salient facts: - The box only talks to the headwaiter. It continues to accumulate phone log data, but doesn't send anything up to logserverd. - The box will retry every 30 minutes if it can't get in. - The box will shut itself off after 2.5 (?) minutes, no matter what. - If a user has one local and one toll call, only the local POP will be used. - If a user has nothing but toll calls, only the first toll POP will be used. If a user is seeing multiple calls starting at a specific time and separated by 30 minutes each on his or her phone bill, chances are MessageWatch is involved. WebTV "Plus" boxes do something similar with EPG (Electronic Program Guide) data downloads. However, in the 2.1 client, the EPG downloader won't stop with the first POP if the second one is toll. There are plans to fix this for future client releases. In "Classical" and earlier service releases, MessageWatch is only enabled when the user turns it on. In "Disco" and later, it may be enabled for all new users by default. -= IV.M =- Idle timeouts Idle timeouts make the box disconnect from the service and hang up the phone when nothing has happened for a set period of time. There are two kinds of idle timeouts, input timeouts and network timeouts. Input timeouts happen when the user stops using the box. If the box doesn't see any activity from the user, such as typing on the keyboard or hitting buttons on the remote control, it will disconnect after 10 minutes. This timeout is set by the service. If the user is connected through an 800 number (determined by comparing the box's IP address against a list of known values), the input idle timer is reduced to 5 minutes. Network timeouts happen when no packets are being transmitted between the box and service. The box used to have a network idle timeout, but this is no longer in use. However, some IAPs, notably CNC, have idle timeouts on their equipment. After 30 minutes with no network activity, CNC's terminal servers will drop the line. If a user is flipping through a large page, or is composing a long e-mail message, there is no network activity. The box won't choose to disconnect, but the terminal server will. If a user is experiencing line drops while composing long e-mail messages, this is probably the cause. Some providers have time limits that don't care whether you're idle or not. After an hour or two the connection is dropped, so that computer users can't leave their machines running and wander off. (Some computers will just redial when disconnected anyway, but try telling that to the IAPs.) We haved added something similar in the form of a usage cap on the fallback 800 number (more details later). -= IV.N =- Adding new providers Adding a new provider to the system isn't something that most people will have to do. If done incorrectly, however, it can adversely affect a large number of people. This section explains the right way to do it, when it should and shouldn't be done, and how things fail if it's done the wrong way. Each IAP should be a separate provider. A provider is defined by a "Provider:" line in a POP list in the PhoneDB. Several attributes are defined for each: - Symbol. This is a single character that represents the provider in certain output formats. The PhoneDB doesn't explicitly check this for uniqueness, so it may be unwise to depend on this value. CNC's symbol is 'C'. - Abbreviation. This uniquely identifies the provider, and is used in tellyscripts. SOC is using the IAP's domain name as the basis for choosing abbreviations. CNC's abbreviation is "cnc". - Cost (also known as "static priority"). The higher the cost, the less willing we are to use the POP. This only affects PhoneDB generation; it has no effect on load-balancing. The costs are relative to each other, and have no absolute meaning or relationship to actual dollar amounts. - Billing method. May be "flat", "hourly", "per-port", or "flat-hybrid". For PhoneDB generation the only thing that matters is whether it's "flat" or not, but other service components (like POPtimization and tellyscript generation) are more discriminating. - Full name. For CNC this is "Concentric Network". This is rarely used. All of the above are included in and available from the PhoneDB. The "dumppops" phonetool command will display them (see the phonetool README). The choice of abbreviation is important, because it's used in the tellyscript, in reports, and often in casual conversation. It has to follow C syntax rules for function names, which means it has to start with a letter and may only contain letters, digits and the underscore ('_'). No spaces, dashes, periods, or other fancy characters are allowed. It can't be longer than 15 characters, and by convention is entirely lower case. More information on POP lists can be found in the rawphonetool README (network/src/tool/rawphonetool/README). Adding the "Provider:" line and a few POPs to a POP list is only half the story. The other half is adding a new .tsf file. When tellyscripts are generated, the service gathers up the .tsf files for every provider that might be dialed, and combines them with several other components to form the complete script. The service doesn't attempt to verify that the tellyscript fragment is correctly written, so it is imperative that the script be error-free. Here's the current script fragment for ZipLink (ziplink.tsf): ----- /* TLLY ver=2 */ /* * This is included from "ziplink.tsf". */ Chat_ziplink() { setwindowsize(7); return PAPChat("ZTV/%s", 0); } Chat_ziplink_2() { setwindowsize(7); return PAPChat("ZTV/%s", 0); } /* --- end of ziplink.tsf --- */ ----- The "TLLY ver=2" at the top specifies the version number that you see on the "clientinfo" output. This should be incremented every time the script is changed. The first line must look EXACTLY like the one shown above, or the service will reject the script. There are two C-like functions, both named with the provider abbreviation. They each call the setwindowsize() function, which sets a TCP window size that may be different for each provider (7 works for nearly everyone), then they call PAPChat with an argument that specifies how the Radius prefix or suffix is to be applied. The "%s" gets replaced with the box's login name. In this case, ZipLink uses a Radius prefix of "ZTV/". There are two functions because there are two different ways to get to ZipLink, the flat-rate way and the hourly-rate way. This is how we support the "flat-hybrid" billing model: the tellyscript calls the first function for the primary POP, and the second function for later POPs. We're not currently taking advantage of the hourly-rate plan for ZipLink, so both prefixes are the same. It doesn't really hurt to have both functions when we're not using the feature, but it does hurt if we're missing one and try to use it, so it's best to define both and make them equivalent. The format is simple enough, but if you have any doubts you can always run the .tsf file through a C compiler. (You will want to have some other things defined if you don't want to be drowning in warnings; see network/src/lib/tellyscript/scripts/ScriptIncl.h.) What happens if we have a .tsf file, but no "Provider:" entry in a POP list, and therefore no information about the provider in the PhoneDB? Nothing. The service will not have heard about the provider, so it won't try to use it. Heck, without a POP list there's nothing to use anyway. What happens if there's an entry in the PhoneDB, but no matching .tsf file? Bad things. headwaiterd will refuse to send a new tellyscript to people would would get a script with the partially-defined provider, and scriptlessd will actually send people off to wtv-*. The reason scriptlessd was written this way was to avoid sending such users to an 800 number, and to make it immediately obvious that a serious but easily correctable problem exists. It would be nicer to send the users to alternative POPs and inform a pager instead of the customer. This may be implemented in a future service release, especially since brain-dead boxes will report the mysterious "couldn't get IP address" error in this situation. What happens if there's a PhoneDB and a .tsf file, but the .tsf file contains an error, or is missing the second function for a hybrid provider? Very, very bad things. The box will probably crash when it tries to execute the tellyscript. For the case of the missing second function, the failures will be intermittent, because they will only happen when users who have the provider as a secondary fail to connect to their primary POP. It is *always* prudent to test new PhoneDB and .tsf combinations with the Vend-A-Telly page before releasing them to customers. In general, there should be a 1-to-1 mapping between providers and IAPs. The load balancing and provider interleaving algorithms do their best to avoid saturating a user with POPs from the same provider, but the only way they can tell whose POPs are whose is by the provider abbreviation. If you split cnc into cnc1 and cnc2, there's nothing to prevent the user from getting cnc1 as their primary provider and cnc2 as their secondary, and a localized network outage within CNC will shut out the user. If a provider has multiple categories of POPs, such as new POPs with higher capacity that are meant to replace older ones, you can give higher priority to the better POPs by assigning cost values to individual POPs. This will cause the PhoneDB to place them ahead of otherwise equivalent POPs from the same provider, and will prevent the provider rotation in the service from swapping the POPs around. There are two cases where we've broken this rule. The first is cnc vs cnc800. We used a separate provider here to make it easy to spot the users who were on the 800 number. This was only used for "coming soon" and other special programs, so there was no risk of multiple CNC assignments causing trouble. The second case was "uunet" vs "uunetdan". Again, it was felt important to distinguish the two because we had radically different pricing on them, and more importantly we wanted the load balancing parameters to only affect the uunetdan set. Since uunet was given a very high cost (low priority), and uunetdan a very low cost (high priority), there was little chance of a user ending up with one of each unless they had no other POPs anyway. -= IV.O =- VideoAds VideoAds are short (15-second) VideoFlash clips that play when the box is powered on. They are downloaded during a MessageWatch connect, play once, and are then thrown away. This feature was first added in the 1.3 client. There are a number of restrictions on the set of users that get VideoAds. The download takes about 5 minutes, which isn't terribly long, but if the box is making a toll call every night it can add up. We also want to control our own costs by not sending the VideoAds to users with hourly-rate POPs. Even if the revenue from an ad impression is more than the cost of a 5-minute call on an hourly IAP, we won't come out ahead unless the user logs in almost every day. We only get the revenue if the ad plays, and the ads are sent down every night whether the box plays the ad or not. The rules are: - Don't send it if they're using OpenISP. - Don't send it if they're making an ExpLocal or toll call. For MessageWatch connects, this can only happen if they have no local calls at all (see the section on MessageWatch and EPG). - Don't send it if they're on an 800# POP. This includes "coming soon" POPs and the fallback 800 number. We can determine the former, and the script can block the latter. - Don't send it if they're connected to an hourly provider. I'm going to approximate this by checking the primary, on the assumption that we never assign hourly POPs as primary if there's a flat or per-port available. (This is a bad assumption when POPtimization is in effect, but it'll do for now.) - Don't send it if they're not in the right user category. The VideoAd plays during the first part of the box's connection to the service. Instead of seeing the Road to Nowhere and the connection status bar, you watch the movie. Audible dialing is disabled for connections that start with a VideoAd playing. -= IV.P -= Automatic Number Frustration There are cases where ANI doesn't work that weren't worth covering in the introductory sections, but should be mentioned for completeness. One of the unnerving things about ANI is that anybody with a T1 or PRI can convince you that they're calling from anywhere. Some PBX systems, especially those targeted for use by telemarketers, explicitly allow you to set the outbound ANI and CallerID information. This means that sometimes the service will receive ANI that is inadvertently or deliberately misleading. A prime example is Microsoft, whose Redmond campus phone system was sending up ANI values that looked like "100-010-1180" or "100-111-5566". Clearly these aren't valid US phone numbers. In this situation, the service will put up the "enter your ANI" page. A more insidious example is a store whose number was 804-850-xxxx. After an area code split, their number changed to 757-850-xxxx, but the PBX was never updated. When CCMI finally removed the exchange from the database, we no longer recognized the ANI as valid, and (on the assumption that it was a new exchange that we weren't recognizing yet) the service started handing out tellyscripts with an 800 number. Not only does this cost us money, it might cost the store money in the future: if the exchange were used for a different location in the new area code, it might be a considerable distance away from the store, and the store would start making toll calls because their ANI is wrong. Another fun case was the user showing up with 415-700-xxxx. This exchange doesn't exist, and apparently never has. As it happens, the caller with this ANI was in Paris, France, and was using an international 800 code to get to us. For whatever reason, the carrier decided to return 415-700-xxxx as the source. | | -=*=- V. Extra Goodies -=*=- | | -= V.A =- OraclePhoneDB and POPtimization Until the "Disco" service at the end of 1997, the PhoneDB had just been a file on disk. With Disco, the PhoneDB is also kept in the database, and in some future release the disk file may vanish altogether. The purpose behind this is to gain greater flexibility and provide direct access to the PhoneDB for database queries. One of the more important developments associated with the OraclePhoneDB (so named because we're using an Oracle database right now) is an optimized POP assignment system, usually referred to as POPtimization. The goal of POPtimization is to assign POPs on an individual basis, rather than on an exchange area basis. The current load balancing system has a number of flaws, but the biggest of them is that it doesn't consider groups of people. When you log in, it looks at the usage percentages assigned to the different providers, looks at your serial number, makes an assignment, and then forgets all about you. It doesn't know how many ports each POP has, and even if it did, it wouldn't know how many users have had that POP assigned to them, or which of those users is likely to dial in during peak hours. POPtimization takes into account customer usage patterns (like number of hours per month and typical time of day logging in), POP capacity, and several other factors, and assigns POPs to all users in an entire region. This allows us to use all the capacity that is available while minimizing costs. The aspect of POPtimization that most directly affects Customer Care and Operations is that tellyscripts can now hold multiple sets of POPs, and can invoke a different set based on what time it is, what day of the week it is, and what month it is. Here is an example of a tellyscript assignment with two sets, one for October 1997 and one for November 1997: Most recent script sent to client: Hash 0x91c63c79, sent Wed Oct 22 20:54:41 1997 v42 base/- v2 locale/- v0 ---Poptimized/199710 v1 wpb/3261095 v2 cnc/6870610 v1 wpb/16503261095 v2 cnc/16506870610 v1 artemis/18004653537 v0 ---Poptimized/199711 v1 wpb/3261095 v2 cnc/6870610 v1 wpb/16503261095 v2 cnc/16506870610 v1 artemis/18004653537 This example shows only two sets, but a tellyscript might have as many as eight. The output of clientinfo will also show the sets in a hierarchical fashion: POPtimized assignments: MONTH Oct 1997 DAYS SMTWRFS TIMES 00:00 - 00:00 POP 1 0:650-326-1095 conn=F POP 2 1:650-687-0610 conn=P POP 3 2:650-687-2255 conn=H MONTH Nov 1997 DAYS SMTWRFS TIMES 00:00 - 00:00 POP 1 0:650-326-1095 conn=F POP 2 1:650-687-0610 conn=P POP 3 2:650-687-2255 conn=H The "conn=X" part tells you if the connection is supposed to be (F)lat rate, (P)er-port, or (H)ourly rate. These aren't used yet, and can be ignored for now. Switching POP sets on calendar month boundaries is especially important when flat-rate IAPs are used. The IAP bills us if a call starts in a particular month, so if we can have the box switch between two flat-rate providers exactly on a month boundary, we won't end up paying two IAPs for the same box in one month or the other. The POPtimization data is determined over the set of existing users, and is updated periodically. New users will either get default POPtimization data for their area, or will just get the standard load-balanced PhoneDB selection, depending on how we implement it. For details on how the POPtimization is performed, contact Joy Mundy (email=joy). There are a number of operational issues related OraclePhoneDB and POPtimization, but it's not really appropriate to list them all here. Some notes are available on the http://webhost-1/~fadden/DocArchive/ site, and Joy has a status page on http://webhost-1/~joy/Poptimization_Status.htm. The service has a number of safeguards to prevent really bizarre behavior. For example, every POP in every set must be one of the ones shown by POP-O-Rama. There is no way for corruption in the POPtimization tables to cause a box to dial a POP that is completely wrong unless the PhoneDB itself is damaged somehow. The results from the OraclePhoneDB are currently compared bit-for-bit against the results from the file version, and the file is checksummed and verified in various ways, so PhoneDB corruption is unlikely unless the POP lists or PhoneDB tools are screwed up. And if the POP lists or PhoneDB tools are broken, then we'll have problems whether we're using the PhoneDB in Oracle or in a file. I'm not expecting customers to be adversely affected by POPtimization. If a box loses power, it loses track of the date and time. In such an event, it will behave as if it were Wednesday at 7pm (local time) in the most recent month in the script. -= V.B =- Fallback usage cap This section is rather brief, partly because the details aren't yet finalized, and partly because I wasn't involved with its design or implementation. Comments or questions on this feature should be addressed to Wiltse Carpenter (wiltse@corp.webtv.net). The basic idea is to cut our costs by reducing the amount of time that people spend on the fallback 800 number. This is accomplished with two different mechanisms, a per-session limit and a per-month limit. The per-session limit is like the 10-minute idle timeout that the box has, except that it forces you to hang up and redial whether you're idle or not. The intrusive nature of this timeout is bound to cause complaints. The per-month limit prevents you from using the fallback number if you have used more than a set number of hours in a calendar month. This caps the per-user cost at a tolerable level, while still allowing relief during temporary POP outages. We found that a handful of users accounted for a large percentage of the costs, so the cap should dramatically reduce costs while only affecting a handful of users. When the monthly usage cap is exceeded, users dialed in through the fallback number will get an HTML page from the headwaiter telling them that all local POPs are unavailable. Some regional adjustments may have to be made for areas with chronic POP problems. -= V.C =- MCI WebTV and MCI have reached an agreement that will allow WebTV customers to switch to an MCI/WebTV co-branded service. Such customers will use MCI POPs exclusively (they may or may not get our fallback number), and will pay a lower WebTV fee per month if they are also subscribed to MCI's long distance service. The trouble with using the MCI POPs is that they only support CHAP authentication, while the box only supports PAP. To use their POPs we need to add CHAP support to the box. In the mean time we still want to sign up customers for the co-branded service, so for late 1997 and early 1998 we will be using the normal WebTV POPs for MCI customers. This will change after the next client and service release. We will have some troubles with flash downloads, because the flash downloader on "Classic" boxes will always behave like a 1.0 client, and therefore can't negotiate CHAP. The tellyscript sent to MCI customers will have to be able to dial either MCI POPs or normal WebTV POPs, and will switch based on whether or not the currently executing ROM supports CHAP authentication. Because of the potentially large number of MCI customers who will briefly be using our POPs, we only regard a customer as eligible for MCI if they have a local MCI POP *and* they have two "normal" local POPs that aren't from flat-rate IAPs. It is possible for a customer to gain or lose eligibility without any changes in MCI's POP list. | | -=*=- VI. For Further Reading -=*=- | | -= VI.A =- On the web Resources available on the web, internally or externally. http://webhost-1/~fadden/DocArchive/ A collection of documents on various subjects, some related to the material here, some not. Take a look sometime. http://webhost-1/~fadden/todo_list.html My to-do list. Relevant because most of the items have some relation to requested features for PhoneDB generation or dialing. If something hasn't been added but you think it should be, check here. http://hyperarchive.lcs.mit.edu/telecom-archives/ TELECOM Digest archives. Several years' worth of interesting articles. http://frodo.bruderhof.com/areacode/ Area code split details. http://www.areacode-info.com/ Assorted area code stuff. http://www.cnet.com/Content/Reviews/Compare/56kmodems/index.html Reviews of 23 56K modems. -= VI.B =- In the service source tree Documents checked into the service source tree. Consult your friendly neighborhood tech pubs person for web versions. network/src/doc/DialingInfo This file! network/src/doc/ANICodes List of OLS codes (found in the first two digits of the ANI number). network/src/doc/IntlPhoneNotes A few notes on how the service deals with the phone systems in foreign countries, e.g. Japan. network/src/doc/POPBalancing A detailed technical discussion of the ramifications of POP load balancing, written while I was trying to convince myself that the system was behaving correctly. network/src/doc/PSI Description of the changes made to the service to support PSI. network/src/phonedb/README Tips and tricks for advanced "phonetool" use. network/src/clientpopedit/README Documentation for the "clientpopedit" tool. network/src/dpedit/README Documentation for the "dpedit" tool. network/src/tool/phonetool/README (and README_JP) Documentation for the "phonetool" tool, which is actually a collection of tools. Of particular interest for some people is the table of dialing pattern codes that are output by the "dumpnpas" sub-command. network/src/tool/rawphonetool/README (and README_JP) Documentation for the "rawphonetool" tool, which is actually a collection of tools. This tells you what all the nasty messages printed by rgenphonedb mean. network/src/tool/psiutil/README How to use psiutil, if you are ever unfortunate enough to need it. That's all, folks... *** WebTV Confidential ***