[LINK] An interesting Peek into the Past.
Tom Koltai
tomk at unwired.com.au
Wed Oct 14 11:33:12 AEDT 2009
Received: (from root at localhost) by ourworld.net; Thu, 12 Jun 1997
22:02:57 +1000 (EST)
Received: from 10.5.0.253 by thistle
Received: from ourworld.net by ogn-gateway.ourworld.net; Thursday, 12
Jun 1997 23:01:44 +1000
Received: from 205.184.2.45 by ogn-gateway.ourworld.net
X-Claimed-Received: ogn-gateway.ourworld.net
X-Claimed-Received: tpk3.ix.netcom.com
Message-ID: <34D46341.7916 at ourworld.net>
Date: Sun, 01 Feb 1998 22:57:53 +1100 (Ignore Date - Laptop
date was wrong)
From: "Thomas P. Koltai" <tpk at ourworld.net>
Reply-To: tpk at ourworld.net
Organization: OGN
X-Mailer: Mozilla 3.01 (Win95; I)
MIME-Version: 1.0
To: rfrawley at cisco.com
CC: tpk at ourworld.net
Subject: This is it!
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=iso-8859-1
X-Mozilla-Status: 0001
Content-Length: 27479
Dear Richard,
Thanks for your time today. In fact the Formus people visited our Sydney
ofice today. That was interesting.
On with the show.
Historical
OGN was incorporated on the 1/Nov/1996. Original seed funding was
completed in the last week of January, 1997 and the company started
trading on the
first of February.
OGN is a company funded at this time at 2.6 million dollars. The parent
company (read major shareholder) is Barton Capital Holdings listed on
the Sydney Stock Exchange as BCH. BCH curently is capitalised at 8
million dollars. (The point 6 is mine.) Parties that have expressed an
interest in taking a slice of us are Saturn Global Networks, UUNet, FAI
(Adler already owns a slice of BCH) and Saturn in NZ. Our business model
is predicated on our mission statement "To become the Worlds first
Global IP Carrier".
It covers several items not discussed by us previously, but I am sure
that the relevency will become self evident as you wade through the
following.
ABSTRACT
Bandwidth is a scarce resource, the expense of bandwidth makes better
use of existing bandwidth a necessity. This paper details multiple
techniques developed by Ourworld Global Networks to maximise the
utilisation of its international and long haul bandwidth segments. We
extend the caching concept into split caches which can perform bandwidth
conservation for inter-cache communication. We note that this technique
can be applied to any proxyable protocol.
INTRODUCTION
Network Service Providers (NSPs) provide Internet access for clients.
They, idealy, have unlimited network bandwidth that their clients may
use, and can guarantee Quality of Service (QoS) for each and every
connection. In reality, NSPs have limitations on the amount of bandwidth
they can supply. They oversell their bandwidth because not all clients
are using full bandwidth at all times. During times of heavy load their
network can become congested, leading to poor performance. Lack of
redundancy further complicates the issue by making it appear that the
NSP is lacking, even if it is an upstream technical difficulty. The
meshing of layer two and layer three networking
technlogies is yet to occur satisfactorarilly to permit IP to emulate
layer one reliability.
The data that clients access can be split into two broad categories,
static, and dynamic. Static is composed on once generated, but often
fetched, data. The majority of the World Wide Web (WWW) is today still
mainly composed of static data; the pages are changed on an infrequent
basis. A page fetched today is identical to one fetched yesterday, or
even last month.
Dynamic content is content that never repeats. An example of this would
be a telnet session. The keystrokes that are sent and the responses that
are received are unique to the one session. All other telnet sessions
are, hopefully, different.
These categories are not as clear cut as a-priori might be suggested. An
audio session could be a static stream, being fed from one source to
multiple destinations, or it could be dynamic, two people holding a
private conversation. Some sessions are ameanable to bandwidth
maximisation techniques, whilst others are not.
Otherwise idle time on the network, time when bandwidth demand falls
below the available supply, can be used to move traffic that is either
low priority or that may be required later. This time shifting of
bandwidth use also gives users better bandwidth during peak times.
Current Bandwidth Maximisation Techniques
Currently the Internet practices little, or no, bandwidth maximisation
techniques. The two main protocols that have bandwidth perfromance
enhancing programs designed for them are the World Wide Web, with the
various caching projects, and the Usenet News Servers. Each of these
systems has the ability to maximise bandwidth by reducing the amout of
unnecessary traffic that moves across the network.
World Wide Web Caching
WWW caching is currently practiced on an ad-hoc basis by various network
users. Browsers such as Netscape's Navigator and Microsoft's Explorer
can cache pages for an individual user on the local machine, enhancing
load time by reducing the number of times the same page is fetched. They
can also communicate with a local proxy, which can perform the same
function amongst all local users. Once one local user has fetched a page
another local user who requests the same page will have that request
satisfied from the local copy instead of refetching the page across the
network.
Network bandwidth is maximised by locally producing the page instead of
using the network to fetch the page from the ultimate source. If the
page is missing the cache will ask the source for the page. The Squid
cache from NLANR enhances normal caching by having a connected series of
caches. If a page is not in the local cache a set of neighbour caches
are asked. If a neighbour cache has the page, and the neighbour is
closer than the source, the page is fetched from the neighbour instead
of the source. Closer is measured in network response time, so a
geographically close neighbour who is connected by a congested link
would be considered more distant that a geographically diverse neighbour
connected by a clear link.
News
Network News is currently a poor network citizen. It moves between
servers on a push model, where the sending server determines what is
sent, and the receipient discards unwanted items. The links between
servers are done purely on a administratively convient basis, with feeds
being obtained from friendly administrators, irrespective of their
network topological location.
The current news situation reflects the way in which network news has
grown over the life of the Internet. The maze of criss-crossing news
connections is certainly sub-optimal, and wastes bandwidth. A more
efficient tree structure would maximise the available bandwidth by
reducing the duplication on the network.
FTP Mirrors
Public file archives are a useful resource on the Internet. Large
collections of commonly requested programs and documents are made
available for public, or \fIanonymous\fP fetching. Traditionally this
occurs using the \fIFTP\fP protocol, although more and more of these
collections are moving to HTTP servers. FTP Mirrors are duplicates of
these collections placed in geographically disperse locations, allowing
quicker local download rather than use the master server. The mirrors
are usually updated once a day, and are mentioned at the master site. It
is suggested that users use a local mirror when fetching files, but this
is a user function, not a server function.
The FTP Mirrors allow quicker download by using local, less congested,
network links, rather than more central, and therefore congested links.
They are a convienence, and are only effective after user education.
Their use is not enforced, and discovering where mirrors are located
often requires a reference to the master site, which encourages users to
fetch from the master site once connected.
Cache Arbitration
Cache arbitration is the collective name given to some of the techniques
developed to maximise bandwidth by reducing bandwidth utilisation. These
techniques all reduce bandwidth use by one of:
. not re-sending data already sent,
. sending only the minimum amount of data, or
. sending the data along a particular network link only once.
The whole system uses a series of geographically disperse machines,
communicating with each other along internal links. Each machine acts as
part of a distributed cache. A cache would hold pages for those machines
closest to it, in a topological sense. When a page is referenced by a
consumer the local cache is queried. That cache will either have the
page, or failing that will ask the cache closest to the source for the
page. The cache closest to the source will either have the page cached,
or will fetch the page and return the page to the consumers' cache which
would return the page to the consumer. It is important to note that the
cache is split amongst the disperse machines, and the intermediate links
are controlled. It is over these internal links that total control is
exercised, and where bandwidth maximisation occurs.
The techniques that this paper discusses are Route Discovery, where
intelligent choices are made concerning the movement of data across the
network; Content Compression, where individual documents are compressed
using a dual compression techniques; and Dynamic Contect Compression,
where dynamic documents are compressed using a split cache technique.
Route Discovery
Route discovery is used to discover the closest, topologically speaking,
cache to a destination site. The route discovery project is slowly
mapping the Internet. Currently the project uses none of the available
information on routing, such as the Autonomous System (AS) numbers of
the Border Gateway Protocol (BGP4). It is used to develop a view of the
Internet from a particular location.
At each cache location a mapping process is continually building and
updating the view of the network. This gives a view of the network from
that cache location. Each cache's view is combined in a central
arbitrator, which can determine the best cache to use for any particular
destination.
The mapping process takes as input a list of addresses to map, and
inserts the routes to these addresses into a local database and also
updates the global database. Initially the database is empty, with
routes being added on both a demand basis and on a pump-prime basis by
the central arbiter. The priming occurs by fetching the top and second
level domain names from the Domain Name Service and using the 'A'
resource record identifies entries as routes to determine.
Each address is mapped by calculating the hops that are passed to arrive
at the address. The database gradually builds to be a list of the links
that are used by packets from this node to other network nodes. It is
important to note that only the radial links are found by this
technique. Traverse links are not found, as no packets from the search
pass across these links. These links are either found by other mapping
searches, or can be discarded as irrelivant to the network performance
of the caches, as no packets will use these links.
The Internet is dynamic, with links appearing and disappearing, so the
mapping process continually re-runs to discover any changes to the map.
Any changes are noted on the map and communicated to the central cache
arbiter for update of the global map.
One methodology of "diff" style routing differentiation is analsis of
BGP updates to identify AS's altered for revisitation and remapping.
Database Overview
The database holds an entry for each node found in the network. Each
node is represented by a structure that holds the node information (the
time to the node and flags), the parent node (the machine one hop
closer), any sibling nodes (also one hop further from the parent node),
and child nodes (nodes one hop further from this node).
This data structure allows us to quickly discover the distance and path
to each visible node. It also allows visual representations of the
Internet to be obtained for capacity planning purposes. Links that are
used for a large number of growing nodes can be easily seen and are
candidate links for upgrading.
The network database is stored in a flat index file. Since the minimum
routable network address is a /24 (or 24 bits of address) we get a
maximum of 2^24 * sizeof (database structure) bytes for the database. If
the database structure held for each entry is 16 bytes the size of the
database is 256 Megabytes - easily storable on a single disk drive. The
full database, with an entry for each host is 64 Gigabytes in size,
easily achievable with todays high capacity disk storage arrays. Since
the full network is only held in one central location the use of this
much disk is justified for better routing.
Classless Inter-Domain Routing
Classless Inter-Domain Routing is a method of routing that ignores the
traditional Internet class structure. Instead of the top bits of an
address determining the division between network and host portions of
the address the routing table holds the division. This requires more
complex routing tables, but reduces the amount of routing information
held at central sites. The current classless routing table has about
40,000 routes, instead of the many millions that would be required for
classful routing. This reduction is due to the compaction of many
smaller networks into single supernetworks.
Program Performance
The file is indexed by the network address. This means that there is
significant duplication for large CIDR blocks of addresses, but with
subnetting this duplication is reduced, as the single block may contain
many different, and disperse networks, which have different caches as
the closest cache.
The program used to populate the database is a custom traceroute. It
uses the traceroute method of sending UDP packets with varying Time To
Live (TTL) values, and seeing which host returns the dead packets.
Traditional traceroute uses an initial TTL of of 1 and increments the
TTL for each hop examined. This means that close links are mapped many
times, with a corresponding waste of bandwidth. This is useful for
immediate analysis of the network, but for long term analysis is
wasteful.
Instead of working from close nodes outwards the mapping program works
from the destination node inwards, towards itself. It does this by first
calculating a first guess distance to the target node, and using this as
the TTL. When the correct TTL for the destination node is calculated the
TTL is then decremented for each packet, tracing hte parent nodes. This
process continues until a node that is currently in the database is
found. At this point the process stops, as the assumption is that the
network only changes slowly.
To obtain the first TTL the mapping program initially sends a packet
with a maximum TTL. The response determines if the node is currently
reachable or not. Unreachable nodes are delayed until they are reachable
to avoid waiting for timeouts unnecessarily. The TTL from the initial
returned packet is used as the initial guess for the distance to the
node. If the path is symmetrical the number of hops on the inward and
outward legs are identical, so the returned TTL is the best guess for
the TTL to the node. If the inward and outward paths differ, and the
outward path is longer than the inward path then the initial TTL is too
short, and the maximum TTL (30) is used instead. If the inward path is
longer than the outward path then the program will send out unnecessary
packets until the real distance is reached whereupon mapping will occur.
The mapping program sends out packets only when necessary. Any returned
packet is used a the distance, so for a node that reponds on the first
attempt the round trip time is used as the distance indicator.
Since the first packet sent to the target node has a large TTL, and
actually makes it to the target node subsequent packets do not incur
routing overheads. The initial packet is delayed by the intermediate
routers as they caclculate the next hop. Subsequent, mapping, packets
use the pre-calculated path that the routers cache. This technique
avoids the delays of traceroute which incurs the delay for each router
probed.
Switching
Current gigaswitch technology is restricted to 65000 MAC addresses. Use
of the above database in
conjunction with a rapid switching device wherein we have high speed
access to the backplane allows rapid re-addressing of "regional"
destined information. At this time, due to non-availibility of
BPX'x it is proposed to conduct the required testing and development
utilising the NETWiz (Israelli)
developed high speed ethernet switch.
Content Compression
Internet content is stored and transmitted in an uncompressed form.
Compression systems build common information structures at both the
compression and decompression ends of the link. Data that is already in
the information structure is referenced, rather than being transmitted.
The more data held in the information structure, and the closer this
data is to the following data in the data stream, the better the
compression that can be achieved. Since HTML documents have a common
structure, with the keywords predefined, the keywords can be preloaded
into the information structure, reducing the need to transmit these
keywords across the network.
The compression technique used is a two stage compresion pipeline. The
first stage removes all HTML and HTTP keywords and replaces them with
tags. The second stage is a generic compressor that searches for
repeated strings of information, and replaces these with tokens. The
decompressor is a likewise two stage pipeline. The first stage reverts
the tokens to strings, and the second expands the tags to the respective
keywords. The compression is designed to be 8 bit clean, allowing 8 bit
pages to be compressed and transmitted.
Compression can olny occur between systems that agree on the compression
techniques to use. Since the compression occurs between caches in the
split cache system we program the caches with the compression algorithm,
and obtain maximum compression across the links.
HTTP 1.1 Enhancements
HTTP 1.1 allows for compression of the content, but uses a generic
compression technique. Dedicated compression can achieve better results
by having more common information at both ends. Dedicated compression is
also faster to perform than generic compression, where the same level of
compression is expected, as the input structure is known.
Dynamic Content Compression
Dynamic content on the Internet is becoming more and more widespread.
Customised pages, video and audio streaming, search engines, all produce
pages that cannot be cached using traditional caching techniques. Some
of these pages are cachable using a technique called differential
content compression. When a page is requested between caches one of four
situations occur:
1. Neither cache has the page,
2. The remote cache has the page,
3. The local cache has the page, or
4. Both caches have the page.
In the first case the page is fetched as normal, and both caches keep a
copy of the page. In the second case the remote cache checks the
expiry of the page, and if necessary updates and sends a copy of the
page to the local cache, and both caches end with a copy of the page. In
the third case the local cache satisfies the request. In the fourth case
the remote cache will fetch the page from the source, perform a
difference
analysis on the page and the cached copy. If the pages are identical the
request is treated as a cache hit. If the page is defferent, and the
difference information is smaller than the page information, the
difference information is passed between caches. Both the local and
remote caches update their pages to the new version for further
difference analysis.
This technique allows pages that change in detail, but have the same
over all structure, and similar content to be passed between caches with
drastically reduced bandwidth. Minor differening pages are now common,
as web servers try to produce customised pages from templates for
individual browsers.
Audio and Video Reflectors
Audio and video broadcasts are the bane of NSPs. They consume vast
bandwidth, but service only one customer. These systems use a point to
point link for each subscriber. Two subscribers who share a common path
to the service provider have identical data streams moving on the common
path. Audio and video reflectors attempt to remove this common pathway
by using one channel between the producer and the NSPs internal network.
The reflector will break the stream into multiple identical streams, one
for each consumer. This system is a form of multicast that does not use
the IP multicast address range, but instead uses normal IP links,
usually using UDP as the broadcast medium.
Having a reflector, or a chain of reflectors, provides bandwidth savings
whilst still allowing users to use the services they desire. Since the
data streams are compressed using dedicated compression techniques no
additional savings can be achieved by trying to recompress the data
stream, so only savings through removal of duplicate packets is
achieved. When no consumers are utilising the service the reflector
shuts the stream for the upstream node, ensuring packets are not
unnecessarily sent.
Kernel Enhancements
All the services described above utilise user level processes on
standard operating systems to perform their functions. This requires a
large number of OS kernel >><< user level transitions, each taking many
milliseconds to perform. Instead it is possible to perform many of the
common functions within the kernel, or device driver, of a dedicated
machine, reducing the number of transitions, and unnecessary data
movements. For the enhanced cache the read operation is performed inside
the kernel, with failure moving to user level where the normal
intercache fetch occurs.
TCP Session Hijacking
This is performed using TCP session hijacking within the kernel. A TCP
session for the cache is captured by the kernel cache software, after
allowing the first two packets through. The first two packets allow the
normal kernel TCP driver to establish the TCP session parameters. The
session then hijacks the captured additional packets on the session,
sending acknowledges until the HTTP header has been captured. If the
request can be satisfied from cache it is so satisfied. Otherwise the
header is passed to the kernel TCP driver, which processes it as the
continuance of the normal session.
This session hijacking has the advantage that read requests are
satisfied immediately from the cache, without user level traversals.
Only cache updates are required to proceed to user level.
Conclusion
Network bandwidth utilisation can be maximised using the described
techniques. We have caching to prevent unnecessary network traffic. When
traffic is required we compress for static, and differential compress
for dynamic content. For streaming content we use reflectors to reduce
the number of streams that are traversing the network. The network
mapping allows us to chose best routing independent of the traditional
routing methodologies. OGN's initiatives with the AUIX and other
exchange points globally allow OGN to both utilise data collected at
those points and to test the developed switches on a hoard of eager
ISP's.
STATUS of the above.
What code has been developed:
The algorithms that have been developed are:
1. IP address space walk
2. HTML dedicated compression
3. Multi point caching
4. Dynamic caching
Code has been developed for (1 & 2) above.
Limited progress has been completed on code for items 3 & 4.
No technical progress has been made with the GRIS but work on
the interface
is almost complete (Sun and Netwiz together) and we expect to
receive our
first switch for testing and commencement by July 15th.
What performance/savings have been achieved to date or simulated to be
achieved?
We are currently undergoing peer discussion about the
algorithms, this allows us to tune the algorithms prior to test
bedding, and additional tuning.
How are we protecting this IP?
We have considered 1) Trade secret provisions ie: telling nobody
and decided to settle on 2) Patent application - Software patent
Real time Route Arbitration? Where is it?
The real time route arbitration is a follow through of the
cache selection strategy. The actual implementation of this
would be performed by Andrew Khoo Leigh Hart and Justin Newton.
Who are the competition?
The only other groups that are pursuing this work are:
ISI/Merit for route arbitration (RA project)
W3C for cache compression (only using a generic compressor)
Nobody for distributed/regional caching
Technology changes, standards, changing standards, marketability and
value of the IP
The IP is a fundamental move from a centric view to a
distributed view of
the Internet. This work can be made a standard easily, however,
once a standard
we would lose any competitive edge. If we release this work
(when completed)
we would gain two things:
1) The edge as we would release as a standard, and others would
have to work to our standard.
2) Publicity, we would be seen as the producer of the standard,
and
as such gain respect for future work. We would also
become well
known in the industry, which would help on industry
boards and
other such places for future plans.
Marketability: Other International carriers, dealing in IP
traffic, would
like these products. It is highly probable that other, non
US-centric, carriers
would be producing similar products.
Please bear in mind that the above is an introduction only. I think we
can make Cisco quite horny by describing the optional packet rerouting
(at layer 2) in case of individual network failure. Some of our
redundandcy planning is obvious; Network exchange points with both
terrestrial and satelite conectivity. e.g.: PAIX and OGN SKG will be
conected via satelite.
OGN is currently entering its second round funding phase and dependng on
funding outcome,
we have tentative completion dates for all of the above R&D to be
completed in 1997.
To date we have expended over three hundred thousand dollars on
developing the above products which represents over ten percent of our
total current capitalisation.
If you, Richard asked me what OGN would like from Cisco; the list is
very short. 1. Equipment delivered as ordered. 2. Promises made - kept.
3. Access to a couple of the engineers that worked wth Toni Li on the
Traffic Cop project. 4. Access at a programming level to a BPX. 5. An in
depth open run down on Tag Switching complications. 6. Access to an
updated copy of the Cisco IOS source, preferably one that will compile
on current routers. 7. A couple of largish switches and routers for
non-commercial research.
8. An agreement that the developed work will be of commercial benefit to
OGN for a predetermined time
to allow OGN to recover development costs.
9. An agreement to slip some non-routing code (E-commerce based) into
the IOS to sniff for ecomm bits in
the headers.
Thats about it. It's bloody late so I'm off to bed. I'm flying back to
Oz tonight, so will speak to you early next week.
ciao,
Tom
The reply from Cisco was equally enthralling but is not mine to share.
Wasn't that interesting Boys and Girls.
_______________________________________
No viruses found in this outgoing message
Scanned by iolo AntiVirus 1.5.6.4
http://www.iolo.com
More information about the Link
mailing list