executes the protocol functions by referring to the
data stored in these local structures (e.g., for
retransmission of packets these structures will
point to the messages to be retransmitted). This
implementation was simplified by assigning a sep-
arate queue for timer handling. A controlling
application program opens the transport driver
and creates a stream to it. All the data to be used
on expiry of the timer are stored in the message
queue of this stream. On timer expiry, the time-
out routine calls a STREAMS qenable() routine
to enable the service() procedure of this queue.
The procedure gets the relevant structures from
this message queue and executes its functions.
Thus, it could provide a simplified implementa-
tion by using the existing STREAMS queues and
scheduling mechanisms.
It should also be noted that the Retix archi-
tecture [8] also makes effective use of
STREAMS scheduling mechanisms by having an
independent timer driver based on it. The differ-
ence in our approach is that our timer queue is
within the transport driver, but their architecture
uses a single timer driver that services all proto-
col layers.
processes are dynamically balanced by the oper-
ating system, service() procedures also get auto-
matically balanced without any extra effort.
An interesting
design
Similarly, symmetric multiprocessor hardware
dynamically directs an interrupt to the processor
running the lowest-priority process. This fact was
used to send a scheduling interrupt to make an
instance of scheduler run as the handler of this
interrupt. This provides the thread for the service()
procedure. Since interrupt handlers are dynamical-
ly balanced, so are these service() procedures.
Although [9] does not give any experimental
data, it does mention that performance improve-
ment was observed when additional processors
were added, due to this improved parallelism.
The Sequent architecture also improves the
message handling capabilities of STREAMS for
both uni- and multiprocessors. If the memory
message allocation call fails, STREAMS requires
a bufcall() routine to be called to invoke a user-
defined function on the later availability of the
buffer. But if a stream closes after calling bufcall(),
the user-defined function is unable to find its data
structures and corrupts the kernel. To avoid this, a
new unbufcall() utility was created that cancels the
previous bufcall(). This unbufcall() must be called
as part of any module’s close procedures.
consideration of
the architecture
was improving
STREAMS
scheduling
mechanisms to
implement paral-
lelism. STREAMS
provides strict
FIFO scheduling
of service()
procedures on its
global queue.
This was modified
by having the
scheduler
SEQUENT ARCHITECTURE
We now look at the first parallel STR EAMS
architecture, from Sequent [9]. This architecture
runs on their tightly coupled Symmetry multipro-
cessor system. For improved scalability on multi-
processors, multiple tasks should work in parallel
on multiple processors. The Sequent architecture
provides mechanisms to provide two types of par-
allelism: horizontal parallelism, where functions
within each stream run on one processor (e.g., in
Fig. 1, stream # 1 will run on one processor and
# 2 on another); and vertical parallelism, where
functions within a stream can themselves run in
parallel on different processors (e.g., timod and
transport driver functions of a single stream run-
ning in parallel). Data consistency is achieved by
ensuring fine-grained locking.
An interesting design consideration of the
architecture was improving STREAMS schedul-
ing mechanisms to implement parallelism.
STREAMS provides strict FIFO scheduling of
service() procedures on its global queue. This
was modified by having the scheduler schedule
these procedures on the global queue on any
available processor. The addition and deletion of
procedures on this queue is synchronized by a
global lock. The thread of execution of the pro-
cedures must be provided intelligently to dynam-
ically balance the load on the processors. Two
alternative mechanisms were adopted.
It has been observed that symmetric multi-
processor operating systems inherently balance
the total process load on the number of available
processors. Thus, if a number of applications’
processes are present, the runnable process will
normally be dispatched to run on the processor
running the lowest-priority process. This fact was
used to create multiple system daemon processes
to provide the thread of execution for service()
procedures. These processes wait on a common
semaphore and get awakened when a new ser-
vice() procedure is enqueued in the global
queue. The service() procedure then executes in
the thread of the daemon process. Since daemon
SVR4MP ARCHITECTURE
The Sequent architecture was based on the
assumption that developers are performing most
protocol processing in the service() procedures
rather than in the put() procedures. There can
still be different views on this point, but the Sys-
tem V R elease 4 MultiProcessor (SVR 4MP)
parallel architecture [10] is based on the assump-
tion that put() procedures occur more than ser-
vice() procedures. They have analyzed a running
TCP/IP stack and have shown that put to service
ratio is 1.6:1.
Since parallelizing multiple service() proce-
dures is not required, they have not used the
fine-grained vertical parallelism and complex
scheduling mechanisms adopted in the Sequent
architecture. Instead, horizontal stream-level
parallelism was adopted where each stream was
run over a single processor. In effect, there is
only one operation per stream allowed at a time.
A lock per stream was used which significantly
reduced large number of locking operations
required in finer-grained parallelism approaches.
Thus, throughout the system call and corre-
sponding put() procedures in different modules,
a single stream lock was held. Independent
streams can run in parallel on multiple proces-
sors, improving performance. Results showed
that TCP /IP performance improved when addi-
tional 33 MHz Intel 486 processors were added
on Compaq Systempro. The speedups over one
CPU were 1.78 at two processors, 2.35 at three,
2.55 at four, and 2.93 at five processors.
schedule these
procedures on
the global queue
on any available
processor.
However, the inflexible intermodule commu-
nication of mux drivers prevents the single lock
to work across the streams on two sides of a
mux. For example, in Fig. 1, after locking stream
# 1 and processing data on it, the transport driv-
er put() cannot invoke the Internet driver put()
procedure since stream # 3 is an independent
stream and has not been locked. Thus, the
SVR4MP model fails.
IEEE Communications Magazine • February 2000
169