Jose Duato

Contact

Position:: Full Professor (Coordinator)

Address:: Valencia
Email:: This email address is being protected from spambots. You need JavaScript enabled to view it.
Phone:: +34963877007x79705

Image & Curriculum Vitae

Image & Curriculum Vitae

Short Curriculum Vitae

The most significant achievements of José Duato, including transfer of his research results to industry, publication records, service to profession and international collaborations, are summarized below:

On his own, he developed several adaptive routing strategies to improve interconnection network performance. These strategies are so efficient and cost-effective that some of them have been implemented in the most powerful supercomputers, including the Cray T3E and Cray Black Widow supercomputers, the Compaq Alpha 21364 microprocessor used in the Alphaserver GS320 supercomputer, and the IBM BlueGene/L supercomputer. These supercomputers were among the most powerful ones when they were launched. In particular, the IBM Blue- Gene/L is the most powerful supercomputer today.
He developed, in collaboration with Xyratex (a UK company), a technique called Regional Explicit Congestion Notification (RECN), which is the only truly scalable congestion management technique for lossless networks to date. This result has been protected with two joint patents and RECN is currently being incorporated into the most important standard for future communication systems: Advanced Switching Interconnect.
He has collaborated with the IBM Zurich Research Laboratory (the only IBM research laboratory in Europe) since year 2001 until their Communications Department was closed. The main outcome of this collaboration are four joint patents with IBM (Xmorph, RXS, BFC, and AFC). Also, he and José Flich developed the In-Transit Buffer (ITB) routing technique. Myricom (a USA company) has included support for ITBs in its popular Myrinet network by means of a special ITB packet type.
He is the first author of a 500-page book published in the USA, which has become the most popular book on interconnection networks. Also, he has authored or co-authored more than 340 publications, including book chapters and papers in journals and conference proceedings. He advised or co-advised 22 PhD students.
He served as Associate Editor of IEEE Transactions on Computers, the oldest and most prestigious journal in the area of computers in history (55 year old). He also served as Associate Editor of IEEE Transactions on Parallel and Distributed Systems, the second most prestigious journal in the area of parallel computers, being the first European researcher who served in this capacity. Also, he is the only Spanish researcher that served as associate editor of both of these journals.
He was the General Co-Chair for the 2001 International Conference on Parallel Processing and played a vital role in bringing this prestigious conference to Valencia, being the first time that it was held in Europe. This is the oldest conference in the area of parallel computers (at that time it was in its 30th year). He served as the Program Chair (Chair of the Scientific Program Committee) of the 2004 International Symposium on High Performance Computer Architecture. This is one of the two most relevant international conferences in the broad area of computer architecture. Also, he served as Program Co-Chair of the 2005 International Conference on Parallel Processing, and as Steering Committee Member, Program Co-Chair, Program Vice-Chair or Program Committee Member in more than 55 international conferences and workshops, including the most prestigious ones in the area of parallel computers: ISCA, HPCA, ICS, ICPP, IPPS/SPDP, IPDPS, HiPC, and Euro-Par.
He collaborated with several researchers from foreign countries, including some of the most prestigious professors in the area of interconnection networks (Lionel M. Ni, Sudhakar Yalamanchili, Chita Das, Timothy M. Pinkston, Dhabaleswar K. Panda, and Anand Sivasubramaniam, all of them currently serving or having served as associate editors of one or both of the two most prestigious journals in the area of computers). In particular, he co-authored 51 papers with researchers from six USA Universities, a USA national laboratory, and two USA companies, and 22 additional papers with researchers from five European and Asian Universities, and a European company. He also filed a joint patent with a Norwegian and a U.S. researchers.
He was invited to present keynote speeches in several international conferences (eight) as well as invited talks in several universities and national laboratories in the USA (nine), Europe (two) and Asia (one), including University of Illinois at Urbana-Champaign, University of Southern California, Georgia Institute of Technology, Ohio State University, Michigan State University, Pennsylvania State University, and Los Alamos National Laboratory. He was also invited to present talks at the research laboratories of some leading computer companies (IBM, Compaq, Sun Microsystems, Intel). Also, he was invited to participate in panel sessions in several international conferences and workshops (eight).
He was invited to send recommendation letters to support the promotion of several Professors in the USA to Associate Professor and Full Professor, as well as award nominations. He also served on the PhD dissertation committee of several doctoral candidates at various universities in the USA, Canada and Europe.

An important aspect of the research developed by José Duato is that he followed a disruptive approach in several cases. Under this approach, existing solutions are discarded and a completely new and superior solution is proposed, analyzed, and evaluated, thus making previously existing solutions to become obsolete. Some examples of this approach follow:

Adaptive routing techniques that allow cyclic dependencies between network resources. This is a counterintuitive approach because it appears at first glance that deadlocks may form when allowing cyclic dependencies between network links or buffers. Only a complex mathematical proof can show that deadlock freedom can be guaranteed if certain conditions are met. This research was so disruptive when it was developed that it was rejected by several peers and considered to be incorrect, even by the most prominent researchers at that time. However, it was finally accepted and several well-known researchers developed their own version of this theory. The benefits from this disruptive approach are a dramatic reduction in the number of resources required to implement fully adaptive routing. This drastic reduction in the number of resources has led all the supercomputer designers who wanted to implement adaptive routing in their interconnect to select Duato?s technique as the most suitable and efficient one.
Dynamic network reconfiguration techniques for lossless networks. It was thought that dynamic network reconfiguration is not possible in a lossless network like the ones used in most high-performance clusters. The reason is that routing tables for the different routers or switches cannot be synchronously updated, and therefore, routing tables for the old and new network configurations may coexist, usually leading to deadlocks. As a consequence, all the commercial products (e.g. Myrinet) are based on static reconfiguration, thus leading to very significant performance losses every time there is a change in the topology. Again, we developed some disruptive research showing that, although routing tables cannot be asynchronously updated without introducing deadlocks, it is possible to asynchronously perform several rounds of partial routing table reconfigurations in such a way that deadlock could be avoided. Again, a complex theory was required to prove deadlock freedom. This research opened the door to new and much more powerful network reconfiguration strategies.
Network congestion management techniques that eliminate the negative effects of congestion instead of eliminating congestion. In order to use networks in a cost-effective manner, the working point of the network should be close to saturation, but performance degrades dramatically when entering saturation due to congestion trees growing very quickly, thus introducing head-of-line (HOL) blocking among packets. Congestion trees were identified more than two decades ago. Many solutions have been proposed but none of them is scalable with respect to neither link bandwidth nor network size. Again, we proposed a disruptive approach here. Instead of eliminating congestion, we just eliminated the negative effects produced by congestion. Our approach is to eliminate HOL blocking, thus allowing blocked packets leading to non-congested destinations to proceed. This is achieved in a scalable manner by implementing a small set (e.g. two to four) of additional queues at every switch input port, which are dynamically allocated to congested packet flows whenever congestion is detected. By doing so, blocked packets are moved to set aside queues (SAQs) and packets leading to non-congested destinations are able to proceed, effectively eliminating HOL blocking with a small number of resources. Moreover, as our congestion management scheme reacts immediately and locally, it does not introduce the performance degradation and instability problems that typically arise in traditional congestion management techniques.

Resume of scientific work and impact of work

The following paragraphs comment on the impact of the research developed by José Duato:

On his own, he developed the best adaptive routing techniques for interconnection networks, also formally proving through a necessary and sufficient condition that those routing techniques cannot be improved. This theory is very easy to apply because he also proposed simple design methodologies. Moreover, these algorithms require very few extra hardware resources to be implemented, thus making adaptive routing commercially viable. As a result, these adaptive routing techniques were used in experimental machines, such as the Reliable Router and the M-Machine (developed at MIT). Moreover, these techniques were used in the Cray T3E supercomputer (the fastest supercomputer at the time it was marketed in 1997) and in the on-chip router of the Compaq Alpha 21364, the fastest commercially available microprocessor when it was launched. These routing techniques have also been used in the Cray Black Widow and the IBM BlueGene/L, a 720 Teraflop massively parallel supercomputer with more than 130,000 dual-processor nodes, which is the fastest supercomputer today. The relevance of these results comes from the fact that adaptive routing based on the theory of José Duato drastically increases supercomputer performance at roughly no extra cost by allowing a much faster communication among processors, and supercomputers are instrumental in many fields. In particular, the BlueGene/L was conceived to accelerate research on protein folding.
His recent research aims at setting new standards for commercial interconnection networks. In particular, he developed, in collaboration with Xyratex (a UK company), a technique called Regional Explicit Congestion Notification (RECN), which is the only truly scalable congestion management technique for lossless networks to date. The importance of this result comes from the fact that it prevents the dramatic performance degradation experienced by communication networks when they reach saturation, thus allowing higher performance, better utilization, and hence, lower cost. This result has been protected with two joint patents, and RECN is currently being incorporated into the most important standard currently under development for future communication systems: Advanced Switching Interconnect. This standard extends the functionality of the popular PCI-Express, which is used in almost every computer today. Also, several members in his research group are working on improving InfiniBand, a recently proposed industry standard for communication between processors and input/output devices as well as interprocessor communication. The main results achieved by his research group up to now are a set of InfiniBand-compatible routing algorithms that significantly improve performance over previously proposed ones while being flexible enough to be defined on any topology, a set of mechanisms to extend the InfiniBand standard to support adaptive routing, a subnet manager protocol to implement network reconfiguration when the topology changes due to hot swapping/replacement of components, and a fast algorithm to compute InfiniBand arbitration tables that supports multiple classes of service. Several of these results are currently being considered by Sun Microsystems for their next-generation InfiniBand products. Also, some of these results are being ported to Advanced Switching Interconnect.
He structured the knowledge amassed over the years in the area of interconnection networks by presenting the fundamental concepts and state-of-the-art techniques in his book "Interconnection Networks: An Engineering Approach", published in the USA by IEEE Computer Society Press in 1997 and by Morgan Kaufmann in 2003. This book is currently used in the leading computer companies (IBM, Compaq, Intel, Sun Microsystems) by engineers who design interconnection networks. It is also used to teach PhD courses (especially in USA Universities). It is the most popular book on interconnection networks in the market today. Most recent research papers on interconnection networks reference this book. Additionally, due to his high international visibility, he has been invited to write a chapter on interconnection networks for the fourth edition of the book "Computer Architecture: A Quantitative Approach", by John Hennessy and Dave Patterson. This chapter is now complete and the book will be published by Morgan Kaufmann in the coming months. This is by far the most popular book on computer architecture, and almost the only one used for teaching this topic worldwide.
He opened new research lines in the area of interconnection networks. Examples of those lines are the design of adaptive routing algorithms with cyclic dependencies between resources, the use of scalable congestion management techniques to prevent performance degradation when the network reaches saturation, and the dynamic reconfiguration of the routing algorithm to support topology changes without stopping network traffic. The first one had a significant impact on industry, as mentioned above. The second line is likely to have a tremendous impact thanks to the joint patents with Xyratex and its inclusion in the Advanced Switching Interconnect standard. The third line has already raised significant interest in academia and industry (Sun Microsystems), and it led to filing a joint patent among three research institutions: Simula Research Laboratory (Norway), University of Southern California (USA) and Universidad Politécnica de Valencia.
He has contributed to the state-of-the-art in practically all the topics related to interconnection networks, including the proposal of new topologies, switching techniques, flow control techniques, deadlock detection and recovery techniques, congestion management, unicast and multicast routing algorithms, techniques to compute and enhance fault tolerance, network reconfiguration, scheduling algorithms for router resources, and router design, including support for multimedia traffic.
He promoted the creation of a large, multidisciplinary research team in Spain, coordinated by him. In just one decade, more than 50 researchers from five Spanish universities have joined his team. This team recently passed the first evaluation stage for the Consolider-Ingenio 2010 program. Taking into account that only 35 research teams passed this stage, it can be concluded that his research team is among the 35 best research teams in Spain across all the science areas

Publications

Tomas Picornell, Carles Hernández, Jose Flich and Jose Duato. Enforcing Predictability of Many-cores with DCFNoC. IEEE Transactions on Computers, 2020. BibTeX

@article{10.1109/TC.2020.2987797,
author = "Picornell, Tomas and Hern{\'a}ndez, Carles and Flich, Jose and Duato, Jose",
abstract = "The ever need for higher performance forces industry to include technology based on multi-processors system on chip (MPSoCs) in their safety-critical embedded systems. MPSoCs include a network-on-chip (NoC) to interconnect the cores between them and with memory and the rest of shared resources. Unfortunately, the inclusion of NoCs compromises guaranteeing time predictability as network-level conflicts may occur. To overcome this problem, in this paper we propose DCFNoC, a new time-predictable NoC design paradigm where conflicts within the network are eliminated by design. This new paradigm builds on top of the Channel Dependency Graph (CDG) in order to deterministically avoid network conflicts. The network guarantees predictability to applications and is able to naturally inject messages using a TDM period equal to the optimal theoretical bound without the need of using a computationally demanding offline process. DCFNoC is integrated in a tile-based many-core system and adapted to its memory hierarchy. Our results show that DCFNoC guarantees time predictability avoiding network interference among multiple running applications. DCFNoC always guarantees performance and also improves wormhole performance in a 4 × 4 setting by a factor of 3.7× when interference traffic is injected. For a 8 × 8 network differences are even larger. In addition, DCFNoC obtains a total area saving of 10.79% over a standard wormhole implementation.",
journal = "IEEE Transactions on Computers",
title = "{E}nforcing {P}redictability of {M}any-cores with {DCFN}o{C}",
year = 2020
}

Tomas Picornell, Carles Hernández, Jose Duato and Jose Flich. DCFNoC: A Delayed Conflict-Free Time Division Multiplexing Network on Chip. 56th Annual Design Automation Conference 2019, 2019. BibTeX

@article{10.1145/3316781.3317794,
	author = "Picornell, Tomas and Hern{\'a}ndez, Carles and Duato, Jose and Flich, Jose",
	abstract = "The adoption of many-cores in safety-critical systems requires real-time capable networks on chip (NoC). In this paper we propose a new time-predictable NoC design paradigm where contention within the network is eliminated. This new paradigm builds on the Channel Dependency Graph (CDG) and guarantees by design the absence of contention. Our delayed conflict-free NoC (DCFNoC) is able to naturally inject messages using a TDM period equal to the optimal theoretical bound and without the need of using a computationally demanding offline process. Results show that DCFNoC guarantees time predictability with very low implementation cost.",
	journal = "56th Annual Design Automation Conference 2019",
	title = "{DCFN}o{C}: {A} {D}elayed {C}onflict-{F}ree {T}ime {D}ivision {M}ultiplexing {N}etwork on {C}hip",
	year = 2019
}

Carlos Reaño, Federico Silla and Jose Duato. Enhancing the rCUDA Remote GPU Virtualization Framework: from a Prototype to a Production Solution. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017, Madrid, Spain, May 14-17, 2017. 2017, 695–698. URL, DOI BibTeX

@conference{DBLP:conf/ccgrid/ReanoSD17,
	author = "Rea{\~n}o, Carlos and Silla, Federico and Duato, Jose",
	booktitle = "Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017, Madrid, Spain, May 14-17, 2017",
	crossref = "DBLP:conf/ccgrid/2017",
	doi = "10.1109/CCGRID.2017.42",
	pages = "695--698",
	title = "{E}nhancing the r{CUDA} {R}emote {GPU} {V}irtualization {F}ramework: from a {P}rototype to a {P}roduction {S}olution",
	url = "https://doi.org/10.1109/CCGRID.2017.42",
	year = 2017
}

Josué Feliu, Salvador Petit, Julio Sahuquillo and Jose Duato. Perf&Fair: A Progress-Aware Scheduler to Enhance Performance and Fairness in SMT Multicores. EEE Transactions on Computers PP (99), 2016. BibTeX

@article{10.1109/TC.2016.2620977,
	author = "Feliu, Josu{\'e} and Petit, Salvador and Sahuquillo, Julio and Duato, Jose",
	abstract = "Nowadays, high performance multicore processors implement multithreading capabilities. The processes running concurrently on these processors are continuously competing for the shared resources, not only among cores, but also within the core. While resource sharing increases the resource utilization, the interference among processes accessing the shared resources can strongly affect the performance of individual processes and its predictability. In this scenario, process scheduling plays a key role to deal with performance and fairness.",
	journal = "EEE Transactions on Computers PP",
	number = 99,
	title = "{P}erf{\&}{F}air: {A} {P}rogress-{A}ware {S}cheduler to {E}nhance {P}erformance and {F}airness in {SMT} {M}ulticores",
	year = 2016
}

Francisco Candel, Salvador Petit, Julio Sahuquillo and Jose Duato. Impact of Memory-Level Parallelism on the Performance of GPU Coherence Protocols. 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), 2016. BibTeX

@article{10.1109/PDP.2016.67,
	author = "Candel, Francisco and Petit, Salvador and Sahuquillo, Julio and Duato, Jose",
	journal = "2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)",
	title = "{I}mpact of {M}emory-{L}evel {P}arallelism on the {P}erformance of {GPU} {C}oherence {P}rotocols",
	year = 2016
}

Julio Sahuquillo, Houcine Hassan Mohamed, Salvador Petit, Jose Duato and José Luis March. A dynamic execution time estimation model to save energy in heterogeneous multicores running periodic tasks. Future Generation Computer Systems (56), 2015. BibTeX

@article{10.1016/j.future.2015.06.011,
	author = "Sahuquillo, Julio and Mohamed, Houcine Hassan and Petit, Salvador and Duato, Jose and March, Jos{\'e} Luis",
	abstract = "Nowadays, real-time embedded applications have to cope with an increasing demand of functionalities, which require increasing processing capabilities. With this aim real-time systems are being implemented on top of high-performance multicore processors that run multithreaded periodic workloads by allocating threads to individual cores. In addition, to improve both performance and energy savings, the industry is introducing new multicore designs such as ARM’s big.LITTLE that include heterogeneous cores in the same package. A key issue to improve energy savings in multicore embedded real-time systems and reduce the number of deadline misses is to accurately estimate the execution time of the tasks considering the supported processor frequencies. Two main aspects make this estimation difficult. First, the running threads compete among them for shared resources. Second, almost all current microprocessors implement Dynamic Voltage and Frequency Scaling (DVFS) regulators to dynamically adjust the voltage/frequency at run-time according to the workload behavior. Existing execution time estimation models rely on off-line analysis or on the assumption that the task execution time scales linearly with the processor frequency, which can bring important deviations since the memory system uses a different power supply. In contrast, this paper proposes the Processor–Memory (Proc–Mem) model, which dynamically predicts the distinct task execution times depending on the implemented processor frequencies. A power-aware EDF (Earliest Deadline First)-based scheduler using the Proc–Mem approach has been evaluated and compared against the same scheduler using a typical Constant Memory Access Time model, namely CMAT. Results on a heterogeneous multicore processor show that the average deviation of Proc–Mem is only by 5.55% with respect to the actual measured execution time, while the average deviation of the CMAT model is 36.42%. These results turn in important energy savings, by 18% on average and up to 31% in some mixes, in comparison to CMAT for a similar number of deadline misses.",
	journal = "Future Generation Computer Systems",
	number = 56,
	title = "{A} dynamic execution time estimation model to save energy in heterogeneous multicores running periodic tasks",
	year = 2015
}

Francisco Candel, Salvador Petit, Julio Sahuquillo and Jose Duato. Accurately modeling the GPU memory subsystem. 2015 International Conference on High Performance Computing & Simulation (HPCS), 2015. BibTeX

@article{10.1109/HPCSim.2015.7237038,
	author = "Candel, Francisco and Petit, Salvador and Sahuquillo, Julio and Duato, Jose",
	journal = "2015 International Conference on High Performance Computing {\&} Simulation (HPCS)",
	title = "{A}ccurately modeling the {GPU} memory subsystem",
	year = 2015
}

Josué Feliu, Julio Sahuquillo, Salvador Petit and Jose Duato. Addressing Fairness in SMT Multicores with a Progress-Aware Schedule. IEEE InternationalParallel and Distributed Processing Symposium (IPDPS), 2015. BibTeX

@article{10.1109/IPDPS.2015.48,
	author = "Feliu, Josu{\'e} and Sahuquillo, Julio and Petit, Salvador and Duato, Jose",
	journal = "IEEE InternationalParallel and Distributed Processing Symposium (IPDPS)",
	title = "{A}ddressing {F}airness in {SMT} {M}ulticores with a {P}rogress-{A}ware {S}chedule",
	year = 2015
}

Alejandro Valero, Julio Sahuquillo, Salvador Petit and Jose Duato. Design of Hybrid Second-Level Caches. IEEE Transactions on Computers 7(64):1884-1897, 2015. BibTeX

@article{10.1109/TC.2014.2346185,
	author = "Valero, Alejandro and Sahuquillo, Julio and Petit, Salvador and Duato, Jose",
	abstract = "In recent years, embedded dynamic random-access memory (eDRAM) technology has been implemented in last-level caches due to its low leakage energy consumption and high density. However, the fact that eDRAM presents slower access time than static RAM (SRAM) technology has prevented its inclusion in higher levels of the cache hierarchy. This paper proposes to mingle SRAM and eDRAM banks within the data array of second-level (L2) caches. The main goal is to achieve the best trade-off among performance, energy, and area. To this end, two main directions have been followed. First, this paper explores the optimal percentage of banks for each technology. Second, the cache controller is redesigned to deal with performance and energy. Performance is addressed by keeping the most likely accessed blocks in fast SRAM banks. In addition, energy savings are further enhanced by avoiding unnecessary destructive reads of eDRAM blocks. Experimental results show that, compared to a conventional SRAM L2 cache, a hybrid approach requiring similar or even lower area speedups the performance on average by 5.9 percent, while the total energy savings are by 32 percent. For a 45 nm technology node, the energy-delay-area product confirms that a hybrid cache is a better design than the conventional SRAM cache regardless of the number of eDRAM banks, and also better than a conventional eDRAM cache when the number of SRAM banks is an eighth of the total number of cache banks.",
	journal = "IEEE Transactions on Computers",
	number = 64,
	pages = "1884-1897",
	title = "{D}esign of {H}ybrid {S}econd-{L}evel {C}aches",
	volume = 7,
	year = 2015
}

Alejandro Valero, Salvador Petit, Julio Sahuquillo and Jose Duato. Article A reuse-based refresh policy for energy-aware eDRAM caches. 1(39):37-48, 2015. BibTeX

@article{10.1016/j.micpro.2014.12.001,
author = "Valero, Alejandro and Petit, Salvador and Sahuquillo, Julio and Duato, Jose",
abstract = "DRAM technology requires refresh operations to be performed in order to avoid data loss due to capacitance leakage. Refresh operations consume a significant amount of dynamic energy, which increases with the storage capacity. To reduce this amount of energy, prior work has focused on reducing refreshes in off-chip memories. However, this problem also appears in on-chip eDRAM memories implemented in current low-level caches. The refresh energy can dominate the dynamic consumption when a high percentage of the chip area is devoted to eDRAM cache structures. Replacement algorithms for high-associativity low-level caches select the victim block avoiding blocks more likely to be reused soon. This paper combines the state-of-the-art MRUT replacement algorithm with a novel refresh policy. Refresh operations are performed based on information produced by the replacement algorithm. The proposed refresh policy is implemented on top of an energy-aware eDRAM cache architecture, which implements bank-prediction and swap operations to save energy. Experimental results show that, compared to a conventional eDRAM design, the proposed energy-aware cache can achieve by 72% refresh energy savings. Considering the entire on-chip memory hierarchy consumption, the overall energy savings are 30%. These benefits come with minimal impact on performance (by 1.2%) and area overhead (by 0.4%).",
number = 39,
pages = "37-48",
title = "{A}rticle {A} reuse-based refresh policy for energy-aware e{DRAM} caches",
volume = 1,
year = 2015
}

Josué Feliu, Salvador Petit, Julio Sahuquillo and Jose Duato. Bandwidth-Aware On-Line Scheduling in SMT Multicores. IEEE Transactions on Computers 1(65), 2015. BibTeX

@article{10.1109/TC.2015.2428694,
	author = "Feliu, Josu{\'e} and Petit, Salvador and Sahuquillo, Julio and Duato, Jose",
	abstract = "The memory hierarchy plays a critical role on the performance of current chip multiprocessors. Main memory is shared by all the running processes, which can cause important bandwidth contention. In addition, when the processor implements SMT cores, the L1 bandwidth becomes shared among the threads running on each core. In such a case, bandwidth-aware schedulers emerge as an interesting approach to mitigate the contention. This work investigates the performance degradation that the processes suffer due to memory bandwidth constraints. Experiments show that main memory and L1 bandwidth contention negatively impact the process performance; in both cases, performance degradation can grow up to 40% for some of applications. To deal with contention, we devise a scheduling algorithm that consists of two policies guided by the bandwidth consumption gathered at runtime. The process selection policy balances the number of memory requests over the execution time to address main memory bandwidth contention. The process allocation policy tackles L1 bandwidth contention by balancing the L1 accesses among the L1 caches. The proposal is evaluated on a Xeon E5645 platform using a wide set of multiprogrammed workloads, achieving performance benefits up to 6.7% with respect to the Linux scheduler.",
	journal = "IEEE Transactions on Computers",
	number = 65,
	title = "{B}andwidth-{A}ware {O}n-{L}ine {S}cheduling in {SMT} {M}ulticores",
	volume = 1,
	year = 2015
}

Carlos Reaño, Federico Silla, Adrián Castelló, Antonio Pe J na, Rafael Mayo, Enrique S Quintana-Ortí and Jose Duato. Improving the user experience of the rCUDA remote GPU virtualization framework. Concurrency and Computation: Practice and Experience 27(14):3746–3770, 2015. URL, DOI BibTeX

@article{DBLP:journals/concurrency/ReanoSGPMQD15,
	author = "Rea{\~n}o, Carlos and Silla, Federico and Adri{\'a}n Castell{\'o} and Antonio J. Pe na and Rafael Mayo and Enrique S. Quintana-Ort{\'i} and Duato, Jose",
	doi = "10.1002/cpe.3409",
	journal = "Concurrency and Computation: Practice and Experience",
	number = 14,
	pages = "3746--3770",
	title = "{I}mproving the user experience of the r{CUDA} remote {GPU} virtualization framework",
	url = "https://doi.org/10.1002/cpe.3409",
	volume = 27,
	year = 2015
}

Josué Feliu, Salvador Petit, Julio Sahuquillo and Jose Duato. Cache-hierarchy Contention Aware Scheduling in CMPs. IEEE Transactions on Parallel and Distributed Systems 25(3):581 - 590, March 2014. DOI BibTeX

@article{DBLP:journals/tpds/josue2013,
	author = "Feliu, Josu{\'e} and Petit, Salvador and Sahuquillo, Julio and Duato, Jose",
	doi = "10.1109/TPDS.2013.61",
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	month = "March",
	number = 3,
	pages = "581 - 590",
	title = "{C}ache-hierarchy {C}ontention {A}ware {S}cheduling in {CMP}s",
	volume = 25,
	year = 2014
}

Antonio José Peña, Carlos Reaño, Federico Silla, Rafael Mayo, Enrique S Quintana-Ortí and Jose Duato. A complete and efficient CUDA-sharing solution for HPC clusters. Parallel Computing 40(10):574–588, 2014. URL, DOI BibTeX

@article{DBLP:journals/pc/PenaRSMQD14,
	author = "Pe{\~n}a, Antonio Jos{\'e} and Rea{\~n}o, Carlos and Silla, Federico and Rafael Mayo and Enrique S. Quintana-Ort{\'i} and Duato, Jose",
	doi = "10.1016/j.parco.2014.09.011",
	journal = "Parallel Computing",
	number = 10,
	pages = "574--588",
	title = "{A} complete and efficient {CUDA}-sharing solution for {HPC} clusters",
	url = "http://dx.doi.org/10.1016/j.parco.2014.09.011",
	volume = 40,
	year = 2014
}

Carlos Reaño, Federico Silla, Antonio José Peña, Gilad Shainer, Scot Schultz, Adrián Castelló Gimeno, Enrique S Quintana-Ortí and Jose Duato. Boosting the performance of remote GPU virtualization using InfiniBand connect-IB and PCIe 3.0. In 2014 IEEE International Conference on Cluster Computing, CLUSTER 2014, Madrid, Spain, September 22-26, 2014. 2014, 266–267. URL, DOI BibTeX

@conference{DBLP:conf/cluster/ReanoSPSSGQD14,
	author = "Rea{\~n}o, Carlos and Silla, Federico and Pe{\~n}a, Antonio Jos{\'e} and Gilad Shainer and Scot Schultz and Adri{\'a}n Castell{\'o} Gimeno and Enrique S. Quintana-Ort{\'i} and Duato, Jose",
	booktitle = "2014 IEEE International Conference on Cluster Computing, CLUSTER 2014, Madrid, Spain, September 22-26, 2014",
	crossref = "DBLP:conf/cluster/2014",
	doi = "10.1109/CLUSTER.2014.6968737",
	pages = "266--267",
	title = "{B}oosting the performance of remote {GPU} virtualization using {I}nfini{B}and connect-{IB} and {PCI}e 3.0",
	url = "http://dx.doi.org/10.1109/CLUSTER.2014.6968737",
	year = 2014
}

Sergio Iserte, Adrián Castelló Gimeno, Rafael Mayo, Enrique S Quintana-Ortí, Federico Silla, Jose Duato, Carlos Reaño and Javier Prades. SLURM Support for Remote GPU Virtualization: Implementation and Performance Study. In 26th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2014, Paris, France, October 22-24, 2014. 2014, 318–325. URL, DOI BibTeX

@conference{DBLP:conf/sbac-pad/IserteGMQSDRP14,
	author = "Sergio Iserte and Adri{\'a}n Castell{\'o} Gimeno and Rafael Mayo and Enrique S. Quintana-Ort{\'i} and Silla, Federico and Duato, Jose and Rea{\~n}o, Carlos and Prades, Javier",
	booktitle = "26th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2014, Paris, France, October 22-24, 2014",
	crossref = "DBLP:conf/sbac-pad/2014",
	doi = "10.1109/SBAC-PAD.2014.49",
	pages = "318--325",
	title = "{SLURM} {S}upport for {R}emote {GPU} {V}irtualization: {I}mplementation and {P}erformance {S}tudy",
	url = "http://dx.doi.org/10.1109/SBAC-PAD.2014.49",
	year = 2014
}

Josué Feliu, Julio Sahuquillo, Salvador Petit and Jose Duato. Addressing bandwidth contention in SMT multicores through scheduling. In International Conference on Supercomputing, ICS'14. 2014, 167. BibTeX

@conference{DBLP:conf/ics/FeliuSPD14,
	author = "Feliu, Josu{\'e} and Sahuquillo, Julio and Petit, Salvador and Duato, Jose",
	booktitle = "International Conference on Supercomputing, ICS'14",
	crossref = "DBLP:conf/ics/2014",
	pages = 167,
	title = "{A}ddressing bandwidth contention in {SMT} multicores through scheduling",
	year = 2014
}

Josué Feliu, Julio Sahuquillo, Salvador Petit and Jose Duato. Planificación Considerando Degradación de Prestaciones por Contención. In XXIV Jornadas de Paralelismo, JP 2013, Madrid, Sep 17-20. 2013, 62-67. BibTeX

@conference{JP/Feliu/13,
	author = "Feliu, Josu{\'e} and Sahuquillo, Julio and Petit, Salvador and Duato, Jose",
	booktitle = "XXIV Jornadas de Paralelismo, JP 2013, Madrid, Sep 17-20",
	isbn = "978-84-695-8330-2",
	pages = "62-67",
	title = "{P}lanificaci{\'o}n {C}onsiderando {D}egradaci{\'o}n de {P}restaciones por {C}ontenci{\'o}n",
	year = 2013
}

Carlos Reaño, Antonio José Peña, Federico Silla, Rafa Mayo, Enrique S Quintana-Ortí and Jose Duato. Influence of InfiniBand FDR on the Performance of Remote GPU Virtualization. In International Conference on Cluster Computing (Cluster). 2013. BibTeX

@conference{reanoInfluence,
	author = "Rea{\~n}o, Carlos and Pe{\~n}a, Antonio Jos{\'e} and Silla, Federico and Rafa Mayo and Enrique S. Quintana-Ort{\'i} and Duato, Jose",
	booktitle = "International Conference on Cluster Computing (Cluster)",
	title = "{I}nfluence of {I}nfini{B}and {FDR} on the {P}erformance of {R}emote {GPU} {V}irtualization",
	year = 2013
}

Josué Feliu, Julio Sahuquillo, Salvador Petit and Jose Duato. L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors. In 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT'13, Edinburgh, United Kingdom, Sep 7-11. 2013, 123-132. BibTeX

@conference{PACT/Feliu/13,
	author = "Feliu, Josu{\'e} and Sahuquillo, Julio and Petit, Salvador and Duato, Jose",
	booktitle = "22nd International Conference on Parallel Architectures and Compilation Techniques, PACT'13, Edinburgh, United Kingdom, Sep 7-11",
	isbn = "978-1-4799-1021-2",
	pages = "123-132",
	title = "{L}1-{B}andwidth {A}ware {T}hread {A}llocation in {M}ulticore {SMT} {P}rocessors",
	year = 2013
}

Josué Feliu, Julio Sahuquillo, Salvador Petit and Jose Duato. Using huge pages and performance counters to determine the LLC architecture. In International Conference on Computational Science, ICCS'13, Barcelona, Jun 5-7. 2013, 2557-2560. BibTeX

@conference{josue_iccs_2013,
	author = "Feliu, Josu{\'e} and Sahuquillo, Julio and Petit, Salvador and Duato, Jose",
	booktitle = "International Conference on Computational Science, ICCS'13, Barcelona, Jun 5-7",
	pages = "2557-2560",
	title = "{U}sing huge pages and performance counters to determine the {LLC} architecture",
	year = 2013
}

Carlos Reaño, Antonio José Peña, Federico Silla, R Mayo, E S Quintana-Ortí and Jose Duato. CU2rCU: towards the Complete rCUDA Remote GPU Virtualization and Sharing Solution. In 19th Annual International Conference on High Performance Computing (HiPC). December 2012. URL BibTeX

@conference{CU2rCU_HiPC12,
	author = "Rea{\~n}o, Carlos and Pe{\~n}a, Antonio Jos{\'e} and Silla, Federico and Mayo, R. and Quintana-Ort{\'i}, E. S. and Duato, Jose",
	booktitle = "19th Annual International Conference on High Performance Computing (HiPC)",
	month = "December",
	title = "{CU}2r{CU}: towards the {C}omplete r{CUDA} {R}emote {GPU} {V}irtualization and {S}haring {S}olution",
	url = "http://ieeexplore.ieee.org/stamp/stamp.jsp?tp={\&}arnumber=6507485{\&}isnumber=6507469",
	year = 2012
}

Roberto Peñaranda, Crispín Gomez, Maria E Gomez, Pedro Lopez and Jose Duato. A New Family of Hybrid Topologies for Large-Scale Interconnection Networks. IEEE 11th International Symposium on Network Computing and Applications, pages 220-227, August 2012. BibTeX

@article{HybridTopology,
	author = "Pe{\~n}aranda, Roberto and Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "In large supercomputers the topology of the interconnection network is a key design issue that impacts the performance and cost of the whole system. Direct topologies provide a reduced hardware cost, but, as the number of dimensions is conditioned by 3D wiring restrictions, a high number of nodes per dimension is used, which increases communication latency and reduces network throughput. On the other hand, indirect topologies can provide better performance for large network sizes, but at the cost of a high number of switches and links. In this paper, we propose a new family of topologies that combines the best features of both direct and indirect topologies to efficiently connect an extremely high number of nodes. In particular, we propose an n–dimensional topology, where the nodes of each dimension are connected through a small indirect topology. This combination results in a family of topologies that provides high performance, with latency and throughput figures of merit close to indirect topologies, but with a lower hardware cost. In particular, it is able to double the throughput obtained per switching element of indirect topologies. Moreover, the layout of the topology is much simpler than in indirect topologies. Indeed, its fault–tolerance degree is equal or higher than the one for direct and indirect topologies.",
	journal = "IEEE 11th International Symposium on Network Computing and Applications",
	keywords = "routing algorithm, direct topology, indirect topology",
	month = "August",
	pages = "220-227",
	title = "{A} {N}ew {F}amily of {H}ybrid {T}opologies for {L}arge-{S}cale {I}nterconnection {N}etworks",
	year = 2012
}

@conference{CU2rCU_HiPC2012,
	author = "Rea{\~n}o, Carlos and Pe{\~n}a, Antonio Jos{\'e} and Silla, Federico and Mayo, R. and Quintana-Ort{\'i}, E. S. and Duato, Jose",
	booktitle = "19th Annual International Conference on High Performance Computing (HiPC 2012)",
	title = "{CU}2r{CU}: towards the {C}omplete r{CUDA} {R}emote {GPU} {V}irtualization and {S}haring {S}olution",
	year = 2012
}

Carles Hernández, Antoni Roca, Federico Silla, Jose Flich and Jose Duato. On the Impact of Within-Die Process Variation in GALS-Based NoC Performance. IEEE Trans. on CAD of Integrated Circuits and Systems 31(2):294-307, 2012. BibTeX

@article{DBLP:journals/tcad/HernandezRSFD12,
	author = "Hern{\'a}ndez, Carles and Roca, Antoni and Silla, Federico and Flich, Jose and Duato, Jose",
	journal = "IEEE Trans. on CAD of Integrated Circuits and Systems",
	number = 2,
	pages = "294-307",
	title = "{O}n the {I}mpact of {W}ithin-{D}ie {P}rocess {V}ariation in {GALS}-{B}ased {N}o{C} {P}erformance",
	volume = 31,
	year = 2012
}

Josué Feliu, Julio Sahuquillo, Salvador Petit and Jose Duato. Planificació considerando el ancho de banda de la jerarquía de cache. In XIII Jornadas de Paralelismo, JP 2012, Elche, Sep 19-21. 2012, 472-477. BibTeX

@conference{JP/Feliu/12,
	author = "Feliu, Josu{\'e} and Sahuquillo, Julio and Petit, Salvador and Duato, Jose",
	booktitle = "XIII Jornadas de Paralelismo, JP 2012, Elche, Sep 19-21",
	isbn = "978-84-695-4473-0",
	pages = "472-477",
	title = "{P}lanificaci{\'o} considerando el ancho de banda de la jerarqu{\'i}a de cache",
	year = 2012
}

Josué Feliu, Julio Sahuquillo, Salvador Petit and Jose Duato. Understanding Cache Hierarchy Contention in CMPs to Improve Job Scheduling. In 26th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2012, Shanghai, China, May 21-25. 2012, 508-519. BibTeX

@conference{DBLP:conf/ipps/FeliuSPD12,
	author = "Feliu, Josu{\'e} and Sahuquillo, Julio and Petit, Salvador and Duato, Jose",
	booktitle = "26th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2012, Shanghai, China, May 21-25",
	isbn = "978-1-4673-0975-2",
	pages = "508-519",
	title = "{U}nderstanding {C}ache {H}ierarchy {C}ontention in {CMP}s to {I}mprove {J}ob {S}cheduling",
	year = 2012
}

Carles Hernández, Federico Silla and Jose Duato. Energy and Performance Efficient Thread Mapping in NoC-Based CMPs under Process Variations. In Parallel Processing (ICPP), 2011 International Conference on. 2011, 41 -50. DOI BibTeX

@conference{6047171,
	author = "Hern{\'a}ndez, Carles and Silla, Federico and Duato, Jose",
	abstract = "Within-die process variation causes cores, memories, and network resources in NoC-based CMPs to present different speeds and leakage power. In this context, thread mapping strategies that consider the effects of process variability on chip resources arise as a suitable choice to maximize performance while energy consumption constraints are satisfied. However, other factors, as the location of memory controllers and the concurrent execution of several applications in the chip, can bound the possible benefits of such mapping strategies. In this paper we propose a mapping strategy, named as uniform regions, that takes variability effects into account when assigning application threads to cores in the chip. More specifically, uniform regions, in terms of operating frequency, that additionally present the highest available frequency, are selected so that the benefits of such a variation-aware mapping strategy in a NoC-based CMP are maximized. We additionally present two different ways of configuring the frequency and voltage of the cores in the selected region. The first one is intended to provide the maximum performance while keeping energy as low as possible, while the second one is much more for energy-aware. The first one reduces the execution time up to a 23 #x025; while reducing the energy up to 24 #x025; whereas the second one provides smaller speed ups while reduces energy up to 33 #x025;.",
	booktitle = "Parallel Processing (ICPP), 2011 International Conference on",
	doi = "10.1109/ICPP.2011.48",
	issn = "0190-3918",
	month = "sept.",
	pages = "41 -50",
	title = "{E}nergy and {P}erformance {E}fficient {T}hread {M}apping in {N}o{C}-{B}ased {CMP}s under {P}rocess {V}ariations",
	year = 2011
}

Antoni Roca, Carles Hernández, Jose Flich, Federico Silla and Jose Duato. A Distributed Switch Architecture for On-Chip Networks. In Parallel Processing (ICPP), 2011 International Conference on. 2011, 21 -30. DOI BibTeX

@conference{6047169,
	author = "Roca, Antoni and Hern{\'a}ndez, Carles and Flich, Jose and Silla, Federico and Duato, Jose",
	abstract = "It is well-known that current Chip Multiprocessor (CMP) and high-end MultiProcessor System-on-Chip (MPSoC) designs are growing in their number of components. Networks-on-Chip (NoC) provide the required connectivity for such CMP and MPSoC designs at reasonable costs. However, as technology advances, links become the critical component in the NoC. First, because the power consumption of the link is extremely high with respect the power consumption of the rest of components (mainly switches), becoming unacceptable for long global interconnects. Second, the delay of a link does not scale with technology, thus, degrading the performance of the network. To solve both problems, several solutions have been previously proposed. In this paper, we present a new switch architecture that reduces the negative impact of links on the NoC. We call our proposal distributed switch. The distributed switch moves the circuitry of a standard switch onto the links. Then, packets are buffered, routed, and forwarded at the same time they are crossing the link. Distributing a standard switch onto the link improves the trade off between the power consumption and the operating frequency of the entire network. In contrast, area requirements are increased. The distributed switch reduces up to 14.8 #x025; the peak power consumption while increases its area up to 22 #x025;. Furthermore, the distributed switch is able to increase the maximum achievable frequency with respect to the standard switch. In particular, the maximum operating frequency of the distributed switch can be increased up to 14.3 #x025;.",
	booktitle = "Parallel Processing (ICPP), 2011 International Conference on",
	doi = "10.1109/ICPP.2011.28",
	issn = "0190-3918",
	month = "sept.",
	pages = "21 -30",
	title = "{A} {D}istributed {S}witch {A}rchitecture for {O}n-{C}hip {N}etworks",
	year = 2011
}

Jesus Escudero-Sahuquillo, Ernst Gunnar Gran, Pedro Javier Garcia, Jose Flich, Tor Skeie, Olav Lysne, Francisco Jose Quiles and Jose Duato. Combining Congested-Flow Isolation and Injection Throttling in HPC Interconnection Networks. In Parallel Processing (ICPP), 2011 International Conference on. 2011, 662 -672. DOI BibTeX

@conference{6047234,
	author = "Jesus Escudero-Sahuquillo and Ernst Gunnar Gran and Pedro Javier Garcia and Flich, Jose and Tor Skeie and Olav Lysne and Francisco Jose Quiles and Duato, Jose",
	abstract = "Existing congestion control mechanisms in interconnects can be divided into two general approaches. One is to throttle traffic injection at the sources that contribute to congestion, and the other is to isolate the congested traffic in specially designated resources. These two approaches have different, but non-overlapping weaknesses. In this paper we present in detail a method that combines injection throttling and congested-flow isolation. Through simulation studies we first demonstrate the respective flaws of the injection throttling and of flow isolation. Thereafter we show that our combined method extracts the best of both approaches in the sense that it gives fast reaction to congestion, it is scalable and it has good fairness properties with respect to the congested flows.",
	booktitle = "Parallel Processing (ICPP), 2011 International Conference on",
	doi = "10.1109/ICPP.2011.80",
	issn = "0190-3918",
	month = "sept.",
	pages = "662 -672",
	title = "{C}ombining {C}ongested-{F}low {I}solation and {I}njection {T}hrottling in {HPC} {I}nterconnection {N}etworks",
	year = 2011
}

Jose Duato, Antonio José Peña, Federico Silla, Rafael Mayo and Enrique S Quintana-Orti. Performance of CUDA Virtualized Remote GPUs in High Performance Clusters. In Parallel Processing (ICPP), 2011 International Conference on. 2011, 365 -374. DOI BibTeX

@conference{6047204,
	author = "Duato, Jose and Pe{\~n}a, Antonio Jos{\'e} and Silla, Federico and Rafael Mayo and Enrique S. Quintana-Orti",
	abstract = "In a previous work we presented the architecture of rCUDA, a middleware that enables CUDA remoting over a commodity network. That is, the middleware allows an application to use a CUDA-compatible Graphics Processor (GPU) installed in a remote computer as if it were installed in the computer where the application is being executed. This approach is based on the observation that GPUs in a cluster are not usually fully utilized, and it is intended to reduce the number of GPUs in the cluster, thus lowering the costs related with acquisition and maintenance while keeping performance close to that of the fully-equipped configuration. In this paper we model rCUDA over a series of high throughput networks in order to assess the influence of the performance of the underlying network on the performance of our virtualization technique. For this purpose, we analyze the traces of two different case studies over two different networks. Using this data, we calculate the expected performance for these same case studies over a series of high throughput networks, in order to characterize the expected behavior of our solution in high performance clusters. The estimations are validated using real 1 Gbps Ethernet and 40 Gbps InfiniBand networks, showing an error rate in the order of 1 #x025; for executions involving data transfers above 40 MB. In summary, although our virtualization technique noticeably increases execution time when using a 1 Gbps Ethernet network, it performs almost as efficiently as a local GPU when higher performance interconnects are used. Therefore, the small overhead incurred by our proposal because of the remote use of GPUs is worth the savings that a cluster configuration with less GPUs than nodes reports.",
	booktitle = "Parallel Processing (ICPP), 2011 International Conference on",
	doi = "10.1109/ICPP.2011.58",
	issn = "0190-3918",
	month = "sept.",
	pages = "365 -374",
	title = "{P}erformance of {CUDA} {V}irtualized {R}emote {GPU}s in {H}igh {P}erformance {C}lusters",
	year = 2011
}

, Jose Flich, Antoni Roca and Jose Duato. PC-Mesh: A Dynamic Parallel Concentrated Mesh. In Parallel Processing (ICPP), 2011 International Conference on. 2011, 642 -651. DOI BibTeX

@conference{6047232,
	author = ", and Flich, Jose and Roca, Antoni and Duato, Jose",
	abstract = "We present a novel network on-chip topology, PC-Mesh (Parallel Concentrated Mesh), suitable for tiled CMP systems. The topology is built using four concentrated mesh (C-Mesh) networks and a new network interface able to inject packets through different networks. The goal of the new combined topology is to minimize the power consumption of the network when running applications exhibiting low traffic rates and maximize throughput when applications require high traffic rates. Thus, the topology is dynamically adjusted (switching on and off network components) with a proper injection algorithm, adapting itself to the network on-chip traffic requirements. The PC-Mesh network performs as a C-Mesh network (using one sub network) when the traffic is low obtaining large savings in power consumption. When the load network increases, new sub networks are opened and thus higher traffic rates are supported, thus providing comparable results as the mesh network. Additional benefits of the PC-Mesh network is its fault tolerance degree and the lower latency in terms of hops. An alternative PC-Mesh version is provided to optimize the fault-tolerance degree. Comparative results with detailed evaluations (in area, power, and delay) are provided both for the network interface and switches. Results demonstrate PC-Mesh is able to dynamically adapt to the current traffic situations. Experimental results with a system-level simulation platform (including the application being run and the operating system) are provided. Results show how the PC-Mesh network achieves the same results as the C-Mesh topology reducing execution time of applications by 20 #x025; as well as energy consumption by also 20 #x025;, when compared with the 2D-Mesh network topology. However, when challenged with higher traffic demands, PC-Mesh outperforms the C-Mesh network by achieving much lower execution time of applications and lower energy consumption. In some scenarios, execution time is reduced by a factor of 2 - - and power consumption by 50 #x025;.",
	booktitle = "Parallel Processing (ICPP), 2011 International Conference on",
	doi = "10.1109/ICPP.2011.21",
	issn = "0190-3918",
	month = "sept.",
	pages = "642 -651",
	title = "{PC}-{M}esh: {A} {D}ynamic {P}arallel {C}oncentrated {M}esh",
	year = 2011
}

Blas Cuesta Sáez, Alberto Ros, Maria E Gomez, Antonio Robles and Jose Duato. Increasing the Effectiveness of Directory Caches by Deactivating Coherence for Private Memory Blocks. In 38th International Symposium on Computer Architecture (ISCA). June 2011, 93–103. URL BibTeX

@conference{bcuesta-isca11,
	author = "Cuesta S{\'a}ez, Blas and Ros, Alberto and Gomez, Maria E. and Robles, Antonio and Duato, Jose",
	address = "San Jose (California)",
	booktitle = "38th International Symposium on Computer Architecture (ISCA)",
	isbn = "978-1-4503-0472-6",
	month = "jun",
	pages = "93--103",
	publisher = "Association for Computing Machinery (ACM)",
	title = "{I}ncreasing the {E}ffectiveness of {D}irectory {C}aches by {D}eactivating {C}oherence for {P}rivate {M}emory {B}locks",
	url = "http://skywalker.inf.um.es/~aros/papers/bcuesta-isca11.pdf",
	year = 2011
}

Samuel Rodrigo, Jose Flich, Antoni Roca, S Medardoni, D Bertozzi, , Federico Silla and Jose Duato. Cost-Efficient On-Chip Routing Implementations for CMP and MPSoC Systems. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 30(4):534 -547, April 2011. URL, DOI BibTeX

@article{5737867,
	author = "Rodrigo, Samuel and Flich, Jose and Roca, Antoni and S. Medardoni and D. Bertozzi and , and Silla, Federico and Duato, Jose",
	abstract = "The high-performance computing domain is enriching with the inclusion of networks-on-chip (NoCs) as a key component of many-core (CMPs or MPSoCs) architectures. NoCs face the communication scalability challenge while meeting tight power, area, and latency constraints. Designers must address new challenges that were not present before. Defective components, the enhancement of application-level parallelism, or power-aware techniques may break topology regularity, thus, efficient routing becomes a challenge. This paper presents universal logic-based distributed routing (uLBDR), an efficient logic-based mechanism that adapts to any irregular topology derived from 2-D meshes, instead of using routing tables. uLBDR requires a small set of configuration bits, thus being more practical than large routing tables implemented in memories. Several implementations of uLBDR are presented highlighting the tradeoff between routing cost and coverage. The alternatives span from the previously proposed LBDR approach (with 30% of coverage) to the uLBDR mechanism achieving full coverage. This comes with a small performance cost, thus exhibiting the tradeoff between fault tolerance and performance. Power consumption, area, and delay estimates are also provided highlighting the efficiency of the mechanism. To do this, different router models (one for CMPs and one for MPSoCs) have been designed as a proof concept.",
	doi = "10.1109/TCAD.2011.2119150",
	issn = "0278-0070",
	journal = "Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on",
	keywords = "Fault-tolerance , logic design , networks-on-chip , routing",
	month = "april",
	number = 4,
	pages = "534 -547",
	title = "{C}ost-{E}fficient {O}n-{C}hip {R}outing {I}mplementations for {CMP} and {MPS}o{C} {S}ystems",
	url = "http://dx.doi.org/10.1109/TCAD.2011.2119150",
	volume = 30,
	year = 2011
}

Jose Duato, Antonio José Peña, Federico Silla, Rafael Mayo and Enrique S Quintana-Ort. Enabling CUDA acceleration within virtual machines using rCUDA. Proceedings of HiPC 2011, 2011. URL BibTeX

@article{N/A,
	author = "Duato, Jose and Pe{\~n}a, Antonio Jos{\'e} and Silla, Federico and Rafael Mayo and Enrique S. Quintana-Ort",
	abstract = "The hardware and software advances of Graphics Processing Units (GPUs) have favored the develop- ment of GPGPU (General-Purpose Computation on GPUs) and its adoption in many scientific, engineering, and industrial areas. Thus, GPUs are increasingly being introduced in high-performance computing systems as well as in datacenters. On the other hand, virtualization technologies are also receiving rising interest in these domains, because of their many benefits on acquisition and maintenance savings. There are currently several works on GPU virtualization. However, there is no standard solution allowing access to GPGPU capabilities from virtual machine environments like, e.g., VMware, Xen, VirtualBox, or KVM. Such lack of a standard solution is delaying the integration of GPGPU into these domains.",
	journal = "Proceedings of HiPC 2011",
	keywords = "Virtual machine;rCUDA",
	note = "Clusters;CUDA;High performance computing;Virtualizations;",
	title = "{E}nabling {CUDA} acceleration within virtual machines using r{CUDA}",
	url = "http://www.hipc.org/hipc2011/program.php",
	year = 2011
}

Carles Hernández, Antoni Roca, Jose Flich, Federico Silla and Jose Duato. Fault-Tolerant Vertical Link Design for Effective 3D Stacking. IEEE Computer Architecture Letters 99(RapidPosts), 2011. URL, DOI BibTeX

@article{10.1109/L-CA.2011.17,
	author = "Hern{\'a}ndez, Carles and Roca, Antoni and Flich, Jose and Silla, Federico and Duato, Jose",
	address = "Los Alamitos, CA, USA",
	doi = "10.1109/L-CA.2011.17",
	issn = "1556-6056",
	journal = "IEEE Computer Architecture Letters",
	number = "RapidPosts",
	publisher = "IEEE Computer Society",
	title = "{F}ault-{T}olerant {V}ertical {L}ink {D}esign for {E}ffective 3{D} {S}tacking",
	url = "http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.17",
	volume = 99,
	year = 2011
}

F O Sem-Jacobsen, T Skeie, O Lysne and Jose Duato. Dynamic Fault Tolerance in Fat Trees. IEEE Transactions on Computers 60(4):508 - 25, 2011. URL, DOI BibTeX

@article{11837626,
	author = "F.O. Sem-Jacobsen and T. Skeie and O. Lysne and Duato, Jose",
	abstract = "Fat trees are a very common communication architecture in current large-scale parallel computers. The probability of failure in these systems increases with the number of components. We present a routing method for deterministically and adaptively routed fat trees, applicable to both distributed and source routing, that is able to handle several concurrent faults and that transparently returns to the original routing strategy once the faulty components have recovered. The method is local and dynamic, completely masking the fault from the rest of the system. It only requires a small extra functionality in the switches to handle rerouting packets around a fault. The method guarantees connectedness and deadlock and livelock freedom for up to k -1 benign simultaneous switch and/or link faults where k is half the number of ports in the switches. Our simulation experiments show a graceful degradation of performance as more faults occur. Furthermore, we demonstrate that for most fault combinations, our method will even be able to handle significantly more faults beyond the k -1 limit with high probability.",
	address = "USA",
	doi = "10.1109/TC.2010.97",
	issn = "0018-9340",
	journal = "IEEE Transactions on Computers",
	keywords = "failure analysis;fault tolerant computing;large-scale systems;network routing;parallel architectures;parallel machines;trees;",
	note = "dynamic fault tolerance;fat tree;communication architecture;large-scale parallel computer;failure probability;routing method;source routing;distributed routing;concurrent fault;rerouting packet handling;",
	number = 4,
	pages = "508 - 25",
	title = "{D}ynamic {F}ault {T}olerance in {F}at {T}rees",
	url = "http://dx.doi.org/10.1109/TC.2010.97",
	volume = 60,
	year = 2011
}

Crispín Gomez, Maria E Gomez, Pedro Lopez and Jose Duato. How to reduce packet dropping in a bufferless NoC. Concurrency and Computation: Practice and Experience 23(1):86 - 99, 2011. URL, DOI BibTeX

@article{11723780,
	author = "Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Networks on-chip (NoCs) interconnect the components located inside a chip. In multicore chips, NoCs have a strong impact on the overall system performance. NoC bandwidth is limited by the critical path delay. Recent works show that the critical path delay is heavily affected by switch port buffer size. Therefore, by removing buffers, switch clock frequency can be increased. Recently, a new switching technique for NoCs called Blind Packet Switching (BPS) has been proposed, which is based on removing the switch port buffers. Since buffers consume a high percentage of switch power and area, BPS not only improves performance but also reduces power and area. In BPS, as there are no buffers at the switch ports, packets cannot be stopped and stored on them. If contention arises packets are dropped and later reinjected, negatively affecting performance. In order to prevent packet dropping, some techniques based on resource replication have been proposed. In this paper, we propose some alternative and complementary techniques that do not rely on resource replication. By using them, packet dropping is highly reduced. In particular, packet dropping is completely removed for a very wide network traffic range. Moreover, network throughput is increased and packet latency is reduced. © 2010 John Wiley {\&} Sons, Ltd.",
	address = "UK",
	doi = "10.1002/cpe.1606",
	issn = "1532-0626",
	journal = "Concurrency and Computation: Practice and Experience",
	keywords = "buffer circuits;circuit switching;network-on-chip;",
	note = "packet dropping reduction;bufferless NoC;networks on-chip;critical path delay;switch clock frequency;blind packet switching;switch port buffers;network traffic range;",
	number = 1,
	pages = "86 - 99",
	title = "{H}ow to reduce packet dropping in a bufferless {N}o{C}",
	url = "http://dx.doi.org/10.1002/cpe.1606",
	volume = 23,
	year = 2011
}

Samuel Rodrigo, Jose Flich, Antoni Roca, S Medardoni, D Bertozzi, , Federico Silla and Jose Duato. Cost-efficient on-chip routing implementations for CMP and MPSoC systems. 2011, 534 - 547. URL, DOI BibTeX

@conference{20111313880819,
	author = "Rodrigo, Samuel and Flich, Jose and Roca, Antoni and S. Medardoni and D. Bertozzi and , and Silla, Federico and Duato, Jose",
	abstract = "The high-performance computing domain is enriching with the inclusion of networks-on-chip (NoCs) as a key component of many-core (CMPs or MPSoCs) architectures. NoCs face the communication scalability challenge while meeting tight power, area, and latency constraints. Designers must address new challenges that were not present before. Defective components, the enhancement of application-level parallelism, or power-aware techniques may break topology regularity, thus, efficient routing becomes a challenge. This paper presents universal logic-based distributed routing (uLBDR), an efficient logic-based mechanism that adapts to any irregular topology derived from 2-D meshes, instead of using routing tables. uLBDR requires a small set of configuration bits, thus being more practical than large routing tables implemented in memories. Several implementations of uLBDR are presented highlighting the tradeoff between routing cost and coverage. The alternatives span from the previously proposed LBDR approach (with 30% of coverage) to the uLBDR mechanism achieving full coverage. This comes with a small performance cost, thus exhibiting the tradeoff between fault tolerance and performance. Power consumption, area, and delay estimates are also provided highlighting the efficiency of the mechanism. To do this, different router models (one for CMPs and one for MPSoCs) have been designed as a proof concept. © 2006 IEEE.",
	address = "445 Hoes Lane / P.O. Box 1331, Piscataway, NJ 08855-1331, United States",
	doi = "10.1109/TCAD.2011.2119150",
	issn = 02780070,
	journal = "IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems",
	key = "Fault tolerance",
	keywords = "Computer software selection and evaluation;Logic design;Microprocessor chips;Quality assurance;Telecommunication networks;Topology;",
	note = "Cost-efficient;Distributed routing;Efficient routing;High-performance computing;Irregular topology;Key component;Latency constraints;Many-core;Networks on chips;networks-on-chip;On chips;Performance costs;Power Consumption;Power-aware;Router model;routing;Routing table;Universal logic;",
	number = 4,
	pages = "534 - 547",
	title = "{C}ost-efficient on-chip routing implementations for {CMP} and {MPS}o{C} systems",
	url = "http://dx.doi.org/10.1109/TCAD.2011.2119150",
	volume = 30,
	year = 2011
}

J Escudero-Sahuquillo, P J Garcia, F J Quiles, Jose Flich and Jose Duato. Cost-effective queue schemes for reducing head-of-line blocking in fat-trees. Concurrency Computation Practice and Experience 12(15), 2011. URL, DOI BibTeX

@article{IP51411971,
	author = "J. Escudero-Sahuquillo and P.J. Garcia and F.J. Quiles and Flich, Jose and Duato, Jose",
	abstract = "The fat-tree is one of the most common topologies among the interconnection networks of the systems currently used for high-performance parallel computing. Among other advantages, fat-trees allow the use of simple but very efficient routing schemes. One of them is a deterministic routing algorithm that has been recently proposed, offering a similar (or better) performance than adaptive routing while reducing complexity and guaranteeing in-order packet delivery. However, as other deterministic routing proposals, this deterministic routing algorithm cannot react when high traffic loads or hot-spot traffic scenarios produce severe contention for the use of network resources, leading to the appearance of Head-of-Line (HoL) blocking, which spoils the network performance. In that sense, we describe in this paper two simple, cost-effective strategies for dealing with the HoL-blocking problem that may appear in fat-trees with the aforementioned deterministic routing algorithm. From the results presented in the paper, we conclude that, in the mentioned environment, these proposals considerably reduce HoL-blocking without significantly increasing switch complexity and the required silicon area. © 2011 John Wiley {\&} Sons, Ltd.",
	doi = "10.1002/cpe.1764",
	issn = "1532-0626",
	journal = "Concurrency Computation Practice and Experience",
	key = "Trees (mathematics)",
	keywords = "Cost effectiveness;Network performance;Packet networks;Parallel architectures;Routing algorithms;",
	note = "Adaptive routing;Deterministic routing;Deterministic routing algorithms;Efficient routing;Head of line blocking;Hot-spot traffic;In-order packet delivery;Network resource;Parallel Computing;Silicon area;Switch complexity;Traffic loads;",
	number = 15,
	title = "{C}ost-effective queue schemes for reducing head-of-line blocking in fat-trees",
	url = "http://dx.doi.org/10.1002/cpe.1764",
	volume = 12,
	year = 2011
}

Carles Hernández, Antoni Roca, Jose Flich, Federico Silla and Jose Duato. Characterizing the impact of process variation on 45 nm NoC-based CMPs. Journal of Parallel and Distributed Computing 71(5):651 - 663, 2011. URL, DOI BibTeX

@article{20111413888254,
	author = "Hern{\'a}ndez, Carles and Roca, Antoni and Flich, Jose and Silla, Federico and Duato, Jose",
	abstract = "Current integration scales make possible to design chip multiprocessors with a large amount of cores interconnected by a NoC. Unfortunately, they also bring process variation, posing a new burden to processor manufacturers. Regarding the NoC, variability causes that the delays of links and routers do not match those initially established at design time. In this paper we analyze how variability affects the NoC by applying a new variability model to 100 instances of an 8 × 8 mesh NoC synthesized using 45 nm technology. We also show that GALS-based NoCs present communication bottlenecks due to the slower components of the network, which cause congestion, thus reducing performance. This performance reduction finally affects the applications being executed in the CMP because they may be mapped to slower areas of the chip. In this paper we show that using a mapping algorithm that considers variability data may improve application execution time up to 50%. © 2010 Elsevier Inc. All rights reserved.",
	address = "6277 Sea Harbor Drive, Orlando, FL 32887-4900, United States",
	doi = "10.1016/j.jpdc.2010.09.006",
	issn = "0743-7315",
	journal = "Journal of Parallel and Distributed Computing",
	key = "Routers",
	keywords = "Conformal mapping;Design;Microprocessor chips;Multiprocessing systems;Servers;Systems analysis;VLSI circuits;",
	note = "Chip Multiprocessor;NoC (or Network-on-Chip);Process mapping;Process variations;Router design;",
	number = 5,
	pages = "651 - 663",
	title = "{C}haracterizing the impact of process variation on 45 nm {N}o{C}-based {CMP}s",
	url = "http://dx.doi.org/10.1016/j.jpdc.2010.09.006",
	volume = 71,
	year = 2011
}

Samuel Rodrigo, Jose Flich, Antoni Roca, S Medardoni, D Bertozzi, , Federico Silla and Jose Duato. Cost-Efficient On-Chip Routing Implementations for CMP and MPSoC Systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 30(4):534 - 47, 2011. URL, DOI BibTeX

@article{11874902,
	author = "Rodrigo, Samuel and Flich, Jose and Roca, Antoni and S. Medardoni and D. Bertozzi and , and Silla, Federico and Duato, Jose",
	abstract = "The high-performance computing domain is enriching with the inclusion of networks-on-chip (NoCs) as a key component of many-core (CMPs or MPSoCs) architectures. NoCs face the communication scalability challenge while meeting tight power, area, and latency constraints. Designers must address new challenges that were not present before. Defective components, the enhancement of application-level parallelism, or power-aware techniques may break topology regularity, thus, efficient routing becomes a challenge. This paper presents universal logic-based distributed routing (uLBDR), an efficient logic-based mechanism that adapts to any irregular topology derived from 2-D meshes, instead of using routing tables. uLBDR requires a small set of configuration bits, thus being more practical than large routing tables implemented in memories. Several implementations of uLBDR are presented highlighting the tradeoff between routing cost and coverage. The alternatives span from the previously proposed LBDR approach (with 30% of coverage) to the uLBDR mechanism achieving full coverage. This comes with a small performance cost, thus exhibiting the tradeoff between fault tolerance and performance. Power consumption, area, and delay estimates are also provided highlighting the efficiency of the mechanism. To do this, different router models (one for CMPs and one for MPSoCs) have been designed as a proof concept.",
	address = "USA",
	doi = "10.1109/TCAD.2011.2119150",
	issn = "0278-0070",
	journal = "IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems",
	keywords = "microprocessor chips;network routing;network-on-chip;",
	note = "cost-efficient on-chip routing implementations;chip multiprocessors;CMP;MPSoC Systems;many-core system-on-chip;networks-on-chip;communication scalability;latency constraints;area constraints;power constraints;application-level parallelism;power-aware techniques;topology regularity;universal logic-based distributed routing;logic-based mechanism;2D meshes;fault tolerance;fault performance;power consumption;",
	number = 4,
	pages = "534 - 47",
	title = "{C}ost-{E}fficient {O}n-{C}hip {R}outing {I}mplementations for {CMP} and {MPS}o{C} {S}ystems",
	url = "http://dx.doi.org/10.1109/TCAD.2011.2119150",
	volume = 30,
	year = 2011
}

, Jose Flich, Jose Duato, H Eberle and W Olesinski. A power-efficient network on-chip topology. In Proceedings of the Fifth International Workshop on Interconnection Network Architecture: On-Chip, Multi-Chip. 2011, 23–26. URL, DOI BibTeX

@conference{Camacho:2011:PNO:1930037.1930044,
	author = ", and Flich, Jose and Duato, Jose and H. Eberle and W. Olesinski",
	abstract = "NoCs have become a critical component in many-core architectures. Usually, the preferred topology is the 2D-Mesh as it enables a tile-based layout significantly reducing the design effort. However, new emerging challenges such as power consumption need to be addressed. Looking at the NoC, routers and links not being used must be switched off, thus achieving large power savings. Topology and routing algorithm must be carefully designed as they may lack enough flexibility to switch off components for long periods of time. We present the NR-Mesh (Nearest neighboR Mesh) topology. It gives an end node the choice to inject a message through different neighboring routers, thereby reducing hop count and saving latency. At the receiver side, a message may be delivered to the end node through different routers, thus reducing hop count further and increasing flexibility. When allowing links and routers to switch off and combined with adaptive routing, the power management technique is able to achieve significant power savings (up to 36% savings in static power consumed at routers). When compared with the 2D-Mesh, NR-Mesh reduces execution time by 23% and power consumption at routers by 47%.",
	address = "New York, NY, USA",
	booktitle = "Proceedings of the Fifth International Workshop on Interconnection Network Architecture: On-Chip, Multi-Chip",
	doi = "http://doi.acm.org/10.1145/1930037.1930044",
	isbn = "978-1-4503-0272-2",
	keywords = "Network-on-Chip; Power Efficient Chip Technology; Chip Topology; Routing Algorithms;",
	pages = "23--26",
	publisher = "ACM",
	series = "INA-OCMC '11",
	title = "{A} power-efficient network on-chip topology",
	url = "http://doi.acm.org/10.1145/1930037.1930044",
	year = 2011
}

Crispín Gomez, Maria E Gomez, Pedro Lopez and Jose Duato. How to reduce packet dropping in a bufferless NoC. Concurrency and Computation: Practice and Experience 23(1):86-99, 2011. URL, DOI BibTeX

@article{DBLP:journals/concurrency/RequenaGLD11,
	author = "Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Abstract Networks on-chip (NoCs) interconnect the components located inside a chip. In multicore chips, NoCs have a strong impact on the overall system performance. NoC bandwidth is limited by the critical path delay. Recent works show that the critical path delay is heavily affected by switch port buffer size. Therefore, by removing buffers, switch clock frequency can be increased. Recently, a new switching technique for NoCs called Blind Packet Switching (BPS) has been proposed, which is based on removing the switch port buffers. Since buffers consume a high percentage of switch power and area, BPS not only improves performance but also reduces power and area. In BPS, as there are no buffers at the switch ports, packets cannot be stopped and stored on them. If contention arises packets are dropped and later reinjected, negatively affecting performance. In order to prevent packet dropping, some techniques based on resource replication have been proposed. In this paper, we propose some alternative and complementary techniques that do not rely on resource replication. By using them, packet dropping is highly reduced. In particular, packet dropping is completely removed for a very wide network traffic range. Moreover, network throughput is increased and packet latency is reduced. Copyright © 2010 John Wiley {\&} Sons, Ltd.",
	doi = "10.1002/cpe.1606",
	issn = "1532-0634",
	journal = "Concurrency and Computation: Practice and Experience",
	keywords = "networks on-chip;buffer limitations;packet dropping reduction",
	number = 1,
	pages = "86-99",
	title = "{H}ow to reduce packet dropping in a bufferless {N}o{C}",
	url = "http://dx.doi.org/10.1002/cpe.1606",
	volume = 23,
	year = 2011
}

, Jose Flich, Jose Duato, H Eberle and W Olesinski. Towards an Efficient NoC Topology through Multiple Injection Ports. In Digital System Design (DSD), 2011 14th Euromicro Conference on. 2011, 165 -172. DOI BibTeX

@conference{6037406,
	author = ", and Flich, Jose and Duato, Jose and H. Eberle and W. Olesinski",
	abstract = "In this paper, we present a flexible network on-chip topology: NR-Mesh (Nearest neighbor Mesh). The topology gives an end node the choice to inject a message through different neighboring routers, thereby reducing hop count and saving latency. At the receiver side, a message may be delivered to the end node through different routers, thus reducing hop count further and increasing flexibility when routing messages. This flexibility allows for maximizing network components to be in switch off mode, thus enabling power aware routing algorithms. Additional benefits are reduced congestion/contention levels in the network, support for efficient broadcast operations, savings in power consumption, and partial fault-tolerance. Our second contribution is a power management technique for the adaptive routing. This technique turns router ports and their attached links on and off depending on traffic conditions. The power management technique is able to achieve significant power savings when there is low traffic in the network. We further compare the new topology with the 2D-Mesh, using either deterministic or adaptive routing. When compared with the 2D-Mesh using deterministic routing, executing real applications in a full system simulation platform, the NR-Mesh topology using adaptive routing is able to obtain significant savings, 7% of reduction in execution time and 75% in energy consumption at the network on average for a 16-Node CMP System. Similar numbers are achieved for a 32-Node CMP system.",
	booktitle = "Digital System Design (DSD), 2011 14th Euromicro Conference on",
	doi = "10.1109/DSD.2011.25",
	keywords = "CMP system;NR-mesh topology;NoC topology;adaptive routing;broadcast operation;congestion level;contention level;deterministic routing;energy consumption;fault-tolerance;flexible network on-chip topology;hop count;injection port;nearest neighbor mesh;neigh",
	month = "31 2011-sept. 2",
	pages = "165 -172",
	title = "{T}owards an {E}fficient {N}o{C} {T}opology through {M}ultiple {I}njection {P}orts",
	year = 2011
}

Carles Hernández, Federico Silla and Jose Duato. Energy and Performance Efficient Thread Mapping in NoC-Based CMPs under Process Variations. In ICPP. 2011, 41-50. BibTeX

@conference{DBLP:conf/icpp/HernandezSD11,
	author = "Hern{\'a}ndez, Carles and Silla, Federico and Duato, Jose",
	booktitle = "ICPP",
	crossref = "DBLP:conf/icpp/2011",
	pages = "41-50",
	title = "{E}nergy and {P}erformance {E}fficient {T}hread {M}apping in {N}o{C}-{B}ased {CMP}s under {P}rocess {V}ariations",
	year = 2011
}

Héctor Montaner, Federico Silla, Holger Froning and Jose Duato. A new degree of freedom for memory allocation in clusters. Cluster Computing, pages 1 - 23, 2011. URL BibTeX

@article{IP51265029,
	author = "Montaner, H{\'e}ctor and Silla, Federico and Holger Froning and Duato, Jose",
	abstract = "Improvements in parallel computing hardware usually involve increments in the number of available resources for a given application such as the number of computing cores and the amount of memory. In the case of shared-memory computers, the increase in computing resources and available memory is usually constrained by the coherency protocol, whose overhead rises with system size, limiting the scalability of the final system. In this paper we propose an efficient and cost-effective way to increase the memory available for a given application by leveraging free memory in other computers in the cluster. Our proposal is based on the observation that many applications benefit from having more memory resources but do not require more computing cores, thus reducing the requirements for cache coherency and allowing a simpler implementation and better scalability. Simulation results show that, when additional mechanisms intended to hide remote memory latency are used, execution time of applications that use our proposal is similar to the time required to execute them in a computer populated with enough local memory, thus validating the feasibility of our proposal. We are currently building a prototype that implements our ideas. The first results from real executions in this prototype demonstrate not only that our proposal works but also that it can efficiently execute applications that make use of remote memory resources. {\&}copy; 2011 Springer Science+Business Media, LLC.",
	issn = 13867857,
	journal = "Cluster Computing",
	key = "Computer simulation",
	keywords = "Parallel architectures;Scalability;",
	note = "Cache coherency;Computing resource;Degree of freedom;Execution time;Free memory;Local memories;Memory allocation;Memory resources;Parallel Computing;Remote memory;Shared-memory computers;Simulation result;System size;",
	pages = "1 - 23",
	title = "{A} new degree of freedom for memory allocation in clusters",
	url = "http://dx.doi.org/10.1007/s10586-010-0150-7",
	year = 2011
}

Frank Olaf Sem-Jacobsen, Tor Skeie, Olav Lysne and Jose Duato. Dynamic fault tolerance in fat trees. IEEE Transactions on Computers 60(4):508 - 525, 2011. URL BibTeX

@article{20111013718747,
	author = "Frank Olaf Sem-Jacobsen and Tor Skeie and Olav Lysne and Duato, Jose",
	abstract = "Fat trees are a very common communication architecture in current large-scale parallel computers. The probability of failure in these systems increases with the number of components. We present a routing method for deterministically and adaptively routed fat trees, applicable to both distributed and source routing, that is able to handle several concurrent faults and that transparently returns to the original routing strategy once the faulty components have recovered. The method is local and dynamic, completely masking the fault from the rest of the system. It only requires a small extra functionality in the switches to handle rerouting packets around a fault. The method guarantees connectedness and deadlock and livelock freedom for up to k -1 benign simultaneous switch and/or link faults where k is half the number of ports in the switches. Our simulation experiments show a graceful degradation of performance as more faults occur. Furthermore, we demonstrate that for most fault combinations, our method will even be able to handle significantly more faults beyond the k -1 limit with high probability. {\&}copy; 2011 IEEE.",
	address = "445 Hoes Lane - P.O.Box 1331, Piscataway, NJ 08855-1331, United States",
	issn = 00189340,
	journal = "IEEE Transactions on Computers",
	key = "Quality assurance",
	keywords = "Computer architecture;Fault tolerance;",
	note = "Adaptive routing;deterministic routing;Dynamic faults;Fat trees;k-ary n-trees;",
	number = 4,
	pages = "508 - 525",
	title = "{D}ynamic fault tolerance in fat trees",
	url = "http://dx.doi.org/10.1109/TC.2010.97",
	volume = 60,
	year = 2011
}

Monica Serrano, Julio Sahuquillo, Salvador Petit, Houcine Hassan and Jose Duato. A cost-effective heuristic to schedule local and remote memory in cluster computers. Journal of Supercomputing, pages 1 - 19, 2011. URL BibTeX

@article{IP51286180,
	author = "Serrano, Monica and Sahuquillo, Julio and Petit, Salvador and Houcine Hassan and Duato, Jose",
	abstract = "Cluster computers represent a cost-effective alternative solution to supercomputers. In these systems, it is common to constrain the memory address space of a given processor to the local motherboard. Constraining the system in this way is much cheaper than using a full-fledged shared memory implementation among motherboards. However, memory usage among motherboards can be unfairly balanced. On the other hand, remote memory access (RMA) hardware provides fast interconnects among the motherboards of a cluster. RMA devices can be used to access remote RAM memory from a local motherboard. This work focuses on this capability in order to achieve a better global use of the total RAM memory in the system. More precisely, the address space of local applications is extended to remote motherboards and is used to access remote RAM memory. This paper presents an ideal memory scheduling algorithm and proposes a cost-effective heuristic to allocate local and remote memory among local applications. Compared to the devised ideal algorithm, the heuristic obtains the same (or closely resembling) results while largely reducing the computational cost. In addition, we analyze the impact on the performance of stand alone applications varying the memory distribution among regions (local, local to board, and remote). Then, this study is extended to any number of concurrent applications. Experimental results show that a QoS parameter is needed in order to avoid unacceptable performance degradation. {\&}copy; 2011 Springer Science+Business Media, LLC.",
	issn = 09208542,
	journal = "Journal of Supercomputing",
	key = "Multitasking",
	keywords = "Cost effectiveness;Costs;Printed circuits;Random access storage;Scheduling algorithms;Supercomputers;",
	note = "Address space;Cluster computer;Computational costs;Global use;Memory address space;Memory usage;Performance degradation;QoS parameters;Remote memory;Remote memory access;Shared memories;Standalone applications;Work Focus;",
	pages = "1 - 19",
	title = "{A} cost-effective heuristic to schedule local and remote memory in cluster computers",
	url = "http://dx.doi.org/10.1007/s11227-011-0566-8",
	year = 2011
}

Crispín Gomez, Maria E Gomez, Pedro Lopez and Jose Duato. How to reduce packet dropping in a bufferless NoC. Concurrency Computation Practice and Experience 23(1):86 - 99, 2011. URL BibTeX

@article{20105213526965,
	author = "Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Networks on-chip (NoCs) interconnect the components located inside a chip. In multicore chips, NoCs have a strong impact on the overall system performance. NoC bandwidth is limited by the critical path delay. Recent works show that the critical path delay is heavily affected by switch port buffer size. Therefore, by removing buffers, switch clock frequency can be increased. Recently, a new switching technique for NoCs called Blind Packet Switching (BPS) has been proposed, which is based on removing the switch port buffers. Since buffers consume a high percentage of switch power and area, BPS not only improves performance but also reduces power and area. In BPS, as there are no buffers at the switch ports, packets cannot be stopped and stored on them. If contention arises packets are dropped and later reinjected, negatively affecting performance. In order to prevent packet dropping, some techniques based on resource replication have been proposed. In this paper, we propose some alternative and complementary techniques that do not rely on resource replication. By using them, packet dropping is highly reduced. In particular, packet dropping is completely removed for a very wide network traffic range. Moreover, network throughput is increased and packet latency is reduced. Copyright {\&}copy; 2010 John Wiley {\&}amp; Sons, Ltd.",
	address = "Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom",
	issn = 15320626,
	journal = "Concurrency Computation Practice and Experience",
	key = "Packet switching",
	keywords = "Signal filtering and prediction;",
	note = "buffer limitations;Buffer sizes;Clock frequency;Critical path delays;Multicore chips;Network throughput;Network traffic;On chips;Packet dropping;Packet latencies;Resource replication;Switch ports;Switch power;Switching techniques;",
	number = 1,
	pages = "86 - 99",
	title = "{H}ow to reduce packet dropping in a bufferless {N}o{C}",
	url = "http://dx.doi.org/10.1002/cpe.1606",
	volume = 23,
	year = 2011
}

Jesus Escudero-Sahuquillo, Pedro J Garcia, Francisco J Quiles, Jose Flich and Jose Duato. Cost-Effective Congestion Management for Interconnection Networks Using Distributed Deterministic Routing. In 16th International Conference on Parallel and Distributed Systems (ICPADS 2010). December 2010. BibTeX

@conference{icpads2010,
	author = "Jesus Escudero-Sahuquillo and Pedro J. Garcia and Francisco J. Quiles and Flich, Jose and Duato, Jose",
	abstract = "The Interconnection networks are essential elements in current computing systems. For this reason, achieving the best network performance, even in congestion situations, has been a primary goal in recent years. In that sense, there exist several techniques focused on eliminating the main negative effect of congestion: the Head of Line (HOL) blocking. One of the most successful HOL blocking elimination techniques is RECN, which can be applied in source routing networks. FBICM follows the same approach as RECN, but it has been developed for distributed deterministic routing networks. Although FBICM effectively eliminates HOL blocking, it requires too much resources to be implemented. In this paper we present a new FBICM version, based on a new organization of switch memory resources, that significantly reduces the required silicon area, complexity and cost. Moreover, we present new results about FBICM, in network topologies not yet analyzed. From the experiment results we can conclude that a far less complex and feasible FBICM implementation can be achieved by using the proposed improvements, while not losing efficiency.",
	address = "Shanghai, China",
	booktitle = "16th International Conference on Parallel and Distributed Systems (ICPADS 2010)",
	keywords = "Deterministic Routing; Congestion Management; Head-Of-Line Blocking;",
	month = "December",
	title = "{C}ost-{E}ffective {C}ongestion {M}anagement for {I}nterconnection {N}etworks {U}sing {D}istributed {D}eterministic {R}outing",
	year = 2010
}

Antoni Roca, Jose Flich, Federico Silla and Jose Duato. VCTlite: Towards an Efficient Implementation of Virtual Cut-Through Switching in On-Chip Networks. In 17th Int'l Conference on High Performance Computing (HiPC) In Press. December 2010. BibTeX

@conference{roca-hipc10,
	author = "Roca, Antoni and Flich, Jose and Silla, Federico and Duato, Jose",
	address = "Goa,India",
	booktitle = "17th Int'l Conference on High Performance Computing (HiPC)",
	keywords = "on-chip networks; switching;",
	month = "December",
	title = "{VCT}lite: {T}owards an {E}fficient {I}mplementation of {V}irtual {C}ut-{T}hrough {S}witching in {O}n-{C}hip {N}etworks",
	volume = "In Press",
	year = 2010
}

Alberto Ros, Blas Cuesta Sáez, Ricardo Fernández-Pascual, Maria E Gomez, Manuel E Acacio, Antonio Robles, José M García and Jose Duato. EMC^2: Extending Magny-Cours Coherence for Large-Scale Servers. In 17th Int'l Conference on High Performance Computing (HiPC) In Press, Accepted. December 2010. BibTeX

@conference{aros-hipc10,
	author = "Ros, Alberto and Cuesta S{\'a}ez, Blas and Ricardo Fern{\'a}ndez-Pascual and Gomez, Maria E. and Manuel E. Acacio and Robles, Antonio and Jos{\'e} M. Garc{\'i}a and Duato, Jose",
	abstract = "The demand of larger and more powerful highperformance shared-memory servers is growing over the last few years. To meet this need, AMD has recently launched the twelve-core Magny-Cours processors. They include a directory cache (Probe Filter) that increases the scalability of the coherence protocol applied by Opterons, based on coherent HyperTransport interconnect (cHT). cHT limits up to 8 the number of nodes that can be addressed. Recent High Node Count HT specification overcomes this limitation. However, the 3-bit pointer used by the Probe Filter prevents Magny-Cours-based servers from being built beyond 8 nodes. In this paper, we propose and develop an external logic to extend the coherence domain of Magny-Cours processors beyond the 8-node limit while maintaining the advantages provided by the Probe Filter. Evaluation results for up to a 32-node system show how the performance offered by our solution scales with the increment in the number of nodes, enhancing the Probe Filter effectiveness by filtering additional messages. Particularly, we reduce runtime by 47% in a 32-die system respect to the 8-die Magny-Cours system.",
	address = "Goa, India",
	booktitle = "17th Int'l Conference on High Performance Computing (HiPC)",
	month = "December",
	title = "{EMC}^2: {E}xtending {M}agny-{C}ours {C}oherence for {L}arge-{S}cale {S}ervers",
	volume = "In Press, Accepted",
	year = 2010
}

D Flich J.; Bertozzi (ed.). Designing Network On-Chip Architectures in the Nanoscale Era. CRC Press, December 2010. URL BibTeX

@book{365336,
	author = "Gilabert, Francisco and Silla, Federico and Gomez, Maria E. and Lodde, Mario and Roca, Antoni and Flich, Jose and Duato, Jose and Hern{\'a}ndez, Carles and Rodrigo, Samuel",
	abstract = "Going beyond isolated research ideas and design experiences, Designing Network On-Chip Architectures in the Nanoscale Era covers the foundations and design methods of network on-chip (NoC) technology. The contributors draw on their own lessons learned to provide strong practical guidance on various design issues. Exploring the design process of the network, the first part of the book focuses on basic aspects of switch architecture and design, topology selection, and routing implementation. In the second part, contributors discuss their experiences in the industry, offering a roadmap to recent products. They describe Tilera’s TILE family of multicore processors, novel Intel products and research prototypes, and the TRIPS operand network (OPN). The last part reveals state-of-the-art solutions to hardware-related issues and explains how to efficiently implement the programming model at the network interface. In the appendix, the microarchitectural details of two switch architectures targeting multiprocessor system-on-chips (MPSoCs) and chip multiprocessors (CMPs) can be used as an experimental platform for running tests. A stepping stone to the evolution of future chip architectures, this volume provides a how-to guide for designers of current NoCs as well as designers involved with 2015 computing platforms. It cohesively brings together fundamental design issues, alternative design paradigms and techniques, and the main design tradeoffs—consistently focusing on topics most pertinent to real-world NoC designers.",
	editor = "Flich, J.; Bertozzi, D.",
	isbn = 9781439837108,
	keywords = "Network on chip;Chip Architectures;",
	month = "December",
	publisher = "CRC Press",
	title = "{D}esigning {N}etwork {O}n-{C}hip {A}rchitectures in the {N}anoscale {E}ra",
	url = "http://www.crcpress.com/product/isbn/9781439837108",
	year = 2010
}

M Serrano, Julio Sahuquillo, Houcine Hassan Mohamed, Salvador Petit and Jose Duato. A Scheduling Heuristic to Handle Local and Remote Memory in Cluster Computers. In High Performance Computing and Communications (HPCC), 2010 12th IEEE International Conference on. 2010, 35 -42. URL, DOI BibTeX

@conference{5581321,
	author = "M. Serrano and Sahuquillo, Julio and Mohamed, Houcine Hassan and Petit, Salvador and Duato, Jose",
	abstract = "In cluster computers, RAM memory is spread among the motherboards hosting the running applications. In these systems, it is common to constrain the memory address space of a given processor to the local motherboard. Constraining the system in this way is much cheaper than using a full-fledged shared memory implementation among motherboards. However, in this case, memory usage might widely differ among motherboards depending on the memory requirements of the applications running on each motherboard. In this context, if an application requires a huge quantity of RAM memory, the only feasible solution is to increase the amount of available memory in its local motherboard, even if the remaining ones are underused. Nevertheless, beyond a certain memory size, this memory budget increase becomes prohibitive. In this paper, we assume that the Remote Memory Access hardware used in a Hyper Transport based system allows applications to allocate the required memory from remote motherboards. We also analyze how the distribution of memory accesses among different memory locations (local or remote) impact on performance. Finally, an heuristic is devised to schedule local and remote memory among applications according to their requirements, and considering quality of service constraints.",
	booktitle = "High Performance Computing and Communications (HPCC), 2010 12th IEEE International Conference on",
	doi = "10.1109/HPCC.2010.75",
	isbn = "978-1-4244-8335-8",
	keywords = "hyper transport based system;local memory handling;random access memory;remote memory access hardware;remote memory handling;remote motherboards;scheduling heuristic;random-access storage;scheduling;storage management;",
	month = "sept.",
	pages = "35 -42",
	title = "{A} {S}cheduling {H}euristic to {H}andle {L}ocal and {R}emote {M}emory in {C}luster {C}omputers",
	url = "http://dx.doi.org/10.1109/HPCC.2010.75",
	year = 2010
}

Antoni Roca, Jose Flich, Federico Silla and Jose Duato. A Latency-Efficient Router Architecture for CMP Systems. In Digital System Design: Architectures, Methods and Tools (DSD), 2010 13th Euromicro Conference on. 2010, 165 -172. URL, DOI BibTeX

@conference{5615623,
	author = "Roca, Antoni and Flich, Jose and Silla, Federico and Duato, Jose",
	abstract = "As technology advances, the number of cores in Chip Multi Processor systems (CMPs) and Multi Processor Systems-on-Chips (MPSoCs) keeps increasing. Current test chips and products reach tens of cores, and it is expected to reach hundreds of cores in the near future. Such complexity demands for an efficient network-on-chip (NoC). The common choice to build such networks is the 2D mesh topology (as it matches the regular tile-based design) and the Dimension-Order Routing (DOR) algorithm (because its simplicity). The network in such systems must provide sustained throughput and ultra low latencies. One of the key components in the network is the router, and thus, it plays a major role when designing for such performance levels. In this paper we propose a new pipelined router design focused in reducing the router latency. As a first step we identify the router components that take most of the critical path, and thus limit the router frequency. In particular, the arbiter is the one limiting the performance of the router. Based on this fact, we simplify the arbiter logic by using multiple smaller arbiters. The initial set of requests in the initial arbiter is then distributed over the smaller arbiters that operate in parallel. With this design procedure, and with a proper internal router organization, different router architectures are evolved. All of them enable the use of smaller arbiters in parallel by replicating ports and assuming the use of the DOR algorithm. The net result of such changes is a faster router. Preliminary results demonstrate a router latency reduction ranging from 10 #x025; to 21 #x025; with an increase of the router area. Network latency is reduced in a range from 11% to 15%.",
	booktitle = "Digital System Design: Architectures, Methods and Tools (DSD), 2010 13th Euromicro Conference on",
	doi = "10.1109/DSD.2010.42",
	isbn = "978-1-4244-7839-2",
	keywords = "arbiter design;low latency router;network-on-chip;router architecture;router design",
	month = "sept.",
	pages = "165 -172",
	title = "{A} {L}atency-{E}fficient {R}outer {A}rchitecture for {CMP} {S}ystems",
	url = "http://dx.doi.org/10.1109/DSD.2010.42",
	year = 2010
}

Héctor Montaner, Federico Silla, H Fröning and Jose Duato. Getting Rid of Coherency Overhead for Memory-Hungry Applications. In Cluster Computing (CLUSTER), 2010 IEEE International Conference on. 2010, 48 -57. URL, DOI BibTeX

@conference{5600323,
	author = {Montaner, H{\'e}ctor and Silla, Federico and H. Fr{\"o}ning and Duato, Jose},
	abstract = "Current commercial solutions intended to provide additional resources to an application being executed in a cluster usually aggregate processors and memory from different nodes. In this paper we present a 16-node prototype for a shared-memory cluster architecture that follows a different approach by decoupling the amount of memory available to an application from the processing resources assigned to it. In this way, we provide a new degree of freedom so that the memory granted to a process can be expanded with the memory from other nodes in the cluster without increasing the number of processors used by the program. This feature is especially suitable for memory-hungry applications that demand large amounts of memory but present a parallelization level that prevents them from using more cores than available in a single node. The main advantage of this approach is that an application can use more memory from other nodes without involving the processors, and caches, from those nodes. As a result, using more memory no longer implies increasing the coherence protocol overhead because the number of caches involved in the coherent domain has become independent from the amount of available memory. The prototype we present in this paper leverages this idea by sharing 128GB of memory among the cluster. Real executions show the feasibility of our prototype and its scalability.",
	booktitle = "Cluster Computing (CLUSTER), 2010 IEEE International Conference on",
	doi = "10.1109/CLUSTER.2010.14",
	keywords = "16-node prototype;coherence protocol overhead;coherent domain;memory decoupling;memory hungry application;parallelization level;processing resource;shared memory cluster architecture;cache storage;memory architecture;pattern clustering;program processors;",
	month = "sept.",
	pages = "48 -57",
	title = "{G}etting {R}id of {C}oherency {O}verhead for {M}emory-{H}ungry {A}pplications",
	url = "http://dx.doi.org/10.1109/CLUSTER.2010.14",
	year = 2010
}

Héctor Montaner, Federico Silla and Jose Duato. A practical way to extend shared memory support beyond a motherboard at low cost. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. June 2010, 155-166. URL, DOI BibTeX

@conference{Montaner:2010:PWE:1851476.1851495,
	author = "Montaner, H{\'e}ctor and Silla, Federico and Duato, Jose",
	abstract = "Improvements in parallel computing hardware usually involve increments in the number of available resources for a given application such as the number of computing cores and the amount of memory. In the case of shared-memory computers, the increase in computing resources and available memory is usually constrained by the coherency protocol, whose overhead rises with system size, limiting the scalability of the final system. In this paper we propose an efficient and cost-effective way to increase the memory available for a given application by leveraging free memory in other computers in the cluster. Our proposal is based on the observation that many applications benefit from having more memory resources but do not require more computing cores, thus reducing the requirements for cache coherency and allowing a simpler implementation and better scalability. Simulation results show that, when additional mechanisms intended to hide remote memory latency are used, execution time of applications that use our proposal is similar to the time required to execute them in a computer populated with enough local memory, thus validating the feasibility of our proposal. We are currently building a prototype that implements our ideas.",
	address = "Chicago, Illinois",
	booktitle = "Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing",
	doi = "10.1145/1851476.1851495",
	isbn = "978-1-60558-942-8",
	keywords = "memory;",
	month = "June",
	pages = "155-166",
	publisher = "ACM",
	series = "HPDC '10",
	title = "{A} practical way to extend shared memory support beyond a motherboard at low cost",
	url = "http://doi.acm.org/10.1145/1851476.1851495",
	year = 2010
}

Teresa Nachiondo, Jose Flich and Jose Duato. Buffer Management Strategies to Reduce HoL Blocking. Parallel and Distributed Systems, IEEE Transactions on 21(6):739 - 753, June 2010. URL, DOI BibTeX

@article{4815231,
	author = "Nachiondo, Teresa and Flich, Jose and Duato, Jose",
	abstract = "Congestion management is likely to become a critical issue in interconnection networks, as increasing power consumption and cost concerns lead to improve the efficiency of network resources. In previous configurations, networks were usually overdimensioned and underutilized. In a smaller network, however, contention is more likely to happen and blocked packets introduce head-of-line (HoL) blocking to the rest of packets spreading congestion quickly. The best-known solution to HoL blocking is Virtual Output Queues (VOQs). However, the cost of implementing VOQs increases quadratically with the number of output ports in the network, thus, being unpractical. Therefore, a more scalable and cost-effective solution is required to reduce or eliminate HoL blocking. In this paper, we present methodologies, referred to as Destination-Based Buffer Management (DBBM), to reduce/eliminate the HoL blocking effect on interconnection networks. DBBM efficiently uses the resources (mainly memory queues) of the network. These methodologies are comprehensively evaluated in terms of throughput, scalability and fairness. Results show that the use of the DBBM strategy with a reduced number of queues at each switch is able to obtain roughly the same throughput as the VOQ mechanism. Moreover, all the proposed strategies are designed in such a way that can be used in any switch architecture. We compare DBBM with RECN, a sophisticated mechanism that eliminates HoL blocking in congestion situations. Our mechanism is able to achieve almost the same performance with very low logic requirements (in contrast with RECN).",
	doi = "10.1109/TPDS.2009.63",
	issn = "1045-9219",
	journal = "Parallel and Distributed Systems, IEEE Transactions on",
	keywords = "computer network management; quality of service; storage management; telecommunication congestion control",
	month = "June",
	number = 6,
	pages = "739 - 753",
	title = "{B}uffer {M}anagement {S}trategies to {R}educe {H}o{L} {B}locking",
	url = "http://dx.doi.org/10.1109/TPDS.2009.63",
	volume = 21,
	year = 2010
}

Samuel Rodrigo, Jose Flich, Antoni Roca, S Medardoni, D Bertozzi, , Federico Silla and Jose Duato. Addressing Manufacturing Challenges with Cost-Efficient Fault Tolerant Routing. In Networks-on-Chip (NOCS), 2010 Fourth ACM/IEEE International Symposium on. May 2010, 25 -32. URL, DOI BibTeX

@conference{5507564,
	author = "Rodrigo, Samuel and Flich, Jose and Roca, Antoni and S. Medardoni and D. Bertozzi and , and Silla, Federico and Duato, Jose",
	abstract = "The high-performance computing domain is enriching with the inclusion of Networks-on-chip (NoCs) as a key component of many-core (CMPs or MPSoCs) architectures. NoCs face the communication scalability challenge while meeting tight power, area and latency constraints. Designers must address new challenges that were not present before. Defective components, the enhancement of application-level parallelism or power-aware techniques may break topology regularity, thus, efficient routing becomes a challenge.In this paper, uLBDR (Universal Logic-Based Distributed Routing) is proposed as an efficient logic-based mechanism that adapts to any irregular topology derived from 2D meshes, being an alternative to the use of routing tables (either at routers or at end-nodes). uLBDR requires a small set of configuration bits, thus being more practical than large routing tables implemented in memories. Several implementations of uLBDR are presented highlighting the trade-off between routing cost and coverage. The alternatives span from the previously proposed LBDR approach (with 30% of coverage) to the uLBDR mechanism achieving full coverage. This comes with a small performance cost, thus exhibiting the trade-off between fault tolerance and performance.",
	booktitle = "Networks-on-Chip (NOCS), 2010 Fourth ACM/IEEE International Symposium on",
	doi = "10.1109/NOCS.2010.12",
	keywords = "NoC;addressing manufacturing challenges;application level parallelism;cost efficient fault tolerant routing;logic based mechanism;networks-on-chip;power aware techniques;universal logic based distributed routing;network routing;network topology;network-on",
	month = "may",
	pages = "25 -32",
	title = "{A}ddressing {M}anufacturing {C}hallenges with {C}ost-{E}fficient {F}ault {T}olerant {R}outing",
	url = "http://dx.doi.org/10.1109/NOCS.2010.12",
	year = 2010
}

Carles Hernández, Antoni Roca, Federico Silla, Jose Flich and Jose Duato. Improving the Performance of GALS-Based NoCs in the Presence of Process Variation. In 2010 ACM/IEEE International Symposium on Networks-on-Chip (NOCS). May 2010, 35 - 42. URL, DOI BibTeX

@conference{11416504,
	author = "Hern{\'a}ndez, Carles and Roca, Antoni and Silla, Federico and Flich, Jose and Duato, Jose",
	abstract = "Current integration scales allow designing chip multiprocessors (CMP) where cores are interconnected by means of a network-on-chip (NoC). Unfortunately, the small feature size of current integration scales cause some unpredictability in manufactured devices because of process variation. In NoCs,variability may affect links and routers causing that they do not match the parameters established at design time. In this paper we first analyze the way that manufacturing deviations affect the components of a NoC by applying a comprehensive and detailed variability model to 200 instances of an 8×8 mesh NoC synthesized using 45 nm technology. A second contribution of this paper is showing that GALS-based NoCs present communication bottlenecks under process variation. To overcome this performance reduction we draft a novel approach, called performance domains, intended to reduce the negative impact of variability on application execution time. This mechanism is suitable when several applications are simultaneously running in the CMP chip.",
	address = "Grenoble, France",
	booktitle = "2010 ACM/IEEE International Symposium on Networks-on-Chip (NOCS)",
	doi = "10.1109/NOCS.2010.13",
	journal = "2010 ACM/IEEE International Symposium on Networks-on-Chip (NOCS)",
	keywords = "integrated circuit design;large scale integration;network-on-chip;performance evaluation;",
	month = "May",
	note = "GALS-based NoCs;chip multiprocessors;network-on-chip;manufacturing deviations;process variation;performance domains;integration scales;",
	pages = "35 - 42",
	publisher = "ACM",
	title = "{I}mproving the {P}erformance of {GALS}-{B}ased {N}o{C}s in the {P}resence of {P}rocess {V}ariation",
	url = "http://dx.doi.org/10.1109/NOCS.2010.13",
	year = 2010
}

Carles Hernández, Federico Silla and Jose Duato. A Methodology for the Characterization of Process Variation in NoC Links. In 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010). March 2010, 685-690. URL BibTeX

@conference{11283352,
	author = "Hern{\'a}ndez, Carles and Silla, Federico and Duato, Jose",
	abstract = "Associated with the ever growing integration scales is the increase in process variability. In the context of network-on-chip, this variability affects the maximum frequency that could be sustained by each link that interconnects two cores in a chip multiprocessor. In this paper we present a methodology to model delay variations in NoC links. We also show its application to several technologies, namely 45nm, 32nm, 22nm, and 16nm. Simulation results show that conclusions about variability greatly depend on the implementation context.",
	address = "Dresden, Germany",
	booktitle = "2010 Design, Automation {\&} Test in Europe Conference {\&} Exhibition (DATE 2010)",
	isbn = "978-3-9810801-6-2",
	journal = "2010 Design, Automation {\&}amp; Test in Europe Conference {\&}amp; Exhibition (DATE 2010)",
	keywords = "multiprocessor interconnection networks;network-on-chip;",
	month = "March",
	note = "process variation;NoC Links;network-on-chip;chip multiprocessor;process variability;",
	pages = "685-690",
	publisher = "EDDA",
	title = "{A} {M}ethodology for the {C}haracterization of {P}rocess {V}ariation in {N}o{C} {L}inks",
	url = "http://www.date-conference.com/proceedings/PAPERS/2010/DATE10/PDFFILES/06.3_2.PDF",
	year = 2010
}

Diana B Rayo, Julio Sahuquillo, Houcine Hassan Mohamed, Salvador Petit and Jose Duato. Balancing Task Resource Requirements in Embedded Multithreaded Multicore Processors to Reduce Power Consumption. In Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2010). February 2010, 200 - 4. URL, DOI BibTeX

@conference{11260697,
	author = "Rayo, Diana B. and Sahuquillo, Julio and Mohamed, Houcine Hassan and Petit, Salvador and Duato, Jose",
	abstract = "Power consumption is a major design issue in modern microprocessors. Hence, power reduction techniques, like dynamic voltage scaling (DVS), are being widely implemented. Unfortunately, they impact on the task execution time so difficulting schedulability of hard real-time applications. To deal with this problem, this paper proposes a power-aware scheduler for coarse-grain embedded multicore processors implementing global DVS. To this end, this work presents two heuristics, namely Balanced Memory and Balanced CPU, which distribute the task set among cores focusing on resource utilization. Results show that with respect to a system not implementing DVS, two or five DVS levels achieve energy savings by about 35% or 51%, respectively.",
	booktitle = "Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2010)",
	doi = "10.1109/PDP.2010.64",
	isbn = "978-1-4244-5672-7",
	issn = "1066-6192",
	journal = "Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2010)",
	keywords = "microprocessor chips;multi-threading;power consumption;scheduling;",
	month = "Feb",
	note = "task resource requirements;embedded multithreaded multicore processors;power consumption reduction;power reduction techniques;dynamic voltage scaling;global DVS;",
	pages = "200 - 4",
	title = "{B}alancing {T}ask {R}esource {R}equirements in {E}mbedded {M}ultithreaded {M}ulticore {P}rocessors to {R}educe {P}ower {C}onsumption",
	url = "http://dx.doi.org/10.1109/PDP.2010.64",
	year = 2010
}

Marina Alonso, Salvador Coll, Juan Miguel Martínez, Vicente Santonja, Pedro Lopez and Jose Duato. Power saving in regular interconnection networks. Parallel Computing 36(12):696 - 712, 2010. URL, DOI BibTeX

@article{MarinaAlonso|Coll2010696,
	author = "Alonso, Marina and Coll, Salvador and Mart{\'i}nez, Juan Miguel and Santonja, Vicente and Lopez, Pedro and Duato, Jose",
	abstract = "The high level of computing power required for some applications can only be achieved by multiprocessor systems. These systems consist of several processors that communicate by means of an interconnection network. The huge increase both in size and complexity of high-end multiprocessor systems has triggered up their power consumption. Complex cooling systems are needed, which, in turn, increases power consumption. Power consumption reduction techniques are being applied everywhere in computer systems and the interconnection network is not an exception, as its contribution is not negligible. In this paper, we propose a mechanism to reduce interconnect power consumption that combines two alternative techniques: (i) dynamically switching on and off network links as a function of traffic (any link can be switched off, provided that network connectivity is guaranteed), (ii) dynamically reducing the available network bandwidth when traffic becomes low. In both cases, the topology of the network is not modified. Therefore, the same routing algorithm can be used regardless of the power saving actions taken, thus simplifying router design. Our simulation results show that the network power consumption can be greatly reduced, at the expense of some increase in latency. However, the achieved power reduction is always higher than the latency penalty.",
	doi = "DOI: 10.1016/j.parco.2010.08.003",
	issn = "0167-8191",
	journal = "Parallel Computing",
	keywords = "Power saving; Interconnection networks; Routing",
	number = 12,
	pages = "696 - 712",
	title = "{P}ower saving in regular interconnection networks",
	url = "http://www.sciencedirect.com/science/article/B6V12-50VTWG7-1/2/7972b8869966237a0ab6b680fd5fa6ba",
	volume = 36,
	year = 2010
}

Jose Duato, Antonio José Peña, Federico Silla, Rafael Mayo and Enrique S Quintana-Ort. RCUDA: Reducing the number of GPU-based accelerators in high performance clusters. In High Performance Computing and Simulation (HPCS), 2010 International Conference on. 2010, 224 - 231. URL BibTeX

@conference{20103913258676,
	author = "Duato, Jose and Pe{\~n}a, Antonio Jos{\'e} and Silla, Federico and Rafael Mayo and Enrique S. Quintana-Ort",
	abstract = "The increasing computing requirements for GPUs (Graphics Processing Units) have favoured the design and marketing of commodity devices that nowadays can also be used to accelerate general purpose computing. Therefore, future high performance clusters intended for HPC (High Performance Computing) will likely include such devices. However, high-end GPU-based accelerators used in HPC feature a considerable energy consumption, so that attaching a GPU to every node of a cluster has a strong impact on its overall power consumption. In this paper we detail a framework that enables remote GPU acceleration in HPC clusters, thus allowing a reduction in the number of accelerators installed in the cluster. This leads to energy, acquisition, maintenance, and space savings. ©2010 IEEE.",
	address = "Caen, France",
	booktitle = "High Performance Computing and Simulation (HPCS), 2010 International Conference on",
	journal = "Proceedings of the 2010 International Conference on High Performance Computing and Simulation, HPCS 2010",
	key = "Energy conservation",
	keywords = "Energy utilization;Program processors;",
	note = "Clusters;CUDA;Energy saving;High performance computing;Virtualizations;",
	pages = "224 - 231",
	title = "{RCUDA}: {R}educing the number of {GPU}-based accelerators in high performance clusters",
	url = "http://dx.doi.org/10.1109/HPCS.2010.5547126",
	year = 2010
}

P Morillo, S Rueda, J M Orduna and Jose Duato. Ensuring the performance and scalability of peer-to-peer distributed virtual environments. In Future Generation Computer Systems 26(7). 2010, 905 - 915. URL BibTeX

@conference{20103413166817,
	author = "P. Morillo and S. Rueda and J.M. Orduna and Duato, Jose",
	abstract = "Large scale distributed virtual environments (DVEs) have become a major trend in distributed applications. Peer-to-peer (P2P) architectures have been proposed as an efficient and truly scalable solution for these kinds of systems. However, in order to design efficient P2P DVEs these systems must be characterized, measuring the impact of different client behavior on system performance. This paper presents the experimental characterization of P2P DVEs. The results show that the saturation of a given client has an exclusive effect on the surrounding clients in the virtual world, having no noticeable effect at all on the rest of clients. Nevertheless, the interactions among clients that can take place in this types of systems can lead to the temporal saturation of an unbounded number of clients, thus limiting the performance of P2P DVEs. In this paper, we also discuss and propose a technique for avoiding the saturation of the client computers in P2P DVEs. The evaluation results show that the performance and the scalability of P2P DVEs are significantly improved. These results can be used as the basis for an efficient design of P2P DVEs. © 2010 Elsevier B.V. All rights reserved.",
	address = "P.O. Box 211, Amsterdam, 1000 AE, Netherlands",
	booktitle = "Future Generation Computer Systems",
	journal = "Future Generation Computer Systems",
	key = "Peer to peer networks",
	keywords = "Adaptive filtering;Distributed computer systems;Scalability;Virtual reality;",
	note = "Distributed applications;Distributed Virtual Environments;Efficient designs;Evaluation results;Experimental characterization;Peer to peer;Peer-to-peer architectures;Performance evaluation;Scalable solution;Virtual worlds;",
	number = 7,
	pages = "905 - 915",
	title = "{E}nsuring the performance and scalability of peer-to-peer distributed virtual environments",
	url = "http://dx.doi.org/10.1016/j.future.2010.03.003",
	volume = 26,
	year = 2010
}

Jose Duato, Francisco D Igual, Rafael Mayo, Antonio José Peña, Enrique S Quintana-Orti and Federico Silla. An efficient implementation of GPU virtualization in high performance clusters. In Euro-Par 2009 – Parallel Processing Workshops 6043 LNCS. 2010, 385 - 394. URL BibTeX

@conference{20102913080626,
	author = "Duato, Jose and Francisco D. Igual and Rafael Mayo and Pe{\~n}a, Antonio Jos{\'e} and Enrique S. Quintana-Orti and Silla, Federico",
	abstract = "Current high performance clusters are equipped with high bandwidth/low latency networks, lots of processors and nodes, very fast storage systems, etc. However, due to economical and/or power related constraints, in general it is not feasible to provide an accelerating co-processor -such as a graphics processor (GPU)- per node. To overcome this, in this paper we present a GPU virtualization middleware, which makes remote CUDA-compatible GPUs available to all the cluster nodes. The software is implemented on top of the sockets application programming interface, ensuring portability over commodity networks, but it can also be easily adapted to high performance networks. © 2010 Springer-Verlag.",
	address = "Delft, Netherlands",
	booktitle = "Euro-Par 2009 – Parallel Processing Workshops",
	issn = "0302-9743",
	journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
	key = "Data storage equipment",
	keywords = "Application programming interfaces;Computer graphics equipment;Computer software portability;Middleware;Nanotechnology;Program processors;",
	note = "Cluster nodes;Co-processors;Efficient implementation;Graphics processor;High performance cluster;High performance computing;High performance networks;Storage systems;Virtualizations;",
	pages = "385 - 394",
	title = "{A}n efficient implementation of {GPU} virtualization in high performance clusters",
	url = "http://dx.doi.org/10.1007/978-3-642-14122-5_44",
	volume = "6043 LNCS",
	year = 2010
}

Joan-Lluis Ferrer, Elvira Baydal, Antonio Robles, Pedro Lopez and Jose Duato. A Scalable and Early Congestion Management Mechanism for MINs. In Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP 2010. 2010, 43 - 50. URL BibTeX

@conference{11260741,
	author = "Ferrer, Joan-Lluis and Baydal, Elvira and Robles, Antonio and Lopez, Pedro and Duato, Jose",
	abstract = "Several packet marking-based mechanisms have been proposed to manage congestion in multistage interconnection networks. One of them, the MVCM mechanism obtains very good results for different network configurations and traffic loads. However, as MVCM applies full virtual output queuing at origin, its memory requirements may jeopardize its scalability. Additionally, the applied packet marking technique introduces certain delay to detect congestion. In this paper, we propose and evaluate the Scalable Early Congestion Management mechanism which eliminates the drawbacks exhibited by MVCM. The new mechanism replaces the full virtual output queuing at origin by either a partial virtual output queuing or a shared buffer, in order to reduce its memory requirements, thus making the mechanism scalable. Also, it applies an improved packet marking technique based on marking packets at output buffers regardless of their marking at input buffers, which simplifies the marking technique, allowing also a sooner detection of the root of a congestion tree.",
	address = "Piscataway, NJ, USA",
	booktitle = "Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP 2010",
	journal = "Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2010)",
	keywords = "multistage interconnection networks;",
	note = "packet marking based mechanisms;multistage interconnection networks;MVCM mechanism;virtual output queuing;scalable early congestion management mechanism;shared buffer;",
	pages = "43 - 50",
	title = "{A} {S}calable and {E}arly {C}ongestion {M}anagement {M}echanism for {MIN}s",
	url = "http://dx.doi.org/10.1109/PDP.2010.36",
	year = 2010
}

Ricardo Fernandez-Pascual, Jose M Garcia, Manuel E Acacio and Jose Duato. Dealing with transient faults in the interconnection network of CMPs at the cache coherence level. IEEE Transactions on Parallel and Distributed Systems 21(8):1117 - 1131, 2010. URL BibTeX

@article{20102713062175,
	author = "Ricardo Fernandez-Pascual and Jose M. Garcia and Manuel E. Acacio and Duato, Jose",
	abstract = "The importance of transient faults is predicted to grow due to current technology trends of increased scale of integration. One of the components that will be significantly affected by transient faults is the interconnection network of chip multiprocessors (CMPs). To deal efficiently with these faults and differently from other authors, we propose to use fault-tolerant cache coherence protocols that ensure the correct execution of programs when not all messages are correctly delivered. We describe the extensions made to a directory-based cache coherence protocol to provide fault tolerance and provide a modified set of token counting rules which are useful to design fault-tolerant token-based cache coherence protocols. We compare the directory-based fault-tolerant protocol with a token-based fault-tolerant one. We also show how to adjust the fault tolerance parameters to achieve the desired level of fault tolerance and measure the overhead achieved to be able to support very high fault rates. Simulation results using a set of scientific, multimedia, and commercial applications show that the fault tolerance measures have virtually no impact on execution time with respect to a non-fault-tolerant protocol. Additionally, our protocols can support very high rates of transient faults at the cost of slightly increased network traffic. {\&}copy; 2006 IEEE.",
	address = "445 Hoes Lane - P.O.Box 1331, Piscataway, NJ 08855-1331, United States",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Quality assurance",
	keywords = "Fault tolerance;Interconnection networks;Multiprocessing systems;Network protocols;Packet networks;",
	note = "cache coherence;Cache coherence protocols;Chip Multiprocessor;Commercial applications;Current technology;Design faults;Execution time;Fault rates;Fault tolerant protocols;Fault-tolerant;High rate;Network traffic;Scale of integration;Simulation result;Transient faults;",
	number = 8,
	pages = "1117 - 1131",
	title = "{D}ealing with transient faults in the interconnection network of {CMP}s at the cache coherence level",
	url = "http://dx.doi.org/10.1109/TPDS.2009.148",
	volume = 21,
	year = 2010
}

Samuel Rodrigo, Carles Hernández, Jose Flich, Federico Silla, Jose Duato, S Medardoni, D Bertozzi, D Dai and . Yield-oriented evaluation methodology of network-on-chip routing implementations. In System-on-Chip, 2009. SOC 2009. International Symposium on. 2009, 100 -105. URL, DOI BibTeX

@conference{5335667,
	author = "Rodrigo, Samuel and Hern{\'a}ndez, Carles and Flich, Jose and Silla, Federico and Duato, Jose and S. Medardoni and D. Bertozzi and D. Dai and ,",
	abstract = "Network-on-Chip technology is gaining wide popularity for the interconnection of an increasing number of processor cores on the same silicon die. However, growing process variations cause interconnect malfunction or prevent the network from working at the intended frequency, directly impacting yield and manufacturing cost. Topology agnostic routing algorithms have the potential to tolerate process variations without degrading performance. We propose a three step methodology for evaluating routing algorithms in their ability to deal with variability. Using yield enhancement and operation speed preservation as the criteria, we demonstrate how this methodology can be used to select the best design choice among several plausible combinations of routing algorithms and implementations. Also, we show how an efficient table-less routing implementation can be used to minimise the impact of variability on manufacturing and operating frequency.",
	booktitle = "System-on-Chip, 2009. SOC 2009. International Symposium on",
	doi = "10.1109/SOCC.2009.5335667",
	keywords = "Si;interconnect malfunction;network-on-chip routing;processor core interconnection;silicon die;yield enhancement;yield operation;yield oriented evaluation;integrated circuit interconnections;integrated circuit yield;microprocessor chips;network-on-chip;si",
	month = "oct.",
	pages = "100 -105",
	title = "{Y}ield-oriented evaluation methodology of network-on-chip routing implementations",
	url = "http://dx.doi.org/10.1109/SOCC.2009.5335667",
	year = 2009
}

Samuel Rodrigo, S Medardoni, Jose Flich, D Bertozzi and Jose Duato. Efficient implementation of distributed routing algorithms for NoCs. Computers Digital Techniques, IET 3(5):460 -475, September 2009. DOI BibTeX

@article{5200571,
	author = "Rodrigo, Samuel and S. Medardoni and Flich, Jose and D. Bertozzi and Duato, Jose",
	abstract = "Chip multiprocessors (CMPs) are gaining momentum in the high-performance computing domain. Networks-on-chip (NoCs) are key components of CMP architectures, in that they have to deal with the communication scalability challenge while meeting tight power, area and latency constraints. 2D mesh topologies are usually preferred by designers of general purpose NoCs. However, manufacturing faults may break their regularity. Moreover, resource management frameworks may require the segmentation of the network into irregular regions. Under these conditions, efficient routing becomes a challenge. Although the use of routing tables at switches is flexible, it does not scale in terms of latency and area due to its memory requirements. Logic-based distributed routing (LBDR) is proposed as a new routing method that removes the need for routing tables at all. LBDR enables the implementation of many routing algorithms on most of the practical topologies we may find in the near future in a multi-core system. From an initial topology and routing algorithm, a set of three bits per switch/output port is computed. Evaluation results show that, by using a small logic, LBDR mimics the performance of routing algorithms when implemented with routing tables, both in regular and irregular topologies. LBDR implementation in a real NoC switch is also explored, proving its smooth integration in the architecture and its negligible hardware and performance overhead.",
	doi = "10.1049/iet-cdt.2008.0092",
	issn = "1751-8601",
	journal = "Computers Digital Techniques, IET",
	keywords = "2D mesh topologies;chip multiprocessors;communication scalability;distributed routing algorithms;high-performance computing domain;logic-based distributed routing;manufacturing faults;multicore system;network segmentation;networks-on-chip;resource managem",
	month = "september",
	number = 5,
	pages = "460 -475",
	title = "{E}fficient implementation of distributed routing algorithms for {N}o{C}s",
	volume = 3,
	year = 2009
}

Salvador Petit, Rafael Ubal, Julio Sahuquillo, Pedro Lopez and Jose Duato. An Efficient Low-Complexity Alternative to the ROB for Out-of-Order Retirement of Instructions. In Antonio Nunez; Pedro P Carballo (ed.). Digital System Design, Architectures, Methods and Tools, 2009. DSD '09. 12th Euromicro Conference on. 2009, 635 -642. URL, DOI BibTeX

@conference{5350186,
	author = "Petit, Salvador and Ubal, Rafael and Sahuquillo, Julio and Lopez, Pedro and Duato, Jose",
	abstract = "Current superscalar processors use a reorder buffer (ROB) to support speculation, precise exceptions, and register reclamation. Instructions are retired from this structure in program order, which may lead to significant performance degradation if a long latency operation blocks the ROB head. In this paper, a checkpoint-free out-of-order commit architecture is proposed, which replaces the ROB with a small structure called validation buffer (VB) from which instructions are retired as soon as their speculative state is resolved. An aggressive register reclamation mechanism targeted to this microarchitecture is also devised. Experimental results show that the VB microarchitecture is much more efficient than a ROB-based microprocessor. For example, a 32-entry VB provides similar performance to a 256-entry ROB, while reducing the utilization of other major processor structures.",
	booktitle = "Digital System Design, Architectures, Methods and Tools, 2009. DSD '09. 12th Euromicro Conference on",
	doi = "10.1109/DSD.2009.237",
	editor = "Antonio Nunez; Pedro P. Carballo",
	isbn = "978-0-7695-3782-5",
	keywords = "ROB-based microprocessor;checkpoint-free out-of-order commit architecture;out-of-order instruction retirement;register reclamation;register reclamation mechanism;superscalar reorder buffer processors;validation buffer;buffer circuits;microprocessor chips;",
	month = "aug.",
	pages = "635 -642",
	title = "{A}n {E}fficient {L}ow-{C}omplexity {A}lternative to the {ROB} for {O}ut-of-{O}rder {R}etirement of {I}nstructions",
	url = "http://dx.doi.org/10.1109/DSD.2009.237",
	year = 2009
}

Vicente Chirivella, Rosa Alcover, Jose Flich and Jose Duato. Dependability analysis of a fault-tolerant network reconfiguring strategy. In Henk Sips; Dick Epema; Hai-Xiang Lin (ed.). Euro-Par 2009 Parallel Processing 5704. August 2009, 1040 - 1051. URL, DOI BibTeX

@conference{20094612441323,
	author = "Chirivella, Vicente and Alcover, Rosa and Flich, Jose and Duato, Jose",
	abstract = "Fault tolerance mechanisms become indispensable as the number of processors increases in large systems. Measuring the effectiveness of such mechanisms before its implementation becomes mandatory. Research toward understanding the effects of different network parameters on the dependability parameters, like mean time to network failure or availability, becomes necessary. In this paper we analyse in detail such effects with a methodology proposed previously by us. This methodology is based on Markov chains and Analysis of Variance techniques. As a case study we analyse the effects of network size, mean time to node failure, mean time to node repair, mean time to network repair and coverage of the failure when using a 2D mesh network with a fault-tolerant mechanism (similar to the one used in the BlueGene/L system), that is able to remove rows and/or columns in the presence of failures. © 2009 Springer.",
	address = "Delft, Netherlands",
	booktitle = "Euro-Par 2009 Parallel Processing",
	doi = "10.1007/978-3-642-03869-3_96",
	editor = "Henk Sips; Dick Epema; Hai-Xiang Lin",
	isbn = "978-3-642-03869-3",
	issn = "0302-9743",
	journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
	key = "Fault tolerant computer systems",
	keywords = "Artificial intelligence;Bioinformatics;Fault tolerance;Markov processes;Quality assurance;Regression analysis;",
	month = "Aug",
	note = "BlueGene/L systems;Dependability analysis;Fault tolerance mechanisms;Fault-tolerant mechanism;Fault-tolerant networks;Large system;Markov Chain;Mesh network;Network failure;Network parameters;Network size;Node failure;",
	pages = "1040 - 1051",
	publisher = "Springer",
	series = "Lecture Notes in Computer Science",
	title = "{D}ependability analysis of a fault-tolerant network reconfiguring strategy",
	url = "http://dx.doi.org/10.1007/978-3-642-03869-3_96",
	volume = 5704,
	year = 2009
}

, M Palesi, Jose Flich, S Kumar, Pedro Lopez, R Holsmark and Jose Duato. Region-Based Routing: A Mechanism to Support Efficient Routing Algorithms in NoCs. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 17(3):356 -369, March 2009. URL, DOI BibTeX

@article{4804124,
	author = ", and M. Palesi and Flich, Jose and S. Kumar and Lopez, Pedro and R. Holsmark and Duato, Jose",
	abstract = "An efficient routing algorithm is important for large on-chip networks [network-on-chip (NoC)] to provide the required communication performance to applications. Implementing NoC using table-based switches provide many advantages, including possibility of changing routing algorithms and fault tolerance, due to the option of table reconfigurations. However, table-based switches have been considered unsuitable for NoCs due to their perceived high area and power consumption. In this paper, we describe the region-based routing (RBR) mechanism which groups destinations into network regions allowing an efficient implementation with logic blocks. RBR can also be viewed as a mechanism to reduce the number of entries in routing tables. RBR is general and can be used in conjunction with any adaptive routing algorithm. In particular, we have evaluated the proposed scheme in conjunction with a general routing algorithm, namely segment-based routing (SR) and an application specific routing algorithm (APSRA) using regular and irregular mesh topologies. Our study shows that the number of entries in the table is significantly reduced, especially for large networks. Evaluation results show that RBR requires only four regions to support several routing algorithms in a 2-D mesh with no performance degradation. Considering link failures, our results indicate that RBR combined with SR is able to tolerate up to 7 link failures in an 8times8 mesh. RBR also reduces area and power dissipation of an equivalent table-based implementation by factors of 8 and 10, respectively. Moreover, the degradation in performance of the network is insignificant when using APSRA combined with RBR.",
	doi = "10.1109/TVLSI.2008.2012010",
	issn = "1063-8210",
	journal = "Very Large Scale Integration (VLSI) Systems, IEEE Transactions on",
	keywords = "adaptive routing algorithm;application specific routing algorithm;fault tolerance;large on-chip networks;network-on-chip;region-based routing mechanism;segment-based routing;table-based switches;network topology;network-on-chip;",
	month = "march",
	number = 3,
	pages = "356 -369",
	title = "{R}egion-{B}ased {R}outing: {A} {M}echanism to {S}upport {E}fficient {R}outing {A}lgorithms in {N}o{C}s",
	url = "http://dx.doi.org/10.1109/TVLSI.2008.2012010",
	volume = 17,
	year = 2009
}

A Martinez, P J Garcia, F J Alfaro, J L Sanchez, Jose Flich, F J Quiles and Jose Duato. A Switch Architecture Guaranteeing QoS Provision and HOL Blocking Elimination. Parallel and Distributed Systems, IEEE Transactions on 20(1):13 -24, 2009. DOI BibTeX

@article{4497190,
	author = "A. Martinez and P.J. Garcia and F.J. Alfaro and J.L. Sanchez and Flich, Jose and F.J. Quiles and Duato, Jose",
	abstract = "Both QoS support and congestion management techniques become essential to achieve good network performance in current high-speed interconnection networks. The most effective techniques traditionally considered for both issues, however, require too many resources for being implemented. In this paper we propose a new cost-effective switch architecture able to face the challenges of congestion management and, at the same time, to provide QoS. The efficiency of our proposal is based on using the resources (queues) used by RECN (an efficient Head-Of-Line blocking elimination technique) also for QoS support, without increasing queue requirements. Provided results show that the new switch architecture is able to guarantee QoS levels without any degradation due to congestion situations.",
	doi = "10.1109/TPDS.2008.62",
	issn = "1045-9219",
	journal = "Parallel and Distributed Systems, IEEE Transactions on",
	keywords = "HOL blocking elimination;QoS provision;congestion management;high-speed interconnection networks;network performance;switch architecture;quality of service;telecommunication congestion control;telecommunication network management;telecommunication switchi",
	month = "jan.",
	number = 1,
	pages = "13 -24",
	title = "{A} {S}witch {A}rchitecture {G}uaranteeing {Q}o{S} {P}rovision and {HOL} {B}locking {E}limination",
	volume = 20,
	year = 2009
}

Carles Hernández, Federico Silla, Vicente Santonja and Jose Duato. A new mechanism to deal with process variability in NoC links. In IPDPS 2009 - Proceedings of the 2009 IEEE International Parallel and Distributed Processing Symposium. 2009, IEEE Computer Societ. URL BibTeX

@conference{20094812508592,
	author = "Hern{\'a}ndez, Carles and Silla, Federico and Santonja, Vicente and Duato, Jose",
	abstract = "Associated with the ever growing integration scale of VLSI technologies is the increase in process variability, which makes silicon devices to become less predictable. In the context of network-on-chip (NoC), this variability affects the maximum frequency that could be sustained by each wire of the link that interconnects two cores in a CMP system. Reducing the clock frequency so that all wires can properly work is a trivial solution but, as variability increases, this approach causes an unacceptable performance penalty. In this paper, we propose a new technique to deal with the effects of variability on the links of the NoC that interconnects cores in a CMP system. This technique, called Phit Reduction (PR), retrieves most of the bandwidth still available in links containing wires that are not able to operate at the designed operating frequency. More precisely, our mechanism discards these slow wires and uses all the wires that can work at the design frequency. Two implementations are presented: Local Phit Reduction (LPR), oriented to fabrication processes with very high variability, which requires more hardware but provides higher performance; and Global Phit Reduction (GPR), that requires less additional hardware but is not able to extract all the available bandwidth. The performance evaluation presented in the paper confirms that LPR obtains good results both for low and high variability scenarios. Moreover, in most of our experiments LPR practically achieves the same performance than the ideal network. On the other hand, GPR is appropriate for systems where whithin-die variations are expected to be low. © 2009 IEEE.",
	address = "Rome, Italy",
	booktitle = "IPDPS 2009 - Proceedings of the 2009 IEEE International Parallel and Distributed Processing Symposium",
	journal = "IPDPS 2009 - Proceedings of the 2009 IEEE International Parallel and Distributed Processing Symposium",
	key = "Wire",
	keywords = "Bandwidth;Distributed parameter networks;Electric network topology;Machine design;Nanotechnology;Radar antennas;",
	note = "Available bandwidth;Clock frequency;Design frequencies;Fabrication process;High variability;Ideal network;In-process;Maximum frequency;Network on chip;New mechanisms;Operating frequency;Performance evaluation;Performance penalties;Process Variability;Silicon devices;Trivial solutions;VLSI technology;",
	pages = "IEEE Computer Societ",
	title = "{A} new mechanism to deal with process variability in {N}o{C} links",
	url = "http://dx.doi.org/10.1109/IPDPS.2009.5161048",
	year = 2009
}

Salvador Coll, Francisco J Mora, Jose Duato and Fabrizio Petrini. Efficient and scalable hardware-based multicast in fat-tree networks. IEEE Transactions on Parallel and Distributed Systems 20(9):1285 - 1298, 2009. URL, DOI BibTeX

@article{20093412267181,
	author = "Coll, Salvador and Francisco J. Mora and Duato, Jose and Fabrizio Petrini",
	abstract = "This article presents an efficient and scalable mechanism to overcome the limitations of collective communication in switched interconnection networks in the presence of faults. Considering that current trends in supercomputing are moving toward massively parallel computers, with many thousands of components, reliability becomes a challenge. In such scenario, fat-tree networks that provide hardware support for collective communication suffer from serious performance degradation due to the presence of, even, a single faulty node. This paper describes a new mechanism to provide high-performance collective communication in such situations. The feasibility of the proposed technique is formally demonstrated. We present the design of a new hardware-based routing algorithm for multicast, that is at the base of our proposal. The proposed mechanism is implemented and experimentally evaluated. Our experimental results show that hardware-based multicast trees provide an efficient and scalable solution for collective communication in fat-tree networks, significantly outperforming traditional solutions. © 2009 IEEE.",
	address = "445 Hoes Lane - P.O.Box 1331, Piscataway, NJ 08855-1331, United States",
	doi = "10.1109/TPDS.2008.228",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Communication",
	keywords = "Computer hardware;Convolutional codes;Multicasting;Routing algorithms;Switching circuits;",
	note = "Data communications;Interprocessor communications;Multicast;Network communication;Network problems;Trees;",
	number = 9,
	pages = "1285 - 1298",
	title = "{E}fficient and scalable hardware-based multicast in fat-tree networks",
	url = "http://dx.doi.org/10.1109/TPDS.2008.228",
	volume = 20,
	year = 2009
}

P Morillo, J M Orduna and Jose Duato. M-GRASP: a GRASP with memory for latency-aware partitioning methods in DVE systems. IEEE Transactions on Systems, Man and Cybernetics, Part A (Systems and Humans) 39(6):1214 - 23, 2009. URL BibTeX

@article{10919102,
	author = "P. Morillo and J.M. Orduna and Duato, Jose",
	abstract = "A necessary condition for providing quality of service to distributed virtual environments (DVEs) is to provide a system response below a maximum threshold to the client computers. In this sense, latency-aware partitioning methods try to provide response times below the threshold to the maximum number of client computers as possible. These partitioning methods should find an assignment of clients to servers that optimizes system throughput, system latency, and partitioning efficiency. In this paper, we present a new algorithm based on greedy randomized adaptive search procedure with memory for finding the best solutions as possible to this problem. We take into account several different alternatives in order to design both the constructive phase and the local search phase of this multistart metaheuristic for combinatorial problems. Additionally, we enhance this basic approach with some intensification strategies that improve the efficiency of the basic search method. Performance evaluation results show that the new algorithm increases the performance provided by other metaheuristics when applied to solve the latency-aware partitioning problem in DVE systems.",
	address = "USA",
	issn = "1083-4427",
	journal = "IEEE Transactions on Systems, Man and Cybernetics, Part A (Systems and Humans)",
	keywords = "client-server systems;combinatorial mathematics;greedy algorithms;quality of service;randomised algorithms;search problems;virtual reality;",
	note = "M-GRASP;distributed virtual environments;quality of service;latency-aware partitioning methods;system latency;greedy randomized adaptive search procedure;local search phase;combinatorial problems;DVE system;",
	number = 6,
	pages = "1214 - 23",
	title = "{M}-{GRASP}: a {GRASP} with memory for latency-aware partitioning methods in {DVE} systems",
	url = "http://dx.doi.org/10.1109/TSMCA.2009.2025024",
	volume = 39,
	year = 2009
}

, Jose Flich, Jose Duato, H Eberle, N Gura and W Olesinski. A performance evaluation of 2D-mesh, ring, and crossbar interconnects for chip multi-processors. In Network on Chip Architectures, 2009. NoCArc 2009. 2nd International Workshop on. 2009, 51 -56. BibTeX

@conference{5375715,
	author = ", and Flich, Jose and Duato, Jose and H. Eberle and N. Gura and W. Olesinski",
	abstract = "As the number of processing nodes on chip multi-processors (CMPs) keeps increasing, providing efficient communication with the on-chip interconnect becomes increasingly critical. With 32-core CMP designs on the drawing table of engineers, there is a demand for accurate simulation models that capture all the complexities and interactions of the different design layers including the application, operating system, cache hierarchy, coherency protocol, and other on-chip resources. These components cannot be modeled anymore in isolation as unpredicted performance anomalies may arise once all the system variables are taken into account. In this paper, we present a simulation framework for CMP systems, focusing our attention on the on-chip network. We show preliminary results for the choice of key network parameters (topology, flit size) with respect to the behavior and performance of applications running on top of different network configurations. This paper tries to convey the need for an overall CMP system simulator as a way to accurately characterize the actual behavior of the on-chip network.",
	booktitle = "Network on Chip Architectures, 2009. NoCArc 2009. 2nd International Workshop on",
	keywords = "2D-mesh interconnects;32-core CMP designs;cache hierarchy;chip multi-processors;coherency protocol;crossbar interconnects;on-chip network;operating system;processing nodes;ring interconnects;integrated circuit design;integrated circuit interconnections;mi",
	month = "12-12",
	pages = "51 -56",
	title = "{A} performance evaluation of 2{D}-mesh, ring, and crossbar interconnects for chip multi-processors",
	year = 2009
}

Francisco J Alfaro, Jose L Sanchez and Jose Duato. A new strategy to manage the InfiniBand arbitration tables. Journal of Parallel and Distributed Computing 69(6):508 - 520, 2009. URL BibTeX

@article{20091912072389,
	author = "Francisco J. Alfaro and Jose L. Sanchez and Duato, Jose",
	abstract = "The InfiniBand Architecture (IBA) is an industry-standard architecture for server I/O and interprocessor communication. InfiniBand is extensively used for interconnection in high-performance clusters. It has been developed by the InfiniBandS M Trade Association (IBTA) to provide the levels of reliability, availability, performance, scalability, and quality of service (QoS) necessary for present and future server systems. The provision of QoS in data communication networks is currently the focus of much discussion and research in both industry and academia. In that sense, IBA enables QoS support with some mechanisms. In this paper, we examine these mechanisms and describe a way to use them. We propose a traffic segregation strategy based only on delay requirements. Moreover, we propose a very effective methodology to compute the virtual lane arbitration tables. Finally, we evaluate our proposal and performance results show that, with a correct traffic treatment at the output ports, every traffic class meets its QoS requirements. © 2009 Elsevier Inc. All rights reserved.",
	address = "6277 Sea Harbor Drive, Orlando, FL 32887-4900, United States",
	issn = 07437315,
	journal = "Journal of Parallel and Distributed Computing",
	key = "Quality of service",
	keywords = "Interconnection networks;Parallel processing systems;Queueing networks;Telecommunication networks;",
	note = "Arbitration;Clusters;Connection requirements;InfiniBand;QoS;",
	number = 6,
	pages = "508 - 520",
	title = "{A} new strategy to manage the {I}nfini{B}and arbitration tables",
	url = "http://dx.doi.org/10.1016/j.jpdc.2009.02.002",
	volume = 69,
	year = 2009
}

Crispín Gomez, Maria E Gomez, Pedro Lopez and Jose Duato. An Efficient Switching Technique for NoCs with Reduced Buffer Requirements. In Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on. 2008, 713 -720. URL, DOI BibTeX

@conference{4724384,
	author = "Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Networks on chip (NoCs) communicate the components located inside a chip. Overall system performance depends on NoC performance, that is affected by several factors. One of them is the network clock frequency, imposed by the critical path delay. Recent works show that switch critical path includes buffer control logic. Consequently, by removing switch buffers, switch frequency can be doubled. In this paper, we exploit this idea, proposing a new switching technique for NoCs which requires a reduced amount of storage at the switches. It is based on replacing switch port buffers by single latches. By doing so, network cycle can be reduced, which reduces packet latency. On the other hand, power and area consumption requirements can be reduced. However, since there are no buffers at the switch ports, packets can not be stopped. Stopped packets due to contention are dropped and reinjected from their senders via negative acknowledgments. Packet dropping is strongly reduced by exploiting NoCs wiring capability.",
	booktitle = "Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on",
	doi = "10.1109/ICPADS.2008.43",
	issn = "1521-9097",
	keywords = "buffer control logic;critical path delay;network clock frequency;network cycle;networks on chip;packet dropping;reduced buffer requirements;switching technique;network-on-chip;performance evaluation;",
	month = "dec.",
	pages = "713 -720",
	title = "{A}n {E}fficient {S}witching {T}echnique for {N}o{C}s with {R}educed {B}uffer {R}equirements",
	url = "http://dx.doi.org/10.1109/ICPADS.2008.43",
	year = 2008
}

Crispín Gomez, Francisco Gilabert, Maria E Gomez, Pedro Lopez and Jose Duato. Beyond Fat–tree: Unidirectional Load–Balanced Multistage Interconnection Network. Computer Architecture Letters 7(2):49 -52, 2008. URL, DOI BibTeX

@article{4544509,
	author = "Gomez, Crisp{\'i}n and Gilabert, Francisco and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "The fat-tree is one of the most widely-used topologies by interconnection network manufacturers. Recently, it has been demonstrated that a deterministic routing algorithm that optimally balances the network traffic can not only achieve almost the same performance than an adaptive routing algorithm but also outperforms it. On the other hand, fat-trees require a high number of switches with a non-negligible wiring complexity. In this paper, we propose replacing the fat-tree by a unidirectional multistage interconnection network (UMIN) that uses a traffic balancing deterministic routing algorithm. As a consequence, switch hardware is almost reduced to the half, decreasing, in this way, the power consumption, the arbitration complexity, the switch size itself, and the network cost. Preliminary evaluation results show that the UMIN with the load balancing scheme obtains lower latency than fat-tree for low and medium traffic loads. Furthermore, in networks with a high number of stages or with high radix switches, it obtains the same, or even higher, throughput than fat-tree.",
	doi = "10.1109/L-CA.2008.8",
	issn = "1556-6056",
	journal = "Computer Architecture Letters",
	keywords = "adaptive routing algorithm;interconnection network manufacturers;network traffic;nonnegligible wiring complexity;power consumption;radix switches;traffic balancing deterministic routing algorithm;unidirectional load-balanced multistage interconnection net",
	month = "july-dec.",
	number = 2,
	pages = "49 -52",
	title = "{B}eyond {F}at--tree: {U}nidirectional {L}oad--{B}alanced {M}ultistage {I}nterconnection {N}etwork",
	url = "http://dx.doi.org/10.1109/L-CA.2008.8",
	volume = 7,
	year = 2008
}

O Lysne, J M Montañana, Jose Flich, Jose Duato, T M Pinkston and T Skeie. An Efficient and Deadlock-Free Network Reconfiguration Protocol. Computers, IEEE Transactions on 57(6):762 -779, June 2008. URL, DOI BibTeX

@article{4459311,
	author = "O. Lysne and Monta{\~n}ana, J. M. and Flich, Jose and Duato, Jose and T.M. Pinkston and T. Skeie",
	abstract = {Component failures and planned component replacements cause changes in the topology and routing paths supplied by the interconnection network of a parallel processor system over time. Such changes may require the network to be reconfigured such that the existing routing function is replaced by one that enables packets to reach their intended destinations amid the changes. Efficient reconfiguration methods are desired which allow the network to function uninterruptedly over the course of the reconfiguration process while remaining free from deadlocking behavior. In this paper, we propose, evaluate, and prove the deadlock freedom of a new network reconfiguration protocol that overlaps various phases of "static" reconfiguration processes traditionally used in commercial and research systems to provide performance efficiency on par with that of recently proposed "dynamic" reconfiguration processes but without their complexity. Simulation results show that the proposed Overlapping Static Reconfiguration protocol can reduce reconfiguration time by up to 50 percent, reduce packet latency by several orders of magnitude, reduce packet dropping by an order of magnitude, and provide unhalted packet injection as compared to traditional static reconfiguration while allowing network throughput similar to dynamic reconfiguration.},
	doi = "10.1109/TC.2008.31",
	issn = "0018-9340",
	journal = "Computers, IEEE Transactions on",
	keywords = "deadlock freedom;dynamic reconfiguration processes;interconnection network;network reconfiguration protocol;overlapping static reconfiguration protocol;parallel processor system;reduce packet latency;static reconfiguration processes;multiprocessor interco",
	month = "june",
	number = 6,
	pages = "762 -779",
	title = "{A}n {E}fficient and {D}eadlock-{F}ree {N}etwork {R}econfiguration {P}rotocol",
	url = "http://dx.doi.org/10.1109/TC.2008.31",
	volume = 57,
	year = 2008
}

J M Montañana, Jose Flich and Jose Duato. Epoch-based reconfiguration: Fast, simple, and effective dynamic network reconfiguration. In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on. April 2008, 1 -12. URL, DOI BibTeX

@conference{4536298,
author = "Monta{\~n}ana, J. M. and Flich, Jose and Duato, Jose",
abstract = "Dynamic network reconfiguration is defined as the process of changing from one routing function to another while the network remains up and running. The main challenge is to avoid deadlocks and reduce packet dropping rate while keeping network service. Current approaches either require the existence of extra network resources like e.g. virtual channels, their complexity is so high that their practical applicability is limited, or they affect to the performance of the network during the reconfiguration process. In this paper we present EBR, a simple and fast method for dynamic network reconfiguration. EBR guarantees a fast and deadlock-free reconfiguration, but instead of avoiding deadlocks our mechanism is based on regressive deadlock recoveries. Thus, EBR allows cycles to be formed, and in the situation of a deadlock some packets may be dropped. However, as demonstrated, no packets need to be dropped in the working zone of the system. Also, the mechanism works in an asynchronous manner, does not require additional resources and works on any topology. In order to minimize the number of dropped packets, EBR uses an epoch marking system that guarantees that only packets potentially leading to a deadlock will be removed. Evaluation results show that EBR works efficiently in different topologies and with different routing algorithms. When compared with current proposals, EBR always gets the best numbers in all the analyzed parameters (dropped packets, latency, throughput, reconfiguration time and resources required), thus achieving the good properties of all mechanisms.",
booktitle = "Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on",
doi = "10.1109/IPDPS.2008.4536298",
isbn = "978-1-4244-1693-6",
issn = "1530-2075",
keywords = "deadlock-free reconfiguration;dynamic network reconfiguration;epoch-based reconfiguration;network resource;network service;packet dropping rate;regressive deadlock recovery;routing algorithm;routing function;topology;computer networks;telecommunication ne",
month = "april",
pages = "1 -12",
title = "{E}poch-based reconfiguration: {F}ast, simple, and effective dynamic network reconfiguration",
url = "http://dx.doi.org/10.1109/IPDPS.2008.4536298",
year = 2008
}

Crispín Gomez, Maria E Gomez, Pedro Lopez and Jose Duato. Exploiting Wiring Resources on Interconnection Network: Increasing Path Diversity. In Parallel, Distributed and Network-Based Processing, 2008. PDP 2008. 16th Euromicro Conference on. 2008, 20 -29. URL, DOI BibTeX

@conference{4457100,
	author = "Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "On-chip networks are the answer to the growing demands for high communication performance of chip multiprocessors. These networks have a number of characteristics that make their design quite different to off-chip networks. In particular, wires are an abundant available resource inside the chip. In this paper, we explore how to organize the huge wiring capabilities available in on-chip networks. In particular, we analyze the option of distributing the wires among several parallel links connecting the same two switches. This technique is known as Space Division Multiplexing (SDM). The number of parallel sub-links and their width are two key parameters that are studied together with the relationship with the mean packet size. The paper shows that SDM is a technique to take into account in on-chip networks since it allows to highly increase the network accepted traffic at the expense of a small latency increase or even no increase. Moreover, in some networks, it allows to reduce the network hardware, providing simiar performance results, which results in a reduction in the consumption of area and power.",
	booktitle = "Parallel, Distributed and Network-Based Processing, 2008. PDP 2008. 16th Euromicro Conference on",
	doi = "10.1109/PDP.2008.33",
	isbn = "978-0-7695-3089-5",
	issn = "1066-6192",
	keywords = "chip multiprocessors;interconnection network;mean packet size;on-chip networks;parallel links;path diversity;space division multiplexing;wiring capabilities;wiring resources;multiprocessor interconnection networks;space division multiplexing;wiring;",
	month = "feb.",
	pages = "20 -29",
	title = "{E}xploiting {W}iring {R}esources on {I}nterconnection {N}etwork: {I}ncreasing {P}ath {D}iversity",
	url = "http://dx.doi.org/10.1109/PDP.2008.33",
	year = 2008
}

Jose Flich and Jose Duato. Logic-Based Distributed Routing for NoCs. Computer Architecture Letters 7(1):13 -16, 2008. DOI BibTeX

@article{4407676,
	author = "Flich, Jose and Duato, Jose",
	abstract = "The design of scalable and reliable interconnection networks for multicore chips (NoCs) introduces new design constraints like power consumption, area, and ultra low latencies. Although 2D meshes are usually proposed for NoCs, heterogeneous cores, manufacturing defects, hard failures, and chip virtualization may lead to irregular topologies. In this context, efficient routing becomes a challenge. Although switches can be easily configured to support most routing algorithms and topologies by using routing tables, this solution does not scale in terms of latency and area. We propose a new circuit that removes the need for using routing tables. The new mechanism, referred to as logic-based distributed routing (LBDR), enables the implementation in NoCs of many routing algorithms for most of the practical topologies we might find in the near future in a multicore chip. From an initial topology and routing algorithm, a set of three bits per switch output port is computed. By using a small logic block, LHDR mimics (demonstrated by evaluation) the behavior of routing algorithms implemented with routing tables. This result is achieved both in regular and irregular topologies. Therefore, LBDR removes the need for using routing tables for distributed routing, thus enabling flexible, fast and power-efficient routing in NoCs.",
	doi = "10.1109/L-CA.2007.16",
	issn = "1556-6056",
	journal = "Computer Architecture Letters",
	keywords = "NoC;chip virtualization;heterogeneous cores;interconnection network reliability;logic-based distributed routing;manufacturing defects;networks for multicore chips;circuit reliability;interconnections;logic circuits;network routing;network topology;network",
	month = "january-june",
	number = 1,
	pages = "13 -16",
	title = "{L}ogic-{B}ased {D}istributed {R}outing for {N}o{C}s",
	volume = 7,
	year = 2008
}

Crispín Gomez, Francisco Gilabert, Maria E Gomez, Pedro Lopez and Jose Duato. RUFT: Simplifying the fat-tree topology. In Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on. 2008, 153 - 160. URL BibTeX

@conference{20090911931135,
	author = "Gomez, Crisp{\'i}n and Gilabert, Francisco and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "The fat-tree is one of the most widely-used topologies by interconnection network manufacturers. Recently, a deterministic routing algorithm that optimally balances the network traffic in fat-trees was proposed. It can not only achieve almost the same performance than adaptive routing, but also outperforms it for some traffic patterns. Nevertheless, fat-trees require a high number of switches with a non-negligible wiring complexity. In this paper, we propose replacing the fat-tree by an unidirectional multistage interconnection network referred to as Reduced Unidirectional Fat-tree (RUFT) that uses a a simplified version of the aforementioned deterministic routing algorithm. As a consequence, switch hardware is almost reduced to the half, decreasing, in this way, power consumption, arbitration complexity, switch size, and network cost. Evaluation results show that RUFT obtains lower latency than fat-tree for low and medium traffic loads. Furthermore, in large networks, it obtains almost the same throughput than the classical fat-tree. {{\&}}copy; 2008 IEEE.",
	address = "Melbourne, VIC, Australia",
	booktitle = "Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on",
	issn = 15219097,
	journal = "Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS",
	key = "Trees (mathematics)",
	keywords = "Interconnection networks;Internet;Routing algorithms;Switches;Switching circuits;",
	note = "Adaptive routing;Deterministic routing algorithms;Evaluation results;Large networks;Multi-stage interconnection networks;Network costs;Network traffics;Number of switches;Power consumption;Switch sizes;Traffic loads;Traffic patterns;Tree topologies;",
	pages = "153 - 160",
	title = "{RUFT}: {S}implifying the fat-tree topology",
	url = "http://dx.doi.org/10.1109/ICPADS.2008.44",
	year = 2008
}

Blas Cuesta Sáez, Antonio Robles and Jose Duato. Switch-based packing technique for improving token coherence scalability. 2008, 80 - 87. URL BibTeX

@conference{20090411871352,
	author = "Cuesta S{\'a}ez, Blas and Robles, Antonio and Duato, Jose",
	abstract = "Traditional cache coherence protocols either provide low latency cache misses (snooping protocols) or bandwidth efficiency (directory protocols). To simultaneously capture the best attributes of traditional protocols, Token Coherence has been recently proposed. This protocol can quickly resolve cache misses by transient requests. However, since transient requests are unordered messages, they may sometimes fail in solving cache misses mainly due to the occurrence of protocol races. Thus, when the completion of cache misses is not possible by transient requests, Token Coherence uses a starvation prevention mechanism to ensure their completion. Although several implementation options of starvation prevention mechanisms have been proposed, all of them are broadcast-based. This fact represents a large detriment to the Token Coherence scalability. To tackle this problem, in this work we apply a switchbased packing technique that alleviates the harm of broadcast messages and improves the protocol scalability. © 2008 IEEE.",
	address = "Dunedin, Otago, New zealand",
	journal = "Parallel and Distributed Computing, Applications and Technologies, PDCAT Proceedings",
	key = "Coherent light",
	keywords = "Multiprocessing systems;Scalability;",
	note = "Bandwidth efficiencies;Broadcast messages;Cache coherence protocols;Cache misses;Directory protocols;Low latencies;Packing techniques;Protocol scalabilities;Token coherences;",
	pages = "80 - 87",
	title = "{S}witch-based packing technique for improving token coherence scalability",
	url = "http://dx.doi.org/10.1109/PDCAT.2008.25",
	year = 2008
}

Crispín Gomez, Maria E Gomez, Pedro Lopez and Jose Duato. Reducing packet dropping in a bufferless NoC. In Euro-Par 2008 – Parallel Processing. 2008, 899 - 909. URL BibTeX

@conference{10528093,
	author = "Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Networks on chip (NoCs) has a strong impact on overall chip performance. Interconnection bandwidth is limited by the critical path delay. Recent works show that the critical path includes the switch input buffer control logic. As a consequence, by removing buffers, switch clock frequency can be doubled. Recently, a new switching technique for NoCs called blind packet switching (BPS) has been proposed. It is based on replacing the buffers of the switch ports by simple latches. Since buffers consume a high percentage of switch power and area, BPS not only improves performance but also helps in reducing power and area. In BPS there are no buffers at the switch ports, so packets can not be stopped. If the required output port is busy, the packet will be dropped. In order to prevent packet dropping, some techniques based on resource replication has been proposed. In this paper, we propose some alternative and complementary techniques that does not rely on resource replication. By using these techniques, packet dropping and its negative effects are highly reduced. In particular, packet dropping is completely removed for a very wide network traffic range. The first dropped packet appears at a 11.6 higher traffic load. As a consequence, network throughput is increased and the packet latency is kept almost constant.",
	address = "Berlin, Germany",
	booktitle = "Euro-Par 2008 – Parallel Processing",
	journal = "Euro-Par 2008 Parallel Processing. 14th International Euro-Par Conference",
	keywords = "delays;multiprocessor interconnection networks;network-on-chip;packet switching;",
	note = "bufferless NoC;networks on chip;interconnection bandwidth;buffer control logic;blind packet switching;resource replication;packet dropping;network traffic load;critical path delay;",
	pages = "899 - 909",
	title = "{R}educing packet dropping in a bufferless {N}o{C}",
	url = "http://dx.doi.org/10.1007/978-3-540-85451-7_97",
	year = 2008
}

Joan-Lluis Ferrer, Elvira Baydal, Antonio Robles, Pedro Lopez and Jose Duato. On the influence of the packet marking and injection control schemes in congestion management for MINs. 2008, 930 - 9. URL BibTeX

@conference{10528096,
	author = "Ferrer, Joan-Lluis and Baydal, Elvira and Robles, Antonio and Lopez, Pedro and Duato, Jose",
	abstract = "Several Congestion Management Mechanisms (CMMs) have been proposed for Multistage Interconnection Networks (MINs) in order to avoid the degradation of network performance when congestion appears. Most of them are based on Explicit Congestion Notification (ECN). For this purpose, switches detect congestion and, depending on the applied mechanism, some flags are marked to warn the source hosts. In response, source hosts apply corrective actions to adjust their packet injection rate. These mechanisms have been evaluated by analyzing whether they are able to manage a congestion situation but there is not a comparison study among them. Moreover, marking effects are not separately analyzed from corrective actions. In this paper, we analyze the current proposals for CMMs, showing the impact of the applied packet marking techniques as well as the corrective actions they apply.",
	address = "Berlin, Germany",
	journal = "Euro-Par 2008 Parallel Processing. 14th International Euro-Par Conference",
	keywords = "multistage interconnection networks;packet switching;telecommunication congestion control;",
	note = "packet marking;injection control schemes;congestion management mechanisms;multistage interconnection networks;explicit congestion notification;message throttling;",
	pages = "930 - 9",
	title = "{O}n the influence of the packet marking and injection control schemes in congestion management for {MIN}s",
	url = "http://dx.doi.org/10.1007/978-3-540-85451-7_100",
	year = 2008
}

Héctor Montaner, Vicente Santonja, Federico Silla and Jose Duato. Network reconfiguration suitability for scientific applications. In Parallel Processing, 2008. ICPP '08. 37th International Conference on. 2008, 312 - 319. URL, DOI BibTeX

@conference{10207626,
	author = "Montaner, H{\'e}ctor and Santonja, Vicente and Silla, Federico and Duato, Jose",
	abstract = "This paper analyzes the communication pattern of several scientific applications and how they can make profit of network reconfiguration in order to adapt network topology to the communication needs so that total execution time is reduced. By using an analysis methodology based on real application executions, we study the variation of the required communication bandwidth with time and also the global interprocedural communication patterns. Results show that required bandwidth between each pair of processes does not significantly fluctuates, leading to a constant use of the links and therefore discouraging dynamic reconfigurations of the network during execution time. Nevertheless, the group of busy links changes with each application showing a different communication graph for each of them. Thus, execution time may be accelerated by using an ad-hoc topology, that is, reconfiguring the network before the execution of the application in order to adapt it to the application needs.",
	address = "Piscataway, NJ, USA",
	booktitle = "Parallel Processing, 2008. ICPP '08. 37th International Conference on",
	doi = "10.1109/ICPP.2008.58",
	journal = "2008 37th International Conference on Parallel Processing (ICPP)",
	keywords = "ad hoc networks;application program interfaces;message passing;natural sciences computing;telecommunication network topology;",
	note = "network reconfiguration suitability;scientific applications;network topology;global interprocedural communication patterns;communication graph;ad-hoc topology;message passing interface;",
	pages = "312 - 319",
	title = "{N}etwork reconfiguration suitability for scientific applications",
	url = "http://dx.doi.org/10.1109/ICPP.2008.58",
	year = 2008
}

Blas Cuesta Sáez, Antonio Robles and Jose Duato. Improving token coherence by multicast coherence messages. 2008, 269 - 73. URL BibTeX

@conference{9904937,
	author = "Cuesta S{\'a}ez, Blas and Robles, Antonio and Duato, Jose",
	abstract = "Token coherence is a cache coherence protocol that joins the main advantages of traditional protocols. However, unlike them, token coherence does not handle messages in order, which may lead to races, causing some cache misses not to be solved. To assure their completion, an inefficient mechanism named persistent requests is used. Recently we have proposed the priority request mechanism to efficiently handle races. As acknowledgements are not required, a single node can solve several misses for the same memory block at the same time. When solving a lot of misses, the node may become a bottleneck. To avoid it, in this work we propose the multicast coherence message, which allows to simultaneously resolve several misses by using only one response message. It reduces the network traffic and the average response latency, improving significantly the overall performance.",
	address = "Piscataway, NJ, USA",
	journal = "2008 16th Euromicro Conference on Parallel, Distributed and Network-based Processing - PDP '08",
	keywords = "cache storage;multicast protocols;routing protocols;",
	note = "token coherence;multicast coherence messages;cache coherence protocol;priority request mechanism;network traffic;average response latency;",
	pages = "269 - 73",
	title = "{I}mproving token coherence by multicast coherence messages",
	url = "http://dx.doi.org/10.1109/PDP.2008.36",
	year = 2008
}

Jesus Escudero-Sahuquillo, Pedro Garcia, Francisco Quiles, Jose Flich and Jose Duato. FBICM: Efficient congestion management for high-performance networks using distributed deterministic routing. In High Performance Computing - HiPC 2008 5374 LNCS. 2008, 503 - 517. URL, DOI BibTeX

@conference{20090511881191,
	author = "Jesus Escudero-Sahuquillo and Pedro Garcia and Francisco Quiles and Flich, Jose and Duato, Jose",
	abstract = "As the number of components in cluster-based systems increases, cost and power consumption also increase. One way to reduce both problems is using smaller networks with adequate congestion management mechanisms. Recent successful proposals (RECN) eliminate the negative effects of congestion, the Head-of-Line (HOL) blocking, leaving congestion harmless. RECN relies on source-based networks architectures, where the entire route is placed at packet headers before injection. Unfortunately, distributed table-based routing is also common in cluster-based networks, being InfiniBand the most prominent example. We propose a novel congestion management technique for distributed table-based routing. The mechanism relies on additional congestion information located at routing tables. With this information HOL blocking is minimized by smartly using switch queues. Detailed memory organization and the way congestion information is updated/propagated is described. Preliminary results indicate that with modest resource requirements maximum network performance is kept regardless of congestion. © 2008 Springer Berlin Heidelberg.",
	address = "Bangalore, India",
	booktitle = "High Performance Computing - HiPC 2008",
	doi = "10.1007/978-3-540-89894-8_44",
	issn = 03029743,
	journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
	key = "Network management",
	keywords = "High performance liquid chromatography;Industrial management;Network performance;Parallel processing systems;Systems engineering;",
	note = "Congestion control;Congestion management;Distributed routing;Head of line (HOL) blocking;High-performance interconnects;InfiniBand (CO);Memory organizations;Negative effects;Network performances;One way;Packet headers;Power consumption (CE);Resource requirements;",
	pages = "503 - 517",
	title = "{FBICM}: {E}fficient congestion management for high-performance networks using distributed deterministic routing",
	url = "http://dx.doi.org/10.1007/978-3-540-89894-8_44",
	volume = "5374 LNCS",
	year = 2008
}

Francisco Gilabert, S Medardoni, D Bertozzi, L Benini, Maria E Gomez, Pedro Lopez and Jose Duato. Exploring high-dimensional topologies for NoC design through an integrated analysis and synthesis framework. 2008, 107 - 16. BibTeX

@conference{9940710,
	author = "Gilabert, Francisco and S. Medardoni and D. Bertozzi and L. Benini and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Networks-on-chip (NoCs) address the challenge to provide scalable communication bandwidth to tiled architectures in a power-efficient fashion. The 2-D mesh is currently the most popular regular topology used for on-chip networks in tile-based architectures, because it perfectly matches the 2-D silicon surface and is easy to implement. However, a number of limitations have been proved in the open literature, especially for long distance traffic. Two relevant variants of 2-D meshes are explored in this paper: high-dimensional and concentrated topologies. The novelty of our exploration framework includes the use of fast and accurate transaction level simulation to provide constraints to the physical synthesis flow, which is integrated with standard industrial toolchains for accurate physical implementation. Interestingly, this work illustrates how effectively the compared topologies can handle synchronization-intensive traffic patterns and accounts for chip I/O interfaces.",
	address = "Piscataway, NJ, USA",
	journal = "2008 2nd ACM/IEEE International Symposium on Networks-on-Chip (NOCS '08)",
	keywords = "integrated circuit design;logic design;network topology;network-on-chip;",
	note = "NoC design;networks-on-chip;2D mesh topology;on-chip networks;tile-based architectures;industrial toolchains;chip I/O interfaces;",
	pages = "107 - 16",
	title = "{E}xploring high-dimensional topologies for {N}o{C} design through an integrated analysis and synthesis framework",
	year = 2008
}

Crispín Gomez, Maria E Gomez, Pedro Lopez and Jose Duato. Exploiting wiring resources on interconnection network: Increasing path diversity. In Parallel, Distributed and Network-Based Processing, 2008. PDP 2008. 16th Euromicro Conference on. 2008, 20 - 29. URL BibTeX

@conference{20083011395413,
	author = "Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "On-chip networks are the answer to the growing demands for high communication performance of chip multiprocessors. These networks have a number of characteristics that make their design quite different to off-chip networks. In particular, wires are an abundant available resource inside the chip. In this paper, we explore how to organize the huge wiring capabilities available in on-chip networks. In particular, we analyze the option of distributing the wires among several parallel links connecting the same two switches. This technique is known as Space Division Multiplexing (SDM). The number of parallel sub-links and their width are two key parameters that are studied together with the relationship with the mean packet size. The paper shows that SDM is a technique to take into account in on-chip networks since it allows to highly increase the network accepted traffic at the expense of a small latency increase or even no increase. Moreover, in some networks, it allows to reduce the network hardware, providing similar performance results, which results in a reduction in the consumption of area and power. © 2008 IEEE.",
	address = "Toulouse, France",
	booktitle = "Parallel, Distributed and Network-Based Processing, 2008. PDP 2008. 16th Euromicro Conference on",
	journal = "Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP 2008",
	key = "Space division multiple access",
	keywords = "Electric network topology;Internet;Telecommunication;Wire;",
	note = "Chip multi processor (CMP);Communication performances;Key parameters;Latency increase;Off chip;On Chip Network (OCN);Packet size (PS);Parallel links;Path diversity;Performance results;Space division multiplexing (SDM);",
	pages = "20 - 29",
	title = "{E}xploiting wiring resources on interconnection network: {I}ncreasing path diversity",
	url = "http://dx.doi.org/10.1109/PDP.2008.33",
	year = 2008
}

Samuel Rodrigo, Jose Flich, Jose Duato and M Hummel. Efficient unicast and multicast support for CMPs. In 2008 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-41). 2008, 364 - 75. URL BibTeX

@conference{10428961,
	author = "Rodrigo, Samuel and Flich, Jose and Duato, Jose and M. Hummel",
	abstract = "Beyond a certain number of cores, multi-core processing chips will require a network-on-chip (NoC) to interconnect the cores and overcome the limitations of a bus. NoCs must be carefully designed to meet constraints like power consumption, area, and ultra low latencies. Although 2D meshes with DOR (dimension-order-routing) meet these constraints, the need for partitioning (e.g. virtual machines, coherency domains) and traffic isolation may prevent the use of DOR routing. Also, core heterogeneity and manufacturing and run-time faults may lead to partially irregular topologies. Routing in these topologies is complex, and previously proposed solutions required routing tables, which drastically increase power consumption, area, and latency. The exception is LBDR (logic-based distributed routing), a flexible routing method for irregular topologies that removes the need for using routing tables (both at end-nodes and switches), thus achieving large savings in chip area and power consumption. But LBDR lacks support for multicast and broadcast, which are required to efficiently support cache coherence protocols both for single and multiple coherence domains. In this paper we propose bLBDR, an efficient multicast and broadcast mechanism built on top of LBDR. bLBDR performs multicast operations using a logic-based broadcast within a domain (a region with bounds). This allows us to isolate the traffic into different domains, thus enabling the concept of visualization at the NoC level. Also, bLBDR extends the concept of routing regions in LBDR by providing a mechanism that allows the flexible definition of multiple domains, sets of network resources. bLBDR fulfills all the practical requirements, including not only low latency and power and area efficiency, but also support for visualization, partitionability, fault-tolerance, traffic isolation and broadcast across the entire network as well as constrained to coherency domains or regions. All this is achieved by a small and power efficient routing logic (7{{\&}}times; area savings and 17{{\&}}times; power reduction when compared to a routing table in an 8 {{\&}}times; 8 mesh network).",
	address = "Piscataway, NJ, USA",
	booktitle = "2008 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-41)",
	journal = "2008 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-41)",
	keywords = "microprocessor chips;network topology;network-on-chip;power consumption;protocols;",
	note = "CMP;chip multiprocessors;multicore processing chips;network-on-chip;power consumption;dimension-order-routing;logic-based distributed routing;routing tables;cache coherence protocols;",
	pages = "364 - 75",
	title = "{E}fficient unicast and multicast support for {CMP}s",
	url = "http://dx.doi.org/10.1109/MICRO.2008.4771805",
	year = 2008
}

Crispín Gomez, Maria E Gomez, Pedro Lopez and Jose Duato. An efficient switching technique for NoCs with reduced buffer requirements. In Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on. 2008, 713 - 20. URL BibTeX

@conference{10428505,
	author = "Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Networks on chip (NoCs) communicate the components located inside a chip. Overall system performance depends on NoC performance, that is affected by several factors. One of them is the network clock frequency, imposed by the critical path delay. Recent works show that switch critical path includes buffer control logic. Consequently, by removing switch buffers, switch frequency can be doubled. In this paper, we exploit this idea, proposing a new switching technique for NoCs which requires a reduced amount of storage at the switches. It is based on replacing switch port buffers by single latches. By doing so, network cycle can be reduced, which reduces packet latency. On the other hand, power and area consumption requirements can be reduced. However, since there are no buffers at the switch ports, packets can not be stopped. Stopped packets due to contention are dropped and reinjected from their senders via negative acknowledgments. Packet dropping is strongly reduced by exploiting NoCs wiring capability.",
	address = "Piscataway, NJ, USA",
	booktitle = "Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on",
	journal = "Proceedings of the Fourteenth International Conference on Parallel and Distributed Systems",
	keywords = "network-on-chip;performance evaluation;",
	note = "switching technique;reduced buffer requirements;networks on chip;network clock frequency;critical path delay;buffer control logic;network cycle;packet dropping;",
	pages = "713 - 20",
	title = "{A}n efficient switching technique for {N}o{C}s with reduced buffer requirements",
	url = "http://dx.doi.org/10.1109/ICPADS.2008.43",
	year = 2008
}

Scott Pakin, Craig Stunkel, Jose Flich, Francisco Alfaro, Gheorghe Almasi, Angelos Bilas, Ron Brightwell, Darius Buntinas, Wu-Chun Feng, Mitchell Gusat, Nectarios Koziris, Pedro Lopez, Andrew Lumsdaine, Jarek Nieplocha, Greg Pfister, Jamie Riotto, Vikram Saletore, Evan Speight, Pete Wyckoff, D K Panda, Jose Duato and Mazin Yousif. Workshop 9 Introduction: The Workshop on Communication Architecture for Clusters - CAC 2008. IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM, pages IEEE Computer Societ, 2008. URL BibTeX

@article{20083711535136,
	author = "Scott Pakin and Craig Stunkel and Flich, Jose and Francisco Alfaro and Gheorghe Almasi and Angelos Bilas and Ron Brightwell and Darius Buntinas and Wu-Chun Feng and Mitchell Gusat and Nectarios Koziris and Lopez, Pedro and Andrew Lumsdaine and Jarek Nieplocha and Greg Pfister and Jamie Riotto and Vikram Saletore and Evan Speight and Pete Wyckoff and D.K. Panda and Duato, Jose and Mazin Yousif",
	abstract = "No abstract available",
	address = "Miami, FL, United states",
	journal = "IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM",
	pages = "IEEE Computer Societ",
	title = "{W}orkshop 9 {I}ntroduction: {T}he {W}orkshop on {C}ommunication {A}rchitecture for {C}lusters - {CAC} 2008",
	url = "http://dx.doi.org/10.1109/IPDPS.2008.4536118",
	year = 2008
}

A Martinez, F J Alfaro, J L Sanchez and Jose Duato. Providing full QoS with 2 VCs in high-speed switches. 2008, 345 - 354. URL BibTeX

@conference{20090111835975,
	author = "A. Martinez and F.J. Alfaro and J.L. Sanchez and Duato, Jose",
	abstract = "Current interconnect standards propose 16 or even more virtual channels (VCs) for provision of quality of service (QoS). However, VCs increase the complexity of the switch and the scheduling delays. In a previous paper, we have shown how to use only two VCs for full QoS support at the switches. In this paper, we explore thoroughly two alternative switch designs that take advantage of this reduction. We analyze their feasibility in a single chip implementation and show that they get a noticeable performance while greatly reducing the cost and power consumption of the network. {{\&}}copy; 2008 Springer Berlin Heidelberg.",
	address = "Estoril, Portugal",
	issn = 03029743,
	journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
	key = "Quality of service",
	keywords = "Diesel engines;Paper;Switches;Ubiquitous computing;",
	note = "Current interconnect;Power consumptions;Qos supports;Single chips;Speed switches;Switch designs;Virtual channels;",
	pages = "345 - 354",
	title = "{P}roviding full {Q}o{S} with 2 {VC}s in high-speed switches",
	url = "http://dx.doi.org/10.1007/978-3-540-89524-4-35",
	volume = "5200 LNCS",
	year = 2008
}

, Jose Flich and Jose Duato. On the Potentials of Segment-Based Routing for NoCs. In Parallel Processing, 2008. ICPP '08. 37th International Conference on. 2008, 594 -603. URL, DOI BibTeX

@conference{4625898,
author = ", and Flich, Jose and Duato, Jose",
abstract = "The topology, the routing algorithm and the way the traffic pattern is distributed over the network influence the ultimate performance of the interconnection network. Off-chip high-performance interconnects provide mechanisms to support irregular topologies, whereas in on-chip networks the topology is fixed at design time. Continuous trend on device miniaturization and high volume manufacturing increase the probability of faults in embedded systems, leading to irregular topologies. Also, partitionability and virtualization of the entire on-chip network is envisioned for future systems. These trends lead to the need of routing algorithms that adapt to the static or dynamic changes in irregular topologies.In this paper we analyze the benefits of the reconfiguration at the routing algorithm level in order to allow topology changes. That is, support topology changes that appear on the network due to different reasons including switch or link failures, energy reduction decisions or design and manufacturing issues. We perform an exhaustive analysis on the performance impact of the routing algorithm in a NoC system. Our aim is to enable the possibility of reconfiguration of the routing algorithm. We take advantage on the flexibility offered by the segment-based routing methodology that allows a fast computation of many deadlock-free routing algorithms by obtaining different segmentation processes and routing restriction policies. This study analyzes the potentials offered by SR. Results show that the election of the routing algorithm may greatly affect the final performance of the network. Additionally, we propose an organized segmentation process that achieves reliable performance with low variability for all topologies studied under uniform traffic conditions. These results encourages us to the search of a dynamic mechanism that adapts the routing algorithm to the traffic.",
booktitle = "Parallel Processing, 2008. ICPP '08. 37th International Conference on",
doi = "10.1109/ICPP.2008.56",
issn = "0190-3918",
keywords = "NoC;deadlock-free routing algorithms;embedded systems;interconnection network;off-chip high-performance interconnects;routing algorithm;segment-based routing;segment-based routing methodology;traffic pattern;uniform traffic conditions;interconnections;net",
month = "9-12",
pages = "594 -603",
title = "{O}n the {P}otentials of {S}egment-{B}ased {R}outing for {N}o{C}s",
url = "http://dx.doi.org/10.1109/ICPP.2008.56",
year = 2008
}

Jose Flich, Samuel Rodrigo, Jose Duato, T Sodring, A G Solheim, T Skeie and O Lysne. On the Potential of NoC Virtualization for Multicore Chips. In Complex, Intelligent and Software Intensive Systems, 2008. CISIS 2008. International Conference on. 2008, 801 -807. DOI BibTeX

@conference{4606771,
	author = "Flich, Jose and Rodrigo, Samuel and Duato, Jose and T. Sodring and A.G. Solheim and T. Skeie and O. Lysne",
	abstract = "As the end of Moores-law is on the horizon, power becomes a limiting factor to continuous increases in performance gains for single-core processors. Processor engineers have shifted to the multicore paradigm and many-core processors are a reality. Within the context of these multi-core chips, three key metrics point themselves out as being of major importance, performance, fault-tolerance (including yield), and power consumption. A solution that optimizes all three of these metrics is challenging. As the number of cores increases the importance of the interconnection network-on-chip (NoC) grows as well, and chip designers should aim to optimize these three key metrics in the NoC context as well. In this paper we identify and discuss the main properties that a NoC must exhibit in order to enable such optimizations. In particular, we propose the use of virtualization techniques at the NoC level. AS a major finding, we identify the implementation of routing algorithms to become a key design parameter in order to achieve an effective virtualization of the chip should also supporting broadcast within the virtualized context. The intention behind this paper is for it to serve as a position paper on the topic of virtualization for NoC and the challenges that should be met at the routing layer in order to maximize performance, fault-tolerance and power consumption in multicore chips.",
	booktitle = "Complex, Intelligent and Software Intensive Systems, 2008. CISIS 2008. International Conference on",
	doi = "10.1109/CISIS.2008.97",
	keywords = "Moores-law;NoC virtualization;interconnection network-on-chip;many-core processors;multicore chips;routing algorithms;single-core processors;microprocessor chips;multiprocessor interconnection networks;network-on-chip;",
	month = "4-7",
	pages = "801 -807",
	title = "{O}n the {P}otential of {N}o{C} {V}irtualization for {M}ulticore {C}hips",
	year = 2008
}

H Eberle, P J Garcia, Jose Flich, Jose Duato, R Drost, N Gura, D Hopkins and W Olesinski. High-radix crossbar switches enabled by Proximity Communication. In High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for. 2008, 1 -12. DOI BibTeX

@conference{5219754,
	author = "H. Eberle and P.J. Garcia and Flich, Jose and Duato, Jose and R. Drost and N. Gura and D. Hopkins and W. Olesinski",
	abstract = "We describe a novel way to implement high-radix crossbar switches. Our work is enabled by a new chip interconnect technology called proximity communication (PxC) that offers unparalleled chip IO density. First, we show how a crossbar architecture is topologically mapped onto a PxC-enabled multi-chip module (MCM). Then, we describe a first prototype implementation of a small-scale switch based on a PxC MCM. Finally, we present a performance analysis of two large-scale switch configurations with 288 ports and 1,728 ports, respectively, contrasting a 1-stage PxC-enabled switch and a multi-stage switch using conventional technology. Our simulation results show that (a) arbitration delays in a large 1-stage switch can be considerable, (b) multi-stage switches are extremely susceptible to saturation under non-uniform traffic, a problem that becomes worse for higher radices (1-stage switches, in contrast, are not affected by this problem).",
	booktitle = "High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for",
	doi = "10.1109/SC.2008.5219754",
	keywords = "PxC-enabled switch;chip interconnect technology;crossbar architecture;high-radix crossbar switches;multichip module;multistage switch;proximity communication;small-scale switch;unparalleled chip IO density;multichip modules;multiprocessor interconnection",
	month = "15-21",
	pages = "1 -12",
	title = "{H}igh-radix crossbar switches enabled by {P}roximity {C}ommunication",
	year = 2008
}

Ricardo Fernandez-Pascual, Jose M Garcia, Manuel E Acacio and Jose Duato. Fault-tolerant cache coherence protocols for CMPs: Evaluation and trade-offs. 2008, 555 - 568. URL BibTeX

@conference{20090511881194,
	author = "Ricardo Fernandez-Pascual and Jose M. Garcia and Manuel E. Acacio and Duato, Jose",
	abstract = "One way of dealing with transient faults that will affect the interconnection network of future large-scale Chip Multiprocessor (CMP) systems is by extending the cache coherence protocol. Fault tolerance at the level of the cache coherence protocol has been proven to achieve very low performance overhead in absence of faults while being able to support very high fault rates. In this work, we compare two already proposed fault-tolerant cache coherence protocols in a common framework and present a new one based in the cache coherence protocol used in AMD Opteron processors. Also, we thoroughly evaluate the performance of the three protocols, show how to adjust the fault tolerance parameters of the protocols to achieve a desired level of fault tolerance and measure the overhead achieved to be able to support very high transient fault rates. {{\&}}copy; 2008 Springer Berlin Heidelberg.",
	address = "Bangalore, India",
	issn = 03029743,
	journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
	key = "Quality assurance",
	keywords = "Coherent light;Errors;Failure analysis;Fault tolerance;Fault tree analysis;High performance liquid chromatography;Reliability;",
	note = "Cache coherence protocols;Chip multi processor (CMP);Fault rates;Fault-tolerant;One way;Opteron processors;Transient faults;",
	pages = "555 - 568",
	title = "{F}ault-tolerant cache coherence protocols for {CMP}s: {E}valuation and trade-offs",
	url = "http://dx.doi.org/10.1007/978-3-540-89894-8_48",
	volume = "5374 LNCS",
	year = 2008
}

Ricardo Fernandez-Pascual, Jose M Garcia, Manuel E Acacio and Jose Duato. Extending the tokenCMP cache coherence protocol for low overhead fault tolerance in CMP architectures. IEEE Transactions on Parallel and Distributed Systems 19(8):1044 - 1056, 2008. URL BibTeX

@article{20083011390050,
	author = "Ricardo Fernandez-Pascual and Jose M. Garcia and Manuel E. Acacio and Duato, Jose",
	abstract = "It is widely accepted that transient failures will appear more frequently in chips designed in the near future due to several factors such as the increased integration scale. On the other hand, Chip-multiprocessors (CMP) that integrate several processor cores in a single chip are nowadays the best alternative to more efficient use of the increasing number of transistors that can be placed in a single die. Hence, it is necessary to design new techniques to deal with these faults to be able to build sufficiently reliable Chip Multiprocessors (CMPs). In this work, we present a coherence protocol aimed at dealing with transient failures that affect the interconnection network of a CMP, thus assuming that the network is no longer reliable. In particular, our proposal extends a token-based cache coherence protocol so that no data can be lost and no deadlock can occur due to any dropped message. Using GEMS full system simulator, we compare our proposal against TokenCMP. We show that in absence of failures our proposal does not introduce overhead in terms of increased execution time over TokenCMP. Additionally, our protocol can tolerate message loss rates much higher than those likely to be found in the real world without increasing execution time more than 15 percent. {{\&}}copy; 2008 IEEE.",
	address = "445 Hoes Lane - P.O.Box 1331, Piscataway, NJ 08855-1331, United States",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Network architecture",
	keywords = "Coherent light;Fault tolerance;Microprocessor chips;Multiprocessing systems;Nanotechnology;Quality assurance;Reliability;",
	note = "Cache coherence protocols;Chip multi processor (CMP);Chip multi-processors (CMP);CMP architectures;Coherence protocols;Execution time;Low overhead;Message loss;New techniques;Processor cores;Real world;Single chips;System simulators;",
	number = 8,
	pages = "1044 - 1056",
	title = "{E}xtending the token{CMP} cache coherence protocol for low overhead fault tolerance in {CMP} architectures",
	url = "http://dx.doi.org/10.1109/TPDS.2007.70803",
	volume = 19,
	year = 2008
}

Alejandro Martinez, George Apostolopoulos, Francisco J Alfaro, Jose L Sanchez and Jose Duato. Efficient deadline-based QoS algorithms for high-performance networks. IEEE Transactions on Computers 57(7):928 - 939, 2008. URL BibTeX

@article{20091912073930,
	author = "Alejandro Martinez and George Apostolopoulos and Francisco J. Alfaro and Jose L. Sanchez and Duato, Jose",
	abstract = "Quality of service (QoS) is becoming an attractive feature for high-performance networks and parallel machines because, in those environments, there are different traffic types, each one having its own requirements. In that sense, deadline-based algorithms can provide powerful QoS provision. However, the cost associated with keeping ordered lists of packets makes these algorithms impractical for high-performance networks. In this paper, we explore how to efficiently adapt the Earliest Deadline First family of algorithms to high-speed network environments. The results show excellent performance using just two virtual channels, FIFO queues, and a cost feasible with today's technology. {{\&}}copy; 2008 IEEE.",
	address = "445 Hoes Lane - P.O.Box 1331, Piscataway, NJ 08855-1331, United States",
	issn = 00189340,
	journal = "IEEE Transactions on Computers",
	key = "Quality of service",
	keywords = "Algorithms;Interconnection networks;Quality control;",
	note = "Earliest deadline firsts;Excellent performance;Fifo queues;High-performance networks;High-speed interconnection networks;High-speed networks;Parallel machines;Virtual channels;",
	number = 7,
	pages = "928 - 939",
	title = "{E}fficient deadline-based {Q}o{S} algorithms for high-performance networks",
	url = "http://dx.doi.org/10.1109/TC.2008.39",
	volume = 57,
	year = 2008
}

R Tornero, J M Ordua, , Jose Flich and Jose Duato. CART: Communication-Aware Routing Technique for Application-Specific NoCs. In Digital System Design Architectures, Methods and Tools, 2008. DSD '08. 11th EUROMICRO Conference on. 2008, 26 -31. URL, DOI BibTeX

@conference{4669215,
	author = "R. Tornero and J.M. Ordua and , and Flich, Jose and Duato, Jose",
	abstract = "Networks on Chip (NoCs) have been shown as an efficient solution to the complex on-chip communication problems derived from the increasing number of processor cores. One of the key issues in the design of NoCs is the reduction of both area and power dissipation. As a result, two-dimensional meshes have become the preferred topology, since it offers low and constant link delay. Unfortunately, manufacturing defects or even real-time failures often make the resulting topology to become irregular, preventing the use of traditional routing algorithms. This scenario shows the need for topology-agnostic routing algorithms that provide a valid routing solution when applied over any topology. Moreover, in order to deal with run-time failures, the routing algorithm should be able to fit runtime constraints. This paper proposes a new communication-aware routing technique, referred to as CART, that optimizes the network performance for application-specific NoCs. CART combines a flexible, topology-agnostic routing algorithm with a communication-aware mapping technique that matches the traffic generated by the application with the available network bandwidth. Since the mapping technique can be pruned as needed in order to fit either quality function values or time constraints, CART can be adapted to fit with different computational costs. The evaluation results show that CART significatively improves network performance in terms of both latency and power consumption.",
	booktitle = "Digital System Design Architectures, Methods and Tools, 2008. DSD '08. 11th EUROMICRO Conference on",
	doi = "10.1109/DSD.2008.19",
	isbn = "978-0-7695-3277-6",
	keywords = "CART;application-specific NoC;communication-aware mapping technique;communication-aware routing technique;complex on-chip communication problems;network-on-chip;power dissipation;topology-agnostic routing algorithms;two-dimensional meshes;network routing;",
	month = "3-5",
	pages = "26 -31",
	title = "{CART}: {C}ommunication-{A}ware {R}outing {T}echnique for {A}pplication-{S}pecific {N}o{C}s",
	url = "http://dx.doi.org/10.1109/DSD.2008.19",
	year = 2008
}

Crispin Gomez Requena, Francisco Gilabert Villamon, Maria E Gomez, Pedro Lopez and Jose Duato. Beyond fat - Tree: Unidirectional load - Balanced multistage interconnection network. IEEE Computer Architecture Letters 7(2):49 - 52, 2008. URL BibTeX

@article{20090211850984,
	author = "Crispin Gomez Requena and Francisco Gilabert Villamon and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "The fat-tree is one of the most widely-used topologies by interconnection network manufacturers. Recently, it has been demonstrated that a deterministic routing algorithm that optimally balances the network traffic can not only achieve almost the same performance than an adaptive routing algorithm but also outperforms it. On the other hand, fat-trees require a high number of switches with a non-negligible wiring complexity. In this paper, we propose replacing the fat - tree by a unidirectional multistage interconnection network (UMIN) that uses a traffic balancing deterministic routing algorithm. As a consequence, switch hardware is almost reduced to the half, decreasing, in this way, the power consumption, the arbitration complexity, the switch size itself, and the network cost. Preliminary evaluation results show that the UMIN with the load balancing scheme obtains lower latency than fat - tree for low and medium traffic loads. Furthermore, in networks with a high number of stages or with high radix switches, it obtains the same, or even higher, throughput than fat-tree. © 2006 IEEE.",
	address = "3 Park Avenue, 17th Floor, New York, NY 10016-5997, United States",
	issn = 15566056,
	journal = "IEEE Computer Architecture Letters",
	key = "Computer networks",
	keywords = "Adaptive algorithms;Interconnection networks;Internet;Metropolitan area networks;Routing algorithms;Switches;Switching circuits;Telecommunication networks;Trees;",
	note = "Butterfly network;Deterministic routing;Fat-trees;Multistage Interconnection networks;Traffic balancing;",
	number = 2,
	pages = "49 - 52",
	title = "{B}eyond fat - {T}ree: {U}nidirectional load - {B}alanced multistage interconnection network",
	url = "http://dx.doi.org/10.1109/L-CA.2008.8",
	volume = 7,
	year = 2008
}

Jose Flich, Samuel Rodrigo and Jose Duato. An Efficient Implementation of Distributed Routing Algorithms for NoCs. In Networks-on-Chip, 2008. NoCS 2008. Second ACM/IEEE International Symposium on. 2008, 87 -96. DOI BibTeX

@conference{4492728,
	author = "Flich, Jose and Rodrigo, Samuel and Duato, Jose",
	abstract = "The design of NoCs for multi-core chips introduces new design constraints like power consumption, area, and ultra low latencies. Although 2D meshes are preferred, heterogeneous blocks, fabrication faults, reliability issues, and chip virtualization may lead to the need of irregular topologies or regions. In this situation, efficient routing becomes a challenge. Although the use of routing tables at switches is flexible, it does not scale in terms of latency and area due to its memory requirements. LBDR (logic-based distributed routing) is proposed as a new routing method that removes the need of using routing tables at all. LBDR enables the implementation of many routing algorithms on most of the practical topologies we might find in the near future in a multi-core system. From an initial topology and routing algorithm, a set of three bits per switch/output port is computed. Evaluation results show that, by using a small logic, LBDR mimics the performance of routing algorithms when implemented with routing tables, both in regular and irregular topologies.",
	booktitle = "Networks-on-Chip, 2008. NoCS 2008. Second ACM/IEEE International Symposium on",
	doi = "10.1109/NOCS.2008.4492728",
	keywords = "NoC;distributed routing algorithm;logic-based distributed routing;multicore chip;network-on-chip;routing tables;network routing;network-on-chip;",
	month = "7-10",
	pages = "87 -96",
	title = "{A}n {E}fficient {I}mplementation of {D}istributed {R}outing {A}lgorithms for {N}o{C}s",
	year = 2008
}

Antonio Robles, Aurelio Bermudez, Rafael Casado, Francisco J Quiles, Tor Skeie and Jose Duato. A proposal for managing ASI fabrics. Journal of Systems Architecture 54(7):664 - 678, 2008. URL BibTeX

@article{20083011398981,
	author = "Robles, Antonio and Aurelio Bermudez and Rafael Casado and Francisco J. Quiles and Tor Skeie and Duato, Jose",
	abstract = "Recent years, computer performance has been significantly increased. As a consequence, data I/O systems have become bottlenecks within systems. To alleviate this problem, Advanced Switching was recently proposed as a new standard for future interconnects. The Advanced Switching specification establishes a fabric management infrastructure, which is in charge of updating the set of fabric paths each time a topological change takes place. The use of source routing and passive switches makes unfeasible the adaptation to this new technology of many existing proposals to handle topological changes in switched interconnection networks. This paper presents a fabric management mechanism for Advanced Switching, but also suitable for other source routing interconnects. Furthermore, the work presents a detailed performance evaluation for this proposal. This evaluation allows us to identify the main drawbacks of the mechanism and to define future improvements. © 2007 Elsevier B.V. All rights reserved.",
	address = "P.O. Box 211, Amsterdam, 1000 AE, Netherlands",
	issn = 13837621,
	journal = "Journal of Systems Architecture",
	key = "Fabrics",
	keywords = "Mechanisms;Standards;Switching circuits;Topology;",
	note = "Advanced switching;Computer performance;Elsevier (CO);I/O systems;Management Infrastructure;New technologies;Passive switches;Performance evaluation (PE);Source routing;Topological changes;",
	number = 7,
	pages = "664 - 678",
	title = "{A} proposal for managing {ASI} fabrics",
	url = "http://dx.doi.org/10.1016/j.sysarc.2007.12.002",
	volume = 54,
	year = 2008
}

Ricardo Fernandez-Pascual, Jose M Garcia, Manuel E Acacio and Jose Duato. A fault-tolerant directory-based cache coherence protocol for CMP architectures. 2008, 267 - 276. URL BibTeX

@conference{20084211640662,
	author = "Ricardo Fernandez-Pascual and Jose M. Garcia and Manuel E. Acacio and Duato, Jose",
	abstract = "Current technology trends of increased scale of integration are pushing CMOS technology into the deepsubmicron domain, enabling the creation of chips with a significantly greater number of transistors but also more prone to transient failures. Hence, computer architects will have to consider reliability as a prime concern for future chip-multiprocessor designs (CMPs). Since the interconnection network of future CMPs will use a significant portion of the chip real state, it will be especially affected by transient failures. We propose to deal with this kind of failures at the level of the cache coherence protocol instead of ensuring the reliability of the network itself. Particularly, we have extended a directory-based cache coherence protocol to ensure correct program semantics even in presence of transient failures in the interconnection network. Additionally, we show that our proposal has virtually no impact on execution time with respect to a non fault-tolerant protocol, and just entails modest hardware and network traffic overhead. © 2008 IEEE.",
	address = "Anchorage, AK, United states",
	journal = "Proceedings of the International Conference on Dependable Systems and Networks",
	key = "Computer networks",
	keywords = "CMOS integrated circuits;Coherent light;Information theory;Interconnection networks;Internet;Nanotechnology;Network architecture;Network protocols;Reliability;Sensor networks;",
	note = "Cache coherence protocols;CMOS technologies;CMP architectures;Computer architects;Current technologies;Dependable systems;Execution time;Fault-tolerant;Fault-tolerant protocols;International conferences;Multiprocessor designs;Network traffics;Program semantics;Real state;Scale of integration;",
	pages = "267 - 276",
	title = "{A} fault-tolerant directory-based cache coherence protocol for {CMP} architectures",
	url = "http://dx.doi.org/10.1109/DSN.2008.4630095",
	year = 2008
}

Rafael Tornero, Juan M Orduna, Maurizio Palesi and Jose Duato. A communication-aware topological mapping technique for NoCs. 2008, 910 - 919. URL BibTeX

@conference{20083911589416,
	author = "Rafael Tornero and Juan M. Orduna and Maurizio Palesi and Duato, Jose",
	abstract = "Networks-on-Chip (NoCs) have been proposed as a promising solution to the complex on-chip communication problems derived from the increasing number of processor cores. The design of NoCs involves several key issues, being the topological mapping (the mapping of the Intellectual Properties (IPs) to network nodes) one of them. Several proposals have been focused on topological mapping last years, but they require the experimental validation of each mapping considered. In this paper, we propose a communication-aware topological mapping technique for NoCs. This technique is based on the experimental correlation of the network model with the actual network performance, thus avoiding the need to experimentally evaluate each mapping explored. The evaluation results show that the proposed technique can provide better performance than the currently existing techniques (in terms of both network latency and energy consumption). Additionally, it can be used for both regular and irregular topologies. © 2008 Springer-Verlag Berlin Heidelberg.",
	address = "Las Palmas de Gran Canaria, Spain",
	issn = 03029743,
	journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
	key = "Conformal mapping",
	keywords = "Biological materials;Chlorine compounds;Communication;Energy policy;Microprocessor chips;Systems engineering;Telecommunication;Topology;",
	note = "Energy consumption;Evaluation results;Experimental validations;Network latencies;Network modelling;Network nodes;Network performances;Networks-on-chip;On-chip communications;Parallel processing;Processor cores;Topological mapping;",
	pages = "910 - 919",
	title = "{A} communication-aware topological mapping technique for {N}o{C}s",
	url = "http://dx.doi.org/10.1007/978-3-540-85451-7_98",
	volume = "5168 LNCS",
	year = 2008
}

Rafael Ubal, Julio Sahuquillo, Salvador Petit, Pedro Lopez and Jose Duato. VB-MT: Design Issues and Performance of the Validation Buffer Microarchitecture for Multithreaded Processors. In Parallel Architecture and Compilation Techniques, 2007. PACT 2007. 16th International Conference on. 2007, 429 -429. URL, DOI BibTeX

@conference{4336257,
	author = "Ubal, Rafael and Sahuquillo, Julio and Petit, Salvador and Lopez, Pedro and Duato, Jose",
	abstract = "The validation buffer (VB) Microarchitecture retires instructions out of order, by substituting the classical ROB by the VB structure. The VB removes the negative effect of long latency instructions located at the ROB head, which prevent other instructions from retiring and cause frequent pipeline stalls due to lack of space in the ROB. This work analyzes different multithreading models (coarse grain, fine grain and simultaneous multithreading) and a set of different instruction fetch policies.",
	booktitle = "Parallel Architecture and Compilation Techniques, 2007. PACT 2007. 16th International Conference on",
	doi = "10.1109/PACT.2007.4336257",
	issn = "1089-795X",
	keywords = "ROB head;VB structure;instruction fetch policies;multithreaded processors;validation buffer microarchitecture;buffer storage;multi-threading;parallel architectures;storage allocation;",
	month = "sept.",
	pages = "429 -429",
	title = "{VB}-{MT}: {D}esign {I}ssues and {P}erformance of the {V}alidation {B}uffer {M}icroarchitecture for {M}ultithreaded {P}rocessors",
	url = "http://dx.doi.org/10.1109/PACT.2007.4336257",
	year = 2007
}

Crispín Gomez, Francisco Gilabert, Maria E Gomez, Pedro Lopez and Jose Duato. Deterministic versus Adaptive Routing in Fat-Trees. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International. March 2007, 1 -8. URL, DOI BibTeX

@conference{4228210,
	author = "Gomez, Crisp{\'i}n and Gilabert, Francisco and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Clusters of PCs have become very popular to build high performance computers. These machines use commodity PCs linked by a high speed interconnect. Routing is one of the most important design issues of interconnection networks. Adaptive routing usually better balances network traffic, thus allowing the network to obtain a higher throughput. However, adaptive routing introduces out-of-order packet delivery, which is unacceptable for some applications. Concerning topology, most of the commercially available interconnects are based on fat-tree. Fat-trees offer a rich connectivity among nodes, making possible to obtain paths between all source-destination pairs that do not share any link. We exploit this idea to propose a deterministic routing algorithm for fat-trees, comparing it with adaptive routing in several workloads. The results show that deterministic routing can achieve a similar, and in some scenarios higher, level of performance than adaptive routing, while providing in-order packet delivery.",
	booktitle = "Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International",
	doi = "10.1109/IPDPS.2007.370482",
	isbn = "1-4244-0910-1",
	keywords = "PC clusters;adaptive routing;deterministic routing algorithm;fat-tree topology;interconnection networks;packet delivery;multistage interconnection networks;telecommunication network routing;telecommunication network topology;telecommunication traffic;tree",
	month = "march",
	pages = "1 -8",
	title = "{D}eterministic versus {A}daptive {R}outing in {F}at-{T}rees",
	url = "http://dx.doi.org/10.1109/IPDPS.2007.370482",
	year = 2007
}

Crispín Gomez, Francisco Gilabert, Maria E Gomez, Pedro Lopez and Jose Duato. Deterministic versus adaptive routing in fat-trees. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International. 2007, 8 pp. -. URL, DOI BibTeX

@conference{9516533,
	author = "Gomez, Crisp{\'i}n and Gilabert, Francisco and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Clusters of PCs have become very popular to build high performance computers. These machines use commodity PCs linked by a high speed interconnect. Routing is one of the most important design issues of interconnection networks. Adaptive routing usually better balances network traffic, thus allowing the network to obtain a higher throughput. However, adaptive routing introduces out-of-order packet delivery, which is unacceptable for some applications. Concerning topology, most of the commercially available interconnects are based on fat-tree. Fat-trees offer a rich connectivity among nodes, making possible to obtain paths between all source-destination pairs that do not share any link. We exploit this idea to propose a deterministic routing algorithm for fat-trees, comparing it with adaptive routing in several workloads. The results show that deterministic routing can achieve a similar, and in some scenarios higher, level of performance than adaptive routing, while providing in-order packet delivery.",
	booktitle = "Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International",
	doi = "10.1109/IPDPS.2007.370482",
	isbn = "1-4244-0910-1",
	journal = "2007 IEEE International Parallel and Distributed Processing Symposium (IEEE Cat. No.07TH8938)",
	keywords = "multistage interconnection networks;telecommunication network routing;telecommunication network topology;telecommunication traffic;trees;",
	month = "Mar.",
	note = "adaptive routing;fat-tree topology;PC clusters;interconnection networks;packet delivery;deterministic routing algorithm;",
	pages = "8 pp. -",
	publisher = "IEEE Computer Society",
	title = "{D}eterministic versus adaptive routing in fat-trees",
	url = "http://dx.doi.org/10.1109/IPDPS.2007.370482",
	year = 2007
}

, Jose Flich, Jose Duato, Sven-Arne Reinemo and Tor Skeie. Boosting Ethernet Performance by Segment-Based Routing. In Parallel, Distributed and Network-Based Processing, 2007. PDP '07. 15th EUROMICRO International Conference on. 2007, 55 -62. URL, DOI BibTeX

@conference{4135259,
	author = ", and Flich, Jose and Duato, Jose and Sven-Arne Reinemo and Tor Skeie",
	abstract = "Ethernet is turning out to be a cost-effective solution for building cluster networks offering compatibility, simplicity, high bandwidth, scalability and a good performance-to-cost ratio. Nevertheless, Ethernet still makes inefficient use of network resources (links) and suffers from long failure recovery time due to the lack of a suitable routing algorithm. In this paper we embed an efficient routing algorithm into 802.3 Ethernet technology, making it possible to use off-the-shelf equipment to build high-performance and cost-effective Ethernet clusters, with an efficient use of link bandwidth and with fault tolerant capabilities. The algorithm, referred to as segment-based routing (SR), is a deterministic routing algorithm that achieves high performance without the need for virtual channels (not available in Ethernet). Moreover, SR is topology agnostic, meaning it can be applied to any topology, and tolerates any combination of faults derived from the original topology when combined with static reconfiguration. Through simulations we verify an overall improvement in throughput by a factor of 1.2 to 10.0 when compared to the conventional Ethernet routing algorithm, the spanning tree protocol (STP), and other topology agnostic routing algorithms such as Up*/Down* and tree-based turn-prohibition, the last one being recently proposed for Ethernet",
	booktitle = "Parallel, Distributed and Network-Based Processing, 2007. PDP '07. 15th EUROMICRO International Conference on",
	doi = "10.1109/PDP.2007.28",
	issn = "1066-6192",
	keywords = "Ethernet technology;Ethernet clusters;cluster networks;fault tolerant capability;off-the-shelf equipment;routing algorithm;segment-based routing;spanning tree protocol;static reconfiguration;topology agnostic routing algorithms;tree-based turn-prohi",
	month = "feb.",
	pages = "55 -62",
	title = "{B}oosting {E}thernet {P}erformance by {S}egment-{B}ased {R}outing",
	url = "http://dx.doi.org/10.1109/PDP.2007.28",
	year = 2007
}

Marina Alonso, Salvador Coll, Vicente Santonja, Juan Miguel Martínez, Pedro Lopez and Jose Duato. Power-aware fat-tree networks using on/off links. In R Perrott, BM Chapman, J Subhlok, RF DeMello and LT Yang (eds.). HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, PROCEEDINGS 4782. 2007, 472-483. BibTeX

@conference{ISI:000250940200040,
	author = "Alonso, Marina and Coll, Salvador and Santonja, Vicente and Mart{\'i}nez, Juan Miguel and Lopez, Pedro and Duato, Jose",
	abstract = "Nowadays, power consumption reduction techniques are being increasingly used in computer systems, and high-performance computing systems are not an exception. In particular, the power consumed by the interconnect circuitry has a non-negligible contribution to the total system budget. In this scenario, fat-tree interconnection networks are one of the most popular topologies. This topology is particularly well-suited for applying power consumption reduction techniques since it provides multiple alternative paths for each source/destination pair. In this paper, we present a mechanism that dynamically adjusts the available network bandwidth by switching links on and off, according to the traffic requirements. This mechanism provides significant reduction in power consumption while maintaining the original underlying routing algorithm, at the expense of slight latency increase for low loads.",
	booktitle = "HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, PROCEEDINGS",
	editor = "Perrott, R and Chapman, BM and Subhlok, J and DeMello, RF and Yang, LT",
	isbn = "978-3-540-75443-5",
	issn = "0302-9743",
	note = "3rd International Conference on High Performance Computing and Communications (HPCC 2007), Houston, TX, SEP 26-28, 2007",
	pages = "472-483",
	series = "LECTURE NOTES IN COMPUTER SCIENCE",
	title = "{P}ower-aware fat-tree networks using on/off links",
	volume = 4782,
	year = 2007
}

A Martinez-Vicente, P J Garcia, F J Alfaro, J L Sanchez, Jose Flich, F J Quiles and Jose Duato. Integrated QoS provision and congestion management for interconnection networks. In Euro-Par 2007. Parallel Processing. Proceedings 13th International Euro-Par Conference. LNCS 4641. 2007, 837 - 47. BibTeX

@conference{9689023,
	author = "A. Martinez-Vicente and P.J. Garcia and F.J. Alfaro and J.L. Sanchez and Flich, Jose and F.J. Quiles and Duato, Jose",
	abstract = "Both QoS support and congestion management techniques have become essential for achieving good performance in current highspeed interconnection networks. However, traditional techniques proposed for both issues require too many resources for being implemented. In this paper we propose a new switch architecture that efficiently uses the same resources to offer both congestion management and QoS provision. It is as effective as previous proposals, but much more cost-effective.",
	address = "Berlin, Germany",
	booktitle = "Euro-Par 2007. Parallel Processing. Proceedings 13th International Euro-Par Conference.",
	journal = "Euro-Par 2007. Parallel Processing. Proceedings 13th International Euro-Par Conference. (Lecture Notes in Computer Science vol. 4641)",
	keywords = "computer network management;multistage interconnection networks;quality of service;queueing theory;telecommunication congestion control;",
	note = "switch architecture;interconnection network;quality of service;QoS support;congestion management technique;",
	pages = "837 - 47",
	title = "{I}ntegrated {Q}o{S} provision and congestion management for interconnection networks",
	volume = "LNCS 4641",
	year = 2007
}

A Martinez, F J Alfaro, J L Sanchez and Jose Duato. Deadline-based QoS algorithms for high-performance networks. 2007, 9 pp. -. BibTeX

@conference{9516704,
	author = "A. Martinez and F.J. Alfaro and J.L. Sanchez and Duato, Jose",
	abstract = "Quality of service (QoS) is becoming an attractive feature for high-performance networks and parallel machines because it could allow a more efficient use of resources. Deadline-based algorithms can provide powerful QoS provision. However, the cost associated with keeping ordered lists of packets makes them impractical for high-performance networks. In this paper, we explore how to adapt efficiently the earliest deadline first family of algorithms to the high-speed networks environments. The results show excellent performance using just two virtual channels, FIFO queues, and a cost feasible with today's technology.",
	address = "Piscataway, NJ, USA",
	journal = "2007 IEEE International Parallel and Distributed Processing Symposium (IEEE Cat. No.07TH8938)",
	keywords = "multiprocessor interconnection networks;parallel machines;quality of service;",
	note = "quality of service;QoS;high-performance network;parallel machine;earliest deadline first algorithm;high-speed network;virtual channel;FIFO queue;",
	pages = "9 pp. -",
	title = "{D}eadline-based {Q}o{S} algorithms for high-performance networks",
	year = 2007
}

Joan-Lluis Ferrer, Elvira Baydal, Antonio Robles, Pedro Lopez and Jose Duato. Congestion management in MINs through marked validated packets. 2007, 260 - 7. BibTeX

@conference{10266202,
	author = "Ferrer, Joan-Lluis and Baydal, Elvira and Robles, Antonio and Lopez, Pedro and Duato, Jose",
	abstract = {Congestion management is a very critical problem tackled in interconnection networks for years but not solved yet. Although several mechanisms have been recently proposed for lossless multistage interconnection networks (MINs), they either have drawbacks or are partial solutions. Some of them introduce penalty over packets not really addressed to the hot-spots, whereas others can cope only with congestion situations that last a short time. In this paper, we propose an effective and efficient congestion management mechanism for lossless interconnection networks based on explicit congestion notification. The mechanism uses two different flags in ACK packets, a Marking Bit (MB) and a Validation Bit (VB), to detect congestion and warn the origin hosts. In this way, packets belonging to "coldflows" but stopped because of head-of-line (HOL) blocking can be distinguished from "hotflow" packets which are really causing congestion. In response, origin hosts can apply corrective actions only to the "hotflows", minimizing the negative impact on "coldflows"performance. Evaluation results show that the proposed congestion management strategy is able to avoid the degradation of network performance, regardless of traffic load and the location of the congestion in the network.},
	address = "Piscataway, NJ, USA",
	journal = "15th EUROMICRO International Conference on Parallel, Distributed and Network-Based Processing (PDP'07)",
	keywords = "multistage interconnection networks;",
	note = "congestion management;lossless multistage interconnection network;validated packet;marked packet;ACK packet;head-of-line blocking;marking bit;validation bit;",
	pages = "260 - 7",
	title = "{C}ongestion management in {MIN}s through marked validated packets",
	year = 2007
}

Jose Flich, , Pedro Lopez and Jose Duato. Region-Based Routing: An Efficient Routing Mechanism to Tackle Unreliable Hardware in Network on Chips. In Networks-on-Chip, 2007. NOCS 2007. First International Symposium on. 2007, 183 -194. URL, DOI BibTeX

@conference{4209007,
	author = "Flich, Jose and , and Lopez, Pedro and Duato, Jose",
	abstract = "The design of scalable and reliable interconnection networks for system on chips (SoCs) introduce new design constraints not present in current multicomputer systems. Although regular topologies are preferred for building NoCs, heterogeneous blocks, fabrication faults and reliability issues derived from the high integration scale may lead to irregular topologies. In this situation, efficient routing becomes a challenge. Although table-based routing allows the use of most routing algorithms on any topology, it does not scale in terms of latency and area. In this paper we propose the region-based routing mechanism that avoids the scalability problems of table-based solutions. From an initial topology and routing algorithm, the mechanism groups, at every switch, destinations into different regions based on the output ports. By doing this, redundant routing information typically found in routing tables is eliminated. Evaluation results show that the mechanism requires only four regions to support several routing algorithms in a 2D mesh with no performance degradation. Moreover, when dealing with link failures, our results indicate that the mechanism combined with the segment-based routing algorithm is able to pack all the routing information into eight regions providing high throughput. The paper provides also a simple and efficient hardware implementation of the mechanism requiring only 240 logic gates per switch to support eight regions in a 2D mesh topology",
	booktitle = "Networks-on-Chip, 2007. NOCS 2007. First International Symposium on",
	doi = "10.1109/NOCS.2007.39",
	keywords = "2D mesh topology;interconnection networks;multicomputer systems;network on chips;region-based routing;segment-based routing algorithm;system on chips;table-based routing;integrated circuit interconnections;logic design;microprocessor chips;network routing",
	month = "7-9",
	pages = "183 -194",
	title = "{R}egion-{B}ased {R}outing: {A}n {E}fficient {R}outing {M}echanism to {T}ackle {U}nreliable {H}ardware in {N}etwork on {C}hips",
	url = "http://dx.doi.org/10.1109/NOCS.2007.39",
	year = 2007
}

A Martinez, F J Alfaro, J L Sanchez and Jose Duato. Providing full QoS with 2 VCs in high-speed switches. 2007, 345 - 54. BibTeX

@conference{10418557,
	author = "A. Martinez and F.J. Alfaro and J.L. Sanchez and Duato, Jose",
	abstract = "Current interconnect standards propose 16 or even more virtual channels (VCs) for provision of quality of service (QoS). However, VCs increase the complexity of the switch and the scheduling delays. In a previous paper, we have shown how to use only two VCs for full QoS support at the switches. In this paper, we explore thoroughly two alternative switch designs that take advantage of this reduction. We analyze their feasibility in a single chip implementation and show that they get a noticeable performance while greatly reducing the cost and power consumption of the network.",
	address = "Berlin, Germany",
	journal = "Information Networking. Towards Ubiquitous Networking and Services. International Conference, ICOIN 2007",
	keywords = "quality of service;scheduling;switches;telecommunication switching;",
	note = "QoS;high-speed switches;current interconnect standard;virtual channel;quality of service;scheduling delay;power consumption;",
	pages = "345 - 54",
	title = "{P}roviding full {Q}o{S} with 2 {VC}s in high-speed switches",
	year = 2007
}

Gaspar Mora, P J Garcia, Jose Flich and Jose Duato. RECN-IQ: A Cost-Effective Input-Queued Switch Architecture with Congestion Management. In Parallel Processing, 2007. ICPP 2007. International Conference on. 2007, 74 -74. URL, DOI BibTeX

@conference{4343881,
	author = "Mora, Gaspar and P.J. Garcia and Flich, Jose and Duato, Jose",
	abstract = "As the number of computing and storage nodes keeps increasing, the interconnection network is becoming a key element of many computing and communication systems, where the overall performance directly depends on network performance. This performance may dramatically drop during congestion situations. Although congestion may be avoided by over dimensioning the network, the current trend is to reduce overall cost and power consumption by reducing the number of network components. Thus, the network will be prone to congestion, thereby becoming mandatory the use of congestion management techniques. In that sense, the technique known as Regional Explicit Congestion Notification (RECN) completely eliminates the Head-of-Line (HOL) blocking produced by congested packets, turning congestion harmless. However, RECN has been designed for switches with queues at input and output ports (CIOQ switches), thus it can not be directly applied to other types of switches. Additionally, the method RECN uses for detecting congestion requires several detection queues that increase the memory requirements and thus switch cost. Thus, we completely redefine the RECN mechanism in order to achieve different goals. First, we adapt RECN to a switch organization with queues only at input ports (IQ switches). These switches are simpler and cheaper to produce than CIOQ ones. Second, we propose a new method for detecting congestion that does not require several detection queues, thereby reducing RECN memory requirements. These improvements lead to achieve a cost-effective switch organization that derive maximum performance even in the presence of congestion. Also, we present in detail a realistic switch architecture supporting the new mechanism. Results demonstrate that the new RECN version in an IQ switch achieves maximum network performance in all the analyzed situations. These results have been a reduction factor of data memory requirements of 5 with respect to the previous RECN mechanism in CIOQ- - switches.",
	booktitle = "Parallel Processing, 2007. ICPP 2007. International Conference on",
	doi = "10.1109/ICPP.2007.71",
	issn = "0190-3918",
	keywords = "RECN-IQ memory requirement;cost-effective input-queued switch architecture;head-of-line blocking;interconnection network;packet congestion management technique;power consumption;regional explicit congestion notification;computer architecture;multiprocesso",
	month = "10-14",
	pages = "74 -74",
	title = "{RECN}-{IQ}: {A} {C}ost-{E}ffective {I}nput-{Q}ueued {S}witch {A}rchitecture with {C}ongestion {M}anagement",
	url = "http://dx.doi.org/10.1109/ICPP.2007.71",
	year = 2007
}

S Rueda, P Morillo, J M Orduna and Jose Duato. On the characterization of peer-to-peer distributed virtual environments. 2007, 107 - 114. URL BibTeX

@conference{20073210756837,
	author = "S. Rueda and P. Morillo and J.M. Orduna and Duato, Jose",
	abstract = "Large scale distributed virtual environments (DVEs) have become a major trend in distributed applications, mainly due to the enormous popularity of multi-player online games in the entertainment industry. Since architectures based on networked servers seem to be not scalable enough to support massively multi-player applications, peer-to-peer (P2P) architectures have been proposed as an efficient and truly scalable solution for this kind of systems. However, in order to design efficient DVEs based on peer-to-peer architectures these systems must be characterized, measuring the impact of different client behaviors on system performance. This paper presents the experimental characterization of peer-to-peer distributed virtual environments in regard to well-known performance metrics in distributed systems. Characterization results show that system saturation is inherently avoided due to the peerto-peer scheme, as it could be expected. Also, these results show that the saturation of a given client exclusively has an effect on the surrounding clients in the virtual world, having no noticeable effect at all on the rest of avatars. Finally, the characterization results show that the response time offered to client computers greatly depends on the number of new connections that these clients have to make when new neighbors appear in the virtual world. These results can be used as the basis for an efficient design of peer-to-peer DVE systems. {{\&}}copy; 2007 IEEE.",
	address = "Charlotte, NC, United states",
	journal = "Proceedings - IEEE Virtual Reality",
	key = "Virtual reality",
	keywords = "Computer architecture;Distributed computer systems;Interactive computer graphics;Online systems;Servers;",
	note = "Distributed virtual environments;Entertainment industry;Multiplayer online games;Peer-to-peer architectures;",
	pages = "107 - 114",
	title = "{O}n the characterization of peer-to-peer distributed virtual environments",
	url = "http://dx.doi.org/10.1109/VR.2007.352470",
	year = 2007
}

Blas Cuesta Sáez, Antonio Robles and Jose Duato. Improving token coherence by Multicast Coherence Messages. In D ElBaz, J Bourgeois and F Spies (eds.). PROCEEDINGS OF THE 16TH EUROMICRO CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING. 2007, 269-273. BibTeX

@conference{ISI:000254266500036,
	author = "Cuesta S{\'a}ez, Blas and Robles, Antonio and Duato, Jose",
	abstract = "Token Coherence is a cache coherence protocol that joins the main advantages of traditional protocols. However, unlike them, Token Coherence does not handle messages in order, which may lead to races, causing some cache misses not to be solved To assure their completion, an inefficient mechanism named persistent requests is used Recently we have proposed the priority request mechanism to efficiently handle races. As acknowledgements are not required, a single node can solve several misses for the same memory block at the same time. When solving a lot of misses, the node may become a bottleneck. To avoid it, in this work we propose the Multicast Coherence Message, which allows to simultaneously resolve several misses by using only one response message. It reduces the network traffic and the average response latency, improving significantly the overall performance.",
	booktitle = "PROCEEDINGS OF THE 16TH EUROMICRO CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING",
	editor = "ElBaz, D and Bourgeois, J and Spies, F",
	isbn = 9780769530895,
	issn = "1066-6192",
	note = "16th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, Toulouse, FRANCE, FEB 13-15, 2008",
	pages = "269-273",
	series = "Euromicro Workshop on Parallel and Distributed Processing",
	title = "{I}mproving token coherence by {M}ulticast {C}oherence {M}essages",
	year = 2007
}

Aurelio Bermudez, Rafael Casado, Francisco J Quiles and Jose Duato. Handling topology changes in InfiniBand. IEEE Transactions on Parallel and Distributed Systems 18(2):172 - 185, 2007. URL BibTeX

@article{20070710422540,
	author = "Aurelio Bermudez and Rafael Casado and Francisco J. Quiles and Duato, Jose",
	abstract = "InfiniBand is a high-performance switched network. Its topology may change due to devices being turned on/off, hot expansion, link remapping, and component failures. The InfiniBand specification defines a management infrastructure which is responsible for detecting and assimilating any change in the network. When a change occurs, management entities must update switch forwarding tables, in order to maintain the connectivity among end nodes. This implies the acquisition of the current topology and the computation of a new set of routes accordingly. It is desirable that the execution of this process does not affect the performance of the upper-level applications that are using the network. In previous works, we have proposed enhanced implementations for the main tasks involved in the assimilation of a change. Now, we present a detailed performance evaluation of a management mechanism which incorporates all our proposals. {{\&}}copy; 2007 IEEE.",
	address = "445 Hoes Lane - P.O.Box 1331, Piscataway, NJ 08855-1331, United States",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Switching networks",
	keywords = "Local area networks;Management;Network protocols;Routers;Topology;",
	note = "InfiniBand;Network management;",
	number = 2,
	pages = "172 - 185",
	title = "{H}andling topology changes in {I}nfini{B}and",
	url = "http://dx.doi.org/10.1109/TPDS.2007.26",
	volume = 18,
	year = 2007
}

Eun Jung Kim, Ki Hwan Yum, Chita R Das, Mazin Yousif and Jose Duato. Exploring IBA design space for improved performance. IEEE Transactions on Parallel and Distributed Systems 18(4):498 - 510, 2007. URL BibTeX

@article{20071510536730,
	author = "Eun Jung Kim and Ki Hwan Yum and Chita R. Das and Mazin Yousif and Duato, Jose",
	abstract = "InfiniBand Architecture (IBA) is envisioned to be the default communication fabric for future system area networks (SANs) or clusters. However, IBA design is currently in its infancy since the released specification outlines only higher level functionalities, leaving it open for exploring various design alternatives. In this paper, we investigate four corelated techniques for providing high and predictable performance in IBA. These are: 1) using the Shortest Path First (SPF) algorithm for deterministic packet routing, 2) developing a multipath routing mechanism for minimizing congestion, 3) developing a selective packet dropping scheme to handle deadlock and congestion, and 4) providing multicasting support for customized applications. These designs are implemented in a pipelined, IBA-style switch architecture, and are evaluated using an integrated workload consisting of MPEG-2 video streams, best-effort traffic, and control traffic on a versatile IBA simulation testbed. Simulation results with 15-node and 30-node irregular networks indicate that the SPF routing, multipath routing, packet dropping, and multicasting schemes are quite effective in delivering high and assured performance in clusters. {{\&}}copy; 2007 IEEE.",
	address = "445 Hoes Lane - P.O.Box 1331, Piscataway, NJ 08855-1331, United States",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Network architecture",
	keywords = "Computer simulation;Congestion control (communication);Multicasting;Network routing;Packet networks;Quality of service;Video streaming;",
	note = "InfiniBand architecture;Packet dropping;System area networks;",
	number = 4,
	pages = "498 - 510",
	title = "{E}xploring {IBA} design space for improved performance",
	url = "http://dx.doi.org/10.1109/TPDS.2007.1010",
	volume = 18,
	year = 2007
}

Alejandro Martinez, Francisco J Alfaro, Jose L Sanchez and Jose Duato. Efficient switches with QoS support for clusters. 2007, IEEE Computer Societ. URL BibTeX

@conference{20073910825291,
	author = "Alejandro Martinez and Francisco J. Alfaro and Jose L. Sanchez and Duato, Jose",
	abstract = "Current interconnect standards providing hardware support for quality of service (QoS) consider up to 16 virtual channels (VCs) for this purpose. However, most implementations do not offer so many VCs because they increase the complexity of the switch and the scheduling delays. We have shown that this number of VCs can be significantly reduced, because it is enough to use two VCs for QoS purposes at each switch port. In this paper, we cover the weaknesses of that proposal and, not only we reduce VCs, but we also improve performance due to the flexibility assigning buffer memory. {{\&}}copy; 2007 IEEE.",
	address = "Long Beach, CA, United states",
	journal = "Proceedings - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007; Abstracts and CD-ROM",
	key = "Switching systems",
	keywords = "Buffer storage;Communication channels (information theory);Delay circuits;Interconnection networks;Quality of service;Scheduling;Standards;Virtual reality;",
	note = "Buffer memory;Interconnect standards;Virtual channels (VC);",
	pages = "IEEE Computer Societ",
	title = "{E}fficient switches with {Q}o{S} support for clusters",
	url = "http://dx.doi.org/10.1109/IPDPS.2007.370473",
	year = 2007
}

P J Garcia, Jose Flich, Jose Duato, I Johnson, F J Quiles and F Naven. Decongestants for clogged networks. IEEE Potentials 26(6):36 - 41, 2007. BibTeX

@article{9732590,
	author = "P.J. Garcia and Flich, Jose and Duato, Jose and I. Johnson and F.J. Quiles and F. Naven",
	abstract = {Interconnection networks are a key element in a wide variety of systems: massive parallel processors, local and system area networks, clusters of PCs and workstations, and Internet Protocol routers. They are essential to high performance in the form of high-bandwidth communications, with low latency, "quality of service" (guaranteed service levels), efficient switching, and flexibility of network topology, as embodied in Myrinet, InfiniBand, Quadrics, Advanced Switching, and similar interconnects. But, despite all the advances that modem interconnects offer, congestion is a growing problem as "lossless" interconnection networks{{\&}}rdquo; those that do not allow data packets to be discarded" come to the fore.},
	address = "USA",
	issn = "0278-6648",
	journal = "IEEE Potentials",
	keywords = "multistage interconnection networks;quality of service;",
	note = "decongestant;clogged network;interconnection network;massive parallel processor;quality of service;network topology;Internet protocol router;",
	number = 6,
	pages = "36 - 41",
	title = "{D}econgestants for clogged networks",
	volume = 26,
	year = 2007
}

Crispín Gomez, Maria E Gomez, Pedro Lopez and Jose Duato. An efficient fault-tolerant routing methodology for fat-tree interconnection networks*. 2007, 509 - 22. BibTeX

@conference{9683889,
	author = "Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "In large cluster-based machines, fault-tolerance in the interconnection network is an issue of growing importance, since their increasing size rises the probability of failure. The topology used in these machines is usually a fat-tree. This paper proposes a new distributed fault-tolerant routing methodology for fat-trees. It does not require additional network hardware. It is scalable, since the required memory, switch hardware and routing delay do not depend on the net work size. The methodology is based on enhancing the interval routing scheme with exclusion intervals. Exclusion intervals are associated to each switch output port, and represent the set of nodes that are unreachable from this port after a failure appears. We propose a mechanism to identify the exclusion intervals that must be updated after detecting a failure, and the values to write on them. Our methodology is able to support a relatively high number of network failures with a low degradation in network performance.",
	address = "Berlin, Germany",
	journal = "Parallel and Distributed Processing and Applications. Proceedings 5th International Symposium, ISPA 2007. (Lecture Notes in Computer Science vol. 4742)",
	keywords = "failure analysis;fault tolerant computing;multiprocessor interconnection networks;network routing;network topology;probability;trees;",
	note = "distributed fault-tolerant routing methodology;fat-tree interconnection networks;large cluster-based machines;failure probability;interval routing scheme;switch output port;",
	pages = "509 - 22",
	title = "{A}n efficient fault-tolerant routing methodology for fat-tree interconnection networks*",
	year = 2007
}

Blas Cuesta Sáez, Antonio Robles and Jose Duato. An effective starvation avoidance mechanism to enhance the token coherence protocol. In P DAmbra and MR Guarracino (eds.). 15th EUROMICRO International Conference on Parallel, Distributed and Network-Based Processing, Proceedings. 2007, 47-54. BibTeX

@conference{ISI:000245942700007,
author = "Cuesta S{\'a}ez, Blas and Robles, Antonio and Duato, Jose",
abstract = "Shared-memory multiprocessors are becoming to be formed by an increasingly larger number of nodes. In these systems, implementing cache coherence is a key issue. Token Coherence is a low latency cache coherence protocol that avoids indirection for cache-to-cache misses and which does not require a totally-ordered interconnect. When races are rare, the protocol performs well thanks to the performance policy. Unfortunately, some medium/large systems and some applications that often access the same data simultaneously make races more common. As a result, the protocol does not perform as well as it could because it uses the persistent request mechanism to prevent starvation. This mechanism is too slow and inflexible because it overrides the performance policy. In consequence, the protocol slows down the system and does not take advantage of the flexibility and speed of the common case. We propose a new mechanism, namely priority requests, which replaces the persistent request one. Our mechanism solves races, while still respecting the performance policy, simply by ordering and giving a higher priority to requests suffering from starvation. Thus, our mechanism handles the tokens more efficiently and reduces the network traffic.",
booktitle = "15th EUROMICRO International Conference on Parallel, Distributed and Network-Based Processing, Proceedings",
editor = "DAmbra, P and Guarracino, MR",
isbn = 9780769527840,
note = "15th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, Naples, ITALY, FEB 07-09, 2007",
pages = "47-54",
title = "{A}n effective starvation avoidance mechanism to enhance the token coherence protocol",
year = 2007
}

Alejandro Martinez, Francisco J Alfaro, Jose L Sanchez, Francisco J Quiles and Jose Duato. A new cost-effective technique for QoS support in clusters. IEEE Transactions on Parallel and Distributed Systems 18(12):1714 - 1726, 2007. URL BibTeX

@article{20074810948057,
	author = "Alejandro Martinez and Francisco J. Alfaro and Jose L. Sanchez and Francisco J. Quiles and Duato, Jose",
	abstract = "Virtual channels (VCs) are a popular solution for the provision of quality of service (QoS). Current interconnect standards propose 16 or even more VCs for this purpose. However, most implementations do not offer so many VCs because it is too expensive in terms of silicon area. Therefore, a reduction of the number of VCs necessary to support QoS can be very helpful in the switch design and implementation.In this paper, we show that this number of VCs can be reduced if the system is considered as a whole rather than each element being taken separately. The scheduling decisions made at network interfaces can be easily reused at switches without significantly altering the global behavior. In this way, we obtain a noticeable reduction of silicon area, component count, and, thus, power consumption, and we can provide similar performance to a more complex architecture. We also show that this is a scalable technique, suitable for the foreseen demands of traffic. © 2007 IEEE.",
	address = "445 Hoes Lane - P.O.Box 1331, Piscataway, NJ 08855-1331, United States",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Interconnection networks",
	keywords = "Computer architecture;Cost effectiveness;Interfaces;Quality of service;Telecommunication traffic;",
	note = "Scheduling decisions;Switch design;Virtual channels;",
	number = 12,
	pages = "1714 - 1726",
	title = "{A} new cost-effective technique for {Q}o{S} support in clusters",
	url = "http://dx.doi.org/10.1109/TPDS.2007.1108",
	volume = 18,
	year = 2007
}

Ricardo Fernandez-Pascual, Jose M Garcia, Manuel E Acacio and Jose Duato. A low overhead fault tolerant coherence protocol for CMP architectures. 2007, 157 - 168. URL BibTeX

@conference{20073210756804,
	author = "Ricardo Fernandez-Pascual and Jose M. Garcia and Manuel E. Acacio and Duato, Jose",
	abstract = "It is widely accepted that transient failures will appear more frequently in chips designed in the near future due to several factors such as the increased integration scale. On the other hand, chip-multiprocessors (CMP) that integrate several processor cores in a single chip are nowadays the best alternative to more efficient use of the increasing number of transistors that can be placed in a single die. Hence, it is necessary to design new techniques to deal with these faults to be able to build sufficiently reliable Chip Multi-processors (CMPs). In this work, we present a coherence protocol aimed at dealing with transient failures that affect the interconnection network of a CMP, thus assuming that the network is no longer reliable. In particular, our proposal extends a token-based cache coherence protocol so that no data can be lost and no deadlock can occur due to any dropped message. Using GEMS full system simulator, we compare our proposal against a similar protocol without fault tolerance (TOKENCMP). We show that in absence of failures our proposal does not introduce overhead in terms of increased execution time over TOKENCMP. Additionally, our protocol can tolerate message loss rates much higher than those likely to be found in the real world without increasing execution time more than 15%. © 2007 IEEE.",
	address = "Scottsdale, AZ, United states",
	issn = 15300897,
	journal = "Proceedings - International Symposium on High-Performance Computer Architecture",
	key = "Network protocols",
	keywords = "Computer architecture;Computer simulation;Fault tolerant computer systems;Microprocessor chips;Multiprocessing systems;Program processors;",
	note = "Chip multiprocessors (CMP);CMP architectures;Fault tolerant coherence protocols;Processor cores;",
	pages = "157 - 168",
	title = "{A} low overhead fault tolerant coherence protocol for {CMP} architectures",
	url = "http://dx.doi.org/10.1109/HPCA.2007.346194",
	year = 2007
}

Pedro Morillo, Silivia Reuda, Juan M Orduna and Jose Duato. A latency-aware partitioning method for distributed virtual environment systems. IEEE Transactions on Parallel and Distributed Systems 18(9):1215 - 1226, 2007. URL BibTeX

@article{20073610796895,
	author = "Pedro Morillo and Silivia Reuda and Juan M. Orduna and Duato, Jose",
	abstract = "Distributed Virtual Environment systems allow multiple users, working on different client computers interconnected through different networks, to interact in a shared virtual world. In these systems, latency is crucial for providing an acceptable quality of service, since it determines how fast client computers are reported about changes in the shared virtual scene produced by other client computers.This paper presents, in a unified manner, a partitioning approach for providing a latency below a threshold to the maximum number of users as possible in Distributed Virtual Environment systems. This partitioning approach searches the assignment of avatars that represents the best trade-off among system latency, system throughput, and partitioning efficiency when solving the partitioning problem. Evaluation results show that the proposed approach not only maximizes system throughput, but it also allows the system to satisfy, if possible, any specific latency requirement needed for providing quality of service. This improvement is achieved without decreasing neither image resolution nor quality of animation, and it can be used together with other techniques already proposed. Therefore, it can contribute to provide quality of service in Distributed Virtual Environments. © 2007 IEEE.",
	address = "445 Hoes Lane - P.O.Box 1331, Piscataway, NJ 08855-1331, United States",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Client server computer systems",
	keywords = "Animation;Image quality;Image resolution;Large scale systems;Quality of service;Virtual reality;",
	note = "Distributed virtual environment system;Partitioning method;",
	number = 9,
	pages = "1215 - 1226",
	title = "{A} latency-aware partitioning method for distributed virtual environment systems",
	url = "http://dx.doi.org/10.1109/TPDS.2007.1055",
	volume = 18,
	year = 2007
}

S Rueda, P Morillo, J M Orduna and Jose Duato. A genetic approach for adding QoS to distributed virtual environments. Computer Communications 30(4):731 - 739, 2007. URL BibTeX

@article{20070610409865,
	author = "S. Rueda and P. Morillo and J.M. Orduna and Duato, Jose",
	abstract = "Distributed virtual environment (DVE) systems have been designed last years as a set of distributed servers. These systems allow a large number of remote users to share a single 3D virtual scene. In order to provide quality of service in a DVE system, clients should be properly assigned to servers taking into account system throughput and system latency. The latter one is composed of both network and computational delays. This highly complex problem is known as the quality of service (QoS) problem. In this paper, we study the implementation of a genetic algorithm (GA) for solving the QoS problem in DVE systems. Performance evaluation results show that, due to its ability of both finding good search paths and keeping diversity, this nature inspired technique can provide significantly better solutions than other heuristic methods while requiring shorter execution times. Therefore, the proposed implementation of GA search method can actually improve the QoS offered by DVE systems. {{\&}}copy; 2006 Elsevier B.V. All rights reserved.",
	address = "P.O. Box 211, Amsterdam, 1000 AE, Netherlands",
	issn = 01403664,
	journal = "Computer Communications",
	key = "Genetic algorithms",
	keywords = "Computational complexity;Distributed computer systems;Quality of service;Servers;Three dimensional computer graphics;Virtual reality;",
	note = "Distributed virtual environments;Performance evaluation;Search methods;",
	number = 4,
	pages = "731 - 739",
	title = "{A} genetic approach for adding {Q}o{S} to distributed virtual environments",
	url = "http://dx.doi.org/10.1016/j.comcom.2006.08.015",
	volume = 30,
	year = 2007
}

Francisco J Alfaro, Jose L Sanchez, Manuel Menduina and Jose Duato. {A. IEEE Transactions on Computers (8):1024 - 1039. BibTeX

@article{20091912073795,
	author = "Francisco J. Alfaro and Jose L. Sanchez and Manuel Menduina and Duato, Jose",
	abstract = "The InfiniBand architecture (IBA) is an industry-standard architecture for server I/O and interprocessor communication. IBA enables quality-of-service (QoS) support with certain mechanisms. These mechanisms are basically the service levels, the virtual lanes, and the table-based arbitration of those virtual lanes. In previous papers, we have examined these mechanisms and described how we can apply them to the requirements requested by the applications. We have also tested our proposals, showing that the applications achieve the level of QoS requested. In this paper, we present a formal model for the techniques previously proposed. According to this model, each application needs a sequence of entries in the IBA arbitration tables based on its requirements. These requirements are related to the mean bandwidth needed and the maximum latency tolerated by the application. Specifically, each request requires a number of entries with a maximum separation between any consecutive pair. In order to manage the requests, we propose certain algorithms and we prove some propositions and theorems, showing that our method achieves good behavior. © 2007 IEEE.",
	address = "445 Hoes Lane - P.O.Box 1331, Piscataway, NJ 08855-1331, United States",
	issn = 00189340,
	journal = "IEEE Transactions on Computers",
	key = "Queueing networks",
	keywords = "Applications;",
	note = "Formal model;InfiniBand;InfiniBand architectures;Inter-processor communications;QoS;Quality-of-service;Service levels;Standard architectures;",
	number = 8,
	pages = "1024 - 1039",
	title = "{A"
}

P J Garcia, F J Quiles, Jose Flich, Jose Duato, I Johnson and F Naven. Efficient, Scalable Congestion Management for Interconnection Networks. Micro, IEEE 26(5):52 -66, 2006. DOI BibTeX

@article{1709823,
	author = "P.J. Garcia and F.J. Quiles and Flich, Jose and Duato, Jose and I. Johnson and F. Naven",
	abstract = "Compared to the overdimensioned designs of the past, current interconnection networks operate closer to the point of saturation and run a higher risk of congestion. Among proposed strategies for congestion management, only the regional explicit congestion notification (RECN) mechanism achieves both the required efficiency and the scalability that emerging systems demand",
	doi = "10.1109/MM.2006.88",
	issn = "0272-1732",
	journal = "Micro, IEEE",
	keywords = "RECN mechanism;interconnection networks;regional explicit congestion notification;scalable congestion management;multiprocessor interconnection networks;",
	month = "sept.-oct.",
	number = 5,
	pages = "52 -66",
	title = "{E}fficient, {S}calable {C}ongestion {M}anagement for {I}nterconnection {N}etworks",
	volume = 26,
	year = 2006
}

Maria E Gomez, N A Nordbotten, Jose Flich, Pedro Lopez, Antonio Robles, Jose Duato, T Skeie and O Lysne. A routing methodology for achieving fault tolerance in direct networks. Computers, IEEE Transactions on 55(4):400 - 415, April 2006. URL, DOI BibTeX

@article{1608003,
author = "Gomez, Maria E. and N.A. Nordbotten and Flich, Jose and Lopez, Pedro and Robles, Antonio and Duato, Jose and T. Skeie and O. Lysne",
abstract = "Massively parallel computing systems are being built with thousands of nodes. The nterconnection network plays a key role for the performance of such systems. However, the high number of components significantly increases the probability of failure. Additionally, failures in the interconnection network may isolate a large fraction of the machine. It is therefore critical to provide an efficient fault-tolerant mechanism to keep the system running, even in the presence of faults. This paper presents a new fault-tolerant routing methodology that does not degrade performance in the absence of faults and tolerates a reasonably large number of faults without disabling any healthy node. In order to avoid faults, for some source-destination pairs, packets are first sent to an intermediate node and then from this node to the destination node. Fully adaptive routing is used along both subpaths. The methodology assumes a static fault model and the use of a checkpoint/restart mechanism. However, there are scenarios where the faults cannot be avoided solely by using an intermediate node. Thus, we also provide some extensions to the methodology. Specifically, we propose disabling adaptive routing and/or using misrouting on a per-packet basis. We also propose the use of more than one intermediate node for some paths. The proposed fault-tolerant routing methodology is extensively evaluated in terms of fault tolerance, complexity, and performance.",
doi = "10.1109/TC.2006.46",
issn = "0018-9340",
journal = "Computers, IEEE Transactions on",
keywords = "adaptive routing; checkpoint-restart mechanism; direct networks; fault-tolerant routing methodology; interconnection network; parallel computing system; fault tolerant computing; multiprocessor interconnection networks; network routing; parallel processi",
month = "april",
number = 4,
pages = "400 - 415",
title = "{A} routing methodology for achieving fault tolerance in direct networks",
url = "http://dx.doi.org/10.1109/TC.2006.46",
volume = 55,
year = 2006
}

Marina Alonso, Salvador Coll, Jose Maria Martínez, Vicente Santonja, Pedro Lopez and Jose Duato. Dynamic power saving in fat-tree interconnection networks using on/off links. In Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International. April 2006, 8 pp.. URL, DOI BibTeX

@conference{1639599,
	author = "Alonso, Marina and Coll, Salvador and Mart{\'i}nez, Jose Maria and Santonja, Vicente and Lopez, Pedro and Duato, Jose",
	abstract = "Current trends in high-performance parallel computers show that fat-tree interconnection networks are one of the most popular topologies. The particular characteristics of this topology, that provide multiple alternative paths for each source/destination pair, make it an excellent candidate for applying power consumption reduction techniques. Such techniques are being increasingly applied in computer systems and the interconnection network is not an exception, since its contribution to the system power budget is not negligible. In this paper, we present a mechanism that dynamically switches on and off network links as a function of traffic. The mechanism is designed to guarantee network connectivity, according to the underlying routing algorithm. In this way, the default routing algorithm can be used regardless of the power saving actions taken, thus simplifying router design. Our simulation results show that significant network power consumption reductions can be obtained at no cost. Latency remains the same although the number of operating network links is dynamically adjusted.",
	booktitle = "Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International",
	doi = "10.1109/IPDPS.2006.1639599",
	isbn = "0-7695-0990-8",
	keywords = "dynamic power saving; fat-tree interconnection networks; high-performance parallel computers; network power consumption reduction; on-off links; routing algorithm; energy conservation; multiprocessor interconnection networks; parallel processing;",
	month = "april",
	pages = "8 pp.",
	title = "{D}ynamic power saving in fat-tree interconnection networks using on/off links",
	url = "http://dx.doi.org/10.1109/IPDPS.2006.1639599",
	year = 2006
}

, Jose Flich, Jose Duato, S -A Reinemo and T Skeie. Segment-based routing: an efficient fault-tolerant routing algorithm for meshes and tori. In Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International. April 2006, 10 pp.. URL, DOI BibTeX

@conference{1639341,
author = ", and Flich, Jose and Duato, Jose and S.-A. Reinemo and T. Skeie",
abstract = "Computers get faster every year, but the demand for computing resources seems to grow at an even faster rate. Depending on the problem domain, this demand for more power can be satisfied by either, massively parallel computers, or clusters of computers. Common for both approaches is the dependence on high performance interconnect networks such as Myrinet, Infiniband, or 10 Gigabit Ethernet. While high throughput and low latency are key features of interconnection networks, the issue of fault-tolerance is now becoming increasingly important. As the number of network components grows so does the probability for failure, thus it becomes important to also consider the fault-tolerance mechanism of interconnection networks. The main challenge then lies in combining performance and fault-tolerance, while still keeping cost and complexity low. This paper proposes a new deterministic routing methodology for tori and meshes, which achieves high performance without the use of virtual channels. Furthermore, it is topology agnostic in nature, meaning it can handle any topology derived from any combination of faults when combined with static reconfiguration. The algorithm, referred to as segment-based routing (SR), works by partitioning a topology into subnets, and subnets into segments. This allows us to place bidirectional turn restrictions locally within a segment. As segments are independent, we gain the freedom to place turn restrictions within a segment independently from other segments. This results in a larger degree of freedom when placing turn restrictions compared to other routing strategies. In this paper a way to compute segment-based routing tables is presented and applied to meshes and tori. Evaluation results show that SR increases performance by a factor of 1.8 over FX and up*/down* routing",
booktitle = "Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International",
doi = "10.1109/IPDPS.2006.1639341",
keywords = "deterministic routing;fault-tolerant routing;interconnection networks;meshes;segment-based routing;tori;fault tolerant computing;multiprocessor interconnection networks;telecommunication network routing;telecommunication network topology;",
month = "april",
pages = "10 pp.",
title = "{S}egment-based routing: an efficient fault-tolerant routing algorithm for meshes and tori",
url = "http://dx.doi.org/10.1109/IPDPS.2006.1639341",
year = 2006
}

Marina Alonso, Salvador Coll, Juan Miguel Martínez, Vicente Santonja, Pedro Lopez and Jose Duato. Dynamic power saving in fat-tree interconnection networks using on/off links. In Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International. 2006, 8 pp. -. URL, DOI BibTeX

@conference{8978456,
	author = "Alonso, Marina and Coll, Salvador and Mart{\'i}nez, Juan Miguel and Santonja, Vicente and Lopez, Pedro and Duato, Jose",
	abstract = "Current trends in high-performance parallel computers show that fat-tree interconnection networks are one of the most popular topologies. The particular characteristics of this topology, that provide multiple alternative paths for each source/destination pair, make it an excellent candidate for applying power consumption reduction techniques. Such techniques are being increasingly applied in computer systems and the interconnection network is not an exception, since its contribution to the system power budget is not negligible. In this paper, we present a mechanism that dynamically switches on and off network links as a function of traffic. The mechanism is designed to guarantee network connectivity, according to the underlying routing algorithm. In this way, the default routing algorithm can be used regardless of the power saving actions taken, thus simplifying router design. Our simulation results show that significant network power consumption reductions can be obtained at no cost. Latency remains the same although the number of operating network links is dynamically adjusted",
	booktitle = "Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International",
	doi = "10.1109/IPDPS.2006.1639599",
	isbn = "1-4244-0054-6",
	journal = "Proceedings. 20th International Parallel and Distributed Processing Symposium (IEEE Cat. No.06TH8860)",
	keywords = "energy conservation;multiprocessor interconnection networks;parallel processing;trees;",
	month = "Apr.",
	note = "dynamic power saving;fat-tree interconnection networks;on/off links;high-performance parallel computers;routing algorithm;network power consumption reduction;",
	pages = "8 pp. -",
	publisher = "IEEE Computer Society",
	title = "{D}ynamic power saving in fat-tree interconnection networks using on/off links",
	url = "http://dx.doi.org/10.1109/IPDPS.2006.1639599",
	year = 2006
}

Gaspar Mora, Jose Flich, Jose Duato, Pedro Lopez, Elvira Baydal and O Lysne. Towards an efficient switch architecture for high-radix switches. 2006, 11 - 20. URL, DOI BibTeX

@conference{10091275,
	author = "Mora, Gaspar and Flich, Jose and Duato, Jose and Lopez, Pedro and Baydal, Elvira and O. Lysne",
	abstract = "The interconnection network plays a key role in the overall performance achieved by high performance computing systems, also contributing an increasing fraction of its cost and power consumption. Current trends in interconnection network technology suggest that high-radix switches will be preferred as networks will become smaller (in terms of switch count) with the associated savings in packet latency, cost, and power consumption. Unfortunately, current switch architectures have scalability problems that prevent them from being effective when implemented with a high number of ports. In this paper, an efficient and cost-effective architecture for high-radix switches is proposed. The architecture, referred to as partitioned crossbar input queued (PCIQ), relies on three key components: a partitioned crossbar organization that allows the use of simple arbiters and crossbars, a packet-based arbiter, and a mechanism to eliminate the switch-level HOL blocking. Under uniform traffic, maximum switch efficiency is achieved. Furthermore, switch-level HOL blocking is completely eliminated under hot-spot traffic, again delivering maximum throughput. Additionally, PCIQ inherently implements an efficient congestion management technique that eliminates all the network-wide HOL blocking. On the contrary, the previously proposed architectures either show poor performance or they require significantly higher costs than PCIQ (in both components and complexity).",
	address = "Piscataway, NJ, USA",
	doi = "10.1109/ANCS.2006.4579519",
	journal = "ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS 2006)",
	keywords = "multistage interconnection networks;",
	note = "high-radix switch architecture;interconnection network;power consumption;partitioned crossbar input queued;switch-level head-of-line block elimination;congestion management technique;",
	pages = "11 - 20",
	title = "{T}owards an efficient switch architecture for high-radix switches",
	url = "http://dx.doi.org/10.1109/ANCS.2006.4579519",
	year = 2006
}

@conference{8969869,
author = ", and Flich, Jose and Duato, Jose and S.-A. Reinemo and T. Skeie",
abstract = "Computers get faster every year, but the demand for computing resources seems to grow at an even faster rate. Depending on the problem domain, this demand for more power can be satisfied by either, massively parallel computers, or clusters of computers. Common for both approaches is the dependence on high performance interconnect networks such as Myrinet, Infiniband, or 10 Gigabit Ethernet. While high throughput and low latency are key features of interconnection networks, the issue of fault-tolerance is now becoming increasingly important. As the number of network components grows so does the probability for failure, thus it becomes important to also consider the fault-tolerance mechanism of interconnection networks. The main challenge then lies in combining performance and fault-tolerance, while still keeping cost and complexity low. This paper proposes a new deterministic routing methodology for tori and meshes, which achieves high performance without the use of virtual channels. Furthermore, it is topology agnostic in nature, meaning it can handle any topology derived from any combination of faults when combined with static reconfiguration. The algorithm, referred to as segment-based routing (SR), works by partitioning a topology into subnets, and subnets into segments. This allows us to place bidirectional turn restrictions locally within a segment. As segments are independent, we gain the freedom to place turn restrictions within a segment independently from other segments. This results in a larger degree of freedom when placing turn restrictions compared to other routing strategies. In this paper a way to compute segment-based routing tables is presented and applied to meshes and tori. Evaluation results show that SR increases performance by a factor of 1.8 over FX and up*/down* routing",
address = "Piscataway, NJ, USA",
booktitle = "Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International",
doi = "10.1109/IPDPS.2006.1639341",
journal = "Proceedings. 20th International Parallel and Distributed Processing Symposium (IEEE Cat. No.06TH8860)",
keywords = "fault tolerant computing;multiprocessor interconnection networks;telecommunication network routing;telecommunication network topology;",
note = "segment-based routing;fault-tolerant routing;meshes;tori;interconnection networks;deterministic routing;",
pages = "10 pp. -",
title = "{S}egment-based routing: an efficient fault-tolerant routing algorithm for meshes and tori",
url = "http://dx.doi.org/10.1109/IPDPS.2006.1639341",
year = 2006
}

Francisco Gilabert, Maria E Gomez, Pedro Lopez and Jose Duato. On the influence of the selection function on the performance of fat-trees. 2006, 864 - 73. BibTeX

@conference{9112992,
	author = "Gilabert, Francisco and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Fat-tree topology has become very popular among switch manufacturers. Routing in fat-trees is composed of two phases, an adaptive upwards phase, and a deterministic downwards phase. The unique downwards path to the destination depends on the switch that has been reached in the upwards phase. As adaptive routing is used in the ascending phase, several output ports are possible at each switch and the final choice depends on the selection function. The impact of the selection function on performance has been previously studied for direct networks and has not resulted to be very important. In fat-trees, the decisions made in the upwards phase by the selection function can be critical, since it determines the switch reached in the upwards phase, and therefore the unique downwards path to the destination. In this paper, we analyze the effect of the selection function on fat-trees. Several selection functions are defined, compared and evaluated. The evaluation shows that selection function has a great impact on fat-trees",
	address = "Berlin, Germany",
	journal = "Euro-Par 2006 Parallel Processing. 12th International Euro-Par Conference. Proceedings (Lecture Notes in Computer Science Vol. 4128)",
	keywords = "telecommunication network routing;telecommunication network topology;telecommunication switching;trees;",
	note = "selection function;fat-trees;adaptive routing;interconnection networks;",
	pages = "864 - 73",
	title = "{O}n the influence of the selection function on the performance of fat-trees",
	year = 2006
}

Maria E Gomez, Pedro Lopez and Jose Duato. FIR: An efficient routing strategy for tori and meshes. Journal of Parallel and Distributed Computing 66(7):907 - 21, 2006. URL, DOI BibTeX

@article{8981461,
	author = "Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Recent massively parallel computers are based on clusters of PCs. These machines use one of the recently proposed standard interconnects. These interconnects either use source routing or distributed routing based on forwarding tables. While source routers are simpler, distributed routers provides more flexibility allowing the network to achieve a higher performance. Distributed routing can be implemented by a fixed hardware specific to a routing function on a given topology or by using forwarding tables. The main problem of this approach is the lack of scalability of forwarding tables. In this paper, we propose a distributed routing strategy for commercial switches, flexible interval routing, that is scalable, both in memory and routing time because it is not based on tables. At the same time, the strategy is easy to reconfigure, being able to implement the most commonly used routing algorithms in the most widely used regular topologies. [All rights reserved Elsevier]",
	address = "USA",
	doi = "10.1016/j.jpdc.2005.12.012",
	issn = "0743-7315",
	journal = "Journal of Parallel and Distributed Computing",
	keywords = "multiprocessor interconnection networks;telecommunication network routing;workstation clusters;",
	note = "FIR;flexible interval routing;network routing;PC clusters;network topology;",
	number = 7,
	pages = "907 - 21",
	title = "{FIR}: {A}n efficient routing strategy for tori and meshes",
	url = "http://dx.doi.org/10.1016/j.jpdc.2005.12.012",
	volume = 66,
	year = 2006
}

Teresa Nachiondo, Jose Flich and Jose Duato. Destination-based HoL blocking elimination. In Parallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference onParallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference on 1. 2006, 10 pp. -. URL, DOI BibTeX

@conference{9077844,
	author = "Nachiondo, Teresa and Flich, Jose and Duato, Jose",
	abstract = "In future interconnection networks, congestion management is likely to become a critical issue owing to increasing power consumption and cost concerns. As congested packets introduce head-of-line (HoL) blocking to the rest of packets, congestion spreads quickly. The best-known solution to HoL blocking, virtual output queues (VOQs), is not scalable at all or too costly when implemented in large networks. In previous works, we proposed an efficient and cost-effective solution, referred to as destination-based buffer management (DBBM). DBBM groups destinations into different sets, and packets addressed to destinations in the same set are mapped to the same queue. DBBM eliminates most of the HoL blocking (among packets addressed to different sets). It achieves very good results in terms of scalability, throughput, and robustness. However, depending on the distribution of packet destinations, it may introduce an uncertain degree of unfairness among packets mapped on the same queue. In order to overcome this problem, we propose the dynamic DBBM mechanism (DDBBM). DDBBM dynamically eliminates completely the HoL blocking. Performance results show that DDBBM keeps (and in some cases improves) the good results achieved by DBBM in terms of throughput and scalability. Moreover, DDBBM solves the unfairness introduced by DBBM. As an example of applicability, in this paper we show that DDBBM can be applied to InfiniBand with no hardware modification",
	booktitle = "Parallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference onParallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference on",
	doi = "10.1109/ICPADS.2006.34",
	isbn = "0-7695-2612-8",
	issn = "1521-9097",
	journal = "12th International Conference on Parallel and Distributed Systems",
	keywords = "buffer storage;computer network management;packet switching;queueing theory;telecommunication congestion control;",
	note = "destination-based HoL blocking elimination;interconnection network;network congestion management;head-of-line blocking;virtual output queues;dynamic destination-based buffer management;packet destination distribution;InfiniBand;",
	pages = "10 pp. -",
	title = "{D}estination-based {H}o{L} blocking elimination",
	url = "http://dx.doi.org/10.1109/ICPADS.2006.34",
	volume = 1,
	year = 2006
}

Maria E Gomez, N A Nordbotten, Jose Flich, Pedro Lopez, Antonio Robles, Jose Duato, T Skeie and O Lysne. A routing methodology for achieving fault tolerance in direct networks. IEEE Transactions on Computers 55(4):400 - 15, 2006. URL, DOI BibTeX

@article{8935111,
author = "Gomez, Maria E. and N.A. Nordbotten and Flich, Jose and Lopez, Pedro and Robles, Antonio and Duato, Jose and T. Skeie and O. Lysne",
abstract = "Massively parallel computing systems are being built with thousands of nodes. The interconnection network plays a key role for the performance of such systems. However, the high number of components significantly increases the probability of failure. Additionally, failures in the interconnection network may isolate a large fraction of the machine. It is therefore critical to provide an efficient fault-tolerant mechanism to keep the system running, even in the presence of faults. This paper presents a new fault-tolerant routing methodology that does not degrade performance in the absence of faults and tolerates a reasonably large number of faults without disabling any healthy node. In order to avoid faults, for some source-destination pairs, packets are first sent to an intermediate node and then from this node to the destination node. Fully adaptive routing is used along both subpaths. The methodology assumes a static fault model and the use of a checkpoint/restart mechanism. However, there are scenarios where the faults cannot be avoided solely by using an intermediate node. Thus, we also provide some extensions to the methodology. Specifically, we propose disabling adaptive routing and/or using misrouting on a per-packet basis. We also propose the use of more than one intermediate node for some paths. The proposed fault-tolerant routing methodology is extensively evaluated in terms of fault tolerance, complexity, and performance",
address = "USA",
doi = "10.1109/TC.2006.46",
issn = "0018-9340",
journal = "IEEE Transactions on Computers",
keywords = "fault tolerant computing;multiprocessor interconnection networks;network routing;parallel processing;",
note = "direct networks;parallel computing system;interconnection network;fault-tolerant routing methodology;adaptive routing;checkpoint-restart mechanism;",
number = 4,
pages = "400 - 15",
title = "{A} routing methodology for achieving fault tolerance in direct networks",
url = "http://dx.doi.org/10.1109/TC.2006.46",
volume = 55,
year = 2006
}

J M Montañana, Jose Flich, Antonio Robles and Jose Duato. Reachability-based fault-tolerant routing. In Parallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference on 1. 2006, 10 pp.. URL, DOI BibTeX

@conference{1655699,
	author = "Monta{\~n}ana, J. M. and Flich, Jose and Robles, Antonio and Duato, Jose",
	abstract = "Clusters of PCs are being used as cost-effective alternative to large parallel computers. In most of them it is critical to keep the system running even in the presence of faults. As the number of nodes increases in these systems, the interconnection network grows accordingly. Along with the increase in components the probability of faults increases dramatically, and thus, fault-tolerance in the system, in general, and in the interconnection network, in particular, plays a key role. An interesting approach to provide fault-tolerance consists of migrating on fly the paths affected by the failure to new fault-free paths. In this paper, we propose a simple and effective fault-tolerant routing methodology, referred to as reachability based fault tolerant routing (RFTR), that can be applied to any topology. RFTR builds new alternative paths by joining subpaths extracted from the set of already computed paths, thus being time-efficient. In order to avoid deadlocks, RFTR performs, if required, a virtual channel transition on the subpath union. As an example of applicability, in this paper we apply RFTR to InfiniBand. Evaluation results on tori show that RFTR exhibits a low computation cost and does not degrade performance significantly",
	booktitle = "Parallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference on",
	doi = "10.1109/ICPADS.2006.89",
	isbn = "0-7695-2612-8",
	issn = "1521-9097",
	keywords = "PC clusters;interconnection network;parallel computers;reachability-based fault-tolerant routing;virtual channel transition;fault tolerant computing;reachability analysis;telecommunication network routing;workstation clusters;",
	month = "0-0",
	pages = "10 pp.",
	title = "{R}eachability-based fault-tolerant routing",
	url = "http://dx.doi.org/10.1109/ICPADS.2006.89",
	volume = 1,
	year = 2006
}

A Martinez, P J Garcia, F J Alfaro, J L Sanchez, Jose Flich, F J Quiles and Jose Duato. Towards a cost-effective interconnection network architecture with QoS and congestion management support. 2006, 884 - 95. BibTeX

@conference{9112994,
	author = "A. Martinez and P.J. Garcia and F.J. Alfaro and J.L. Sanchez and Flich, Jose and F.J. Quiles and Duato, Jose",
	abstract = "Congestion management and quality of service (QoS) provision are two important issues in current network design. The most popular techniques proposed for both issues require the existence of specific resources in the interconnection network, usually a high number of separate queues at switch ports. Therefore, the implementation of these techniques is expensive or even in feasible. However, two novel, efficient, and cost-effective techniques for provision of QoS and for congestion management have been proposed recently. In this paper, we combine those techniques to build a single interconnection network architecture, providing an excellent performance while reducing the number of required resources",
	address = "Berlin, Germany",
	journal = "Euro-Par 2006 Parallel Processing. 12th International Euro-Par Conference. Proceedings (Lecture Notes in Computer Science Vol. 4128)",
	keywords = "interconnections;quality of service;telecommunication congestion control;",
	note = "cost-effective interconnection network;quality of service;congestion management;switch port;",
	pages = "884 - 95",
	title = "{T}owards a cost-effective interconnection network architecture with {Q}o{S} and congestion management support",
	year = 2006
}

A Martinez, F J Alfaro, J L Sanchez and Jose Duato. Scalable low-cost QoS support for single-chip switches. 2006, 8 pp. -. BibTeX

@conference{9077868,
	author = "A. Martinez and F.J. Alfaro and J.L. Sanchez and Duato, Jose",
	abstract = "Virtual channels (VCs) are a popular solution for the provision of quality of service (QoS). Current interconnect standards propose 16 or even more VCs for this purpose. However, most commercial implementations do not offer so many VCs because it is too expensive in terms of silicon area. Therefore, a reduction of the number of VCs necessary to support QoS can be very helpful in the switch design and implementation. We have shown that this number of VCs can be reduced if the system is considered as a whole rather than each element being taken separately. Some of the scheduling decisions made at network interfaces can be easily reused at switches without significantly altering the global behavior. In this paper, our aim is to explore the scalability of the technique, considering the restrictions of the final chip implementation",
	address = "Los Alamitos, CA, USA",
	journal = "12th International Conference on Parallel and Distributed Systems",
	keywords = "network interfaces;performance evaluation;quality of service;scheduling;telecommunication switching;workstation clusters;",
	note = "QoS support;single-chip switches;virtual channels;interconnect standards;switch design;scheduling decisions;network interfaces;interconnection networks;storage area network;performance evaluation;",
	pages = "8 pp. -",
	title = "{S}calable low-cost {Q}o{S} support for single-chip switches",
	year = 2006
}

A Martinez, F J Alfaro, J L Sanchez and Jose Duato. Scalable low-cost QoS support for single-chip switches. 2006, 439 - 446. URL BibTeX

@conference{20071510539214,
	author = "A. Martinez and F.J. Alfaro and J.L. Sanchez and Duato, Jose",
	abstract = "Virtual channels (VCs) are a popular solution for the provision of quality of service (QoS). Current interconnect standards propose 16 or even more VCs for this purpose. However, most commercial implementations do not offer so many VCs because it is too expensive in terms of silicon area. Therefore, a reduction of the number of VCs necessary to support QoS can be very helpful in the switch design and implementation. We have shown that this number of VCs can be reduced if the system is considered as a whole rather than each element being taken separately. Some of the scheduling decisions made at network interfaces can be easily reused at switches without significantly altering the global behavior. In this paper, our aim is to explore the scalability of the technique, considering the restrictions of the final chip implementation. {{\&}}copy; 2006 IEEE.",
	address = "Minneapolis, MN, United states",
	issn = 15219097,
	journal = "Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS",
	key = "Semiconductor switches",
	keywords = "Clustering algorithms;Interconnection networks;Microprocessor chips;Quality of service;Scheduling algorithms;Silicon;",
	note = "Performance evaluation;Switch design;Virtual channels (VCs);",
	pages = "439 - 446",
	title = "{S}calable low-cost {Q}o{S} support for single-chip switches",
	url = "http://dx.doi.org/10.1109/ICPADS.2006.110",
	volume = 1,
	year = 2006
}

P Morillo, W Moncho, J M Orduria and Jose Duato. Providing full awareness to distributed virtual environments based on peer-to-peer architectures. 2006, 336 - 47. BibTeX

@conference{9027149,
author = "P. Morillo and W. Moncho and J.M. Orduria and Duato, Jose",
abstract = "Large scale distributed virtual environments (DVEs) have become a major trend in distributed applications, mainly due to the enormous popularity of multiplayer online games in the entertainment industry. Since architectures based on networked servers seem to be not scalable enough to support massively multiplayer applications, peer-to-peer (P2P) architectures have been proposed as an efficient and truly scalable solution for this kind of systems. However, the main challenge of P2P architectures consists of providing each avatar with updated information about which other avatars are its neighbors. We have denoted this problem as the awareness problem. Although some proposals have been made, none of them provide total awareness to avatars under any situation. This paper presents a new awareness method based on unicast communication that is capable of providing awareness to 100% of avatars, regardless of both their location and their movement pattern in the virtual world. Therefore, it allows large scale DVEs based on P2P architectures to properly scale with the number of users while fully providing awareness to all of them",
address = "Berlin, Germany",
journal = "Advances in Computer Graphics. 24th Computer Graphics International Conference, CGI 2006. Proceedings (Lecture Notes in Computer Science Vol.4035)",
keywords = "avatars;peer-to-peer computing;",
note = "peer-to-peer architectures;large scale distributed virtual environments;P2P architectures;avatar;unicast communication;",
pages = "336 - 47",
title = "{P}roviding full awareness to distributed virtual environments based on peer-to-peer architectures",
year = 2006
}

P Morillo, W Moncho, J M Orduna and Jose Duato. Providing full awareness to distributed virtual environments based on peer-to-peer architectures. 2006, 336 - 347. BibTeX

@conference{20063010029819,
	author = "P. Morillo and W. Moncho and J.M. Orduna and Duato, Jose",
	abstract = "In recent years, large scale distributed virtual environments (DVEs) have become a major trend in distributed applications, mainly due to the enormous popularity of multiplayer online games in the entertainment industry. Since architectures based on networked servers seems to be not scalable enough to support massively multiplayer applications, peer-to-peer (P2P) architectures have been proposed as an efficient and truly scalable solution for this kind of systems. However, the main challenge of P2P architectures consists of providing each avatar with updated information about which other avatars are its neighbors. We have denoted this problem as the awareness problem. Although some proposals have been made, none of them provide total awareness to avatars under any situation. This paper presents a new awareness method based on unicast communication that is capable of providing awareness to 100% of avatars, regardless of both their location and their movement pattern in the virtual world. Therefore, it allows large scale DVEs based on P2P architectures to properly scale with the number of users while fully providing awareness to all of them. {{\&}}copy; Springer-Verlag Berlin Heidelberg 2006.",
	address = "Hangzhou, China",
	issn = 03029743,
	journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
	key = "Virtual reality",
	keywords = "Communication systems;Computer architecture;Computer graphics;Computer science;Information technology;Online systems;",
	note = "Distributed virtual environments (DVE);Multiplayer online games;Peer-to-peer (P2P) architectures;",
	pages = "336 - 347",
	title = "{P}roviding full awareness to distributed virtual environments based on peer-to-peer architectures",
	volume = "4035 LNCS",
	year = 2006
}

A Martinez, G Apostolopoulos, F J Alfaro, J L Sanchez and Jose Duato. QoS support for video transmission in high-speed interconnects. 2006, 631 - 641. BibTeX

@conference{20064410206368,
	author = "A. Martinez and G. Apostolopoulos and F.J. Alfaro and J.L. Sanchez and Duato, Jose",
	abstract = "Multimedia traffic presents some special requirements that are unattainable with a best-effort service. Current interconnect standards provide mechanisms to overcome the limitations of the best-effort model, but they do not suffice to satisfy the strict requirements of video transmissions. This problem has been extensively addressed at the general networking community. Several solutions have arisen, but they are too complex to be applied to high speed-interconnects. In this paper, we propose a network architecture that is at the same time compatible with the requirements of high-speed interconnects and provides video traffic with the QoS it demands. {{\&}}copy; Springer-Verlag Berlin Heidelberg 2006.",
	address = "Munich, Germany",
	issn = 03029743,
	journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
	key = "Video signal processing",
	keywords = "Communication channels (information theory);Multimedia systems;Optical interconnects;Quality of service;Scheduling;Telecommunication traffic;",
	note = "Clusters;Switch design;Video transmissions;Virtual channels;",
	pages = "631 - 641",
	title = "{Q}o{S} support for video transmission in high-speed interconnects",
	volume = "4208 LNCS",
	year = 2006
}

A Martinez, G Apostolopoulos, F J Alfaro, J L Sanchez and Jose Duato. QoS support for video transmission in high-speed interconnects. 2006, 631 - 41. BibTeX

@conference{9132719,
	author = "A. Martinez and G. Apostolopoulos and F.J. Alfaro and J.L. Sanchez and Duato, Jose",
	abstract = "Multimedia traffic presents some special requirements that are unattainable with a best-effort service. Current interconnect standards provide mechanisms to overcome the limitations of the best-effort model, but they do not suffice to satisfy the strict requirements of video transmissions. This problem has been extensively addressed at the general networking community. Several solutions have arisen, but they are too complex to be applied to high speed-interconnects. In this paper, we propose a network architecture that is at the same time compatible with the requirements of high-speed interconnects and provides video traffic with the QoS it demands",
	address = "Berlin, Germany",
	journal = "High Performance Computing and Communications. Second International Conference, HPCC 2006. Proceedings (Lecture Notes in Computer Science Vol.4208)",
	keywords = "multimedia communication;quality of service;telecommunication traffic;video communication;",
	note = "QoS;video transmission;high-speed interconnects;multimedia traffic;network architecture;",
	pages = "631 - 41",
	title = "{Q}o{S} support for video transmission in high-speed interconnects",
	year = 2006
}

P J Garcia, F J Quiles, Jose Flich, Jose Duato and I Johnson. RECN-DD: A Memory-Efficient Congestion Management Technique for Advanced Switching. In Parallel Processing, 2006. ICPP 2006. International Conference on. 2006, 23 -32. DOI BibTeX

@conference{1690602,
	author = "P.J. Garcia and F.J. Quiles and Flich, Jose and Duato, Jose and I. Johnson",
	abstract = "As VLSI technology advances, the interconnection network represents a larger percentage of the total system cost and power consumption. In fact, a current trend in network design is to reduce the number of components. However, this leads to systems working closer to saturation point, and therefore an efficient congestion management technique is required. In that sense, RECN has been recently proposed for advanced switching (AS). RECN detects the formation of congestion trees and dynamically allocates queues for storing congested packets, thus, eliminating the HOL blocking introduced by congestion trees. These queues are deallocated when congestion vanishes. We have identified two shortcomings that may affect RECN scalability and implementation. Firstly, although RECN allocates queues in an efficient way, resource deallocation is performed in-order, thus losing efficiency and wasting resources. This leads to an excessive requirement of memory at switch ports. Secondly, both allocation and deallocation mechanisms involve the use of specific control packets not supported by the AS standard, thus preventing RECN implementation. In this sense we provide a detailed description of the current RECN deallocation mechanism. In this paper we present an enhanced RECN version (RECN-DD) where these problems have been eliminated. Specifically, we propose a new distributed queue deallocation mechanism that reduces the number of required resources and does not require the use of control packets. Moreover, we propose a new congestion notification mechanism that does not require non-standard AS packets. Instead, flow control packets are used to notify congestion, thus simplifying the implementation of RECN-DD in AS",
	booktitle = "Parallel Processing, 2006. ICPP 2006. International Conference on",
	doi = "10.1109/ICPP.2006.62",
	issn = "0190-3918",
	keywords = "distributed queue deallocation;flow control packet;memory-efficient congestion management;regional explicit congestion notification;resource deallocation;multiprocessor interconnection networks;packet switching;queueing theory;telecommunication congestion",
	month = "14-18",
	pages = "23 -32",
	title = "{RECN}-{DD}: {A} {M}emory-{E}fficient {C}ongestion {M}anagement {T}echnique for {A}dvanced {S}witching",
	year = 2006
}

B Caminero, C Carrion, F J Quiles, Jose Duato and S Yalamanchili. MMR: A MultiMedia Router architecture to support hybrid workloads. Journal of Parallel and Distributed Computing 66(2):307 - 321, 2006. URL BibTeX

@article{2006039654031,
	author = "B. Caminero and C. Carrion and F.J. Quiles and Duato, Jose and S. Yalamanchili",
	abstract = "During the past few years, multimedia traffic with quality of service (QoS) requirements has become of widespread use. Media servers are usually built as clusters of workstations connected by a high-performance interconnection network. However, these high-performance networks do not usually offer differentiated support for multimedia traffic. The MultiMedia Router (MMR) is a proposal to address the QoS issue in cluster networks within a compact architecture, while also integrating conventional best-effort traffic. This paper describes the main architectural features of the MMR, such as the use of a hybrid switching technique, credit-based flow control, or small input buffers. Also, the main design parameters are tuned by means of simulation. It can be seen how proper differentiation among the different traffic classes is achieved, while retaining a compact design with small buffers. {{\&}}copy; 2005 Elsevier Inc. All rights reserved.",
	issn = 07437315,
	journal = "Journal of Parallel and Distributed Computing",
	key = "Hybrid computers",
	keywords = "Buffer circuits;Computer simulation;Flow control;Interconnection networks;Quality of service;Servers;Telecommunication traffic;",
	note = "Clusters of workstations (COW);Hybrid switching technique;Multimedia transmissions;Router architecture;",
	number = 2,
	pages = "307 - 321",
	title = "{MMR}: {A} {M}ulti{M}edia {R}outer architecture to support hybrid workloads",
	url = "http://dx.doi.org/10.1016/j.jpdc.2005.10.002",
	volume = 66,
	year = 2006
}

A Martinez, F J Alfaro, J L Sanchez and Jose Duato. Full QoS support with 2 VCs for single-chip switches. 2006, 4 pp. -. BibTeX

@conference{9077009,
	author = "A. Martinez and F.J. Alfaro and J.L. Sanchez and Duato, Jose",
	abstract = "Current interconnection standards providing hardware support for quality of service (QoS) consider up to 16 virtual channels (VCs) for this purpose. However, most implementations do not offer so many because VCs increase the complexity of the switch and the scheduling delays. We have shown that this number of VCs can be significantly reduced, because it is enough to use two VCs for QoS purposes at each switch port. In this paper, we explore two alternative switch designs that take advantage of this reduction",
	address = "Los Alamitos, CA, USA",
	journal = "5th IEEE International Symposium on Network Computing and Applications",
	keywords = "communication complexity;logic design;multiprocessor interconnection networks;network interfaces;network-on-chip;packet switching;quality of service;telecommunication channels;telecommunication traffic;",
	note = "single-chip switches;interconnection standard;quality of service;virtual channel;switch complexity;scheduling delay;switch design;network interface;",
	pages = "4 pp. -",
	title = "{F}ull {Q}o{S} support with 2 {VC}s for single-chip switches",
	year = 2006
}

Alejandro Martinez, Francisco J Alfaro, Jose L Sanchez and Jose Duato. Full QoS support with 2 VCs for single-chip switches. 2006, 239 - 242. URL BibTeX

@conference{20071710566900,
	author = "Alejandro Martinez and Francisco J. Alfaro and Jose L. Sanchez and Duato, Jose",
	abstract = "Current interconnection standards providing hardware support for quality of service (QoS) consider up to 16 virtual channels (VCs) for this purpose. However, most implementations do not offer so many because VCs increase the complexity of the switch and the scheduling delays. We have shown that this number of VCs can be significantly reduced, because it is enough to use two VCs for QoS purposes at each switch port. In this paper, we explore two alternative switch designs that take advantage of this reduction. {{\&}}copy; 2006 IEEE.",
	address = "Cambridge, MA, United states",
	journal = "Proceedings - Fifth IEEE International Symposium on Network Computing and Applications, NCA 2006",
	key = "Quality of service",
	keywords = "Computational complexity;Computer hardware;Interconnection networks;Switching systems;",
	note = "Hardware support;Interconnection standards;Scheduling delays;Virtual channels (VC);",
	pages = "239 - 242",
	title = "{F}ull {Q}o{S} support with 2 {VC}s for single-chip switches",
	url = "http://dx.doi.org/10.1109/NCA.2006.33",
	volume = 2006,
	year = 2006
}

Aurelio Bermudez, Rafael Casado, Francisco J Quiles and Jose Duato. Fast routing computation on InfiniBand networks. IEEE Transactions on Parallel and Distributed Systems 17(3):215 - 226, 2006. URL BibTeX

@article{2006079693287,
	author = "Aurelio Bermudez and Rafael Casado and Francisco J. Quiles and Duato, Jose",
	abstract = "The InfiniBand architecture has been proposed as a technology both for communication between processing nodes and I/O devices, and for interprocessor communication. Its specification defines a basic management infrastructure that is responsible for subnet configuration and fault tolerance. Each time a topology change is detected, new forwarding tables have to be computed and uploaded to devices. The time required to compute these tables is a critical issue, due to application traffic is negatively affected by the temporary lack of connectivity. In this paper, we show the way to integrate several routing algorithms, in order to combine their advantages. In particular, we merge a new proposal, characterized by its high computation speed but low efficiency, with a traditional one, slower but more efficient. Our goal is to provide new routes in a short period of time, minimizing the degradation mentioned before, and maintaining, at the same time, high network performance. {{\&}}copy; 2006 IEEE.",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Computer architecture",
	keywords = "Algorithms;Fault tolerant computer systems;Local area networks;Network protocols;Topology;",
	note = "Fast routing computation;High-speed LANs;InfiniBand networks;Network management;Network topology;Routing protocols;",
	number = 3,
	pages = "215 - 226",
	title = "{F}ast routing computation on {I}nfini{B}and networks",
	url = "http://dx.doi.org/10.1109/TPDS.2006.35",
	volume = 17,
	year = 2006
}

F O Sem-Jacobsen, T Skeie, O Lysne and Jose Duato. Dynamic fault tolerance with misrouting in fat trees. 2006, 10 pp. -. BibTeX

@conference{9165284,
	author = "F.O. Sem-Jacobsen and T. Skeie and O. Lysne and Duato, Jose",
	abstract = "Fault tolerance is critical for efficient utilisation of large computer systems. Dynamic fault tolerance allows the network to remain available through the occurance of faults as opposed to static fault tolerance which requires the network to be halted to reconfigure it. Although dynamic fault tolerance may lead to less efficient solutions than static fault tolerance, it allows for a much higher availability of the system. In this paper we devise a dynamic fault tolerant adaptive routing algorithm for the fat tree, a much used interconnect topology, which relies on misrouting around link faults. We show that we are guaranteed to tolerate any combination of less than (num_switch_ports)/2 link faults without the need for additional network resources for deadlock freedom. There is also a high probability of tolerating an even larger number of link faults. Simulation results show that network performance degrades very little when faults are dynamically tolerated",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings. 2006 International Conference on Parallel Processing",
	keywords = "fault tolerant computing;multiprocessor interconnection networks;telecommunication network routing;trees (mathematics);",
	note = "dynamic fault tolerance;fat tree;dynamic fault tolerant adaptive routing;interconnect topology;link fault misrouting;network performance;",
	pages = "10 pp. -",
	title = "{D}ynamic fault tolerance with misrouting in fat trees",
	year = 2006
}

Frank Olaf Sem-Jacobsen, Tor Skeie, Olav Lysne and Jose Duato. Dynamic fault tolerance with misrouting in fat trees. 2006, 33 - 42. URL BibTeX

@conference{20073110737857,
	author = "Frank Olaf Sem-Jacobsen and Tor Skeie and Olav Lysne and Duato, Jose",
	abstract = "Fault tolerance is critical for efficient utilisation of large computer systems. Dynamic fault tolerance allows the network to remain available through the occurance of faults as opposed to static fault tolerance which requires the network to be halted to reconfigure it. Although dynamic fault tolerance may lead to less efficient solutions than static fault tolerance, it allows for a much higher availability of the system. In this paper we devise a dynamic fault tolerant adaptive routing algorithm for the fat tree, a much used interconnect topology, which relies on misrouting around link faults. We show that we are guaranteed to tolerate any combination of less than num_switch_ports/2 link faults without the need for additional network resources for deadlock freedom. There is also a high probability of tolerating an even larger number of link faults. Simulation results show that network performance degrades very little when faults are dynamically tolerated. {{\&}}copy; 2006 IEEE.",
	address = "Columbus, OH, United states",
	issn = 01903918,
	journal = "Proceedings of the International Conference on Parallel Processing",
	key = "Fault tolerance",
	keywords = "Adaptive algorithms;Computer networks;Computer resource management;Computer simulation;Network routing;Trees (mathematics);",
	note = "Dynamic fault tolerance;Large computer systems;Link faults;",
	pages = "33 - 42",
	title = "{D}ynamic fault tolerance with misrouting in fat trees",
	url = "http://dx.doi.org/10.1109/ICPP.2006.36",
	year = 2006
}

P Morillo, J M Orduna and Jose Duato. A scalable synchronization technique for distributed virtual environments based on networked-server architectures. 2006, 74 - 81. URL BibTeX

@conference{20073110720815,
	author = "P. Morillo and J.M. Orduna and Duato, Jose",
	abstract = "In recent years, large scale distributed virtual environments have become a major trend in distributed applications, mainly due to the enormous popularity of multiplayer online games in the entertainment industry. Thus, scalability has become an essential issue for these highly interactive systems. In this paper, we propose a new synchronization technique for those distributed virtual environments that are based on networked-server architectures. Unlike other methods described in the literature, the proposed technique takes into account the updating messages exchanged by avatars, thus releasing the servers from updating the location of such avatars when synchronizing the state of the system. As a result, the communications required for synchronization are greatly reduced, and this method results more scalable. Also, these communications are distributed along the whole synchronization period, in order to reduce workload peaks. Performance evaluation results show that the proposed approach significantly reduces the percentage of CPU utilization in the servers when compared with other existing methods, therefore supporting a higher number of avatars. Additionally, the system response time is reduced accordingly. {{\&}}copy; 2006 IEEE.",
	address = "Columbus, OH, United states",
	issn = 15302016,
	journal = "Proceedings of the International Conference on Parallel Processing Workshops",
	key = "Distributed computer systems",
	keywords = "Communication systems;Computer architecture;Data processing;Interactive computer systems;Program processors;Servers;Virtual reality;",
	note = "Distributed applications;Entertainment industry;Synchronization technique;Virtual environments;",
	pages = "74 - 81",
	title = "{A} scalable synchronization technique for distributed virtual environments based on networked-server architectures",
	url = "http://dx.doi.org/10.1109/ICPPW.2006.16",
	year = 2006
}

P Morillo, J M Ordufia and Jose Duato. A scalable synchronization technique for distributed virtual environments based on networked-server architectures. 2006, 8 pp. -. BibTeX

@conference{9089294,
author = "P. Morillo and J.M. Ordufia and Duato, Jose",
abstract = "Large scale distributed virtual environments have become a major trend in distributed applications, mainly due to the enormous popularity of multi-player online games in the entertainment industry. Thus, scalability has become an essential issue for these highly interactive systems. In this paper, we propose a new synchronization technique for those distributed virtual environments that are based on networked-server architectures. Unlike other methods described in the literature, the proposed technique takes into account the updating messages exchanged by avatars, thus releasing the servers from updating the location of such avatars when synchronizing the state of the system. As a result, the communications required for synchronization are greatly reduced, and this method results more scalable. Also, these communications are distributed along the whole synchronization period, in order to reduce workload peaks. Performance evaluation results show that the proposed approach significantly reduces the percentage of CPU utilization in the servers when compared with other existing methods, therefore supporting a higher number of avatars. Additionally, the system response time is reduced accordingly",
address = "Los Alamitos, CA, USA",
journal = "2006 International Conference on Parallel Processing Workshops",
keywords = "avatars;distributed processing;interactive systems;network servers;performance evaluation;resource allocation;",
note = "scalable synchronization technique;distributed virtual environments;networked-server architectures;multiplayer online games;entertainment industry;avatars;CPU utilization;interactive systems;",
pages = "8 pp. -",
title = "{A} scalable synchronization technique for distributed virtual environments based on networked-server architectures",
year = 2006
}

Elvira Baydal, Pedro Lopez and Jose Duato. A family of mechanisms for congestion control in wormhole networks. Parallel and Distributed Systems, IEEE Transactions on 16(9):772 - 784, 2005. URL, DOI BibTeX

@article{1490509,
	author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
	abstract = "Multiprocessor interconnection networks may reach congestion with high traffic loads, which prevents reaching the wished performance. Unfortunately, many of the mechanisms proposed in the literature for congestion control either suffer from a lack of robustness, being unable to work properly with different traffic patterns or message lengths, or detect congestion relying on global information that wastes some network bandwidth. This paper presents a family of mechanisms to avoid network congestion in wormhole networks. All of them need only local information, applying message throttling when it is required. The proposed mechanisms use different strategies to detect network congestion and also apply different corrective actions. The mechanisms are evaluated and compared for several network loads and topologies, noticeably improving network performance with high loads but without penalizing network behavior for low and medium traffic rates, where no congestion control is required.",
	doi = "10.1109/TPDS.2005.102",
	issn = "1045-9219",
	journal = "Parallel and Distributed Systems, IEEE Transactions on",
	keywords = "message throttling; multiprocessor interconnection network; network bandwidth; network congestion control; traffic load; wormhole network; wormhole switching; multiprocessor interconnection networks; telecommunication congestion control; telecommunicatio",
	month = "sept.",
	number = 9,
	pages = "772 - 784",
	title = "{A} family of mechanisms for congestion control in wormhole networks",
	url = "http://dx.doi.org/10.1109/TPDS.2005.102",
	volume = 16,
	year = 2005
}

Teresa Nachiondo, Jose Flich, Jose Duato and M Gusat. Cost/performance trade-offs and fairness evaluation of queue mapping policies. In José Cunha; Pedro C D Medeiros (ed.). Euro-Par 2005 Parallel Processing 3648. August 2005, 1024 - 1034. URL, DOI BibTeX

@conference{8746125,
	author = "Nachiondo, Teresa and Flich, Jose and Duato, Jose and M. Gusat",
	abstract = "Whereas the established interconnection networks (ICTN) achieve low latency by operating in the linear region, i.e. oversizing the fabric, the strict cost and power constraints demand more efficient utilization of future networks. Increasing the utilization of lossless ICTNs may, however, lead to saturation and performance degradation owing to HOL-blocking. The current solution to HOL-blocking consists of using virtual output queueing (VOQ), whose quadratical scalability is expensive in large networks. To improve VOQ's scalability we have proposed the destination-based buffer management (DBBM), a scheme that compares well with VOQ. Whereas previously we have analyzed DBBM's basic operation and performance, in this paper we have set two different goals. First we focus on how the different DBBM mappings can impact the cost/performance of multistage ICTNs. Next, because DBBM can introduce unfairness, this constitutes the second theme of our paper. The new results show that DBBM with modulo-4/8 mapping performs very well for only a fraction of the VOQ cost. Also in terms of fairness DBBM shows promise, because it (i) keeps the unfairness degree independent of both topology and routing, while (ii) minimizing the number of flows affected by unfairness",
	booktitle = "Euro-Par 2005 Parallel Processing",
	doi = "10.1007/11549468_112",
	editor = "Jos{\'e} C. Cunha; Pedro D. Medeiros",
	isbn = "978-3-540-28700-1",
	journal = "Euro-Par 2005 Parallel Processing. 11th International Euro-Par Conference. Proceedings (Lecture Notes in Computer Science Vol. 3648)",
	keywords = "buffer storage;multistage interconnection networks;performance evaluation;queueing theory;",
	month = "Aug",
	note = "fairness evaluation;queue mapping policies;interconnection networks;destination-based buffer management;multistage ICTN;",
	pages = "1024 - 1034",
	series = "Lecture Notes in Computer Science",
	title = "{C}ost/performance trade-offs and fairness evaluation of queue mapping policies",
	url = "http://dx.doi.org/10.1007/11549468_112",
	volume = 3648,
	year = 2005
}

Maria E Gomez, Pedro Lopez and Jose Duato. A Memory-Effective Fault-Tolerant Routing Strategy for Direct Interconnection Networks. In Parallel and Distributed Computing, 2005. ISPDC 2005. The 4th International Symposium on. July 2005, 341 -348. URL, DOI BibTeX

@conference{1609988,
	author = "Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "High-performance interconnection networks are crucial in massively parallel computers. Routing is one of the most important design issues of interconnection networks. Moreover, the huge amount of hardware of these machines makes fault-tolerance another important design issue. In this paper, we propose a mechanism that combines scalable routing and fault-tolerance for commercial switches to build direct regular topologies, which are the topologies used in large machines. The hardware required is not complex. Furthermore, it allows a high degree of fault-tolerance inflicting a minimal decrease of performance",
	booktitle = "Parallel and Distributed Computing, 2005. ISPDC 2005. The 4th International Symposium on",
	doi = "10.1109/ISPDC.2005.6",
	keywords = "adaptive routing;direct interconnection networks;distributed routing;memory-effective fault-tolerant routing;fault tolerance;multiprocessor interconnection networks;telecommunication network reliability;telecommunication network routing;",
	month = "july",
	pages = "341 -348",
	title = "{A} {M}emory-{E}ffective {F}ault-{T}olerant {R}outing {S}trategy for {D}irect {I}nterconnection {N}etworks",
	url = "http://dx.doi.org/10.1109/ISPDC.2005.6",
	year = 2005
}

Marina Alonso, Juan Miguel Martínez, Vicente Santonja, Pedro Lopez and Jose Duato. Power Saving in Regular Interconnection Networks Built with High-Degree Switches. In Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International. April 2005, 5b - 5b. URL, DOI BibTeX

@conference{1419820,
	author = "Alonso, Marina and Mart{\'i}nez, Juan Miguel and Santonja, Vicente and Lopez, Pedro and Duato, Jose",
	abstract = "Nowadays, high-degree switches are available as building blocks of the interconnection network of clusters of PCs. An alternative to take advantage of the high number of switch ports is to connect every pair of switches through not only one but several links (this is known as link trunking in other environments). This extra connectivity can be exploited by using adaptive routing algorithms, thus improving network throughput and reducing network congestion. However with low traffic loads, all the links that compose the trunk link will not be utilized, but this idle links continue consuming power. Power consumption reduction techniques are being applied everywhere in computer systems and the interconnection network is not an exception, as its contribution is not negligible. In this paper, we present a mechanism that dynamically switches on and off network links as a function of traffic. It is specially targeted to those networks where trunk links are used. The mechanism can switch off any link, provided that network connectivity is guaranteed, (i.e. every pair of switches should be connected through at least one active link). Indeed, this restriction makes possible to use the same routing algorithm regardless the power saving actions taken, thus simplifying router design. Our simulation results show that the network power consumption can be greatly reduced, at the expense of some increase in latency. Nevertheless, it is shown that the power reduction is always higher that this latency increase.",
	booktitle = "Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International",
	doi = "10.1109/IPDPS.2005.349",
	isbn = "0-7695-2312-9",
	keywords = "PC clusters; adaptive routing algorithm; high-degree switch; link trunking; network congestion; network link; network throughput; power consumption; power saving; regular interconnection network; telecommunication traffic; power consumption; telecommunic",
	month = "april",
	pages = "5b - 5b",
	title = "{P}ower {S}aving in {R}egular {I}nterconnection {N}etworks {B}uilt with {H}igh-{D}egree {S}witches",
	url = "http://dx.doi.org/10.1109/IPDPS.2005.349",
	year = 2005
}

Maria E Gomez, Pedro Lopez and Jose Duato. A Memory-Effective Routing Strategy for Regular Interconnection Networks. In Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International. April 2005, 41b - 41b. URL, DOI BibTeX

@conference{1419862,
	author = "Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Massively parallel computing systems have been or are being built with thousands of nodes. In such systems, high-performance interconnection networks are crucial to achieve the maximum performance. Routing is one of the most important design issues of interconnection networks. Routing strategies can be mainly classified as source and distributed routing. Source routing has been used in some networks because routers are very simple. On the other hand, distributed routing allows more flexibility, but the routers are more complex. Distributed routing can be implemented by a fixed hardware specific to a routing function on a given topology, or by using forwarding tables that are very flexible but suffer from a lack of scalability. In this paper, we propose a distributed routing strategy for commercial switches, Flexible Interval Routing, that is scalable for the most widely used regular topologies (tori and meshes) because it is not based on tables. At the same time, the strategy is easy to reconfigure to deal with changes in the topology or in the routing algorithm for a given topology, being able to implement the most commonly-used routing algorithms in regular topologies.",
	booktitle = "Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International",
	doi = "10.1109/IPDPS.2005.44",
	keywords = "distributed routing; flexible interval routing; high-performance interconnection networks; memory-effective routing strategy; parallel computing system; multiprocessor interconnection networks; network routing; parallel machines; performance evaluation;",
	month = "april",
	pages = "41b - 41b",
	title = "{A} {M}emory-{E}ffective {R}outing {S}trategy for {R}egular {I}nterconnection {N}etworks",
	url = "http://dx.doi.org/10.1109/IPDPS.2005.44",
	year = 2005
}

Teresa Nachiondo, Jose Flich and Jose Duato. Efficient reduction of HOL blocking in multistage networks. In Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International. April 2005, 8 pp.. URL, DOI BibTeX

@conference{1420115,
	author = "Nachiondo, Teresa and Flich, Jose and Duato, Jose",
	abstract = "Head-of-line blocking is one of the main problems arising in input-buffered switches. The best-known solution to this problem consists of using virtual output queues (VOQs). However this strategy is not scalable. Its implementation cost increases quadratically with the number of ports in the switch. Taking into account current trends, the demand for larger number of ports in high-performance switches is likely to increase very rapidly in the future. Therefore, a scalable and cost-effective solution is required. In this paper we propose an efficient and cost-effective strategy (belonging to a family of strategies previously proposed, referred to as destination-based buffer management (DBBM)), to reduce HOL blocking in single-stage and multistage networks. The proposed strategy is based on allowing certain destinations to share the same queue. Its main purpose is to maximize network throughput whereas keeping HOL blocking to negligible values. In this paper, we apply the strategy at every switch included in a bidirectional multistage network (BMIN). We have evaluated DBBM, VOQ, and alternative strategies in different BMIN sizes and with different traffic conditions (synthetic traffic, IP traces, and I/O traces). Results show that DBBM with a reduced number of queues at each switch obtains roughly the same throughput as the VOQ mechanism. Moreover, VOQ at the switch level (as many queues as output ports at every switch) has also been analyzed. Results demonstrate that it does not scale. As the number of stages in the network increases, the VOQ solution at the switch level introduces more HOL blocking that leads to a severe degradation in network throughput. With the DBBM using 16 queues, maximum network throughput is sustained for all the traffic cases analyzed. Moreover, as the network size increases (up to a 2048 times; 2048 BMIN), DBBM keeps roughly the same performance with the same number of queues.",
	booktitle = "Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International",
	doi = "10.1109/IPDPS.2005.193",
	isbn = "0-7695-2312-9",
	keywords = "bidirectional multistage network; destination-based buffer management; head-of-line blocking; high-performance switch; virtual output queue; multistage interconnection networks; queueing theory; storage management; telecommunication traffic;",
	month = "April",
	pages = "8 pp.",
	title = "{E}fficient reduction of {HOL} blocking in multistage networks",
	url = "http://dx.doi.org/10.1109/IPDPS.2005.193",
	year = 2005
}

Jose Duato, I Johnson, Jose Flich, F Naven, P Garcia and Teresa Nachiondo. A new scalable and cost-effective congestion management strategy for lossless multistage interconnection networks. In High-Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on. February 2005, 108 - 119. URL, DOI BibTeX

@conference{1385933,
	author = "Duato, Jose and I. Johnson and Flich, Jose and F. Naven and P. Garcia and Nachiondo, Teresa",
	abstract = "In this paper, we propose a new congestion management strategy for lossless multistage interconnection networks that scales as network size and/or link bandwidth increase. Instead of eliminating congestion, our strategy avoids performance degradation beyond the saturation point by eliminating the HOL blocking produced by congestion trees. This is achieved in a scalable manner by using separate queues for congested flows. These are dynamically allocated only when congestion arises, and deallocated when congestion subsides. Performance evaluation results show that our strategy responds to congestion immediately and completely eliminates the performance degradation produced by HOL blocking while using only a small number of additional queues.",
	booktitle = "High-Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on",
	doi = "10.1109/HPCA.2005.1",
	isbn = "0-7695-2275-0",
	issn = "1530-0897",
	keywords = "HOL blocking; congestion management; congestion trees; lossless multistage interconnection networks; network queue; computer network management; multistage interconnection networks; queueing theory; telecommunication congestion control;",
	month = "Feb",
	pages = "108 - 119",
	title = "{A} new scalable and cost-effective congestion management strategy for lossless multistage interconnection networks",
	url = "http://dx.doi.org/10.1109/HPCA.2005.1",
	year = 2005
}

Elvira Baydal, Pedro Lopez and Jose Duato. A family of mechanisms for congestion control in wormhole networks. IEEE Transactions on Parallel and Distributed Systems 16(9):772 - 84, 2005. URL BibTeX

@article{8570709,
	author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
	abstract = "Multiprocessor interconnection networks may reach congestion with high traffic loads, which prevents reaching the wished performance. Unfortunately, many of the mechanisms proposed in the literature for congestion control either suffer from a lack of robustness, being unable to work properly with different traffic patterns or message lengths, or detect congestion relying on global information that wastes some network bandwidth. This paper presents a family of mechanisms to avoid network congestion in wormhole networks. All of them need only local information, applying message throttling when it is required. The proposed mechanisms use different strategies to detect network congestion and also apply different corrective actions. The mechanisms are evaluated and compared for several network loads and topologies, noticeably improving network performance with high loads but without penalizing network behavior for low and medium traffic rates, where no congestion control is required",
	address = "USA",
	issn = "1045-9219",
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	keywords = "multiprocessor interconnection networks;telecommunication congestion control;telecommunication network routing;telecommunication network topology;telecommunication switching;telecommunication traffic;",
	note = "multiprocessor interconnection network;traffic load;network congestion control;network bandwidth;wormhole network;message throttling;wormhole switching;",
	number = 9,
	pages = "772 - 84",
	title = "{A} family of mechanisms for congestion control in wormhole networks",
	url = "http://dx.doi.org/10.1109/TPDS.2005.102",
	volume = 16,
	year = 2005
}

Marina Alonso, Juan Miguel Martínez, Vicente Santonja, Pedro Lopez and Jose Duato. Power saving in regular interconnection networks built with high-degree switches. 2005, 10 pp. -. BibTeX

@conference{8539357,
	author = "Alonso, Marina and Mart{\'i}nez, Juan Miguel and Santonja, Vicente and Lopez, Pedro and Duato, Jose",
	abstract = "Nowadays, high-degree switches are available as building blocks of the interconnection network of clusters of PCs. An alternative to take advantage of the high number of switch ports is to connect every pair of switches through not only one but also several links (this is known as link trunking in other environments). This extra connectivity can be exploited by using adaptive routing algorithms, thus improving network throughput and reducing network congestion. However with low traffic loads, all the links that compose the trunk link will not be utilized, but this idle links continue consuming power. Power consumption reduction techniques are being applied everywhere in computer systems and the interconnection network is not an exception, as its contribution is not negligible. In this paper, we present a mechanism that dynamically switches on and off network links as a function of traffic. It is specially targeted to those networks where trunk links are used. The mechanism can switch off any link, provided that network connectivity is guaranteed, (i.e. every pair of switches should be connected through at least one active link). Indeed, this restriction makes possible to use the same routing algorithm regardless the power saving actions taken, thus simplifying router design. Our simulation results show that the network power consumption can be greatly reduced, at the expense of some increase in latency. Nevertheless, it is shown that the power reduction is always higher that this latency increases",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings. 19th IEEE International Parallel and Distributed Processing Symposium",
	keywords = "power consumption;telecommunication congestion control;telecommunication links;telecommunication network routing;telecommunication switching;telecommunication traffic;workstation clusters;",
	note = "power saving;regular interconnection network;high-degree switch;PC clusters;link trunking;adaptive routing algorithm;network throughput;network congestion;power consumption;telecommunication traffic;network link;",
	pages = "10 pp. -",
	title = "{P}ower saving in regular interconnection networks built with high-degree switches",
	year = 2005
}

Michihiro Koibuchi, Juan Carlos Martinez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Enforcing in-order packet delivery in system area networks with adaptive routing. Journal of Parallel and Distributed Computing 65(10):1223 - 1236, 2005. URL BibTeX

@article{2005379355213,
	author = "Michihiro Koibuchi and Martinez, Juan Carlos and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
	abstract = "Adaptive routing, which dynamically selects the route of packets, has been widely studied for interconnection networks in massively parallel computers and system area networks. Although adaptive routing has the advantage of providing high bandwidth, it may deliver packets out-of-order, which some message passing libraries do not accept. In this paper, we propose two mechanisms called (1) FIFO transmission and (2) couple limitation to guarantee in-order packet delivery in adaptive routing. Both of them limit packet injection at source hosts. The FIFO transmission completely avoids packet sorting at destination hosts, while the couple limitation uses a few buffers to sort packets at destination hosts. Evaluation results show that the FIFO transmission and the couple limitation achieve a similar throughput to that of a method equipped with huge (infinite) buffers enough to store all out-of-order packets at destination hosts under both synthetic traffic and NAS Parallel Benchmarks. © 2005 Elsevier Inc. All rights reserved.",
	issn = 07437315,
	journal = "Journal of Parallel and Distributed Computing",
	key = "Packet networks",
	keywords = "Bandwidth;Benchmarking;Interconnection networks;Routers;Telecommunication traffic;",
	note = "Adaptive routing;In-order packet delivery;PC clusters;System area networks;",
	number = 10,
	pages = "1223 - 1236",
	title = "{E}nforcing in-order packet delivery in system area networks with adaptive routing",
	url = "http://dx.doi.org/10.1016/j.jpdc.2005.04.007",
	volume = 65,
	year = 2005
}

Blanca Caminero, Carmen Carrion, Francisco J Quiles, Jose Duato and Sudhakar Yalamanchili. Traffic scheduling solutions with QoS support for an input-buffered multimedia router. IEEE Transactions on Parallel and Distributed Systems 16(11):1009 - 1021, 2005. URL BibTeX

@article{2005499529197,
	author = "Blanca Caminero and Carmen Carrion and Francisco J. Quiles and Duato, Jose and Sudhakar Yalamanchili",
	abstract = "Quality of Service (QoS) support in local and cluster area environments has become an issue of great interest in recent years. Most current high-performance interconnection solutions for these environments have been designed to enhance conventional best-effort traffic performance, but are not well-suited to the special requirements of the new multimedia applications. The MultiMedia Router (MMR) aims at offering hardware-based QoS support within a compact interconnection component. One of the key elements in the MMR architecture are the algorithms used in traffic scheduling. These algorithms are responsible for the order in which information is forwarded through the internal switch. Thus, they are closely related to the QoS-provisioning mechanisms. In this paper, several traffic scheduling algorithms developed for the MMR architecture are described. Their general organization is motivated by chances for parallelization and pipelining, while providing the necessary support both to multimedia flows and to best-effort traffic. Performance evaluation results show that the QoS requirements of different connections are met, in spite of the presence of best-effort traffic, while achieving high link utilizations. {{\&}}copy; 2005 IEEE.",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Data communication systems",
	keywords = "Local area networks;Multimedia systems;Parallel processing systems;Pipeline processing systems;Quality of service;Routers;Switching networks;Telecommunication traffic;",
	note = "Cluster networks;Input-buffered multimedia router;Link scheduling;Switch architecture;Switch scheduling;",
	number = 11,
	pages = "1009 - 1021",
	title = "{T}raffic scheduling solutions with {Q}o{S} support for an input-buffered multimedia router",
	url = "http://dx.doi.org/10.1109/TPDS.2005.140",
	volume = 16,
	year = 2005
}

Francisco J Alfaro, Jose L Sanchez and Jose Duato. {S. 989 - 994. BibTeX

@conference{2005459464579,
	author = "Francisco J. Alfaro and Jose L. Sanchez and Duato, Jose",
	abstract = "InfiniBand (IBA) has been proposed as an industry-standard architecture both for I/O server and interprocessor communication. IBA employs a switched point-to-point network, instead of using a shared bus. IBA is being developed by the InfiniBand^SM Trade Association to provide present and future server systems with the required levels of reliability, availability, performance, scalability, and quality of service (QoS). In previous papers we have proposed an effective strategy for configuring the IBA networks to provide users with the required levels of QoS. This strategy is based on the proper configuration of the mechanisms IBA carries to support QoS. Specifically, our methodology configures the InfiniBand Arbitration Tables and uses the different Service Levels and Virtual Lanes that are available, in order to segregate the different traffic flows. Thus, each flow receives the treatment it has previously requested. Moreover, by using our methodology, applications can be assured that their requirements will be satisfied. In this paper, we review the basis of our methodology and we study the influence of the packet size on the QoS guaranteed to the applications. {{\&}}copy; 2005 IEEE.",
	address = "Murcia, Spain",
	issn = 15301346,
	journal = "Proceedings - IEEE Symposium on Computers and Communications",
	key = "Servers",
	keywords = "Computer architecture;Program processors;Quality of service;",
	note = "Interprocessor communication;Point-to-point network;Point-to-point networks;",
	pages = "989 - 994",
	title = "{S"
}

F J Alfaro, J L Sanchez and Jose Duato. {S. 989 - 94. BibTeX

@conference{8642751,
	author = "F.J. Alfaro and J.L. Sanchez and Duato, Jose",
	abstract = "InfiniBand (IBA) has been proposed as an industry-standard architecture both for I/O server and interprocessor communication. IBA employs a switched point-to-point network, instead of using a shared bus. IBA is being developed by the InfiniBand_SM Trade Association to provide present and future server systems with the required levels of reliability, availability, performance, scalability, and quality of service (QoS). In previous papers we have proposed an effective strategy for configuring the IBA networks to provide users with the required levels of QoS. This strategy is based on the proper configuration of the mechanisms IBA carries to support QoS. Specifically, our methodology configures the InfiniBand arbitration tables and uses the different service levels and virtual lanes that are available, in order to segregate the different traffic flows. Thus, each flow receives the treatment it has previously requested. Moreover, by using our methodology, applications can be assured that their requirements will be satisfied. In this paper, we review the basis of our methodology and we study the influence of the packet size on the QoS guaranteed to the applications",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings. 10th IEEE Symposium on Computers and Communications",
	keywords = "network servers;packet switching;quality of service;telecommunication traffic;",
	note = "InfiniBand packet size;QoS;industry-standard architecture;I/O server;interprocessor communication;switched point-to-point network;server systems;quality of service;InfiniBand arbitration tables;service levels;virtual lanes;traffic flows;",
	pages = "989 - 94",
	title = "{S"
}

Jose Duato, Olav Lysne, Ruoming Pang and Timothy M Pinkston. Part I: A theory for deadlock-free dynamic network reconfiguration. IEEE Transactions on Parallel and Distributed Systems 16(5):412 - 427, 2005. URL BibTeX

@article{2005259162434,
	author = "Duato, Jose and Olav Lysne and Ruoming Pang and Timothy M. Pinkston",
	abstract = "This paper develops theoretical support useful for determining deadlock properties of dynamic network reconfiguration techniques and also serves as a basis for the development of design methodologies useful for deriving deadlock-free reconfiguration techniques. It is applicable to interconnection networks typically used in multiprocessor servers, network-based computing clusters, and distributed storage systems, and also has potential application to system-on-chip networks. This theory builds on basic principles established by previous theories while pioneering new concepts fundamental to the case of dynamic network reconfiguration. {{\&}}copy; 2005 IEEE.",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Interconnection networks",
	keywords = "Communication channels (information theory);Computer system recovery;Data storage equipment;Servers;Theorem proving;",
	note = "Deadlock freedom theory;Dynamic reconfiguration;",
	number = 5,
	pages = "412 - 427",
	title = "{P}art {I}: {A} theory for deadlock-free dynamic network reconfiguration",
	url = "http://dx.doi.org/10.1109/TPDS.2005.58",
	volume = 16,
	year = 2005
}

Olav Lysne, Timothy Mark Pinkston and Jose Duato. Part II: A methodology for developing deadlock-free dynamic network reconfiguration processes. IEEE Transactions on Parallel and Distributed Systems 16(5):428 - 443, 2005. URL BibTeX

@article{2005259162435,
	author = "Olav Lysne and Timothy Mark Pinkston and Duato, Jose",
	abstract = "Dynamic network reconfiguration is defined as the process of changing from routing function to another while the network remains up and running. The main challenge is in avoiding deadlock anomalies while keeping restrictions on packet injection and forwarding minimal. Current approaches either require virtual channels in the network or they work only for a limited set of routing algorithms and/or fault patterns. In this paper, we present a methodology for devising deadlock free and dynamic transitions between old and new routing functions that is consistent with newly proposed theory [1]. The methodology is independent of topology, can be applied to any deadlock-free routing function, and puts no restrictions on the routing function changes that can be supported. Furthermore, it does not require any virtual channels to guarantee deadlock freedom. This research is motivated by current trends toward using increasingly larger Internet and transaction processing servers based on clusters of PCs that have very high availability and dependability requirements, as well as other local, system, and storage area network-based computing systems. {{\&}}copy; 2005 IEEE.",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Interconnection networks",
	keywords = "Algorithms;Communication channels (information theory);Computer system recovery;Computer systems;Internet;Routers;Servers;",
	note = "Deadlock freedom methodology;Dynamic reconfiguration;",
	number = 5,
	pages = "428 - 443",
	title = "{P}art {II}: {A} methodology for developing deadlock-free dynamic network reconfiguration processes",
	url = "http://dx.doi.org/10.1109/TPDS.2005.59",
	volume = 16,
	year = 2005
}

A Martinez, F J Alfaro, J L Sanchez and Jose Duato. Providing full QoS support in clusters using only two VCs at the switches. 2005, 158 - 169. BibTeX

@conference{2006219900053,
	author = "A. Martinez and F.J. Alfaro and J.L. Sanchez and Duato, Jose",
	abstract = "Current interconnect standards providing hardware support for quality of service (QoS) consider up to 16 virtual channels (VCs) for this purpose. However, most implementations do not offer so many VCs because they increase the complexity of the switch and the scheduling de-lays. In this paper, we show that this number of VCs can be significantly reduced. Some of the scheduling decisions made at network interfaces can be easily reused at switches without significantly altering the global behavior. Specifically, we show that it is enough to use two VCs for QoS purposes at each switch port, thereby simplifying the design and reducing its cost. {{\&}}copy; Springer-Verlag Berlin Heidelberg 2005.",
	address = "Goa, India",
	issn = 03029743,
	journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
	key = "Optical interconnects",
	keywords = "Computational complexity;Interfaces (computer);Quality of service;Scheduling;Standards;Virtual reality;",
	note = "de-lays;Hardware support;Virtual channels (VCs);",
	pages = "158 - 169",
	title = "{P}roviding full {Q}o{S} support in clusters using only two {VC}s at the switches",
	volume = "3769 LNCS",
	year = 2005
}

A Martinez, F J Alfaro, J L Sanchez and Jose Duato. Providing full QoS support in clusters using only two VCs at the switches. 2005, 158 - 69. BibTeX

@conference{8815927,
	author = "A. Martinez and F.J. Alfaro and J.L. Sanchez and Duato, Jose",
	abstract = "Current interconnect standards providing hardware support for quality of service (QoS) consider up to 16 virtual channels (VCs) for this purpose. However, most implementations do not offer so many VCs because they increase the complexity of the switch and the scheduling delays. In this paper, we show that this number of VCs can be significantly reduced. Some of the scheduling decisions made at network interfaces can be easily reused at switches without significantly altering the global behavior. Specifically, we show that it is enough to use two VCs for QoS purposes at each switch port, thereby simplifying the design and reducing its cost",
	address = "Berlin, Germany",
	journal = "High Performance Computing-HiPC 2005. 12th International Conference. Proceedings (Lecture Notes in Computer Science Vol.3769)",
	keywords = "network interfaces;processor scheduling;quality of service;workstation clusters;",
	note = "full QoS support;workstation clusters;quality of service;virtual channels;scheduling decision reuse;network interfaces;",
	pages = "158 - 69",
	title = "{P}roviding full {Q}o{S} support in clusters using only two {VC}s at the switches",
	year = 2005
}

P J Garcia, Jose Flich, Jose Duato, F J Quiles, I Johnson and F Naven. On the correct sizing on meshes through an effective congestion management strategy. 2005, 1035 - 45. BibTeX

@conference{8746126,
	author = "P.J. Garcia and Flich, Jose and Duato, Jose and F.J. Quiles and I. Johnson and F. Naven",
	abstract = "Interconnection networks used in clusters of PCs are often dimensioned with certain restrictions. One restriction could be the reduction of power consumption and overall cost. In this sense, the network size must be reduced. Another restriction is to guarantee that the system offers a minimum bandwidth. In this case, the network size must be increased. In both cases, the head-of-line (HOL) blocking effect (related to network congestion) may appear, degrading network performance and thus, preventing the correct sizing of the network. Therefore, some mechanisms should be implemented for reducing or eliminating this problem, in order to dimension the network as desired while keeping network performance at maximum. In this paper we analyze the impact on network performance when using different mechanisms for handling HOL blocking when interconnection networks with mesh topology are dimensioned in several ways. We show that the previously proposed RECN congestion control mechanism is key in order to efficiently eliminate HOL blocking in meshes and, therefore, it allows the correct network sizing",
	address = "Berlin, Germany",
	journal = "Euro-Par 2005 Parallel Processing. 11th International Euro-Par Conference. Proceedings (Lecture Notes in Computer Science Vol. 3648)",
	keywords = "computer network management;multiprocessor interconnection networks;performance evaluation;telecommunication congestion control;",
	note = "mesh network sizing;congestion management;interconnection networks;head-of-line blocking reduction;HOL blocking handling;RECN congestion control;",
	pages = "1035 - 45",
	title = "{O}n the correct sizing on meshes through an effective congestion management strategy",
	year = 2005
}

Wu-Chun Feng and Jose Duato. Message from the program co-chairs. Proceedings of the International Conference on Parallel Processing 2005:xii - xii, 2005. URL BibTeX

@article{2006259954343,
	author = "Wu-Chun Feng and Duato, Jose",
	abstract = "No abstract available",
	address = "Oslo, Norway",
	issn = 01903918,
	journal = "Proceedings of the International Conference on Parallel Processing",
	pages = "xii - xii",
	title = "{M}essage from the program co-chairs",
	url = "http://dx.doi.org/10.1109/ICPP.2005.52",
	volume = 2005,
	year = 2005
}

Juan Carlos Martinez, Jose Flich, Antonio Robles, Pedro Lopez, Jose Duato and M Koibuchi. In-Order Packet Delivery in Interconnection Networks using Adaptive Routing. In Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International. 2005, 101 - 101. DOI BibTeX

@conference{1419928,
	author = "Martinez, Juan Carlos and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose and M. Koibuchi",
	abstract = "Most commercial switch-based network technologies for PC clusters use deterministic routing. Alternatively, adaptive routing could be used to improve network performance. In this case, switches decide the path to reach the destination by using local information about the state of the possible outgoing links. However, there are two drawbacks that discourage adaptive routing from being applied to commercial interconnects. The first one concerns the possible switch complexity increase with respect to deterministic routing. The second drawback is due to the fact that adaptive routing may introduce out-of-order packet delivery, which is not acceptable for some applications. For the best of our knowledge, there are no works that analyze the degree of out-of-order packet delivery caused by different network and traffic conditions. In this paper, we take on such a challenge. We show that only for high traffic conditions (reaching saturation) out-of-order delivery is introduced. Moreover, by using small buffers and simple sorting mechanisms at destination, we show that high network throughput can be obtained at the same time packets are delivered in order. Thus, the paper demonstrates that it is possible to use adaptive routing, while still guaranteeing in-order packet delivery, without using large buffer resources nor degrading significantly its performance.",
	booktitle = "Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International",
	doi = "10.1109/IPDPS.2005.255",
	keywords = "PC clusters; adaptive routing; deterministic routing; interconnection networks; out-of-order packet delivery; sorting mechanisms; switch-based network technologies; multiprocessor interconnection networks; network routing; packet switching; sorting; work",
	month = "04-08",
	pages = "101 - 101",
	title = "{I}n-{O}rder {P}acket {D}elivery in {I}nterconnection {N}etworks using {A}daptive {R}outing",
	year = 2005
}

Pedro Morillo, Juan M Orduna, Marcos Fernandez and Jose Duato. Improving the performance of distributed virtual environment systems. IEEE Transactions on Parallel and Distributed Systems 16(7):637 - 649, 2005. URL BibTeX

@article{2005329281291,
author = "Pedro Morillo and Juan M. Orduna and Marcos Fernandez and Duato, Jose",
abstract = "The last years have witnessed a dramatic growth in the number as well as in the variety of distributed virtual environment systems. These systems allow multiple users, working on different client computers that are interconnected through different networks, to interact in a shared virtual world. One of the key issues in the design of scalable and cost-effective DVE systems is the partitioning problem. This problem consists of efficiently assigning the existing clients to the servers in the system and some techniques have been already proposed for solving it. This paper experimentally analyzes the correlation of the quality function proposed in the literature for solving the partitioning problem with the performance of DVE systems. Since the results show an absence of correlation, we also propose the experimental characterization of DVE systems. The results show that the reason for that absence of correlation is the nonlinear behavior of DVE systems with regard to the number of clients in the system. DVE systems reach saturation when any of the servers reaches 100 percent of CPU utilization. The system performance greatly decreases if this limit is exceeded in any server. Also, as a direct application of these results, we present a partitioning method that is targeted to keep all the servers in the system below a certain threshold value of CPU utilization, regardless of the amount of network traffic. Evaluation results show that the proposed partitioning method can improve DVE system performance, regardless of both the movement pattern of clients and the initial distribution of clients in the virtual world. {{\&}}copy; 2005 IEEE.",
issn = 10459219,
journal = "IEEE Transactions on Parallel and Distributed Systems",
key = "Distributed computer systems",
keywords = "Computer simulation;Correlation methods;Evaluation;Performance;Servers;Telecommunication traffic;Virtual reality;",
note = "Distributed applications;Distributed network graphics;",
number = 7,
pages = "637 - 649",
title = "{I}mproving the performance of distributed virtual environment systems",
url = "http://dx.doi.org/10.1109/TPDS.2005.83",
volume = 16,
year = 2005
}

P J Garcia, Jose Flich, Jose Duato, I Johnson, F J Quiles and F Naven. Dynamic evolution of congestion trees: Analysis and impact on switch architecture. 2005, 266 - 285. BibTeX

@conference{2006229908739,
	author = "P.J. Garcia and Flich, Jose and Duato, Jose and I. Johnson and F.J. Quiles and F. Naven",
	abstract = "Designers of large parallel computers and clusters are becoming increasingly concerned with the cost and power consumption of the interconnection network. A simple way to reduce them consists of reducing the number of network components and increasing their utilization. However, doing so without a suitable congestion management mechanism may lead to dramatic throughput degradation when the network enters saturation. Congestion management strategies for lossy networks (computer networks) are well known, but relatively little effort has been devoted to congestion management in lossless networks (parallel computers, clusters, and on-chip networks). Additionally, congestion is much more difficult to solve in this context due to the formation of congestion trees. In this paper we study the dynamic evolution of congestion trees. We show that, contrary to the common belief, trees do not only grow from the root toward the leaves. There exist cases where trees grow from the leaves to the root, cases where several congestion trees grow independently and later merge, and even cases where some congestion trees completely overlap while being independent. This complex evolution and its implications on switch architecture are analyzed, proposing enhancements to a recently proposed congestion management mechanism and showing the impact on performance of different design decisions. {{\&}}copy; Springer-Verlag Berlin Heidelberg 2005.",
	address = "Barcelona, Spain",
	issn = 03029743,
	journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
	key = "Trees (mathematics)",
	keywords = "Computer networks;Congestion control (communication);Interconnection networks;Network components;Switching theory;Throughput;",
	note = "Congestion management;Congestion trees;Lossless networks;Throughput degradation;",
	pages = "266 - 285",
	title = "{D}ynamic evolution of congestion trees: {A}nalysis and impact on switch architecture",
	volume = "3793 LNCS",
	year = 2005
}

M Gusat, D Craddock, W Denzel, T Engbersen, N Ni, G Pfister, W Rooney and Jose Duato. Congestion control in InfiniBand networks. 2005, 158 - 9. BibTeX

@conference{8703878,
	author = "M. Gusat and D. Craddock and W. Denzel and T. Engbersen and N. Ni and G. Pfister and W. Rooney and Duato, Jose",
	abstract = {Driving computer interconnection networks closer to saturation minimizes cost/performance and power consumption, but requires efficient congestion control to prevent catastrophic performance degradation during traffic peaks or "hot spot" traffic patterns. The InfiniBand{{\&}}trade;Architecture provides such congestion control, but lacks guidance for setting its parameters. At its adoption, it was unproven that there were any settings that would work at all, avoid instability or oscillations. This paper reports on a simulation-driven exploration of that parameter space which verifies that the architected scheme can, in fact, work properly despite inherent delays in its feedback mechanism},
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings. 13th Symposium on High Performance Interconnects",
	keywords = "multistage interconnection networks;telecommunication congestion control;telecommunication traffic;",
	note = "congestion control;InfiniBand network;computer interconnection network;power consumption;catastrophic performance degradation;traffic pattern;feedback mechanism;InfiniBand{{\&}}trade;architecture;simulation-driven exploration;",
	pages = "158 - 9",
	title = "{C}ongestion control in {I}nfini{B}and networks",
	year = 2005
}

M Gusat, D Craddock, W Denzel, T Engbersen, N Ni, G Pfister, W Rooney and Jose Duato. Congestion control in InfiniBand networks. 2005, 158 - 159. URL BibTeX

@conference{20064710260609,
	author = "M. Gusat and D. Craddock and W. Denzel and T. Engbersen and N. Ni and G. Pfister and W. Rooney and Duato, Jose",
	abstract = {Driving computer interconnection networks closer to saturation minimizes cost/performance and power consumption, but requires efficient congestion control to prevent catastrophic performance degradation during traffic peaks or "hot spot" traffic patterns. The InfiniBand{{\&}}trade; Architecture provides such congestion control, but lacks guidance for setting its parameters. At its adoption, it was unproven that there were any settings that would work at all, avoid instability or oscillations. This paper reports on a simulation-driven exploration of that parameter space which verifies that the architected scheme can, in fact, work properly despite inherent delays in its feedback mechanism. {{\&}}copy; 2005 IEEE.},
	address = "Stanford, CA, United states",
	issn = 15504794,
	journal = "Proceedings - Symposium on the High Performance Interconnects, Hot Interconnects",
	key = "Computer networks",
	keywords = "Computer simulation;Congestion control (communication);Energy utilization;Interconnection networks;Oscillations;Telecommunication traffic;",
	note = "Feedback mechanism;InfiniBand networks;Performance degradation;Traffic peaks;",
	pages = "158 - 159",
	title = "{C}ongestion control in {I}nfini{B}and networks",
	url = "http://dx.doi.org/10.1109/CONECT.2005.14",
	volume = 2005,
	year = 2005
}

Manuel E Acacio, Jose Gonzalez, Jose M Garcia and Jose Duato. A two-level directory architecture for highly scalable cc-NUMA multiprocessors. IEEE Transactions on Parallel and Distributed Systems 16(1):67 - 79, 2005. URL BibTeX

@article{2005078834390,
	author = "Manuel E. Acacio and Jose Gonzalez and Jose M. Garcia and Duato, Jose",
	abstract = "One important issue the designer of a scalable shared-memory multiprocessor must deal with is the amount of extra memory required to store the directory information. It is desirable that the directory memory overhead be kept as low as possible, and that it scales very slowly with the size of the machine. Unfortunately, current directory architectures provide scalability at the expense of performance. This work presents a scalable directory architecture that significantly reduces the size of the directory for large-scale configurations of a multiprocessor without degrading performance. First, we propose multilayer clustering as an effective approach to reduce the width of directory entries. Based on this concept, we derive three new compressed sharing codes, some of them with a space complexity of O(log 2(log2(N))) for an N-node system. Then, we present a novel two-level directory architecture to eliminate the penalty caused by compressed directories in general. The proposed organization consists of a small full-map first-level directory (which provides precise information for the most recently referenced lines) and a compressed second-level directory (which provides in-excess information for all the lines). The proposals are evaluated based on extensive execution-driven simulations (using RSIM) of a 64-node cc-NUMA multiprocessor. Results demonstrate that a system with a two-level directory architecture achieves the same performance as a multiprocessor with a big and nonscalable full-map directory, with a very significant reduction of the memory overhead. © 2005 IEEE.",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Multiprocessing systems",
	keywords = "Cache memory;Computer architecture;Computer simulation;Network protocols;Optimization;",
	note = "Compressed sharing codes;Directory memory overhead;Shared memory multiprocessor;Two level directory architecture;Unnecessary coherence messages;",
	number = 1,
	pages = "67 - 79",
	title = "{A} two-level directory architecture for highly scalable cc-{NUMA} multiprocessors",
	url = "http://dx.doi.org/10.1109/TPDS.2005.4",
	volume = 16,
	year = 2005
}

S Rueda, P Morillo, J M Orduna and Jose Duato. A sexual elitist genetic algorithm for providing QoS in distributed virtual environment systems. 2005, 8 pp. -. BibTeX

@conference{8548873,
	author = "S. Rueda and P. Morillo and J.M. Orduna and Duato, Jose",
	abstract = "Architectures based on networked servers have become a de-facto standard for distributed virtual environment (DVE) systems. These systems allow a large number of remote users to share a single 3D virtual scene. In order to provide quality of service in a DVE system, clients should be assigned to servers taking into account system throughput and system latency. This highly complex problem is known as the quality of service (QoS) problem. This paper proposes an elitist sexual genetic algorithm for solving the QoS problem in distributed virtual environment systems. Performance evaluation results show that, due to its ability of both finding good search paths and keeping diversity escaping from local minima, this nature inspired technique can provide significantly better solutions than other heuristic methods with shorter execution times. Therefore, the proposed implementation of GA search method can improve the QoS offered by DVE systems",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings. 19th IEEE International Parallel and Distributed Processing Symposium",
	keywords = "client-server systems;genetic algorithms;quality of service;search problems;virtual reality;",
	note = "sexual elitist genetic algorithm;QoS;quality of service;distributed virtual environment system;networked server;client-server system;performance evaluation;search path;",
	pages = "8 pp. -",
	title = "{A} sexual elitist genetic algorithm for providing {Q}o{S} in distributed virtual environment systems",
	year = 2005
}

S Rueda, P Morillo, J M Orduna and Jose Duato. A sexual elitist genetic algorithm for providing QoS in distributed virtual environment systems. 2005, IEEE Computer Societ. URL BibTeX

@conference{20063010031238,
	author = "S. Rueda and P. Morillo and J.M. Orduna and Duato, Jose",
	abstract = "Architectures based on networked servers have become a de-facto standard for Distributed Virtual Environment (DVE) systems. These systems allow a large number of remote users to share a single 3D virtual scene. In order to provide quality of service in a DVE system, clients should be assigned to servers taking into account system throughput and system latency. This highly complex problem is known as the quality of service (QoS) problem. This paper proposes an elitist sexual genetic algorithm for solving the QoS problem in Distributed Virtual Environment systems. Performance evaluation results show that, due to its ability of both finding good search paths and keeping diversity escaping from local minima, this nature inspired technique can provide significantly better solutions than other heuristic methods with shorter execution times. Therefore, the proposed implementation of GA search method can improve the QoS offered by DVE systems.",
	address = "Denver, CO, United states",
	journal = "Proceedings - 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005",
	key = "Distributed computer systems",
	keywords = "Genetic algorithms;Problem solving;Quality of service;Servers;Virtual reality;",
	note = "Distributed Virtual Environment (DVE) systems;Execution times;System latency;Virtual scene;",
	pages = "IEEE Computer Societ",
	title = "{A} sexual elitist genetic algorithm for providing {Q}o{S} in distributed virtual environment systems",
	url = "http://dx.doi.org/10.1109/IPDPS.2005.67",
	volume = 2005,
	year = 2005
}

Jose Duato, O Lysne, R Pang and T M Pinkston. {A. IEEE Transactions on Parallel and Distributed Systems (5):412 - 27. BibTeX

@article{8399916,
	author = "Duato, Jose and O. Lysne and R. Pang and T.M. Pinkston",
	abstract = "This paper develops theoretical support useful for determining deadlock properties of dynamic network reconfiguration techniques and also serves as a basis for the development of design methodologies useful for deriving deadlock-free reconfiguration techniques. It is applicable to interconnection networks typically used in multiprocessor servers, network-based computing clusters, and distributed storage systems, and also has potential application to system-on-chip networks. This theory builds on basic principles established by previous theories while pioneering new concepts fundamental to the case of dynamic network reconfiguration",
	address = "USA",
	issn = "1045-9219",
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	keywords = "multiprocessing systems;multiprocessor interconnection networks;reconfigurable architectures;reliability;system recovery;",
	note = "deadlock-free dynamic network reconfiguration;interconnection network;multiprocessor server;network-based computing cluster;distributed storage system;system-on-chip network;system reliability;system availability;",
	number = 5,
	pages = "412 - 27",
	title = "{A"
}

O Lysne, T M Pinkston and Jose Duato. {A. IEEE Transactions on Parallel and Distributed Systems (5):428 - 43. BibTeX

@article{8399917,
	author = "O. Lysne and T.M. Pinkston and Duato, Jose",
	abstract = "For pt.I see ibid., vol.16, no.5, p.412-427 (2005). Dynamic network reconfiguration is defined as the process of changing from one routing function to another while the network remains up and running. The main challenge is in avoiding deadlock anomalies while keeping restrictions on packet injection and forwarding minimal. Current approaches either require virtual channels in the network or they work only for a limited set of routing algorithms and/or fault patterns. In this paper, we present a methodology for devising deadlock free and dynamic transitions between old and new routing functions that is consistent with newly proposed theory [J. Duato et al., (2005)]. The methodology is independent of topology, can be applied to any deadlock-free routing function, and puts no restrictions on the routing function changes that can be supported. Furthermore, it does not require any virtual channels to guarantee deadlock freedom. This research is motivated by current trends toward using increasingly larger Internet and transaction processing servers based on clusters of PCs that have very high availability and dependability requirements, as well as other local, system, and storage area network-based computing systems",
	address = "USA",
	issn = "1045-9219",
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	keywords = "Internet;network routing;network servers;reconfigurable architectures;storage area networks;system recovery;transaction processing;workstation clusters;",
	note = "dynamic network reconfiguration;routing algorithm;fault pattern;deadlock free dynamic transition;deadlock-free routing function;virtual channel;Internet;transaction processing server;clusters PC;storage area network-based computing system;",
	number = 5,
	pages = "428 - 43",
	title = "{A"
}

P Morillo, J M Orduna, M Fernandez and Jose Duato. A method for providing QoS in distributed virtual environments. 2005, 152 - 9. BibTeX

@conference{8486327,
	author = "P. Morillo and J.M. Orduna and M. Fernandez and Duato, Jose",
	abstract = "One of the key issues in the design of scalable and cost-effective distributed virtual environment systems is the partitioning problem. It consists of efficiently assigning clients (3D avatars) to the servers in the system, and some proposed methods allow to significantly increase system throughput. However, these methods are not focused on satisfying any specific time constraint. In this paper, we show that the problem of providing quality of service in distributed virtual environment systems can be addressed by means of the partitioning method. Additionally, we propose a partitioning method that not only provides a high system throughput, but it also satisfies (if possible) any time constraint that avatars can require. This method is based on a heuristic search technique that looks for the best trade-off between system latency, system throughput and partitioning efficiency. The evaluation results show that this partitioning method allows to greatly increase the number of avatars provided with quality of service while also providing the highest system throughput as possible",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings. 13th Euromicro Conference on Parallel, Distributed and Network-based Processing",
	keywords = "client-server systems;quality of service;search problems;virtual reality;",
	note = "distributed virtual environments;quality of service;partitioning method;heuristic search technique;",
	pages = "152 - 9",
	title = "{A} method for providing {Q}o{S} in distributed virtual environments",
	year = 2005
}

Maria E Gomez, Pedro Lopez and Jose Duato. A memory-effective routing strategy for regular interconnection networks. 2005, 41 -. BibTeX

@conference{2005509538034,
	author = "Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Massively parallel computing systems are being built with thousands of nodes. In such systems, high-performance inter-connection networks are crucial to achieve the maximum performance. Routing is one of the most important design issues of interconnection networks. Routing strategies can be mainly classified as source and distributed routing. Source routing has been used in some networks because routers are very simple. On the other hand, distributed routing allows more flexibility, but the routers are more complex. Distributed routing can be implemented by a fixed hardware specific to a routing function on a given topology, or by using forwarding tables that are very flexible but suffer from a lack of scalability. In this paper, we propose a distributed routing strategy for commercial switches, Flexible Interval Routing, that is scalable for the most widely used regular topologies (tori and meshes) because it is not based on tables. At the same time, the strategy is easy to reconfigure to deal with changes in the topology or in the routing algorithm for a given topology, being able to implement the most commonly-used routing algorithms in regular topologies.",
	address = "Denver, CO, United states",
	journal = "Proceedings - 19th IEEE International Parallel and Distributed Processing Symposium",
	key = "Interconnection networks",
	keywords = "Algorithms;Computer hardware;Data storage equipment;Parallel processing systems;Routers;Switches;Topology;",
	note = "Distributed routing;Routing algorithms;Routing strategies;Source routing;",
	pages = "41 -",
	title = "{A} memory-effective routing strategy for regular interconnection networks",
	year = 2005
}

Maria E Gomez, Pedro Lopez and Jose Duato. A memory-effective fault-tolerant routing strategy for direct interconnection networks. 2005, 341 - 8. BibTeX

@conference{8762349,
	author = "Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "High-performance interconnection networks are crucial in massively parallel computers. Routing is one of the most important design issues of interconnection networks. Moreover, the huge amount of hardware of these machines makes fault-tolerance another important design issue. In this paper, we propose a mechanism that combines scalable routing and fault-tolerance for commercial switches to build direct regular topologies, which are the topologies used in large machines. The hardware required is not complex. Furthermore, it allows a high degree of fault-tolerance inflicting a minimal decrease of performance",
	address = "Los Alamitos, CA, USA",
	journal = "ISPDC 2005. The 4th International Workshop on Parallel and Distributed Computing",
	keywords = "fault tolerance;multiprocessor interconnection networks;telecommunication network reliability;telecommunication network routing;",
	note = "memory-effective fault-tolerant routing;direct interconnection networks;distributed routing;adaptive routing;",
	pages = "341 - 8",
	title = "{A} memory-effective fault-tolerant routing strategy for direct interconnection networks",
	year = 2005
}

Maria E Gomez, Jose Flich, Pedro Lopez, Antonio Robles, Jose Duato, N A Nordbotten, O Lysne and T Skeie. An effective fault-tolerant routing methodology for direct networks. In Parallel Processing, 2004. ICPP 2004. International Conference on. 2004, 222 - 231 vol.1. URL, DOI BibTeX

@conference{1327925,
author = "Gomez, Maria E. and Flich, Jose and Lopez, Pedro and Robles, Antonio and Duato, Jose and N.A. Nordbotten and O. Lysne and T. Skeie",
abstract = "Current massively parallel computing systems are being built with thousands of nodes, which significantly affect the probability of failure. M. E. Gomez proposed a methodology to design fault-tolerant routing algorithms for direct interconnection networks. The methodology uses a simple mechanism: for some source-destination pairs, packets are first forwarded to an intermediate node, and later, from this node to the destination node. Minimal adaptive routing is used along both subpaths. For those cases where the methodology cannot find a suitable intermediate node, it combines the use of intermediate nodes with two additional mechanisms: disabling adaptive routing and using misrouting on a per-packet basis. While the combination of these three mechanisms tolerates a large number of faults, each one requires adding some hardware support in the network and also introduces some overhead. In this paper, we perform an in-depth detailed analysis of the impact of these mechanisms on network behaviour. We analyze the impact of the three mechanisms separately and combined. The ultimate goal of this paper is to obtain a suitable combination of mechanisms that is able to meet the trade-off between fault-tolerance degree, routing complexity, and performance.",
booktitle = "Parallel Processing, 2004. ICPP 2004. International Conference on",
doi = "10.1109/ICPP.2004.1327925",
issn = "0190-3918",
keywords = "direct networks; fault-tolerant routing algorithm; in-depth detailed analysis; interconnection networks; minimal adaptive routing; parallel computing system; communication complexity; fault tolerant computing; multiprocessor interconnection networks; par",
month = "aug.",
pages = "222 - 231 vol.1",
title = "{A}n effective fault-tolerant routing methodology for direct networks",
url = "http://dx.doi.org/10.1109/ICPP.2004.1327925",
year = 2004
}

JC Sancho, Antonio Robles and Jose Duato. An effective methodology to improve the performance of the Up*/down* routing algorithm. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 15(8):740-754, August 2004. BibTeX

@article{ISI:000222073200006,
	author = "JC Sancho and Robles, Antonio and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are being considered as a cost-effective alternative to parallel computers. Most NOWs are arranged as a switch-based network and provide mechanisms for discovering the network topology. Hence, they provide support for both regular and irregular topologies, which makes routing and deadlock avoidance quite complicated. Current proposals use the Up{*}/down{*} routing algorithm to remove cyclic dependencies between channels and avoid deadlock. However, routing is considerably restricted and most messages must follow nonminimal paths, increasing latency and wasting resources. In this work, we propose and evaluate a simple and effective methodology to compute Up{*}/down{*} routing tables. The new methodology is based on computing a depth-first search (DFS) spanning tree on the network graph that decreases the number of routing restrictions with respect to the breadth-first search (BFS) spanning tree used by the traditional methodology. Additionally, we propose different heuristic rules for computing the spanning trees to improve the efficiency of Up{*}/down{*} routing. Evaluation results for several different topologies show that computing the Up{*}/down{*} routing tables by using the new methodology increases throughput by a factor of up to 2.48 in large networks with respect to the traditional methodology, and also reduces latency significantly.",
	issn = "1045-9219",
	journal = "IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS",
	month = "AUG",
	number = 8,
	pages = "740-754",
	title = "{A}n effective methodology to improve the performance of the {U}p{*}/down{*} routing algorithm",
	volume = 15,
	year = 2004
}

J M Montañana, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. A transition-based fault-tolerant routing methodology for InfiniBand networks. In Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International. April 2004, 186. URL, DOI BibTeX

@conference{1303198,
author = "Monta{\~n}ana, J. M. and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
abstract = "Summary form only given. Currently, clusters of PCs are considered a cost-effective alternative to large parallel computers. As the number of elements increases in these systems, the probability of faults increases dramatically. Therefore, it is critical to keep the system running even in the presence of faults. The interconnection network plays a key role in its performance. InfiniBand (IBA) is a new standard interconnect suitable for clusters. Most of the fault-tolerant routing strategies proposed for massively parallel computers cannot be applied to IBA because routing and virtual channel transitions are deterministic, which prevents packets from avoiding the faults. A possible approach to provide fault-tolerance in IBA consists of using several disjoint paths between every source-destination pair of nodes and selecting the appropriate path at the source host. However, to this end, a routing algorithm able to provide enough disjoint paths, while still guaranteeing deadlock freedom, is required. We propose a simple and effective fault-tolerant methodology for IBA networks that can be applied to any network topology and meets the trade-off between fault-tolerance degree and the number of network resources devoted to it. Preliminary results show that the proposed methodology scales well and supports up to three faults in 2D and five in 3D tori using only two virtual channels.",
booktitle = "Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International",
doi = "10.1109/IPDPS.2004.1303198",
isbn = "0-7695-2132-0",
issn = "",
keywords = "fault tolerant computing;multiprocessor interconnection networks;network topology;parallel machines;telecommunication network routing;workstation clusters;",
month = "april",
pages = 186,
title = "{A} transition-based fault-tolerant routing methodology for {I}nfini{B}and networks",
url = "http://dx.doi.org/10.1109/IPDPS.2004.1303198",
year = 2004
}

Jose Duato, Jose Flich and Teresa Nachiondo. A cost-effective technique to reduce HOL blocking in single-stage and multistage switch fabrics. In Parallel, Distributed and Network-Based Processing, 2004. Proceedings. 12th Euromicro Conference on. February 2004, 48 - 53. URL, DOI BibTeX

@conference{1271426,
	author = "Duato, Jose and Flich, Jose and Nachiondo, Teresa",
	abstract = "Head-of-line (HOL) blocking is one of the main problems arising in input-buffered switches. The best-known solution to this problem consists of using virtual output queues (VOQs). However this strategy is not scalable at all. Its implementation cost increases quadratically with the number of ports in the switch. Taking into account current trends, the demand for larger number of ports in high-performance switches is likely to increase very rapidly in the near future. Therefore, a more scalable and cost-effective solution is required. We propose a very efficient and cost-effective technique, referred to as destination-based buffer management (DBBM), to reduce HOL blocking in single-stage and multistage switch. Results show that the use of the DBBM technique with a reduced number of queues at each IA is able to obtain roughly the same throughput as the VOQ mechanism. In particular, the number of queues can be reduced by a factor of up to 8 with the DBBM technique.",
	booktitle = "Parallel, Distributed and Network-Based Processing, 2004. Proceedings. 12th Euromicro Conference on",
	doi = "10.1109/EMPDP.2004.1271426",
	isbn = "0-7695-2083-9",
	issn = "1066-6192",
	keywords = "cost-effective technique; destination-based buffer management; head-of-line blocking; input-buffered switches; multistage switch fabrics; single-stage switch fabrics; virtual output queues; IP networks; buffer storage; packet switching; queueing theory;",
	month = "Feb",
	pages = "48 - 53",
	title = "{A} cost-effective technique to reduce {HOL} blocking in single-stage and multistage switch fabrics",
	url = "http://dx.doi.org/10.1109/EMPDP.2004.1271426",
	year = 2004
}

Maria E Gomez, Jose Duato, Jose Flich, Pedro Lopez, Antonio Robles, N A Nordbotten, O Lysne and T Skeie. An Efficient Fault-Tolerant Routing Methodology for Meshes and Tori. Computer Architecture Letters 3(1):3 - 3, 2004. URL, DOI BibTeX

@article{1650124,
	author = "Gomez, Maria E. and Duato, Jose and Flich, Jose and Lopez, Pedro and Robles, Antonio and N.A. Nordbotten and O. Lysne and T. Skeie",
	abstract = "In this paper we present a methodology to design fault-tolerant routing algorithms for regular direct interconnection networks. It supports fully adaptive routing, does not degrade performance in the absence of faults, and supports a reasonably large number of faults without significantly degrading performance. The methodology is mainly based on the selection of an intermediate node (if needed) for each source-destination pair. Packets are adaptively routed to the intermediate node and, at this node, without being ejected, they are adaptively forwarded to their destinations. In order to allow deadlock-free minimal adaptive routing, the methodology requires only one additional virtual channel (for a total of three), even for tori. Evaluation results for a 4 x 4 x 4 torus network show that the methodology is 5-fault tolerant. Indeed, for up to 14 link failures, the percentage of fault combinations supported is higher than 99.96%. Additionally, network throughput degrades by less than 10% when injecting three random link faults without disabling any node. In contrast, a mechanism similar to the one proposed in the BlueGene/L, that disables some network planes, would strongly degrade network throughput by 79%.",
	doi = "10.1109/L-CA.2004.1",
	issn = "1556-6056",
	journal = "Computer Architecture Letters",
	month = "january-december",
	number = 1,
	pages = "3 - 3",
	title = "{A}n {E}fficient {F}ault-{T}olerant {R}outing {M}ethodology for {M}eshes and {T}ori",
	url = "http://dx.doi.org/10.1109/L-CA.2004.1",
	volume = 3,
	year = 2004
}

F J Alfaro, J L Sanchez and Jose Duato. QoS in InfiniBand subnetworks. IEEE Transactions on Parallel and Distributed Systems 15(9):810 - 23, 2004. BibTeX

@article{8094175,
	author = "F.J. Alfaro and J.L. Sanchez and Duato, Jose",
	abstract = "The InfiniBand architecture (IBA) has been proposed as an industry standard both for communication between processing nodes and I/O devices and for interprocessor communication. It replaces the traditional bus-based interconnect with a switch-based network for connecting processing nodes and I/O devices. It is being developed by the InfiniBandSM Trade Association (IBTA) in the aim to provide the levels of reliability, availability, performance, scalability, and quality of service (QoS) required by present and future server systems. For this purpose, IBA provides a series of mechanisms that are able to guarantee QoS to the applications. In previous papers, we have proposed a strategy to compute the InfiniBand arbitration tables. In one of these, we presented and evaluated our proposal to treat traffic with bandwidth requirements. In another, we evaluated our strategy to compute the InfiniBand arbitration tables for traffic with delay requirements, which is a more complex task. In this paper, we evaluate both these proposals together. Furthermore, we also adapt these proposals in order to treat VBR traffic without QoS guarantees, but achieving very good results. Performance results show that, with a correct treatment of each traffic class in the arbitration of the output port, all traffic classes reach their QoS requirements",
	address = "USA",
	issn = "1045-9219",
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	keywords = "bandwidth allocation;multiplexing;quality of service;queueing theory;telecommunication traffic;workstation clusters;",
	note = "InfiniBand architecture;I/O devices;interprocessor communication;bus-based interconnect;switch-based network;processing nodes;InfiniBand Trade Association;quality of service;QoS;InfiniBand arbitration tables;VBR traffic;InfiniBand subnetworks;",
	number = 9,
	pages = "810 - 23",
	title = "{Q}o{S} in {I}nfini{B}and subnetworks",
	volume = 15,
	year = 2004
}

Jose C Sancho, Antonio Robles and Jose Duato. An effective methodology to improve the performance of the up*/down* routing algorithm. IEEE Transactions on Parallel and Distributed Systems 15(8):740 - 754, 2004. URL BibTeX

@article{2004368344586,
	author = "Jose C. Sancho and Robles, Antonio and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are being considered as a cost-effective alternative to parallel computers. Most NOWs are arranged as a switch-based network and provide mechanisms for discovering the network topology. Hence, they provide support for both regular and irregular topologies, which makes routing and deadlock avoidance quite complicated. Current proposals use the Up*/down* routing algorithm to remove cyclic dependencies between channels and avoid deadlock. However, routing is considerably restricted and most messages must follow nonminimal paths, increasing latency and wasting resources. In this work, we propose and evaluate a simple and effective methodology to compute Up*/down* routing tables. The new methodology is based on computing a depth-first search (DFS) spanning tree on the network graph that decreases the number of routing restrictions with respect to the breadth-first search (BFS) spanning tree used by the traditional methodology. Additionally, we propose different heuristic rules for computing the spanning trees to improve the efficiency of Up*/down* routing. Evaluation results for several different topologies show that computing the Up*/down* routing tables by using the new methodology increases throughput by a factor of up to 2.48 in large networks with respect to the traditional methodology, and also reduces latency significantly. © 2004 IEEE.",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Computer networks",
	keywords = "Algorithms;Computer simulation;Computer system recovery;Interconnection networks;Parallel processing systems;Trees;",
	note = "Deadlock avoidance;Irregular topologies;Routing algorithms;Spanning tree;",
	number = 8,
	pages = "740 - 754",
	title = "{A}n effective methodology to improve the performance of the up*/down* routing algorithm",
	url = "http://dx.doi.org/10.1109/TPDS.2004.28",
	volume = 15,
	year = 2004
}

J C Sancho, Antonio Robles and Jose Duato. An effective methodology to improve the performance of the up*/down* routing algorithm. IEEE Transactions on Parallel and Distributed Systems 15(8):740 - 54, 2004. URL BibTeX

@article{8115437,
	author = "J.C. Sancho and Robles, Antonio and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are being considered as a cost-effective alternative to parallel computers. Most NOWs are arranged as a switch-based network and provide mechanisms for discovering the network topology. Hence, they provide support for both regular and irregular topologies, which makes routing and deadlock avoidance quite complicated. Current proposals use the up*/down* routing algorithm to remove cyclic dependencies between channels and avoid deadlock. However, routing is considerably restricted and most messages must follow nonminimal paths, increasing latency and wasting resources. We propose and evaluate a simple and effective methodology to compute up*/down* routing tables. The new methodology is based on computing a depth-first search (DPS) spanning tree on the network graph that decreases the number of routing restrictions with respect to the breadth-first search (BFS) spanning tree used by the traditional methodology. Additionally, we propose different heuristic rules for computing the spanning trees to improve the efficiency of up*/down* routing. Evaluation results for several different topologies show that computing the up*/down* routing tables by using the new methodology increases throughput by a factor of up to 2.48 in large networks with respect to the traditional methodology, and also reduces latency significantly",
	address = "USA",
	issn = "1045-9219",
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	keywords = "concurrency theory;network operating systems;network topology;telecommunication network routing;tree searching;workstation clusters;",
	note = "up*/down* routing algorithm;networks of workstations;depth-first search spanning tree;network graph;breadth-first search;irregular topologies;deadlock avoidance;",
	number = 8,
	pages = "740 - 54",
	title = "{A}n effective methodology to improve the performance of the up*/down* routing algorithm",
	url = "http://dx.doi.org/10.1109/TPDS.2004.28",
	volume = 15,
	year = 2004
}

Maria E Gomez, Jose Flich, Pedro Lopez, Antonio Robles, Jose Duato, N A Nordbotten, O Lysne and T Skeie. An effective fault-tolerant routing methodology for direct networks. 2004, 222 - 31. BibTeX

@conference{8279975,
author = "Gomez, Maria E. and Flich, Jose and Lopez, Pedro and Robles, Antonio and Duato, Jose and N.A. Nordbotten and O. Lysne and T. Skeie",
abstract = "Current massively parallel computing systems are being built with thousands of nodes, which significantly affect the probability of failure. M. E. Gomex proposed a methodology to design fault-tolerant routing algorithms for direct interconnection networks. The methodology uses a simple mechanism: for some source-destination pairs, packets are first forwarded to an intermediate node, and later, from this node to the destination node. Minimal adaptive routing is used along both subpaths. For those cases where the methodology cannot find a suitable intermediate node, it combines the use of intermediate nodes with two additional mechanisms: disabling adaptive routing and using misrouting on a per-packet basis. While the combination of these three mechanisms tolerates a large number of faults, each one requires adding some hardware support in the network and also introduces some overhead. In this paper, we perform an in-depth detailed analysis of the impact of these mechanisms on network behaviour. We analyze the impact of the three mechanisms separately and combined. The ultimate goal of this paper is to obtain a suitable combination of mechanisms that is able to meet the trade-off between fault-tolerance degree, routing complexity, and performance",
address = "Los Alamitos, CA, USA",
journal = "2004 International Conference on Parallel Processing",
keywords = "communication complexity;fault tolerant computing;multiprocessor interconnection networks;parallel processing;",
note = "parallel computing system;fault-tolerant routing algorithm;interconnection networks;minimal adaptive routing;in-depth detailed analysis;direct networks;",
pages = "222 - 31",
title = "{A}n effective fault-tolerant routing methodology for direct networks",
volume = "vol.1",
year = 2004
}

Maria E Gomez, Jose Duato, Jose Flich, Pedro Lopez, Antonio Robles, N A Nordbotten, T Skeie and O Lysne. A new adaptive fault-tolerant routing methodology for direct networks. 2004, 462 - 73. BibTeX

@conference{8426282,
	author = "Gomez, Maria E. and Duato, Jose and Flich, Jose and Lopez, Pedro and Robles, Antonio and N.A. Nordbotten and T. Skeie and O. Lysne",
	abstract = "Interconnection networks play a key role in the fault tolerance of massively parallel computers, since faults may isolate a large fraction of the machine containing many healthy nodes. In this paper, we present a methodology to design fully adaptive fault-tolerant routing algorithms for direct interconnection networks that can be applied to different regular topologies. The methodology is mainly based on the selection of an intermediate node (if needed) for each source-destination pair. Packets are adaptively routed to the intermediate node and, from this node, they are adaptively forwarded to their destination. This methodology requires only one additional virtual channel, even for tori. Evaluation results show that the methodology is 7-fault tolerant, and for up to 14 faults, more than 99% of the combinations are tolerated, also without significantly degrading performance in the presence of faults",
	address = "Berlin, Germany",
	journal = "High Performance Computing-HiPC 2004. 11th International Conference (Lecture notes in Computer Science Vol.3296)",
	keywords = "fault tolerant computing;multiprocessor interconnection networks;parallel processing;telecommunication network routing;telecommunication network topology;",
	note = "adaptive fault-tolerant routing;direct interconnection networks;massively parallel computers;",
	pages = "462 - 73",
	title = "{A} new adaptive fault-tolerant routing methodology for direct networks",
	year = 2004
}

Aurelio Bermudez, Rafael Casado, Francisco J Quiles and Jose Duato. Use of provisional routes to speed-up change assimilation in infiniBand networks. 2004, 2621 - 2628. BibTeX

@conference{2005058819970,
	author = "Aurelio Bermudez and Rafael Casado and Francisco J. Quiles and Duato, Jose",
	abstract = "The InfiniBand architecture has been proposed as a technology both for communication between processing nodes and I/O devices, and for interprocessor communication. The InfiniBand specification defines a basic management infrastructure that is responsible for subnet configuration, activation, and fault tolerance. Each time a topology change is detected, management entities collect the current subnet topology. After that, new forwarding tables have to be computed and uploaded to routing devices. The time required to compute these tables is a critical issue, due to application traffic being negatively affected by the temporary lack of connectivity. In this paper we present a way to compute a valid set of subnet routes in a short period of time. These provisional routes can be immediately distributed to routing devices. After that, final routes can be later uploaded without affecting user traffic.",
	address = "Santa Fe, NM, United states",
	journal = "Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2004 (Abstracts and CD-ROM)",
	key = "Computer networks",
	keywords = "Algorithms;Information theory;Interfaces (computer);Packet networks;Program processors;Topology;",
	note = "Host channel adapters (HCA);InfiBand networks;Processor nodes;Subnets;",
	pages = "2621 - 2628",
	title = "{U}se of provisional routes to speed-up change assimilation in infini{B}and networks",
	volume = 18,
	year = 2004
}

A Bermudez, R Casado, F J Quiles and Jose Duato. Use of provisional routes to speed-up change assimilation in InfiniBand networks. 2004, 186 -. URL BibTeX

@conference{8126616,
	author = "A. Bermudez and R. Casado and F.J. Quiles and Duato, Jose",
	abstract = "Summary form only given. The InfiniBand architecture has been proposed as a technology both for communication between processing nodes and I/O devices, and for interprocessor communication. The InfiniBand specification defines a basic management infrastructure that is responsible for subnet configuration, activation, and fault tolerance. Each time a topology change is detected, management entities collect the current subnet topology. After that, new forwarding tables have to be computed and uploaded to routing devices. The time required to compute these tables is a critical issue, due to application traffic being negatively affected by the temporary lack of connectivity. We present a way to compute a valid set of subnet routes in a short period of time. These provisional routes can be immediately distributed to routing devices. After that, final routes can be later uploaded without affecting user traffic",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings. 18th International Parallel and Distributed Processing Symposium",
	keywords = "fault tolerant computing;multiprocessing systems;network topology;telecommunication network routing;telecommunication traffic;",
	note = "provisional route;infiniband network;InfiniBand architecture;processing node;I/O device;interprocessor communication;subnet configuration;fault tolerance;routing device;application traffic;user traffic;",
	pages = "186 -",
	title = "{U}se of provisional routes to speed-up change assimilation in {I}nfini{B}and networks",
	url = "http://dx.doi.org/10.1109/IPDPS.2004.1303199",
	year = 2004
}

Jose Duato. Program chair's message. 2004, x - x. BibTeX

@conference{2004228179040,
	author = "Duato, Jose",
	abstract = "No abstract available",
	address = "Madrid, Spain",
	issn = 15300897,
	journal = "IEEE High-Performance Computer Architecture Symposium Proceedings",
	pages = "x - x",
	title = "{P}rogram chair's message",
	volume = 10,
	year = 2004
}

Francisco J Alfaro, Jose L Sanchez and Jose Duato. QoS in InfiniBand subnetworks. IEEE Transactions on Parallel and Distributed Systems 15(9):810 - 823, 2004. URL BibTeX

@article{2004408393368,
	author = "Francisco J. Alfaro and Jose L. Sanchez and Duato, Jose",
	abstract = "The InfiniBand Architecture (IBA) has been proposed as an industry standard both for communication between processing nodes and I/O devices and for interprocessor communication. It replaces the traditional bus-based interconnect with a switch-based network for connecting processing nodes and I/O devices. It is being developed by the InfiniBand^SM Trade Association (IBTA) in the aim to provide the levels of reliability, availability, performance, scalability, and quality of service (QoS) required by present and future server systems. For this purpose, IBA provides a series of mechanisms that are able to guarantee QoS to the applications. In previous papers, we have proposed a strategy to compute the InfiniBand arbitration tables. In one of these, we presented and evaluated our proposal to treat traffic with bandwidth requirements. In another, we evaluated our strategy to compute the InfiniBand arbitration tables for traffic with delay requirements, which is a more complex task. In this paper, we will evaluate both these proposals together. Furthermore, we will also adapt these proposals in order to treat VBR traffic without QoS guarantees, but achieving very good results. Performance results show that, with a correct treatment of each traffic class in the arbitration of the output port, all traffic classes reach their QoS requirements. {{\&}}copy; 2004 IEEE.",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Interconnection networks",
	keywords = "Availability;Bandwidth;Computer architecture;Computer simulation;Mathematical models;Performance;Quality of service;Reliability;Servers;Telecommunication links;Telecommunication traffic;",
	note = "InfiniBand architecture;Interprocessor communication;Physical link;QoS requirements;",
	number = 9,
	pages = "810 - 823",
	title = "{Q}o{S} in {I}nfini{B}and subnetworks",
	url = "http://dx.doi.org/10.1109/TPDS.2004.46",
	volume = 15,
	year = 2004
}

Juan Manuel Orduna, Federico Silla and Jose Duato. On the development of a communication-aware task mapping technique. Journal of Systems Architecture 50(4):207 - 220, 2004. URL BibTeX

@article{2004178128206,
	author = "Juan Manuel Orduna and Silla, Federico and Duato, Jose",
	abstract = "Clusters have become a very cost-effective platform for high-performance computing. In these systems, although currently existing networks actually provide enough bandwidth for the existing applications and workstations, the trend is towards the interconnection network becoming the system bottleneck. Therefore, in the future, scheduling strategies will have to take into account the communication requirements of the applications and the communication bandwidth that the network can offer. One of the key issues in these strategies is the task mapping technique used when the network becomes the system bottleneck. In this paper, we propose a communication-aware mapping technique that tries to match as well as possible the existing network resources to the communication requirements of the applications running on the system. Also, we evaluate the mapping technique using real MPI application traces with timestamps. Evaluation results show that the use of the proposed mapping technique better exploits the available network bandwidth, improving load balancing and increasing the throughput that can be delivered by the network. Therefore, the proposed technique can be used in the design of communication-aware scheduling strategies for those situations where the communication requirements lead the network bandwidth to become the system performance bottleneck. © 2003 Elsevier B.V. All rights reserved.",
	issn = 13837621,
	journal = "Journal of Systems Architecture",
	key = "Interconnection networks",
	keywords = "Bandwidth;Computational complexity;Computer systems;Cost effectiveness;Evaluation;Mapping;Problem solving;Program processors;Scheduling;",
	note = "Cluster computing;Task scheduling;",
	number = 4,
	pages = "207 - 220",
	title = "{O}n the development of a communication-aware task mapping technique",
	url = "http://dx.doi.org/10.1016/j.sysarc.2003.09.002",
	volume = 50,
	year = 2004
}

T Skeie, O Lysne, Jose Flich, Pedro Lopez, Antonio Robles and Jose Duato. LASH-TOR: a generic transition-oriented routing algorithm. In Parallel and Distributed Systems, 2004. ICPADS 2004. Proceedings. Tenth International Conference on. 2004, 595 - 604. URL, DOI BibTeX

@conference{1316144,
	author = "T. Skeie and O. Lysne and Flich, Jose and Lopez, Pedro and Robles, Antonio and Duato, Jose",
	abstract = "Cluster networks are seen as the future access networks for multimedia streaming, e-commerce, network storage, etc. For these applications, performance and high availability are particularly crucial. Regular topologies are preferred when performance is the primary concern. However, due to spatial constraints or fault-related issues, the network structure may become irregular, which makes more difficult to find deadlock-free minimal paths. Over the recent years, several solutions have been proposed. One of them is the LASH routing, which enables minimal routing by assigning paths to different virtual layers. In this paper, we propose an extension of LASH in order to reduce the number of required virtual layers by allowing transitions between virtual layers. Evaluation results show that the new routing scheme (LASH-TOR) is able to obtain full minimal routing with a reduced number of virtual channels. For torus and mesh networks, with only two virtual channels, LASH throughput is increased by an average factor of improvement of 3.30 for large networks. For regular networks with some unconnected (faulty) links, equal performance improvements are achieved. Even for highly irregular networks of size up to 128 switches the new routing scheme only needs three virtual channels for guaranteeing minimal routing. Besides, LASH-TOR performs well compared to dimension order routing for mesh and torus networks.",
	booktitle = "Parallel and Distributed Systems, 2004. ICPADS 2004. Proceedings. Tenth International Conference on",
	doi = "10.1109/ICPADS.2004.1316144",
	isbn = "0-7695-2152-5",
	issn = "1521-9097",
	keywords = "LASH routing; LASH-TOR; access networks; cluster networks; deadlock-free minimal paths; e-commerce; mesh network; multimedia streaming; network storage; network structure; spatial constraints; torus network; transition-oriented routing algorithm; virtual",
	month = "7-9",
	pages = "595 - 604",
	title = "{LASH}-{TOR}: a generic transition-oriented routing algorithm",
	url = "http://dx.doi.org/10.1109/ICPADS.2004.1316144",
	year = 2004
}

Bilal Zafar, Timothy M Pinkston, Aurelio Bermudez and Jose Duato. Deadlock-free dynamic reconfiguration over InfiniBand networks. 2004, 127 - 143. URL BibTeX

@conference{2004398371068,
	author = "Bilal Zafar and Timothy M. Pinkston and Aurelio Bermudez and Duato, Jose",
	abstract = "InfiniBand Architecture (IBA) is a newly established general-purpose interconnect standard applicable to local area, system area and storage area networking and I/O. Networks based on this standard should be capable of tolerating topological changes due to resource failures, link/switch activations, and/or hot swapping of components. In order to maintain connectivity, the network's routing function may need to be reconfigured on each topological change. Although the architecture has various mechanisms useful for configuring the network, no strategy or procedure is specified for ensuring deadlock freedom during dynamic network reconfiguration. In this paper, a method for applying the Double Scheme over InfiniBand networks is proposed. The Double Scheme provides a systematic way of reconfiguring a network dynamically while ensuring freedom from deadlocks. We show how features and mechanisms available in IBA for other purposes can also be used to implement dynamic network reconfiguration based on the Double Scheme. We also propose new mechanisms that may be considered in future versions of the IBA specification for making dynamic reconfiguration and other subnet management operations more efficient.",
	issn = 10637192,
	journal = "Parallel Algorithms and Applications",
	key = "Computer networks",
	keywords = "Bandwidth;Costs;Input output programs;Interconnection networks;Optimization;Probability;Quality of service;Routers;Servers;",
	note = "Deadlock-free dynamic reconfiguration;Double scheme;InfiniBand architecture;Network management;",
	number = "2-3",
	pages = "127 - 143",
	title = "{D}eadlock-free dynamic reconfiguration over {I}nfini{B}and networks",
	url = "http://dx.doi.org/10.1080/10637190410001725463",
	volume = 19,
	year = 2004
}

Manuel E Acacio, Jose Gonzalez, Jose M Garcia and Jose Duato. An architecture for high-performance scalable shared-memory multiprocessors exploiting on-chip integration. IEEE Transactions on Parallel and Distributed Systems 15(8):755 - 768, 2004. URL BibTeX

@article{2004368344587,
	author = "Manuel E. Acacio and Jose Gonzalez and Jose M. Garcia and Duato, Jose",
	abstract = "Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller, the coherence hardware, and the network interface/router. In this paper, we exploit such integration scale, presenting a novel node architecture aimed at reducing the long L2 miss latencies and the memory overhead of using directories that characterize cc-NUMA machines and limit their scalability. Our proposal replaces the traditional directory with a novel three-level directory architecture, as well as it adds a small shared data cache to each of the nodes of a multiprocessor system. Due to their small size, the first-level directory and the shared data cache are integrated into the processor chip in every node, which enhances performance by saving accesses to the slower main memory. Scalability is guaranteed by having the second and third-level directories out of the processor chip and using compressed data structures. A taxonomy of the L2 misses, according to the actions performed by the directory to satisfy them, is also presented. Using execution-driven simulations, we show that significant latency reductions can be obtained by using the proposed node architecture, which translates into reductions of more than 30 percent in several cases in the application execution time. © 2004 IEEE.",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Data storage equipment",
	keywords = "Cache memory;Computer architecture;Computer simulation;Computer systems;Interfaces;Microprocessor chips;Routers;",
	note = "Directory memory overhead;Multiprocessor system;Shared data cache;Shared memory multiprocessors;Three level directory;",
	number = 8,
	pages = "755 - 768",
	title = "{A}n architecture for high-performance scalable shared-memory multiprocessors exploiting on-chip integration",
	url = "http://dx.doi.org/10.1109/TPDS.2004.27",
	volume = 15,
	year = 2004
}

P J Garcia, F J Quiles, F J Alfaro, J L Sanchez and Jose Duato. An analysis of deadlock risk during centralized network mapping. 2004, 601 - 6. BibTeX

@conference{8081452,
	author = "P.J. Garcia and F.J. Quiles and F.J. Alfaro and J.L. Sanchez and Duato, Jose",
	abstract = {Modern high-performance interconnection networks implement mechanisms for obtaining explicit topology information. One of these mechanisms is based on a host that explores the network in a centralized way, using "scouting" messages, for discovering the network topology. In this paper, we analyze the risk of deadlock due to such centralized mapping processes in a source-routing network, also evaluating the probability of deadlock configurations and the impact of those deadlocks on network performance},
	address = "Anaheim, CA, USA",
	journal = "IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2004",
	keywords = "concurrency control;multiprocessor interconnection networks;network topology;packet switching;performance evaluation;system recovery;",
	note = "interconnection networks;topology detection;mapping;deadlock;scouting messages;network topology;source-routing network;network performance;",
	pages = "601 - 6",
	title = "{A}n analysis of deadlock risk during centralized network mapping",
	year = 2004
}

N A Nordbotten, Maria E Gomez, Jose Flich, Pedro Lopez, Antonio Robles, T Skeie, O Lysne and Jose Duato. A fully adaptive fault-tolerant routing methodology based on intermediate nodes. 2004, 341 - 56. BibTeX

@conference{8322959,
	author = "N.A. Nordbotten and Gomez, Maria E. and Flich, Jose and Lopez, Pedro and Robles, Antonio and T. Skeie and O. Lysne and Duato, Jose",
	abstract = "Massively parallel computing systems are being built with thousands of nodes. Because of the high number of components, it is critical to keep these systems running even in the presence of failures. Interconnection networks play a key-role in these systems, and this paper proposes a fault-tolerant routing methodology for use in such networks. The methodology supports any minimal routing function (including fully adaptive routing), does not degrade performance in the absence of faults, does not disable any healthy node, and is easy to implement both in meshes and tori. In order to avoid network failures, the methodology uses a simple mechanism: for some source-destination pairs, packets are forwarded to the destination node through a set of intermediate nodes (without being ejected from the network). The methodology is shown to tolerate a large number of faults (e.g., five/nine faults when using two/three intermediate nodes in a 3D torus). Furthermore, the methodology offers a gracious performance degradation: in an 8 × 8 × 8 torus network with 14 faults the throughput is only decreased by 6.49%",
	address = "Germany, Germany",
	journal = "Network and Parallel Computing. IFIP International Conference, NPC 2004. Proceedings (Lecture Notes in Computer Science Vol.3222)",
	keywords = "fault tolerant computing;multiprocessor interconnection networks;packet switching;parallel processing;telecommunication network routing;",
	note = "fully adaptive fault-tolerant routing;intermediate nodes;massively parallel computing systems;interconnection networks;minimal routing function;network failures;source-destination pairs;",
	pages = "341 - 56",
	title = "{A} fully adaptive fault-tolerant routing methodology based on intermediate nodes",
	year = 2004
}

P Morillo, J M Orduna, M Fernandez and Jose Duato. A fine-grain method for solving the partitioning problem in distributed virtual environment systems. 2004, 292 - 297. BibTeX

@conference{2005048802401,
	author = "P. Morillo and J.M. Orduna and M. Fernandez and Duato, Jose",
	abstract = "Distributed Virtual Environment (DVE) systems have experienced a spectacular growth last years. The partitioning problem has been proven as the most critical issue in order to design scalable and efficient DVE systems. It consists of efficiently assigning clients (3-D avatars) to the servers in the system, and some methods have been proposed for solving it. However, only two of these methods take into account the non-linear behavior of DVE servers with the number of avatars attached to them. In this paper, we propose a fine-grain load balancing technique for solving the partitioning problem in DVE systems. Unlike a previously proposed technique, this proposal takes into account the estimated state of the target server before re-assigning avatars. The exceeding workload that causes the saturation of a given server is proportionally distributed among several servers, if necessary. This method avoids the cascading effect, and it allows to increase system throughput with few re-assignments of avatars. Evaluation results show that the proposed method can improve DVE system performance, regardless of both the movement pattern and also the initial distribution of avatars in the virtual world.",
	address = "Cambridge, MA, United states",
	issn = 10272658,
	journal = "Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems",
	key = "Distributed computer systems",
	keywords = "Cascade connections;Client server computer systems;Computer simulation;Internet;Network protocols;Problem solving;Virtual reality;",
	note = "Cascading effect;Distributed virtual environments (DVE);Inter-server communications;Load balancing;",
	pages = "292 - 297",
	title = "{A} fine-grain method for solving the partitioning problem in distributed virtual environment systems",
	volume = 16,
	year = 2004
}

P Morillo, J M Orduna, M Fernandez and Jose Duato. A comparison study of metaheuristic techniques for providing QoS to avatars in DVE systems. 2004, 661 - 70. BibTeX

@conference{8179516,
	author = "P. Morillo and J.M. Orduna and M. Fernandez and Duato, Jose",
	abstract = "Network-server architecture has become a de-facto standard for distributed virtual environment (DVE) systems. In these systems, a large set of remote users share a 3D virtual scene. In order to design scalable DVE systems, different approaches have been proposed to maintain the DVE system working under its saturation point, maximizing system throughput. Also, in order to provide quality of service to avatars in a DVE systems, avatars should be assigned to servers taking into account, among other factors, system throughput and system latency. This highly complex problem is called quality of service (QoS) problem in DVE systems. This paper proposes two different approaches for solving the QoS problem, based on modern heuristics (simulated annealing and GRASP). Performance evaluation results show that the proposed strategies are able no only to provide quality of service to avatars in a DVE system, but also to keep the system away from the saturation point",
	address = "Berlin, Germany",
	journal = "Computational Science and it's Applications - ICCSA 2004. International Conference. Proceedings (Lecture Notes in Comput. Sci. Vol.3044)",
	keywords = "client-server systems;quality of service;simulated annealing;virtual reality;",
	note = "QoS;avatars;network-server architecture;distributed virtual environment;simulated annealing;GRASP;quality of service;",
	pages = "661 - 70",
	title = "{A} comparison study of metaheuristic techniques for providing {Q}o{S} to avatars in {DVE} systems",
	volume = "Vol.2",
	year = 2004
}

Salvador Coll, Jose Duato, F Petrini and F J Mora. Scalable Hardware-Based Multicast Trees. In Supercomputing, 2003 ACM/IEEE Conference. 2003, 54 - 54. URL, DOI BibTeX

@conference{1592957,
author = "Coll, Salvador and Duato, Jose and F. Petrini and F.J. Mora",
abstract = "This paper presents an algorithm for implementing optimal hardware-based multicast trees, on networks that provide hardware support for collective communication. Although the proposed methodology can be generalized to a wide class of networks, we apply our methodology to the Quadrics network, a state-of-the-art network that provides hardware-based multicast communication. The proposed mechanism is intended to improve the performance of the collective communication patterns on the network, in those cases where the hardware support can not be directly used, for instance, due to some faulty nodes. This scheme provides significant reduction on multicast latencies compared to the original system primitives, which use multicast trees based on unicast communication. A backtracking algorithm to find the optimal solution to the problem is presented. In addition, a greedy algorithm is presented and shown to provide near optimal solutions. Finally, our experimental results show the good performance and scalability of the proposed multicast tree in comparison to the traditional unicast-based multicast trees. Our multicast mechanism doubles barrier synchronization and broadcasts performance when compared to the production-level MPI library.",
booktitle = "Supercomputing, 2003 ACM/IEEE Conference",
doi = "10.1109/SC.2003.10058",
isbn = "1-58113-695-1",
month = "nov.",
pages = "54 - 54",
publisher = "IEEE Computer Society",
title = "{S}calable {H}ardware-{B}ased {M}ulticast {T}rees",
url = "http://doi.ieeecomputersociety.org/10.1109/SC.2003.10058",
year = 2003
}

Jose Flich, Pedro Lopez, M P Malumbres, Jose Duato and T Rokicki. Applying in-transit buffers to boost the performance of networks with source routing. Computers, IEEE Transactions on 52(9):1134 - 1153, 2003. DOI BibTeX

@article{1228510,
author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose and T. Rokicki",
abstract = "In this paper, we analyze in depth the effect of using ITB in the network, showing that they not only serve for guaranteeing minimal routing, but also that they are a powerful mechanism able to balance network traffic and reduce network contention. To demonstrate these capabilities, we apply the ITB mechanism to improved routing schemes, such as DFS and smart-routing. These routing algorithms (without ITB) are able to improve the performance of up*/down* by 30 percent and 90 percent, respectively, for a 32-switch network. The evaluation results show that, when ITB are used together with these improved routing algorithms, network throughput achieved by DFS and smart-routing can still be improved by 56 percent and 23 percent, respectively. However, smart-routing requires a time to compute the routing tables that rapidly grows with network size, it being impossible in practice to build networks with more than 32 switches. This high computational cost is mainly motivated by the need of obtaining deadlock-free routing tables. However, when ITB are used, one can decouple the stages of computing routing tables and breaking cycles. Moreover, as stated above, ITB can be used to reduce network contention. In this way, in this paper, we also propose a completely new routing algorithm that tries to balance network traffic by using a simple and low time consuming strategy. The proposed algorithm guarantees deadlock freedom and reduces network contention with the use of ITB. The evaluation results show that our algorithm obtains unprecedented throughputs in 32-switch networks, tripling the original up*/down* and almost doubling smart-routing.",
doi = "10.1109/TC.2003.1228510",
issn = "0018-9340",
journal = "Computers, IEEE Transactions on",
keywords = "32-switch network; DFS; ITB; NOW; breaking cycles; deadlock-free routing tables; in-transit buffers; minimal routing; network contention reduction; network performance; network throughput; network traffic balancing; networks of workstations; performance;",
month = "sept.",
number = 9,
pages = "1134 - 1153",
title = "{A}pplying in-transit buffers to boost the performance of networks with source routing",
volume = 52,
year = 2003
}

J M M Rubio, Pedro Lopez and Jose Duato. FC3D: flow control-based distributed deadlock detection mechanism for true fully adaptive routing in wormhole networks. Parallel and Distributed Systems, IEEE Transactions on 14(8):765 - 779, 2003. URL, DOI BibTeX

@article{1225056,
	author = "J.M.M. Rubio and Lopez, Pedro and Duato, Jose",
	abstract = "Two general approaches have been proposed for deadlock handling in wormhole networks. Traditionally, deadlock-avoidance strategies have been used. In this case, either routing is restricted so that there are no cyclic dependencies between channels or cyclic dependencies between channels are allowed provided that there are some escape paths to avoid deadlock. More recently, deadlock recovery strategies have begun to gain acceptance. These strategies allow the use of unrestricted fully adaptive routing, usually outperforming deadlock avoidance techniques. However, they require a deadlock detection mechanism and a deadlock recovery mechanism that is able to recover from deadlocks faster than they occur. In particular, progressive deadlock recovery techniques are very attractive because they allocate a few dedicated resources to quickly deliver deadlocked messages, instead of killing them. Unfortunately, distributed deadlock detection is usually based on crude time-outs, which detect many false deadlocks. As a consequence, messages detected as deadlocked may saturate the bandwidth offered by recovery resources, thus degrading performance. Additionally, the threshold required by the detection mechanism (the time-out) strongly depends on network load, which is not known in advance at the design stage. This limits the applicability of deadlock recovery on actual networks. We propose a novel distributed deadlock detection mechanism that uses only local information, detects all the deadlocks, considerably reduces the probability of false deadlock detection over previously proposed techniques, and is not significantly affected by variations in message length and/or message destination distribution.",
	doi = "10.1109/TPDS.2003.1225056",
	issn = "1045-9219",
	journal = "Parallel and Distributed Systems, IEEE Transactions on",
	keywords = "FC3D mechanism; crude time-out; deadlock detection mechanism; deadlock-avoidance strategy; deadlocked message; false deadlock detection probability; flow control-based distributed deadlock detection; message destination distribution; message length distr",
	month = "aug.",
	number = 8,
	pages = "765 - 779",
	title = "{FC}3{D}: flow control-based distributed deadlock detection mechanism for true fully adaptive routing in wormhole networks",
	url = "http://dx.doi.org/10.1109/TPDS.2003.1225056",
	volume = 14,
	year = 2003
}

Juan Carlos Martinez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Supporting fully adaptive routing in InfiniBand networks. In Parallel and Distributed Processing Symposium, 2003. Proceedings. International. April 2003, 10 pp.. URL, DOI BibTeX

@conference{1213130,
author = "Martinez, Juan Carlos and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
abstract = "InfiniBand is a new standard for communication between processing nodes and I/O devices as well as for interprocessor communication. The InfiniBand Architecture (IBA) supports distributed routing. However, routing in IBA is deterministic because forwarding tables store a single output port per destination ID. This prevents packets from using alternative paths when the requested output port is busy. Despite the fact that alternative paths could be selected at the source node to reach the same destination node, this is not effective enough to improve network performance. However, using adaptive routing could help to circumvent the congested areas in the network, leading to an increment in performance. In this paper, we propose a simple strategy to implement forwarding tables for IBA switches that support adaptive routing while still maintaining compatibility with the IBA specs. Adaptive routing can be enabled or disabled individually for each packet at the source node. Also, the proposed strategy enables the use in IBA of fully adaptive routing algorithms without using additional network resources to improve network performance. Evaluation results show that extending IBA switch capabilities with fully adaptive routing noticeably increases network performance. In particular, network throughput increases up to an average factor of 3.9.",
booktitle = "Parallel and Distributed Processing Symposium, 2003. Proceedings. International",
doi = "10.1109/IPDPS.2003.1213130",
issn = "1530-2075",
keywords = "InfiniBand networks; distributed routing; fully adaptive routing; interprocessor communication; network performance; network throughput; processing nodes; computer networks; multiprocessor interconnection networks; performance evaluation;",
month = "april",
pages = "10 pp.",
title = "{S}upporting fully adaptive routing in {I}nfini{B}and networks",
url = "http://dx.doi.org/10.1109/IPDPS.2003.1213130",
year = 2003
}

Juan Carlos Martinez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Supporting adaptive routing in InfiniBand networks. In Parallel, Distributed and Network-Based Processing, 2003. Proceedings. Eleventh Euromicro Conference on. 2003, 165 - 172. URL, DOI BibTeX

@conference{1183583,
author = "Martinez, Juan Carlos and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
abstract = "InfiniBand is a new standard for communication between processing nodes and I/O devices as well as for interprocessor communication. The InfiniBand Architecture (IBA) supports distributed deterministic routing because forwarding tables store a single output port per destination ID. This prevents packets from using alternative paths when the requested output port is busy. Despite the fact that alternative paths could be selected at the source node to reach the same destination node, this is not effective enough to improve network performance. However using adaptive routing could help to circumvent the congested areas in the network, leading to an increment in performance. In this paper we propose a simple strategy to implement forwarding tables for IBA switches that supports adaptive routing while still maintaining compatibility with the IBA specifications. Adaptive routing can be individually enabled or disabled for each packet at the source node. The proposed strategy enables the use in IBA of any adaptive routing algorithm with an acyclic channel dependence graph. In this paper, we have taken advantage of the partial adaptivity provided by the well-known up*/down* routing algorithm. Evaluation results show that extending IBA switch capabilities with adaptive routing may noticeably increase network performance. In particular network throughput improvement can be, on average, as high as 46%.",
booktitle = "Parallel, Distributed and Network-Based Processing, 2003. Proceedings. Eleventh Euromicro Conference on",
doi = "10.1109/EMPDP.2003.1183583",
issn = "1066-6192",
keywords = "I-O devices; IBA switches; InfiniBand Architecture; InfiniBand networks; acyclic channel dependence graph; adaptive routing; deterministic routing; forwarding tables; interprocessor communication; network performance; network throughput; processing node",
month = "feb.",
pages = "165 - 172",
title = "{S}upporting adaptive routing in {I}nfini{B}and networks",
url = "http://dx.doi.org/10.1109/EMPDP.2003.1183583",
year = 2003
}

Juan Carlos Martinez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Supporting fully adaptive routing in InfiniBand networks. 2003, 10 pp. -. URL BibTeX

@conference{7891311,
author = "Martinez, Juan Carlos and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
abstract = "InfiniBand is a new standard for communication between processing nodes and I/O devices as well as for interprocessor communication. The InfiniBand Architecture (IBA) supports distributed routing. However, routing in IBA is deterministic because forwarding tables store a single output port per destination ID. This prevents packets from using alternative paths when the requested output port is busy. Despite the fact that alternative paths could be selected at the source node to reach the same destination node, this is not effective enough to improve network performance. However, using adaptive routing could help to circumvent the congested areas in the network, leading to an increment in performance. In this paper, we propose a simple strategy to implement forwarding tables for IBA switches that support adaptive routing while still maintaining compatibility with the IBA specs. Adaptive routing can be enabled or disabled individually for each packet at the source node. Also, the proposed strategy enables the use in IBA of fully adaptive routing algorithms without using additional network resources to improve network performance. Evaluation results show that extending IBA switch capabilities with fully adaptive routing noticeably increases network performance. In particular, network throughput increases up to an average factor of 3.9",
address = "Los Alamitos, CA, USA",
journal = "Proceedings International Parallel and Distributed Processing Symposium",
keywords = "computer networks;multiprocessor interconnection networks;performance evaluation;",
note = "fully adaptive routing;InfiniBand networks;processing nodes;interprocessor communication;distributed routing;network performance;network throughput;",
pages = "10 pp. -",
title = "{S}upporting fully adaptive routing in {I}nfini{B}and networks",
url = "http://dx.doi.org/10.1109/IPDPS.2003.1213130",
year = 2003
}

Maria E Gomez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. VOQSW: a methodology to reduce HOL blocking in InfiniBand networks. In Parallel and Distributed Processing Symposium, 2003. Proceedings. International. 2003, 10 pp.. DOI BibTeX

@conference{1213134,
	author = "Gomez, Maria E. and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
	abstract = "InfiniBand is a new switch-based standard interconnect for communication between processor nodes and I/O devices as well as for interprocessor communication. InfiniBand architecture allows switches to support up to 15 virtual lanes per port for data traffic. To route packets through a given virtual lane (VL), packets are labeled with a certain service level (SL) at injection time, and SLtoVL mapping tables are used at each switch to determine the VL to be used. Many previous works in the literature have shown that separate virtual lanes are able to reduce the influence of the well-known head-of-line (HOL) blocking effect on network performance. However, using virtual lanes to form separate virtual networks is not enough to eliminate the HOL blocking problem. Alternative solutions such as Virtual Output Queuing (VOQ) are able to eliminate it at the expense of modifying the switch buffer organization. In this paper, we propose an effective strategy to implement the VOQ scheme in IBA switches by using virtual lanes. This strategy does not require to modify the switch architecture, simply SL to VL tables must be properly filled. Evaluation results show that our proposed VOQ scheme is able to outperform the results obtained with the virtual network approach using the same number of resources. Moreover, the methodology proposed to implement the VOQ scheme in IBA only requires a small number of resources in order to significantly improve network throughput.",
	booktitle = "Parallel and Distributed Processing Symposium, 2003. Proceedings. International",
	doi = "10.1109/IPDPS.2003.1213134",
	keywords = "HOL blocking; InfiniBand networks; SL to VL mapping tables; head-of-line blocking effect; interprocessor communication; network performance; network throughput; switch buffer organization; switch-based standard interconnect; virtual lane; virtual output",
	month = "22-26",
	pages = "10 pp.",
	title = "{VOQSW}: a methodology to reduce {HOL} blocking in {I}nfini{B}and networks",
	year = 2003
}

JC Sancho, Antonio Robles, Pedro Lopez, Jose Flich and Jose Duato. Routing in InfiniBand (TM) torus network topologies. In P Sadayappan and CS Yang (eds.). 2003 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, PROCEEDINGS. 2003, 509-518. BibTeX

@conference{ISI:000186828800056,
	author = "JC Sancho and Robles, Antonio and Lopez, Pedro and Flich, Jose and Duato, Jose",
	abstract = "InfiniBand is an interconnect standard for communication between processing nodes and I/O devices as well as for interprocessor communication (NOWs). The InfiniBand Architecture (IBA) defines a switch-based network with point-to-point links whose topology can be established by the customer When the performance is the primary concern regular topologies are preferred. Low-dimensional tori (2D and 3D) are some of the regular topologies most widely used in commercial parallel computers. Routing in torus requires the use of virtual channels. Although InfiniBand provides support for deterministic routing and virtual channels, they are selected at each switch by service level (SL) identifiers associated to packets and do not depend on packet destination. This makes routing algorithm implementation more complex. In particular, a large number of SLs may be required, which is a scarce resource. In this paper we analyze the way several routing strategies can be applied in tori InfiniBand networks, also evaluating their resource requirements. In particular, we analyze and compare the well-known e-cube and up{*}/down{*} routing algorithms and the Flexible routing algorithm recently proposed.",
	booktitle = "2003 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, PROCEEDINGS",
	editor = "Sadayappan, P and Yang, CS",
	isbn = 0769520170,
	note = "International Conference on Parallel Processing, KAOHSIUNG, TAIWAN, OCT 06-09, 2003",
	pages = "509-518",
	title = "{R}outing in {I}nfini{B}and ({TM}) torus network topologies",
	year = 2003
}

Juan Carlos Martinez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Supporting adaptive routing in IBA switches. 2003, 441 - 456. URL BibTeX

@conference{2003487758791,
author = "Martinez, Juan Carlos and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
abstract = "InfiniBand is a new standard for communication between processing nodes and I/O devices as well as for interprocessor communication. The InfiniBand Architecture (IBA) supports distributed deterministic routing because forwarding tables store a single output port per destination ID. This prevents packets from using alternative paths when the requested output port is busy. Despite the fact that alternative paths could be selected at the source node to reach the same destination node, this is not effective enough to improve network performance. However, using adaptive routing could help to circumvent the congested areas in the network, leading to an increment in performance. In this paper, we propose a simple strategy to implement forwarding tables for IBA switches that supports adaptive routing while still maintaining compatibility with the IBA specs. Adaptive routing can be individually enabled or disabled for each packet at the source node. The proposed strategy enables the use in IBA of any adaptive routing algorithm with an acyclic channel dependence graph. In this paper, we have taken advantage of the partial adaptivity provided by the well-known up*/down* routing algorithm. Evaluation results show that extending IBA switch capabilities with adaptive routing may noticeably increase network performance. In particular, network throughput improvement can be, on average, as high as 66%. © 2003 Elsevier B.V. All rights reserved.",
issn = 13837621,
journal = "Journal of Systems Architecture",
key = "Systems engineering",
keywords = "Algorithms;Communication;Information technology;Switches;Telecommunication networks;",
note = "Adaptive routing;",
number = "10-11",
pages = "441 - 456",
title = "{S}upporting adaptive routing in {IBA} switches",
url = "http://dx.doi.org/10.1016/S1383-7621(03)00103-6",
volume = 49,
year = 2003
}

Eun Jung Kim, Ki Hwan Yum, C R Das, M Yousif and Jose Duato. Performance enhancement techniques for InfiniBandTM Architecture. 2003, 253 - 62. URL BibTeX

@conference{7703806,
	author = "Eun Jung Kim and Ki Hwan Yum and C.R. Das and M. Yousif and Duato, Jose",
	abstract = "The InfiniBand^TM Architecture (IBA) is envisioned to be the default communication fabric for future system area networks (SAN). However, the released IBA specification outlines only higher level functionalities, leaving it open for exploring various design alternatives. In this paper we investigate four co-related techniques to provide high and predictable performance in IBA. These are: (i) using the shortest path first (SPF) algorithm for deterministic packet routing; (ii) developing a multipath routing mechanism for minimizing congestion; (iii) developing a selective packet dropping scheme to handle deadlock and congestion; and (iv) providing multicasting support for customized applications. These designs are evaluated using an integrated workload on a versatile IBA simulation testbed. Simulation results indicate that the SPF routing, multipath routing, packet dropping, and multicasting schemes are quite effective in delivering high and assured performance in clusters. One of the major contributions of this research is the IBA simulation testbed, which is an essential tool to evaluate various design tradeoffs",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings the Ninth International Symposium on High-Performance Computer Architecture. HPCA-9 2003",
	keywords = "concurrency control;deterministic algorithms;local area networks;multicast communication;packet switching;performance evaluation;telecommunication congestion control;telecommunication network routing;",
	note = "performance enhancement techniques;InfiniBand Architecture;system area networks;SAN;IBA specification;shortest path first algorithm;SPF algorithm;deterministic packet routing;multipath routing mechanism;congestion minimization;selective packet dropping;deadlock;multicasting support;customized applications;integrated workload;simulation testbed;clusters;design tradeoffs;",
	pages = "253 - 62",
	title = "{P}erformance enhancement techniques for {I}nfini{B}and{TM} {A}rchitecture",
	url = "http://dx.doi.org/10.1109/HPCA.2003.1183543",
	year = 2003
}

J C Sancho, Juan Carlos Martinez, Antonio Robles, Pedro Lopez, Jose Flich and Jose Duato. Performance evaluation of COWS under real parallel applications. In Parallel and Distributed Processing Symposium, 2003. Proceedings. International. 2003, 10 pp.. DOI BibTeX

@conference{1213371,
	author = "J.C. Sancho and Martinez, Juan Carlos and Robles, Antonio and Lopez, Pedro and Flich, Jose and Duato, Jose",
	abstract = "Clusters of workstations (COWS) are often arranged as a switch-based network with irregular topology. Usually, the evaluation of interconnection networks for COWS has been carried out by simulation using synthetic traffic and by traces from real parallel applications. Although both types of traffics are used as a first approximation of the behavior of the system, a more accurate behavior can be obtained by using real parallel applications. In this paper, a new simulation framework has been developed in order to evaluate interconnection networks under real parallel applications by using an execution-driven simulator. Moreover, the new simulator can be used to evaluate the impact on the performance of the whole system of several design parameters in addition to the interconnection network. Evaluation results show that the execution time of real parallel applications can be reduced by using an effective routing algorithm. Moreover, in some cases, the achieved improvements are higher than the ones achieved by improving other design issues, such as the processor instruction issue rate, the cache size or the network bandwidth.",
	booktitle = "Parallel and Distributed Processing Symposium, 2003. Proceedings. International",
	doi = "10.1109/IPDPS.2003.1213371",
	issn = "1530-2075",
	keywords = "COWS; cache size; clusters of workstations; execution-driven simulator; interconnection networks; network bandwidth; performance evaluation; processor instruction issue rate; simulation framework; switch-based network; discrete event simulation; performa",
	month = "22-26",
	pages = "10 pp.",
	title = "{P}erformance evaluation of {COWS} under real parallel applications",
	year = 2003
}

F J Alfaro, J L Sanchez, Luis Orozco and Jose Duato. Providing QoS in InfiniBand for Regular and Irregular Topologies. 2003, 1079 - 1082. URL BibTeX

@conference{2003407661216,
	author = "F.J. Alfaro and J.L. Sanchez and Luis Orozco and Duato, Jose",
	abstract = "The InfiniBand Architecture (IBA) is becoming an industry standard for communication between processing nodes and I/O devices or for interprocessor communication. It is being developed by the InfiniBand^SM Trade Association (IBTA) to provide the levels of reliability, availability, performance, scalability, and quality of service (QoS) necessary for present and future server systems. In [1] we proposed a new strategy to address these issues. We have evaluated this new strategy only for irregular topology networks [4]. In this paper we evaluate our proposal for regular topologies (hypercube and mesh) and we compare the. results obtained. In this way, we want to study the influence of the topology on the QoS mechanisms.",
	address = "Montreal, Canada",
	issn = 08407789,
	journal = "Canadian Conference on Electrical and Computer Engineering",
	key = "Communication systems",
	keywords = "Computer simulation;Quality of service;Telecommunication links;Topology;",
	note = "Interprocessor communication;",
	pages = "1079 - 1082",
	title = "{P}roviding {Q}o{S} in {I}nfini{B}and for {R}egular and {I}rregular {T}opologies",
	url = "http://dx.doi.org/10.1109/CCECE.2003.1226083",
	volume = 2,
	year = 2003
}

A Bermudez, R Casado, F J Quiles, T M Pinkston and Jose Duato. On the Infiniband subnet discovery process. 2003, 512 - 17. URL BibTeX

@conference{7962798,
	author = "A. Bermudez and R. Casado and F.J. Quiles and T.M. Pinkston and Duato, Jose",
	abstract = "InfiniBand is becoming an industry standard both for communication between processing nodes and I/O devices, and for interprocessor communication. Instead of using a shared bus, InfiniBand employs an arbitrary (possibly irregular) switched point-to-point network. InfiniBand specification defines a basic management infrastructure that is responsible for subnet configuration, activation, and fault tolerance. After the detection of a topology change, management entities collect the current subnet topology. The topology discovery algorithm is one of the management issues that are outside the scope of the current specification. Preliminary implementations obtain the entire topological information each time a change is detected. In this work, we present and analyze an optimized implementation, based on exploring only the region that has been affected by the change",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings. IEEE International Conference on Cluster Computing",
	keywords = "communication complexity;computer communications software;computer network management;data communication;fault tolerant computing;local area networks;message passing;network operating systems;optimisation;telecommunication network routing;telecommunicatio",
	note = "Infiniband;subnet discovery process;processing nodes;I/O devices;interprocessor communication;shared bus;arbitrary switched point-to-point network;basic management infrastructure;subnet configuration;subnet activation;fault tolerance;subnet topology;topology discovery algorithm;",
	pages = "512 - 17",
	title = "{O}n the {I}nfiniband subnet discovery process",
	url = "http://dx.doi.org/10.1109/CLUSTR.2003.1253361",
	year = 2003
}

R Garcia, Jose Duato and Federico Silla. LSOM: A Link State protocol Over MAC addresses for metropolitan backbones using Optical Ethernet switches. 2003, 315 - 21. URL BibTeX

@conference{7659346,
	author = "R. Garcia and Duato, Jose and Silla, Federico",
	abstract = {This paper presents a new protocol named "Link State Over MAC" (LSOM) for Optical Ethernet switches to allow the use of active loop topologies, like meshes, in Metropolitan Area Networks (MAN) or even Wide Area Networks (WAN) backbone. In this respect, LSOM is an alternative to a ring topology as proposed in draft IEEE 802.17 Resilient Packet Ring (RPR) or a tree topology using IEEE802. 1D Rapid Spanning Tree Protocol (RSTP). LSOM provides higher scalability and is able to achieve better bandwidth utilization and lower latency than RSTP and RPR. Simulation results for 4-node and 9-node topologies show that LSOM can improve throughput over RPR by a factor of up to 1.7. Furthermore, full freedom to choose any MAN active topology allows an effective use of the available dark fiber resources},
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings Second IEEE International Symposium on Network Computing and Applications. NCA 2003",
	keywords = "metropolitan area networks;protocols;SONET;",
	note = "Metropolitan Area Networks;protocol;Link State Over MAC;LSOM;Optical Ethernet switches;active loop topologies;scalability;bandwidth utilization;latency;",
	pages = "315 - 21",
	title = "{LSOM}: {A} {L}ink {S}tate protocol {O}ver {MAC} addresses for metropolitan backbones using {O}ptical {E}thernet switches",
	url = "http://dx.doi.org/10.1109/NCA.2003.1201170",
	year = 2003
}

Juan-Miguel Martinez Rubio, Pedro Lopez and Jose Duato. FC3D: Flow control-based distributed deadlock detection mechanism for true fully adaptive routing in wormhole networks. IEEE Transactions on Parallel and Distributed Systems 14(8):765 - 779, 2003. URL BibTeX

@article{2003407655842,
	author = "Juan-Miguel Martinez Rubio and Lopez, Pedro and Duato, Jose",
	abstract = "Two general approaches have been proposed for deadlock handling in wormhole networks. Traditionally, deadlock avoidance strategies have been used. In this case, either routing is restricted so that there are no cyclic dependencies between channels or cyclic dependencies between channels are allowed provided that there are some escape paths to avoid deadlock. More recently, deadlock recovery strategies have begun to gain acceptance. These strategies allow the use of unrestricted fully adaptive routing, usually outperforming deadlock avoidance techniques. However, they require a deadlock detection mechanism and a deadlock recovery mechanism that is able to recover from deadlocks faster than they occur. In particular, progressive deadlock recovery techniques are very attractive because they allocate a few dedicated resources to quickly deliver deadlocked messages, instead of killing them. Unfortunately, distributed deadlock detection is usually based on crude time-outs, which detect many false deadlocks. As a consequence, messages detected as deadlocked may saturate the bandwidth offered by recovery resources, thus degrading performance. Additionally, the threshold required by the detection mechanism (the time-out) strongly depends on network load, which is not known in advance at the design stage. This limits the applicability of deadlock recovery on actual networks. In this paper, we propose a novel distributed deadlock detection mechanism that uses only local information, detects all the deadlocks, considerably reduces the probability of false deadlock detection over previously proposed techniques, and is not significantly affected by variations in message length and/or message destination distribution.",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Distributed computer systems",
	keywords = "Adaptive control systems;Command and control systems;Congestion control (communication);Data communication systems;Probability distributions;Requirements engineering;Resource allocation;",
	note = "Adaptive routing;Deadlock recovery;Distributed deadlock detection;Wormhole networks;",
	number = 8,
	pages = "765 - 779",
	title = "{FC}3{D}: {F}low control-based distributed deadlock detection mechanism for true fully adaptive routing in wormhole networks",
	url = "http://dx.doi.org/10.1109/TPDS.2003.1225056",
	volume = 14,
	year = 2003
}

A Bermudez, R Casado, F J Quiles, T M Pinkston and Jose Duato. Evaluation of a subnet management mechanism for InfiniBand networks. 2003, 117 - 24. BibTeX

@conference{8301788,
	author = "A. Bermudez and R. Casado and F.J. Quiles and T.M. Pinkston and Duato, Jose",
	abstract = "The InfiniBand architecture is a high-performance network technology for the interconnection of processor nodes and I/O devices using a point-to-point switch-based fabric. The InfiniBand specification defines a basic management infrastructure that is responsible for subnet configuration, activation, and fault tolerance. Subnet management entities and functions are described, but the specifications do not impose any particular implementation. We present and analyze a complete subnet management mechanism for this architecture. We allow to anticipate future directions to obtain efficient management protocols",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 2003 International Conference on Parallel Processing",
	keywords = "computer network management;fault tolerant computing;multiprocessor interconnection networks;parallel architectures;protocols;",
	note = "subnet management mechanism;InfiniBand networks;InfiniBand architecture;point-to-point switch-based fabric;management infrastructure;subnet configuration;fault tolerance;",
	pages = "117 - 24",
	title = "{E}valuation of a subnet management mechanism for {I}nfini{B}and networks",
	year = 2003
}

Timothy Mark Pinkston, Ruoming Pang and Jose Duato. Deadlock-free dynamic reconfiguration schemes for increased network dependability. IEEE Transactions on Parallel and Distributed Systems 14(8):780 - 794, 2003. URL BibTeX

@article{2003407655843,
	author = "Timothy Mark Pinkston and Ruoming Pang and Duato, Jose",
	abstract = "Network-based parallel computing systems often require the ability to reconfigure the routing algorithm to reflect changes in network topology if and when voluntary or involuntary changes occur. The process of reconfiguring a network's routing capabilities may be very inefficient and/or deadlock-prone if not handled properly. In this paper, we propose efficient and deadlock-free dynamic reconfiguration schemes that are applicable to routing algorithms and networks which use wormhole, virtual cut-through, or store-and-forward switching, combined with hard link-level flow control. One requirement is that the network architecture use virtual channels or duplicate physical channels for deadlock-handling as well as performance purposes. The proposed schemes do not impede the injection, transmission, or delivery of user packets during the reconfiguration process. Instead, they provide uninterrupted service, increased availability/reliability, and improved overall quality-of-service support as compared to traditional techniques based on static reconfiguration.",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Parallel processing systems",
	keywords = "Algorithms;Communication channels (information theory);Congestion control (communication);Dynamic programming;Interconnection networks;Packet networks;Quality of service;Requirements engineering;Switching networks;Virtual reality;",
	note = "Deadlock free dynamic reconfiguration;Hard link level flow control;Routing algorithm;Virtual channels;",
	number = 8,
	pages = "780 - 794",
	title = "{D}eadlock-free dynamic reconfiguration schemes for increased network dependability",
	url = "http://dx.doi.org/10.1109/TPDS.2003.1225057",
	volume = 14,
	year = 2003
}

P Morillo, J M Orduna, M Fernandez and Jose Duato. An adaptive load balancing technique for distributed virtual environment systems. 2003, 256 - 61. BibTeX

@conference{8116366,
	author = "P. Morillo and J.M. Orduna and M. Fernandez and Duato, Jose",
	abstract = "One of the key issues in the design of scalable and cost-effective distributed virtual environment (DVE) systems is the partitioning problem. This problem consists of efficiently assigning clients (3D avatars) to the servers in the system, and some methods have been already proposed for solving it. However, only one of these methods takes into account the nonlinear behavior of DVE servers with the number of avatars they support, and this method uses a load balancing technique of local scope. As a result, it only provides good performance if the movement pattern of avatars is uniform. In this paper, we propose an adaptive load balancing technique of global scope for solving the partitioning problem in DVE systems. The global scope of the proposed technique allows to avoid DVE saturation as long as possible. Evaluation results show that the proposed strategy can improve DVE system performance, regardless of both the movement patterns of avatars and also the initial distribution of avatars in the virtual world",
	address = "Anaheim, CA, USA",
	journal = "Proceedings of the Fifteenth IASTED Internation Conference on Parallel and Distributed Computing and Systems",
	keywords = "distributed processing;resource allocation;virtual reality;",
	note = "adaptive load balancing;distributed virtual environment;3D avatars;virtual world;dynamic partitioning;",
	pages = "256 - 61",
	title = "{A}n adaptive load balancing technique for distributed virtual environment systems",
	volume = "vol. 1",
	year = 2003
}

P Morillo, J M Orduna, M Fernandez and Jose Duato. An adaptive load balancing technique for distributed virtual environment systems. 2003, 256 - 261. BibTeX

@conference{2004138084519,
	author = "P. Morillo and J.M. Orduna and M. Fernandez and Duato, Jose",
	abstract = "One of the key issues in the design of scalable and cost-effective Distributed Virtual Environment (DVE) systems is the partitioning problem. This problem consists of efficiently assigning clients (3-D avatars) to the servers in the system, and some methods have been already proposed for solving it. However, only one of these methods takes into account the non-linear behavior of DVE servers with the number of avatars they support, and this method uses a load balancing technique of local scope. As a result, it only provides good performance if the movement pattern of avatars is uniform. In this paper, we propose an adaptive load balancing technique of global scope for solving the partitioning problem in DVE systems. The global scope of the proposed technique allows to avoid DVE saturation as long as possible. Evaluation results show that the proposed strategy can improve DVE system performance, regardless of both the movement patterns of avatars and also the initial distribution of avatars in the virtual world.",
	address = "Marina del Rey, CA, United states",
	journal = "Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems",
	key = "Client server computer systems",
	keywords = "Adaptive control systems;Bandwidth;Computer aided instruction;Computer supported cooperative work;Virtual reality;",
	note = "Distributed virtual environments;Dynamic partitioning;Load balancing;",
	number = 1,
	pages = "256 - 261",
	title = "{A}n adaptive load balancing technique for distributed virtual environment systems",
	volume = 15,
	year = 2003
}

B Caminero, C Carrion, F J Quiles, Jose Duato and S Yalamanchili. {A. 8 pp. -. BibTeX

@conference{7891389,
	author = "B. Caminero and C. Carrion and F.J. Quiles and Duato, Jose and S. Yalamanchili",
	abstract = "The primary objective of the MultiMedia Router (MMR) project is the design and implementation of a compact router optimized for multimedia applications. The router is targeted for use in cluster and LAN interconnection networks, which offer different constraints and therefore differing router solutions than WANs. One of the key elements within the router are the algorithms used to decide the forwarding order of the information that traverses it: the link and switch scheduling algorithms. They help greatly to determine the QoS guarantees delivered to the application flows. Also, conventional best-effort traffic should be seamlessly integrated by scheduling algorithms, in such a way that link bandwidth is efficiently used, but without degrading the QoS guarantees of the multimedia connections. In this paper, two solutions for switch scheduling are thoroughly evaluated with mixed workloads (i.e., composed of multimedia and best-effort traffic), and their is performance compared to another well-known approach for switch scheduling, that does not consider QoS requirements when performing scheduling decisions. Results show that, when a QoS-aware switch scheduler is used, the QoS received by the multimedia flows is not affected by the presence of best-effort traffic",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings International Parallel and Distributed Processing Symposium",
	keywords = "LAN interconnection;multimedia communication;performance evaluation;quality of service;telecommunication network routing;wide area networks;workstation clusters;",
	note = "hybrid traffic handling;clustered environments;multimedia router MMR;LAN interconnection networks;cluster networks;WANs;QoS guarantees;link bandwidth;",
	pages = "8 pp. -",
	title = "{A"
}

F J Alfaro, J L Sanchez and Jose Duato. A new proposal to fill in the InfiniBand arbitration tables. 2003, 133 - 40. BibTeX

@conference{8301790,
	author = "F.J. Alfaro and J.L. Sanchez and Duato, Jose",
	abstract = "The InfiniBand architecture (IBA) is a new industry-standard architecture for server I/O and interprocessor communication. InfiniBand is very likely to become the de facto standard in a few years. It is being developed by the InfiniBand^SM Trade Association (IBTA) to provide the levels of reliability, availability, performance, scalability, and quality of service (QoS) necessary for present and future server systems. We propose a simple and effective strategy for configuring the IBA networks to provide the required levels of QoS. This is a global frame that allows one to do a different treatment to each kind of traffic based on its QoS requirements. It is based on the correct configuration of the mechanisms IBA provides to support QoS. We also propose a simple algorithm to maximize the number of requests to be allocated in the arbitration table that the output ports have. This proposal is evaluated and the results show that every traffic class meets its QoS requirements",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 2003 International Conference on Parallel Processing",
	keywords = "multiprocessor interconnection networks;parallel architectures;performance evaluation;quality of service;telecommunication network reliability;telecommunication traffic;",
	note = "InfiniBand arbitration table;InfiniBand architecture;server I/O;interprocessor communication;reliability;performance evaluation;scalability;quality of service;QoS;telecommunication traffic;",
	pages = "133 - 40",
	title = "{A} new proposal to fill in the {I}nfini{B}and arbitration tables",
	year = 2003
}

O Lysne, T M Pinkston and Jose Duato. A methodology for developing dynamic network reconfiguration processes. 2003, 77 - 86. BibTeX

@conference{8301784,
	author = "O. Lysne and T.M. Pinkston and Duato, Jose",
	abstract = "Dynamic network reconfiguration is defined as the change from one routing function to another while the network is up and running. The main challenge is avoidance of deadlocks, while keeping restrictions on packet injection and forwarding minimal. Current approaches either require virtual channels in the network, or they work only for a limited set of routing algorithms. We present a methodology for devising deadlock free and dynamic transitions between an old and a new routing function. The methodology is independent of topology and puts no restrictions on either routing function. Furthermore, it does not require any virtual channels to guarantee deadlock freedom. This research is motivated by the current trend toward using increasingly larger Internet servers based on clusters of PCs and the very high availability requirements of those as well as other local, system, and storage area network-based systems",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 2003 International Conference on Parallel Processing",
	keywords = "concurrency control;Internet;multiprocessor interconnection networks;telecommunication network routing;",
	note = "dynamic network reconfiguration;deadlock avoidance;virtual channels;network routing;Internet server;storage area network-based system;local area network;system area network;interconnection network architecture;",
	pages = "77 - 86",
	title = "{A} methodology for developing dynamic network reconfiguration processes",
	year = 2003
}

Timothy Mark Pinkston, Bilal Zafar and Jose Duato. A method for applying double scheme dynamic reconfiguration over infiniBand. 2003, 793 - 800. BibTeX

@conference{2004148099398,
	author = "Timothy Mark Pinkston and Bilal Zafar and Duato, Jose",
	abstract = "InfiniBand Architecture is a newly established general-purpose interconnect standard applicable to local area, system area and storage area networking and I/O. Networks based on this standard should be capable of tolerating topological changes due to resource failures, link/switch activations, and/or hot swapping of components. In order to maintain connectivity, the network's routing function may need to be reconfigured on each topological change. Although the architecture has various mechanisms useful for configuring the network, no strategy or procedure is specified for ensuring deadlock freedom during dynamic network reconfiguration. In this paper, a method for applying the Double Scheme [1] over InfiniBand networks is proposed. The Double Scheme provides a systematic way of reconfiguring a network dynamically while ensuring freedom from deadlocks. We show how features and mechanisms available in InfiniBand Architecture for other purposes can also be used to implement dynamic network reconfiguration based on the Double Scheme. We also propose new mechanisms that may be considered in future versions of the spec for making dynamic reconfiguration and other subnet management operations more efficient.",
	address = "Las Vegas, NV, United states",
	journal = "Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications",
	key = "Data communication systems",
	keywords = "Local area networks;Multiprocessing systems;Packet networks;Packet switching;Real time systems;Routers;Servers;",
	note = "Deadlock-free dynamic reconfiguration;Double scheme;Infiniband architecture;Routing;Storage area networking;System area networking;",
	pages = "793 - 800",
	title = "{A} method for applying double scheme dynamic reconfiguration over infini{B}and",
	volume = 2,
	year = 2003
}

B Caminero, C Carrion, F J Quiles, Jose Duato and S Yalamanchili. {A. 220 - 226. BibTeX

@conference{2004148099314,
author = "B. Caminero and C. Carrion and F.J. Quiles and Duato, Jose and S. Yalamanchili",
abstract = "Over the past few years a dramatical increase in the use of multimedia applications has taken place, due mainly to the availability of fast processors and sophisticated peripherals at a low cost. Many of these applications either are inherently distributed, or need the resources of a cluster of computers. The traffic generated by multimedia applications has very different requirements than the best-effort traffic generated by conventional applications. Now, the network must deliver some sort of Quality of Service (QoS) to the flows that require it. The Multimedia Router (MMR) arises as a solution for providing such QoS support within a compact interconnection element, aimed for use in local and clustered environments. The particular needs for every kind of traffic, both multimedia and best-effort, are addressed in order to provide the multimedia flows the QoS guarantees they need, while still achieving high link utilizations. In this work, the Multimedia Router architecture is described, and some insight is given on its performance, specially regarding the interaction between buffer size and the algorithms used for link scheduling.",
address = "Las Vegas, NV, United states",
journal = "Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications",
key = "Data communication systems",
keywords = "Distributed computer systems;Interconnection networks;Parallel algorithms;Quality of service;Routers;Telecommunication traffic;",
note = "Link-switch scheduling;Multimedia communication;Router architecture;",
pages = "220 - 226",
title = "{A"
}

Jose Flich, Pedro Lopez, M P Malumbres and Jose Duato. Boosting the performance of Myrinet networks. Parallel and Distributed Systems, IEEE Transactions on 13(11):1166 - 1182, November 2002. URL, DOI BibTeX

@article{1058099,
	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. These networks allow the customer to connect processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Some of these networks use source routing and wormhole switching. In particular, we are interested in Myrinet networks because it is a well-known commercial product and its behavior can be controlled by the software running in network interfaces (Myrinet Control Program, MCP). Usually, the Myrinet network uses up*/down* routing for computing the paths for every source-destination pair. We propose the In-Transit Buffer (ITB) mechanism to improve network performance. We apply the ITB mechanism to NOWs with up*/down* source routing, like Myrinet, analyzing its behavior on both networks with regular and irregular topologies. The proposed scheme can be implemented on Myrinet networks by only modifying the MCP, without changing the network hardware. We evaluate by simulation several networks with different traffic patterns using timing parameters taken from the Myrinet network. Results show that the current routing schemes used in Myrinet networks can be strongly improved by applying the ITB mechanism. In general, our proposed scheme is able to double the network throughput on medium and large NOWs. Finally, we present a first implementation of the ITB mechanism on a Myrinet network.",
	doi = "10.1109/TPDS.2002.1058099",
	issn = "1045-9219",
	journal = "Parallel and Distributed Systems, IEEE Transactions on",
	keywords = "In-Transit Buffer; Myrinet network; irregular topologies; network interfaces; network performance boosting; network traffic; parallel computers; performance evaluation; scalability; simulation; throughput; up down source routing; workstation networks; wo",
	month = "nov",
	number = 11,
	pages = "1166 - 1182",
	title = "{B}oosting the performance of {M}yrinet networks",
	url = "http://dx.doi.org/10.1109/TPDS.2002.1058099",
	volume = 13,
	year = 2002
}

Jose Flich, Pedro Lopez, M P Malumbres and Jose Duato. Boosting the performance of Myrinet networks. Parallel and Distributed Systems, IEEE Transactions on 13(7):693 -709, July 2002. URL, DOI BibTeX

@article{1019859,
	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. These networks allow the customer to connect processors using irregular topologies, providing the wiring flexibility, scalability and incremental expansion capability required in this environment. Some of these networks use source routing and wormhole switching. In particular, we are interested in Myrinet networks because they are a well-known commercial product and their behavior can be controlled by the software running on the network interfaces (the Myrinet Control Program, MCP). Usually, the Myrinet network uses up*/down* routing for computing the paths for every source-destination pair. In this paper, we propose an in-transit buffer (ITB) mechanism to improve the network performance. We apply the ITB mechanism to NOWs with up*/down* source routing, like the Myrinet, analyzing its behavior on networks with both regular and irregular topologies. The proposed scheme can be implemented on Myrinet networks by simply modifying the MCP, without changing the network hardware. We evaluate by simulation several networks with different traffic patterns using timing parameters taken from the Myrinet network. The results show that the current routing schemes used in Myrinet networks can be strongly improved by applying the ITB mechanism. In general, our proposed scheme is able to double the network throughput on medium and large NOWs. Finally, we present a first implementation of the ITB mechanism on a Myrinet network",
	doi = "10.1109/TPDS.2002.1019859",
	issn = "1045-9219",
	journal = "Parallel and Distributed Systems, IEEE Transactions on",
	keywords = "Myrinet Control Program;Myrinet network performance;in-transit buffer mechanism;incremental expansion capability;irregular topologies;minimal routing;network interfaces;network throughput;network traffic patterns;performance evaluation;regular topologies;",
	month = "jul",
	number = 7,
	pages = "693 -709",
	title = "{B}oosting the performance of {M}yrinet networks",
	url = "http://dx.doi.org/10.1109/TPDS.2002.1019859",
	volume = 13,
	year = 2002
}

Elvira Baydal, Pedro Lopez and Jose Duato. Increasing the adaptivity of routing algorithms for k-ary n-cubes. In Parallel, Distributed and Network-based Processing, 2002. Proceedings. 10th Euromicro Workshop on. 2002, 455 -462. URL, DOI BibTeX

@conference{994333,
	author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
	abstract = "In this paper, we show that routing algorithms may exploit not only the flexibility obtained by crossing network dimensions in any order but also that obtained in the same network dimension, thanks to the availability of bidirectional channels. We analyze the behavior of adaptive routing algorithms both for deadlock avoidance and recovery, exploiting this increased routing flexibility, and compare them with previous proposals in order to evaluate the contribution of the additional routing freedom on network performance. Simulation results show that this simple improvement in the routing algorithm allows one to achieve throughput improvements of up to 45% in networks with low radix, for a uniform distribution of message destinations",
	booktitle = "Parallel, Distributed and Network-based Processing, 2002. Proceedings. 10th Euromicro Workshop on",
	doi = "10.1109/EMPDP.2002.994333",
	isbn = "0-7695-1444-8",
	keywords = "adaptive routing algorithms;additional routing freedom;algorithm adaptivity;bidirectional channels;deadlock avoidance;deadlock recovery;hypercube networks;k-ary n-cubes;network dimension crossing;network performance;network radix;routing flexibility;simul",
	pages = "455 -462",
	title = "{I}ncreasing the adaptivity of routing algorithms for k-ary n-cubes",
	url = "http://dx.doi.org/10.1109/EMPDP.2002.994333",
	year = 2002
}

Elvira Baydal, Pedro Lopez and Jose Duato. Avoiding network congestion with local information. 2002, 35 - 48. URL BibTeX

@conference{20093412265277,
	author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
	abstract = "Congestion leads to a severe performance degradation in multiprocessor interconnection networks. Therefore, the use of techniques that prevent network saturation are of crucial importance. Some recent proposals use global network information, thus requiring that nodes exchange some control information, which consumes a far from negligible bandwidth. As a consequence, the behavior of these techniques in practice is not as good as expected. In this paper, we propose a mechanism that uses only local information to avoid network saturation. Each node estimates traffic locally by using the percentage of free virtual output channels that can be used to forward a message towards its destination. When this number is below a threshold value, network congestion is assumed to exist and message throttling is applied. The main contributions of the proposed mechanism are two: i) it is more selective than previous approaches, as it only prevents the injection of messages when they are destined to congested areas; and ii) it outperforms recent proposals that rely on global information. © 2002 Springer Berlin Heidelberg.",
	address = "Kansai Science City, Japan",
	issn = "0302-9743",
	journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
	key = "Interconnection networks",
	keywords = "Computer science;Telecommunication networks;",
	note = "Control information;Global informations;Global network information;Local information;Message throttling;Multiprocessor interconnections;Network congestions;Network saturation;Performance degradation;Virtual output;",
	pages = "35 - 48",
	title = "{A}voiding network congestion with local information",
	url = "http://dx.doi.org/10.1007/3-540-47847-7_6",
	volume = "2327 LNCS",
	year = 2002
}

Elvira Baydal, Pedro Lopez and Jose Duato. Increasing the adaptivity of routing algorithms for k-ary n-cubes. 2002, 455 - 62. URL BibTeX

@conference{7205121,
	author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
	abstract = "In this paper, we show that routing algorithms may exploit not only the flexibility obtained by crossing network dimensions in any order but also that obtained in the same network dimension, thanks to the availability of bidirectional channels. We analyze the behavior of adaptive routing algorithms both for deadlock avoidance and recovery, exploiting this increased routing flexibility, and compare them with previous proposals in order to evaluate the contribution of the additional routing freedom on network performance. Simulation results show that this simple improvement in the routing algorithm allows one to achieve throughput improvements of up to 45% in networks with low radix, for a uniform distribution of message destinations",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing",
	keywords = "adaptive systems;concurrency control;hypercube networks;network routing;parallel algorithms;performance evaluation;system recovery;",
	note = "adaptive routing algorithms;algorithm adaptivity;k-ary n-cubes;hypercube networks;wormhole switching;routing flexibility;network dimension crossing;bidirectional channels;deadlock avoidance;deadlock recovery;additional routing freedom;network performance;simulation;throughput;network radix;uniform message destination distribution;",
	pages = "455 - 62",
	title = "{I}ncreasing the adaptivity of routing algorithms for k-ary n-cubes",
	url = "http://dx.doi.org/10.1109/EMPDP.2002.994333",
	year = 2002
}

Maria E Gomez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Evaluation of routing algorithms for InfiniBand networks. 2002, 775 - 80. BibTeX

@conference{7568237,
	author = "Gomez, Maria E. and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
	abstract = "Storage area networks (SAN) provide the scalability required by the IT servers. The InfiniBand (IBA) interconnect is very likely to become the de facto standard for SAN as well as for NOW. The routing algorithm is a key design issue in irregular networks. Moreover, as several virtual lanes can be used and different network issues can be considered, the performance of the routing algorithms may be affected. In this paper we evaluate three existing routing algorithms (up*/down*, DFS, and smart-routing) suitable for being applied to IBA. Evaluation has been performed by simulation under different synthetic traffic patterns and I/O traces. Simulation results show that the smart-routing algorithm achieves the highest performance",
	address = "Berlin, Germany",
	journal = "Euro-Par 2002 Parallel Processing. 8th International Euro-Par Conference. Proceedings (Lecture Notes in Computer Science Vol.2400)",
	keywords = "parallel algorithms;performance evaluation;telecommunication network routing;telecommunication standards;telecommunication traffic;workstation clusters;",
	note = "routing algorithms;InfiniBand networks;storage area networks;SAN;scalability;de facto standard;IBA interconnect;NOW;irregular networks;virtual lanes;performance;up*/down* routing;DFS routing;smart routing;synthetic traffic patterns;I/O traces;simulation;IT servers;",
	pages = "775 - 80",
	title = "{E}valuation of routing algorithms for {I}nfini{B}and networks",
	year = 2002
}

Elvira Baydal, Pedro Lopez and Jose Duato. Congestion control based on transmission times. 2002, 781 - 90. BibTeX

@conference{7568238,
	author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
	abstract = "Congestion leads to a severe performance degradation in multiprocessor interconnection networks. Therefore, the use of techniques that prevent network saturation are of crucial importance to avoid high execution times. We propose a new mechanism that uses only local information to avoid network saturation in wormhole networks. In order to detect congestion, each network node computes the quotient between the real transmission time of messages and its minimum theoretical value. If this ratio is greater than a threshold, the physical channel used by the message is considered congested. Depending on the number of congested channels, the available bandwidth to inject messages is reduced. The main contributions of the new mechanism are three: i) it can detect congestion in a remote way, but without transmitting control information through the network; ii) it tries to dynamically adjust the effective injection bandwidth available at each node; and iii) it is starvation-free. Evaluation results show that the proposed mechanism avoids network performance degradation for different network loads and topologies. Indeed, the mechanism does not introduce any penalty for low and medium network loads, where no congestion control mechanism is required",
	address = "Berlin, Germany",
	journal = "Euro-Par 2002 Parallel Processing. 8th International Euro-Par Conference. Proceedings (Lecture Notes in Computer Science Vol.2400)",
	keywords = "multiprocessor interconnection networks;network routing;parallel architectures;parallel machines;performance evaluation;",
	note = "congestion control;transmission times;performance degradation;multiprocessor interconnection networks;network saturation;execution times;massively parallel computers;wormhole networks;bandwidth;starvation-free;network topologies;",
	pages = "781 - 90",
	title = "{C}ongestion control based on transmission times",
	year = 2002
}

Jose Flich, M P Malumbres, Pedro Lopez and Jose Duato. Removing the latency overhead of the ITB mechanism in COWs with source routing. 2002, 463 - 70. URL BibTeX

@conference{7205122,
	author = "Flich, Jose and M.P. Malumbres and Lopez, Pedro and Duato, Jose",
	abstract = "Clusters of workstations (COWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. The in-transit buffer (ITB) mechanism can improve network performance when applied to COWs with irregular topology and source routing. This mechanism considerably improves the performance of this kind of network when compared to current source routing algorithms; however, it introduces a latency penalty. An implementation of this mechanism was performed, showing that the latency overhead of the mechanism may be noticeable, especially for short messages and at low network loads. In this paper, we analyze in detail the latency overhead of ITBs, proposing several mechanisms to reduce, hide and remove it. Firstly, we show, by simulation, the effect of an ITB implementation that is much slower than the one implemented. Then we propose three mechanisms that try to overcome the latency penalty. All the mechanisms are simple and can be easily implemented; also, they are out of the critical path of the ITB packet-processing procedure. The results show very good behaviour of the proposed mechanisms, considerably reducing or even completely removing the latency overhead",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing",
	keywords = "buffer storage;delays;performance evaluation;telecommunication network routing;workstation clusters;",
	note = "latency overhead removal;in-transit buffer mechanism;workstation clusters;source routing;network performance;irregular network topology;short messages;network loads;simulation;latency penalty;critical path;packet processing procedure;",
	pages = "463 - 70",
	title = "{R}emoving the latency overhead of the {ITB} mechanism in {COW}s with source routing",
	url = "http://dx.doi.org/10.1109/EMPDP.2002.994334",
	year = 2002
}

M E Acacio, J Gonzalez, J M Garcia and Jose Duato. The use of prediction for accelerating upgrade misses in cc-NUMA multiprocessors. 2002, 155 - 64. URL BibTeX

@conference{7503575,
	author = "M.E. Acacio and J. Gonzalez and J.M. Garcia and Duato, Jose",
	abstract = "This work is focused on accelerating upgrade misses in cc-NUMA multiprocessors. These misses are caused by store instructions for which a read-only copy of the line is found in the L2 cache. Upgrade misses require a message sent from the missing node to the directory, a directory lookup in order to find the set of sharers, invalidation messages being sent to the sharers and responses to the invalidations being sent back. Therefore, the penalty paid by these misses is not negligible, mainly if we consider that they account for a high percentage of the total miss rate. We propose the use of prediction as a means of providing cc-NUMA multiprocessors with a more efficient support for upgrade misses by directly invalidating sharers from the missing node. Our proposal comprises an effective prediction scheme achieving high hit rates as well as a coherence protocol extended to support the use of prediction. Our work is motivated by two key observations: first, upgrade misses present a repetitive behavior and, second, the total number of sharers being invalidated is small (one, in some cases). Using execution-driven simulations, we show that the use of prediction can significantly accelerate upgrade misses (latency reductions of more than 40% in some cases). These important improvements translate into speed-ups on application performance up to 14%. Finally, these results can be obtained including a predictor with a total size of less than 48 KB in every node",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 2002 International Conference on Parallel Architectures and Compilation Techniques. PACT 2002",
	keywords = "cache storage;delays;memory protocols;shared memory systems;",
	note = "prediction;upgrade miss acceleration;cc-NUMA multiprocessors;L2 cache;direct invalidation;coherence protocol;repetitive behavior;sharers;execution-driven simulations;latency reductions;",
	pages = "155 - 64",
	title = "{T}he use of prediction for accelerating upgrade misses in cc-{NUMA} multiprocessors",
	url = "http://dx.doi.org/10.1109/PACT.2002.1106014",
	year = 2002
}

M E Acacio, J Gonzalez, J M Garcia and Jose Duato. Reducing the latency of L2 misses in shared-memory multiprocessors through on-chip directory integration. 2002, 368 - 75. URL BibTeX

@conference{7205111,
	author = "M.E. Acacio and J. Gonzalez and J.M. Garcia and Duato, Jose",
	abstract = "Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller and the network interface. In this paper, we exploit such an integration scale, presenting a new three-level directory architecture aimed at reducing the long L2 miss latencies and the memory overhead that characterize cc-NUMA machines and limit their scalability. The proposed architecture is based on the integration into the processor chip of the directory controller and a small first-level directory cache that stores precise information for the most recently referenced memory lines, as the means to reduce miss latencies. The second- and third-level directories are located near the main memory and they are only accessed when a directory entry for a certain memory line is not present in the first-level directory. This off-chip structure achieves the performance of a large and non-scalable full-map directory with a very significant reduction in the memory overhead. Using execution-driven simulations, we show that substantial latency reductions can be obtained by using the proposed directory architecture. Load, store and read-modify-write misses are significantly accelerated (latency reductions of more than 35% in some cases). These reductions translate into important improvements on the final application performance (reductions up to 20% in execution time)",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing",
	keywords = "cache storage;delays;microprocessor chips;parallel architectures;performance evaluation;shared memory systems;",
	note = "L2 miss latency reduction;shared-memory multiprocessors;on-chip directory integration;technology improvements;memory controller;network interface;integration scale;3-level directory architecture;memory overhead reduction;cc-NUMA machines;cache-coherent nonuniform memory access;scalability;directory controller;directory cache;recently referenced memory lines;main memory;off-chip structure;performance;execution-driven simulations;load misses;store misses;read-modify-write misses;application performance;execution time;",
	pages = "368 - 75",
	title = "{R}educing the latency of {L}2 misses in shared-memory multiprocessors through on-chip directory integration",
	url = "http://dx.doi.org/10.1109/EMPDP.2002.994312",
	year = 2002
}

F J Alfaro, J L Sanchez, Luis Orozco and Jose Duato. Performance evaluation of VBR traffic in InfiniBand. 2002, 1532 - 1537. URL BibTeX

@conference{2002317038403,
	author = "F.J. Alfaro and J.L. Sanchez and Luis Orozco and Duato, Jose",
	abstract = "The InfiniBand Architecture (IBA) is becoming an industry standard both for communication between processing nodes and I/O devices and for interprocessor communication. It replaces the traditional I/O bus with a switch-based interconnect for connecting processing nodes and I/O devices. It is being developed by the InfiniBand^SM Trade Association (IBTA) to provide the levels of reliability, availability, performance, scalability, and quality of service (QoS) necessary for present and future server systems. For this, IBA provides a series of mechanisms that are able to guarantee QoS to the applications. In [2, 4], we proposed a strategy to compute the InfiniBand arbitration tables. We only evaluated our proposal for CBR traffic with fixed mean bandwidth requirements. In this paper, we evaluate our strategy to compute the InfiniBand arbitration tables with VBR traffic. Performance results show that, this class of traffic also gets their QoS requirements.",
	address = "Winnipeg, Manitoba, Canada",
	issn = 08407789,
	journal = "Canadian Conference on Electrical and Computer Engineering",
	key = "Telecommunication networks",
	keywords = "Bandwidth;Packet switching;Quality of service;Servers;Telecommunication traffic;",
	note = "InfiniBand Architecture (IBA);",
	pages = "1532 - 1537",
	title = "{P}erformance evaluation of {VBR} traffic in {I}nfini{B}and",
	url = "http://dx.doi.org/10.1109/CCECE.2002.1012981",
	volume = 3,
	year = 2002
}

JC Sancho, Antonio Robles and Jose Duato. Performance sensitivity of routing algorithms to failures in networks of workstations with regular and irregular topologies. In F Vajda and N Podhorszki (eds.). 10TH EUROMICRO WORKSHOP ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, PROCEEDINGS. 2002, 81-90. BibTeX

@conference{ISI:000173566600010,
	author = "JC Sancho and Robles, Antonio and Duato, Jose",
	abstract = "Networks of workstations (NOWs) provide a cost-effective alternative to parallel computers. Components in NOWs may fail, degrading the network operation until the faults are repaired. In this paper, we analyze the influence of both switch and link failures on the network performance. In particular, given that network performance in NOWs strongly depends on the applied routing algorithm, we quantify the sensitivity to failures of two routing algorithms: flexible routing and up{*}/down{*} routing algorithms. In the case of up{*}/down{*} routing, two methodologies to compute routing tables are evaluated. Evaluation results modeling a Myrinet network show that, in general, up{*}/down{*} routing is more robust to failures, although its behavior strongly depends on the type of network topology, regular or irregular, and the methodology used to compute routing tables. However, the flexible routing algorithm presents a better performance, regardless of the network topology, even in presence of failures, but at expense of a larger sensitivity.",
	booktitle = "10TH EUROMICRO WORKSHOP ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, PROCEEDINGS",
	editor = "Vajda, F and Podhorszki, N",
	isbn = 0769514448,
	note = "10th Euromicro Workshop on Parallel, Distributed and Network-based Processing (PDP 2002), LAS PALMAS GC, SPAIN, JAN 09-11, 2002",
	pages = "81-90",
	title = "{P}erformance sensitivity of routing algorithms to failures in networks of workstations with regular and irregular topologies",
	year = 2002
}

J C Sancho, Antonio Robles and Jose Duato. Performance sensitivity of routing algorithms to failures in networks of workstations with regular and irregular topologies. 2002, 81 - 90. URL BibTeX

@conference{7205079,
	author = "J.C. Sancho and Robles, Antonio and Duato, Jose",
	abstract = "Networks of workstations (NOWs) provide a cost-effective alternative to parallel computers. Components in NOWs may fail, degrading the network operation until the faults are repaired. In this paper, we analyze the influence of both switch and link failures on the network performance. In particular, given that network performance in NOWs strongly depends on the applied routing algorithm, we quantify the sensitivity to failures of two routing algorithms: flexible routing and up*/down* routing algorithms. In the case of up*/down* routing, two methodologies to compute routing tables are evaluated. Evaluation results modeling a Myrinet network show that, in general, up*/down* routing is more robust to failures, although its behavior strongly depends on the type of network topology, regular or irregular, and the methodology used to compute routing tables. However, the flexible routing algorithm presents a better performance, regardless of the network topology, even in presence of failures, but at expense of a larger sensitivity",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing",
	keywords = "computer networks;performance evaluation;workstation clusters;",
	note = "performance sensitivity;routing algorithms;networks of workstations;irregular topologies;regular topologies;link failures;switch failures;network performance;Myrinet network;",
	pages = "81 - 90",
	title = "{P}erformance sensitivity of routing algorithms to failures in networks of workstations with regular and irregular topologies",
	url = "http://dx.doi.org/10.1109/EMPDP.2002.994237",
	year = 2002
}

G Bernabe, J Gonzalez, J M Garcia and Jose Duato. Memory conscious 3D wavelet transform. 2002, 108 - 13. URL BibTeX

@conference{7480885,
	author = "G. Bernabe and J. Gonzalez and J.M. Garcia and Duato, Jose",
	abstract = "The video compression algorithms based on the 3D wavelet transform obtain excellent compression rates at the expense of huge memory requirements, which drastically affect the execution time of such applications. The goal of this work is to mitigate the memory problem by exploiting the memory hierarchy of the processor through blocking. In particular, we present two blocking approaches: cube and rectangular that differ in the way that the original working set is divided. We also propose the reuse of previous computations in order to decrease the number of memory accesses and floating point operations. Results show that the rectangular overlapped approach with computation reuse obtains the best results in terms of execution time, a speedup of 2.42 over the non-blocking non-overlapped wavelet transform, maintaining the compression ratio and the video quality (PSNR) of the original encoder based on the 3D wavelet transform",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 28th Euromicro Conference",
	keywords = "data compression;medical image processing;performance evaluation;storage management;transform coding;video coding;wavelet transforms;",
	note = "3D wavelet transform;processor memory hierarchy;cube blocking;rectangular blocking;previous computation reuse;floating point operations;rectangular overlapped approach;execution time;speedup;compression ratio;medical video;video quality;PSNR;video compression algorithms;",
	pages = "108 - 13",
	title = "{M}emory conscious 3{D} wavelet transform",
	url = "http://dx.doi.org/10.1109/EURMIC.2002.1046141",
	year = 2002
}

Ki Hwan Yum, Eun Jung Kim, C R Das, M Yousif and Jose Duato. Integrated admission and congestion control for QoS support in clusters. 2002, 325 - 32. URL BibTeX

@conference{7503603,
	author = "Ki Hwan Yum and Eun Jung Kim and C.R. Das and M. Yousif and Duato, Jose",
	abstract = "Admission and congestion control mechanisms are integral parts of any Quality of Service (QoS) design for networks that support integrated traffic. In this paper we propose an-admission control algorithm and a congestion control algorithm for clusters, which are increasingly being used in a diverse set of applications that require QoS guarantees. The uniqueness of our approach is that we develop these algorithms for wormhole-switched networks. We use QoS-capable wormhole routers and QoS-capable network interface cards (NICs), referred to as Host Channel Adapters (HCAs) in InfiniBand{{\&}}trade; Architecture (IBA), to evaluate the effectiveness of these algorithms. The admission control is applied at the HCAs and the routers, while the congestion control is deployed only at the HCAs. Simulation results indicate that the admission and congestion control algorithms are quite effective in delivering the assured performance. The proposed credit-based congestion control algorithm is simple and practical in that it relies on hardware already available in the HCA to regulate traffic injection",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 2002 IEEE International Conference on Cluster Computing",
	keywords = "quality of service;telecommunication congestion control;telecommunication network routing;workstation clusters;",
	note = "admission control;congestion control;Quality of Service;integrated traffic;clusters;wormhole-switched networks;network interface cards;Host Channel Adapters;",
	pages = "325 - 32",
	title = "{I}ntegrated admission and congestion control for {Q}o{S} support in clusters",
	url = "http://dx.doi.org/10.1109/CLUSTR.2002.1137761",
	year = 2002
}

J Fernandez, J M Garcia and Jose Duato. Improving the performance of real-time communication services on high-speed LANs under topology changes. 2002, 385 - 94. URL BibTeX

@conference{7670833,
	author = "J. Fernandez and J.M. Garcia and Duato, Jose",
	abstract = "In this paper, we propose and evaluate a new protocol that provides topology change- and fault-tolerant real-time communication services on NOW and clusters. This protocol overcomes the main drawback of our previously proposed protocol, called Dynamically Re-established Real-Time Channels (DRRTC), which is physically limited by the number of virtual channels per port. The new protocol allows different real-time channels to share the same virtual channel. In this way, the new protocol allows us to establish a greater number of real-time channels than the previous one. Moreover, its only limitation is the bandwidth devoted to real-time traffic. However, this introduces two new problems that are successfully managed by the new protocol: the existence of cyclic dependencies among different real-time channels and the increased complexity of deadline requirements. We present and analyze the performance evaluation results when a single switch or a single link is deactivated/activated for different topologies and workloads. The new protocol overwhelms the DRRTC protocol while guaranteeing deadline requirements and channel recovery",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings LCN 2002. 27th Annual IEEE Conference on Local Computer Networks",
	keywords = "fault tolerance;network topology;performance evaluation;protocols;quality of service;workstation clusters;",
	note = "real-time communication services;high-speed LAN;topology changes;protocol;fault-tolerant real-time communication;NOW;clusters;Dynamically Re-established Real-Time Channels;DRRTC;virtual channel;cyclic dependencies;performance evaluation;deadline requirements;channel recovery;",
	pages = "385 - 94",
	title = "{I}mproving the performance of real-time communication services on high-speed {LAN}s under topology changes",
	url = "http://dx.doi.org/10.1109/LCN.2002.1181810",
	year = 2002
}

Jose Flich, Pedro Lopez, J C Sancho, Antonio Robles and Jose Duato. Improving InfiniBand routing through multiple virtual networks. 2002, 49 - 63. BibTeX

@conference{7387421,
	author = "Flich, Jose and Lopez, Pedro and J.C. Sancho and Robles, Antonio and Duato, Jose",
	abstract = "InfiniBand is very likely to become the de facto standard for communication between nodes and I/O devices as well as for interprocessor communication. Often, the interconnection pattern is irregular. Up*/down* is the most popular routing scheme currently used in NOWs with irregular topologies. However, the main drawbacks of up*/down* routing are the unbalanced channel utilization and the difficulties to route most packets through minimal paths, which negatively affects network performance. Using additional virtual lanes can improve up*/down* routing performance by reducing the head-of-line blocking effect, but its use is not aimed to remove its main drawbacks. We propose a methodology that uses a reduced number of virtual lanes in an efficient way to achieve a better traffic balance and a higher number of minimal paths. This methodology is based on routing packets simultaneously through several properly selected up*/down* trees. To guarantee deadlock freedom, each up*/down* tree is built over a different virtual network. Simulation results, show that the proposed methodology increases throughput up to an average factor ranging from 1.18 to 2.18 for 8, 16, and 32-switch networks by using only two virtual lanes. For larger networks with an additional virtual lane, network throughput is tripled, on average",
	address = "Berlin, Germany",
	journal = "High Performance Computing. 4th International Symposium, ISHPC 2002. Proceedings (Lecture Notes in Computer Science Vol.2327)",
	keywords = "multiplexing;multiprocessor interconnection networks;telecommunication network routing;workstation clusters;",
	note = "InfiniBand routing;networks of workstations;multiple virtual networks;interprocessor communication;NOWs;switch-based network;point-to-point links;up*/down* routing;head-of-line blocking effect;deadlock freedom;",
	pages = "49 - 63",
	title = "{I}mproving {I}nfini{B}and routing through multiple virtual networks",
	year = 2002
}

J C Sancho, Antonio Robles, Jose Flich, Pedro Lopez and Jose Duato. Effective methodology for deadlock-free minimal routing in InfiniBand networks. In Parallel Processing, 2002. Proceedings. International Conference on. 2002, 409 - 418. DOI BibTeX

@conference{1040897,
	author = "J.C. Sancho and Robles, Antonio and Flich, Jose and Lopez, Pedro and Duato, Jose",
	abstract = "The InfiniBand Architecture (IBA) defines a switch-based network with point-to-point links whose topology is arbitrarily established by the customer. We propose a simple and effective methodology for designing deadlock-free routing strategies that are able to route packets through minimal paths in InfiniBand networks. This methodology can meet the trade-off between network performance and the number of resources dedicated to deadlock avoidance. Evaluation results show that the resulting routing strategies significantly outperform up*/down* routing. In particular, throughput improvement ranges, on average, from 1.33 for small networks to 4.05 for large networks. Also, it is shown that just two virtual lanes and three service levels are enough to achieve more than 80% of the throughput improvement achieved by the best proposed routing strategy (the one that always provides minimal paths without limiting the number of resources).",
	booktitle = "Parallel Processing, 2002. Proceedings. International Conference on",
	doi = "10.1109/ICPP.2002.1040897",
	issn = "0190-3918",
	keywords = "InfiniBand architecture; InfiniBand networks; NOWs; deadlock-free minimal routing; interconnection pattern; minimal paths; network performance; packet routing; point-to-point links; service levels; switch-based network; throughput improvement; up*/down*",
	pages = "409 - 418",
	title = "{E}ffective methodology for deadlock-free minimal routing in {I}nfini{B}and networks",
	year = 2002
}

Jose Flich, Pedro Lopez, Perez M Malumbres and Jose Duato. Boosting the performance of Myrinet networks. IEEE Transactions on Parallel and Distributed Systems 13(7):693 - 709, 2002. URL BibTeX

@article{2002367073594,
	author = "Flich, Jose and Lopez, Pedro and M. Perez Malumbres and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. These networks allow the customer to connect processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Some of these networks use source routing and wormhole switching. In particular, we are interested in Myrinet networks because it is a well-known commercial product and its behavior can be controlled by the software running in network interfaces (Myrinet Control Program, MCP). Usually, the Myrinet network uses up*/down* routing for computing the paths for every source-destination pair. In this paper, we propose the In-Transit Buffer (ITB) mechanism to improve network performance. We apply the ITB mechanism to NOWs with up*/down* source routing, like Myrinet, analyzing its behavior on both networks with regular and irregular topologies. The proposed scheme can be implemented on Myrinet networks by only modifying the MCP, without changing the network hardware. We evaluate by simulation several networks with different traffic patterns using timing parameters taken from the Myrinet network. Results show that the current routing schemes used in Myrinet networks can be strongly improved by applying the ITB mechanism. In general, our proposed scheme is able to double the network throughput on medium and large NOWs. Finally, we present a first implementation of the ITB mechanism on a Myrinet network.",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Computer networks",
	keywords = "Buffer storage;Computer hardware;Computer simulation;Computer workstations;Interfaces;Parallel processing systems;Program processors;Routers;Telecommunication traffic;Topology;",
	note = "Myrinet networks;",
	number = 7,
	pages = "693 - 709",
	title = "{B}oosting the performance of {M}yrinet networks",
	url = "http://dx.doi.org/10.1109/TPDS.2002.1019859",
	volume = 13,
	year = 2002
}

J C Sancho, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Analyzing the influence of virtual lanes on the performance of infiniband networks. In Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002, Abstracts and CD-ROM. 2002, 166 -175. BibTeX

@conference{1016568,
	author = "J.C. Sancho and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
	booktitle = "Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002, Abstracts and CD-ROM",
	pages = "166 -175",
	title = "{A}nalyzing the influence of virtual lanes on the performance of infiniband networks",
	year = 2002
}

I Paul, S Yalamanchili and Jose Duato. Algorithms for switch-scheduling in the multimedia router for LANs. 2002, 219 - 31. BibTeX

@conference{7748982,
	author = "I. Paul and S. Yalamanchili and Duato, Jose",
	abstract = "The primary objective of the multimedia router (MMR) [Jose Duato et al., (1999)] project is to design and implement a single chip router targeted for use in cluster and LAN interconnection networks. The goal can be concisely captured in the phrase 'QoS routing at link speeds'. We study a set of algorithms for switch-scheduling based on a highly concurrent implementation for capturing output port requests. Two different switch-scheduling algorithms called row-column ordering and diagonal ordering are proposed and implemented in a switch-scheduling framework which involves a matrix data structure, and therefore enables concurrent and parallel operations at high-speed. Their performance has been evaluated with constant bit rate (CBR), variable bit rate (VBR), and a mixture of CBR and VBR traffic. At high offered loads both these ordering functions have been shown to deliver superior quality of service (QoS) to connections at a high scheduling rate and high utilization",
	address = "Berlin, Germany",
	journal = "High Performance Computing - HiPC 2002. 9th International Conference. Proceedings (Lecture Notes in Computer Science Vol.2552)",
	keywords = "concurrency control;LAN interconnection;multimedia communication;packet switching;parallel processing;quality of service;telecommunication network routing;telecommunication traffic;",
	note = "multimedia router;MMR;LAN interconnection network;QoS routing;quality of service;output port request;switch-scheduling algorithm;row-column ordering;diagonal ordering;matrix data structure;concurrent operation;parallel operation;constant bit rate;variable bit rate;",
	pages = "219 - 31",
	title = "{A}lgorithms for switch-scheduling in the multimedia router for {LAN}s",
	year = 2002
}

M E Acacio, J Gonzalez, J M Garcia and Jose Duato. A. novel approach to reduce L2 miss latency in shared-memory multiprocessors. 2002, 580 - 7. URL BibTeX

@conference{7342351,
	author = "M.E. Acacio and J. Gonzalez and J.M. Garcia and Duato, Jose",
	abstract = "Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller, the coherence hardware and the network interface/router. In this work we exploit such integration scale, presenting a novel node architecture aimed at reducing the long L2 miss latencies and the memory overhead of using directories that characterize cc-NUMA machines and limit their scalability. Our proposal replaces the traditional directory with a novel three-level directory architecture and adds a small shared data cache to each of the nodes of a multiprocessor system. Due to their small size, the first-level directory and the shared data cache are integrated into the processor chip in every node. A taxonomy of the L2 misses, according to the actions performed by the directory to satisfy them is also presented. Using execution-driven simulations, we show significant L2 miss latency reductions (more than 60% in some cases). These important improvements translate into reductions of more than 30% in the application execution time in some cases",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 16th International Parallel and Distributed Processing Symposium",
	keywords = "cache storage;parallel architectures;performance evaluation;shared memory systems;",
	note = "L2 miss latency reduction;shared-memory multiprocessors;memory controller;coherence hardware;network interface;node architecture;memory overhead;cc-NUMA machines;shared data cache;execution-driven simulations;scalability;three-level directory architecture;",
	pages = "580 - 7",
	title = "{A}. novel approach to reduce {L}2 miss latency in shared-memory multiprocessors",
	url = "http://dx.doi.org/10.1109/IPDPS.2002.1015554",
	year = 2002
}

F J Alfaro, J L Sanchez, Jose Duato and C R Das. A strategy to compute the InfiniBand arbitration tables. 2002, 43 - 8. URL BibTeX

@conference{7342290,
	author = "F.J. Alfaro and J.L. Sanchez and Duato, Jose and C.R. Das",
	abstract = "The InfiniBand Architecture (IBA) is a new industry standard architecture for server I/O and interprocessor communication. InfiniBand is very likely to become the de facto standard in a few years. It is being developed by the InfiniBand Trade Association (IBTA) to provide the levels of reliability, availability, performance, scalability, and quality of service (QoS) necessary for present and future server systems. The provision of QoS in data communication networks is currently the focus of much discussion and research in industry and academia. IBA enables QoS support with some mechanisms. In this paper, we examine these mechanisms and describe a way to use them. We propose a traffic segregation strategy based on mean bandwidth requirements. Moreover, we propose a very effective strategy to compute the virtual lane arbitration tables for IBA switches. We evaluate our proposal with different network topologies. Performance results show that, with a correct treatment of each traffic class in the arbitration of the output port, every traffic class meets its QoS requirements",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 16th International Parallel and Distributed Processing Symposium",
	keywords = "quality of service;standards;system buses;",
	note = "InfiniBand arbitration tables;industry standard architecture;server I/O;interprocessor communication;reliability;availability;performance;scalability;quality of service;data communication networks;traffic segregation strategy;mean bandwidth requirements;virtual lane arbitration tables;QoS requirements;",
	pages = "43 - 8",
	title = "{A} strategy to compute the {I}nfini{B}and arbitration tables",
	url = "http://dx.doi.org/10.1109/IPDPS.2002.1015474",
	year = 2002
}

B Caminero, C Carrion, F J Quiles, Jose Duato and S Yalamanchili. A new switch scheduling algorithm to improve QoS in the multimedia router. 2002, 376 - 9. URL BibTeX

@conference{7810072,
	author = "B. Caminero and C. Carrion and F.J. Quiles and Duato, Jose and S. Yalamanchili",
	abstract = "The multimedia router (MMR) is aimed at providing QoS to multimedia flows, which coexist with conventional best-effort traffic, by means of a single-chip, compact router designed for cluster and local area environments. As the router is based on a multiplexed crossbar, hardware efficient link and switch scheduling algorithms are needed. Their goal is to achieve a high utilization, while the QoS needed by the multimedia connections is guaranteed. This work presents a novel switch scheduling algorithm, the candidate conflict arbiter (CCA), that can be efficiently implemented in the MMR. Simulation results show that this proposal beats other previous algorithms in terms of maximum throughput achieved while still providing QoS to the multimedia flows",
	address = "Piscataway, NJ, USA",
	journal = "Proceedings of 2002 IEEE Workshop on Multimedia Signal Processing (Cat. No.02TH8661)",
	keywords = "local area networks;multimedia communication;quality of service;scheduling;telecommunication network routing;telecommunication switching;",
	note = "switch scheduling algorithm;QoS;multimedia router;quality of service;best-effort traffic;LAN;single-chip;multiplexed crossbar;hardware efficient link;candidate conflict arbiter;multimedia flow;cluster;local area environment;",
	pages = "376 - 9",
	title = "{A} new switch scheduling algorithm to improve {Q}o{S} in the multimedia router",
	url = "http://dx.doi.org/10.1109/MMSP.2002.1203324",
	year = 2002
}

B Caminero, C Carrion, F J Quiles, Jose Duato and S Yalamanchili. A multimedia router architecture to provide high performance and QoS guarantees to mixed traffic. 2002, 313 - 16. URL BibTeX

@conference{7540635,
	author = "B. Caminero and C. Carrion and F.J. Quiles and Duato, Jose and S. Yalamanchili",
	abstract = "The explosive growth in using scalable and cost-effective clusters and local area environments involve the design of high performance networks aimed at providing QoS to multimedia flows. Thus, the main goal pursued by the Multi-Media (MMR) project is to design a single-chip router able to efficiently handle multimedia flows and best-effort traffic. In this paper we focus on the performance evaluation of the MMR architecture using a mix of CBR, VBR and best effort workload. Preliminary simulation results show that, by using simple link and switch scheduling algorithms, the router is able to achieve a link bandwidth utilization of 80%, while still providing QoS guarantees to both CBR and VBR traffic in the presence of best-effort traffic",
	address = "Piscataway, NJ, USA",
	journal = "Proceedings 2002 IEEE International Conference on Multimedia and Expo (Cat. No.02TH8604)",
	keywords = "bandwidth allocation;multimedia communication;quality of service;telecommunication network routing;telecommunication traffic;",
	note = "multimedia router architecture;QoS guarantees;Multi-Media project;MMR project;single-chip router;best-effort traffic;performance evaluation;CBR;VBR;link bandwidth utilization;mixed traffic;multimedia flows;high performance networks;",
	pages = "313 - 16",
	title = "{A} multimedia router architecture to provide high performance and {Q}o{S} guarantees to mixed traffic",
	url = "http://dx.doi.org/10.1109/ICME.2002.1035781",
	volume = "vol.1",
	year = 2002
}

J M Orduna, Federico Silla and Jose Duato. A clustering method for modeling the communication requirements of message-passing applications. Computing and Informatics 21(1):1 - 16, 2002. BibTeX

@article{7407405,
	author = "J.M. Orduna and Silla, Federico and Duato, Jose",
	abstract = "Clusters have become a very cost-effective platform for high-performance computing. Usually these systems become heterogeneous as they grow, due to their incremental capabilities. Many research activities have focused on the problem of task scheduling in heterogeneous systems from the computational point of view. However, an ideal scheduling strategy would also take into account the communication requirements of the applications and the communication bandwidth available in the network. One of the key issues in this strategy is the measurement of the communication requirements for each application. We propose a clustering-based method to characterize the communications between processes generated by message-passing applications. This technique provides a model consisting of several partitions of the processes generated by the application. Also, we propose a criterion to measure the quality of the obtained partitions. This approach can be used when a given application is repeatedly executed with different input data. Results show that the proposed method can provide a partition with the highest ratio between the intracluster and the intercluster required communication bandwidth. This partition can be used to map groups of processes to processors in the heterogeneous system",
	address = "Slovakia",
	issn = "0232-0274",
	journal = "Computing and Informatics",
	keywords = "message passing;performance evaluation;resource allocation;scheduling;workstation clusters;",
	note = "clustering method;communication requirements;message-passing applications;cost-effective;high-performance computing;task scheduling;heterogeneous systems;interconnection networks;cluster computing;communication bandwidth;intracluster;intercluster;",
	number = 1,
	pages = "1 - 16",
	title = "{A} clustering method for modeling the communication requirements of message-passing applications",
	volume = 21,
	year = 2002
}

Vicente Chirivella, Rosa Alcover and Jose Duato. Accurate reliability and availability models for direct interconnection networks. In Parallel Processing, International Conference on, 2001.. September 2001, 517 - 24. URL, DOI BibTeX

@conference{7075325,
	author = "Chirivella, Vicente and Alcover, Rosa and Duato, Jose",
	abstract = "Fault tolerance in multicomputer interconnection networks has been traditionally studied by determining the worst possible combination of faulty components that causes its failure and then assuming that this will occur. But, the probability of the worst possible combination is usually low, and the routing algorithm may be able to find a route between source and destination nodes. The network dependability parameters computed according to this approach will be underestimated. In this paper we propose a methodology for accurately evaluating interconnection network dependability. In addition, we apply it to obtain an accurate estimation of the reliability and availability parameters in a 2-D mesh, taking into account network size, routing algorithm, failure and repair rates of nodes, and coverage. Finally we compare the computed results under both approaches",
	booktitle = "Parallel Processing, International Conference on, 2001.",
	doi = "10.1109/ICPP.2001.952099",
	isbn = "0-7695-1257-7",
	journal = "Proceedings International Conference on Parallel Processing",
	keywords = "fault tolerant computing;multiprocessor interconnection networks;network routing;",
	month = "Sep",
	note = "accurate reliability;availability models;direct interconnection networks;fault tolerance;multicomputer interconnection networks;faulty components;routing algorithm;network dependability parameters;network size;",
	pages = "517 - 24",
	title = "{A}ccurate reliability and availability models for direct interconnection networks",
	url = "http://dx.doi.org/10.1109/ICPP.2001.952099",
	year = 2001
}

Rosa Alcover, Vicente Chirivella and Jose Duato. Improving the accuracy of reliability models for direct interconnection networks. In Rizos Sakellariou; John Gurd; Len Freeman; John Keane (ed.). Euro-Par 2001 Parallel Processing 2150. August 2001, 621 - 629. URL, DOI BibTeX

@conference{7211185,
	author = "Alcover, Rosa and Chirivella, Vicente and Duato, Jose",
	abstract = "Fault-tolerance in multicomputer interconnection networks has been traditionally studied by determining the worst possible combination of faulty components that causes a network failure and then assuming that this will occur. But the worst possible combination may occur with low probability and the routing algorithm may allow the network to work, even when there is a large number of faults. Thus, the network dependability parameters computed according to this approach will be underestimated. Previously (V. Chirivella and R. Alcover, 2000), we proposed a new methodology based on Markov chains, for evaluating interconnection network dependability. Using this methodology, we can accurately compute the network reliability behavior. We apply it to evaluate dependability parameters in a 2-D mesh, taking into account network size, routing algorithm, failure and repair rates of nodes and coverage. Finally, we compare the computed results to a traditional approach",
	address = "Berlin, Germany",
	booktitle = "Euro-Par 2001 Parallel Processing",
	doi = "10.1007/3-540-44681-8_89",
	editor = "Rizos Sakellariou; John Gurd; Len Freeman; John Keane",
	isbn = "978-3-540-42495-6",
	journal = "Euro-Par 2001 Parallel Processing. 7th International Euro-Par Conference. Proceedings (Lecture Notes in Computer Science Vol.2150)",
	keywords = "fault tolerant computing;Markov processes;multiprocessor interconnection networks;network routing;",
	month = "Aug",
	note = "reliability model accuracy;direct interconnection networks;fault-tolerance;multicomputer interconnection networks;worst possible combination;faulty components;network failure;probability;routing algorithm;network dependability parameters;Markov chains;interconnection network dependability;network reliability behavior;dependability parameters;2D mesh;network size;repair rates;failure rates;",
	pages = "621 - 629",
	publisher = "Springer",
	series = "Lecture Notes in Computer Science",
	title = "{I}mproving the accuracy of reliability models for direct interconnection networks",
	url = "http://dx.doi.org/10.1007/3-540-44681-8_89",
	volume = 2150,
	year = 2001
}

Juan Miguel Martínez, Pedro Lopez and Jose Duato. A cost-effective approach to deadlock handling in wormhole networks. Parallel and Distributed Systems, IEEE Transactions on 12(7):716 -729, July 2001. URL, DOI BibTeX

@article{940746,
	author = "Mart{\'i}nez, Juan Miguel and Lopez, Pedro and Duato, Jose",
	abstract = "Wormhole networks have traditionally used deadlock avoidance strategies. More recently, deadlock recovery strategies have begun to gain acceptance. In particular, progressive deadlock recovery techniques allocate a few dedicated resources to quickly deliver deadlocked packets. Deadlock recovery is based on the assumption that deadlocks are rare; otherwise, recovery techniques are not efficient. Measurements of deadlock occurrence frequency show that deadlocks are highly unlikely when enough routing freedom is provided. However, networks are more prone to deadlocks when the network is close to or beyond saturation, causing some network performance degradation. Similar performance degradation behavior at saturation was also observed in networks using deadlock avoidance strategies. In this paper, we take a different approach to handling deadlocks and performance degradation. We propose the use of an injection limitation mechanism that prevents performance degradation near the saturation point and, at the same time, reduces the probability of deadlock to negligible values. We also propose an improved deadlock detection mechanism that uses only local information, detects all deadlocks, and considerably reduces the probability of false deadlock detection over previous proposals. In the rare case when impending deadlock is detected, our proposal consists of using a simple recovery technique that absorbs the deadlocked message at the current node and later reinjects it for continued routing toward its destination. Performance evaluation results show that our new approach to handling deadlock is more efficient than previously proposed techniques",
	doi = "10.1109/71.940746",
	issn = "1045-9219",
	journal = "Parallel and Distributed Systems, IEEE Transactions on",
	keywords = "cost-effective approach;deadlock avoidance;deadlock handling;deadlock occurrence frequency;deadlock recovery;injection limitation mechanism;network performance degradation;performance degradation;performance evaluation;wormhole networks;concurrency contro",
	month = "jul",
	number = 7,
	pages = "716 -729",
	title = "{A} cost-effective approach to deadlock handling in wormhole networks",
	url = "http://dx.doi.org/10.1109/71.940746",
	volume = 12,
	year = 2001
}

Salvador Coll, Jose Flich, M P Malumbres, Pedro Lopez, Jose Duato and F J Mora. A first implementation of in-transit buffers on myrinet gm software. In Parallel and Distributed Processing Symposium., Proceedings 15th International. April 2001, 1640 -1647. URL, DOI BibTeX

@conference{925150,
author = "Coll, Salvador and Flich, Jose and M.P. Malumbres and Lopez, Pedro and Duato, Jose and F.J. Mora",
abstract = "Clusters of workstations (COWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. In these systems, the interconnection network connects hosts using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Myrinet is the most popular network used to build COWs. It uses source routing with the up*/down* routing algorithm. In previous papers we proposed the In-Transit Buffer (ITB) mechanism that improves network performance by allowing minimal routing, balancing network traffic, and reducing network contention. The mechanism is based on ejecting packets at some intermediate hosts and later re-injecting them into the network. Moreover, the ITB mechanism does not require additional hardware as it can be implemented on the software running at Myrinet network adapters. In this paper, we present a first implementation of the ITB mechanism on Myrinet GM software. We show the changes required in packet format and the modifications performed in the Myrinet Control Program (MCP). In addition, both the overhead introduced by the new code and the cost of extracting and re-injecting packets are measured. Results show that, even for this simple implementation, code overhead is only about 125 ns per packet and the message latency increase for messages that use the ITB mechanismis around 1.3 s per ITB. This is the first attempt to implement this mechanism, showing that a real implementation of ITBs is feasible on Myrinet COWs, and the associated overhead does not restrict the potential benefits of this mechanism.",
booktitle = "Parallel and Distributed Processing Symposium., Proceedings 15th International",
doi = "10.1109/IPDPS.2001.925150",
isbn = "0-7695-0990-8",
issn = "1530-2075",
month = "apr",
pages = "1640 -1647",
title = "{A} first implementation of in-transit buffers on myrinet gm software",
url = "http://dx.doi.org/10.1109/IPDPS.2001.925150",
year = 2001
}

Jose Flich, Pedro Lopez, M P Malumbres, Jose Duato and T Rokicki. Improving network performance by reducing network contention in source-based COWS with a low path-computation overhead. In Parallel and Distributed Processing Symposium., Proceedings 15th International. April 2001, 8 pp.. DOI BibTeX

@conference{925016,
	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose and T. Rokicki",
	abstract = "In previous papers, we have proposed the in-transit buffer mechanism (ITB) to improve network performance in COWs with irregular topology and source routing. This mechanism allows the use of minimal paths among all hosts, breaking cyclic dependences between channels by storing and later re-injecting packets at some intermediate hosts. However it also has two additional features that can improve even more network performance. First, the ITB mechanism reduces network contention because some messages are ejected from the network freeing network links. Second the ITB mechanism allows the use of any path between each source-destination pair improving traffic balance. In this paper we present a new routing algorithm that takes advantage of ITB by exploiting both issues: traffic balance and network contention reduction. The evaluation results show that network throughput can be considerably improved. On average, network throughput increases with respect to up*/down* by factors of 2.51 and 3.77 in 32 and 64-switch networks, respectively",
	booktitle = "Parallel and Distributed Processing Symposium., Proceedings 15th International",
	doi = "10.1109/IPDPS.2001.925016",
	keywords = "in-transit buffer mechanism;network contention;network performance;network throughput;source routing;source-based COWS;traffic balance;performance evaluation;workstation clusters;",
	month = "apr",
	pages = "8 pp.",
	title = "{I}mproving network performance by reducing network contention in source-based {COWS} with a low path-computation overhead",
	year = 2001
}

Elvira Baydal, Pedro Lopez and Jose Duato. A congestion control mechanism for wormhole networks. In Parallel and Distributed Processing, 2001. Proceedings. Ninth Euromicro Workshop on. 2001, 19 -26. URL, DOI BibTeX

@conference{904965,
author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
abstract = "Deadlock avoidance and recovery techniques suffer from severe performance degradation when the network is close to or beyond saturation. Many parallel applications produce bursty traffic that may saturate the network during some intervals, and increase execution time. Therefore, the use of techniques that prevent network saturation are of crucial importance in both deadlock avoidance and recovery strategies. Several mechanisms have been proposed in the literature to reach this goal. However some of them do not work well under all network load conditions. Others introduce some penalty when the network is not fully saturated, or complicate network and/or node implementation. In this paper we propose a new mechanism to avoid network saturation that overcomes these drawbacks. In this mechanism, each node estimates network traffic locally by using the percentage of free virtual output channels that can be used for forwarding a message towards its destination. When this number surpasses a threshold value, network congestion is assumed to exist and message injection is forbidden",
booktitle = "Parallel and Distributed Processing, 2001. Proceedings. Ninth Euromicro Workshop on",
doi = "10.1109/EMPDP.2001.904965",
keywords = "bursty traffic;congestion control mechanism;deadlock avoidance;deadlock recovery;free virtual output channels;message injection;network congestion;network load conditions;network saturation;network traffic;performance degradation;threshold value;wormhole",
pages = "19 -26",
title = "{A} congestion control mechanism for wormhole networks",
url = "http://dx.doi.org/10.1109/EMPDP.2001.904965",
year = 2001
}

JC Sancho, Antonio Robles and Jose Duato. On the relative behavior of source and distributed routing in NOWs using up*/down* routing schemes. In K Klockner (ed.). NINTH EUROMICRO WORKSHOP ON PARALLEL AND DISTRIBUTED PROCESSING, PROCEEDINGS. 2001, 11-18. BibTeX

@conference{ISI:000166833400002,
	author = "JC Sancho and Robles, Antonio and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are arranged as a switch-based network with irregular topology which makes routing and deadlock avoidance quite complicated. Current proposals use the up{*}/down{*} routing algorithm to remove cyclic dependencies between channels and avoid deadlock. Recently a simple and effective methodology to compute up{*}/down{*} routing tables has been proposed by us. The resulting up{*}/down{*} routing scheme increases the number of alternative paths between every pair of switches and allows most messages to follow minimal paths. Also, up{*}/down{*} routing is suitable to be implemented using source or distributed routing. Source routing provides a safer and lower cost implementation of up{*}/down{*} routing than that provided by distributed routing. However distributed routing may benefit from routing messages through alternative paths to reach their destination. In this paper we evaluate the performance of up{*}/down{*} routing when using two methodologies to compute routing tables, and when both source and distributed rousing are used. Evaluation results show that it is not worth to implement up{*}/down{*} routing in a distributed way in a NOW environment, since its performance is ver), close to that achieved by implementing it with source routing when a traffic-balancing algorithm is used. Moreover it is shown that a greater improvement in performance can be achieved by modifying the method to compute up{*}/down{*} routing tables when source routing is used.",
	booktitle = "NINTH EUROMICRO WORKSHOP ON PARALLEL AND DISTRIBUTED PROCESSING, PROCEEDINGS",
	editor = "Klockner, K",
	isbn = 0769509886,
	note = "9th Euromicro Workshop on Parallel and Distributed Processing, MANTOVA, ITALY, FEB 07-09, 2001",
	pages = "11-18",
	title = "{O}n the relative behavior of source and distributed routing in {NOW}s using up{*}/down{*} routing schemes",
	year = 2001
}

JC Sancho, Antonio Robles and Jose Duato. Effective strategy to compute forwarding tables for InfiniBand networks. In LM Ni and M Valero (eds.). PROCEEDINGS OF THE 2001 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING. 2001, 48-57. BibTeX

@conference{ISI:000171882100006,
	author = "JC Sancho and Robles, Antonio and Duato, Jose",
	abstract = "InfiniBand is very likely to become the facto standard for communication between processing nodes and I/O devices as well as for interprocessor communication. The InifiniBand Architecture (IBA) defines a switch-based network with point-to-point links that support any topology defined by the user Routing in IBA is distributed, based on forwarding tables, and only considers the packet destination ID for routing within subnets. Up{*}/down{*} routing is the simplest and most popular routing algorithm for irregular topologies. Unfortunately, up{*}/down{*} routing cannot be used in IBA switches because it may leads to deadlock. In this paper we address this issue, proposing an easy-to-implement strategy to compute up{*}/down{*} forwarding tables for IBA switches that guarantees deadlock freedom, and is effective whatever the methodology applied to compute up{*}/down{*} routing tables. Preliminary evaluation results modeling an InfiniBand network at register transfer level show that the proposed strategy, allows up{*}/down{*} routing algorithms to be implemented on InfiniBand networks with minimal performance degradation.",
	booktitle = "PROCEEDINGS OF THE 2001 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING",
	editor = "Ni, LM and Valero, M",
	isbn = 0769512585,
	issn = "0190-3918",
	note = "30th International Conference on Parallel Processing (ICPP 01), VALENCIA, SPAIN, SEP 03-07, 2001",
	pages = "48-57",
	series = "PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING",
	title = "{E}ffective strategy to compute forwarding tables for {I}nfini{B}and networks",
	year = 2001
}

R Casado, A Bermudez, Jose Duato, J Quiles and J L Sanchez. A protocol for deadlock-free dynamic reconfiguration in high-speed local area networks. IEEE Transactions on Parallel and Distributed Systems 12(2):115 - 132, 2001. URL BibTeX

@article{2001206504242,
	author = "R. Casado and A. Bermudez and Duato, Jose and J. Quiles and J.L. Sanchez",
	abstract = "High-speed local area networks (LANs) consist of a set of switches interconnected by point-to-point links, and hosts linked to those switches through a network interface card. High-speed LANs may change their topology due to switches being turned on/off, hot expansion, link remapping, and component failures. In these cases, a distributed reconfiguration protocol analyzes the topology, computes the new routing tables, and downloads them to the corresponding switches. Unfortunately, in most cases, user traffic is stopped during the reconfiguration process to avoid deadlock. These strategies are called static reconfiguration techniques. Although network reconfigurations are not frequent, static reconfiguration such as this may take hundreds of milliseconds to execute, thus degrading system availability significantly. Several distributed real-time applications have strict communication requirements. Distributed multimedia applications have similar, although less strict, quality of service (QoS) requirements [3], [4]. Both stopping packet transmission and discarding packets due to the reconfiguration process prevent the system from satisfying the above requirements. Therefore, in order to support hard real-time and distributed multimedia applications over a high-speed LAN, we need to avoid stopping user traffic and discarding packets when the topology changes. In this paper, we propose a new deadlock-free distributed reconfiguration protocol that is able to asynchronously update routing tables without stopping user traffic. This protocol is valid for any topology, including regular as well as irregular topologies. It is also valid for packet switching as well as for cut-through switching techniques and does not rely on the existence of virtual channels to work. Simulation results show that the behaviour of our protocol is significantly better than for other protocols based on stopping user traffic.",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Distributed computer systems",
	keywords = "Computer simulation;Computer system recovery;Interconnection networks;Interfaces;Local area networks;Multimedia systems;Network protocols;Packet switching;Quality of service;Real time systems;Telecommunication traffic;",
	note = "Deadlock avoidance;Dynamic reconfiguration;High speed networks;Irregular topologies;Static reconfiguration;System availability;",
	number = 2,
	pages = "115 - 132",
	title = "{A} protocol for deadlock-free dynamic reconfiguration in high-speed local area networks",
	url = "http://dx.doi.org/10.1109/71.910868",
	volume = 12,
	year = 2001
}

Juan Miguel Martínez, Pedro Lopez and Jose Duato. A cost-effective approach to deadlock handling in wormhole networks. IEEE Transactions on Parallel and Distributed Systems 12(7):716 - 729, 2001. URL, DOI BibTeX

@article{2001376648866,
	author = "Mart{\'i}nez, Juan Miguel and Lopez, Pedro and Duato, Jose",
	abstract = "Wormhole networks have traditionally used deadlock avoidance strategies. More recently, deadlock recovery strategies have begun to gain acceptance. In particular, progressive deadlock recovery techniques allocate a few dedicated resources to quickly deliver deadlocked packets. Deadlock recovery is based on the assumption that deadlocks are rare; otherwise, recovery techniques are not efficient. Measurements of deadlock occurrence frequency show that deadlocks are highly unlikely when enough routing freedom is provided [36], [32]. However, networks are more prone to deadlocks when the network is close to or beyond saturation, causing some network performance degradation. Similar performance degradation behavior at saturation was also observed in networks using deadlock avoidance strategies [13]. In this paper, we take a different approach to handling deadlocks and performance degradation. We propose the use of an injection limitation mechanism that prevents performance degradation near the saturation point and, at the same time, reduces the probability of deadlock to negligible values. We also propose an improved deadlock detection mechanism that uses only local information, detects all deadlocks, and considerably reduces the probability of false deadlock detection over previous proposals. In the rare case when impending deadlock is detected, our proposal consists of using a simple recovery technique that absorbs the deadlocked message at the current node and later reinjects it for continued routing toward its destination. Performance evaluation results show that our new approach to handling deadlock is more efficient than previously proposed techniques.",
	doi = "10.1109/71.940746",
	issn = "1045-9219",
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Interconnection networks",
	keywords = "Communication channels;Computer system recovery;Multiprocessing programs;Packet networks;",
	note = "Wormhole networks;",
	number = 7,
	pages = "716 - 729",
	title = "{A} cost-effective approach to deadlock handling in wormhole networks",
	url = "http://dx.doi.org/10.1109/71.940746",
	volume = 12,
	year = 2001
}

Elvira Baydal, Pedro Lopez and Jose Duato. A congestion control mechanism for wormhole networks. 2001, 19 - 26. URL BibTeX

@conference{6867163,
	author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
	abstract = "Deadlock avoidance and recovery techniques suffer from severe performance degradation when the network is close to or beyond saturation. Many parallel applications produce bursty traffic that may saturate the network during some intervals, and increase execution time. Therefore, the use of techniques that prevent network saturation are of crucial importance in both deadlock avoidance and recovery strategies. Several mechanisms have been proposed in the literature to reach this goal. However some of them do not work well under all network load conditions. Others introduce some penalty when the network is not fully saturated, or complicate network and/or node implementation. In this paper we propose a new mechanism to avoid network saturation that overcomes these drawbacks. In this mechanism, each node estimates network traffic locally by using the percentage of free virtual output channels that can be used for forwarding a message towards its destination. When this number surpasses a threshold value, network congestion is assumed to exist and message injection is forbidden",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings Ninth Euromicro Workshop on Parallel and Distributed Processing",
	keywords = "multiprocessor interconnection networks;network routing;performance evaluation;system recovery;telecommunication congestion control;",
	note = "congestion control mechanism;wormhole networks;deadlock avoidance;deadlock recovery;performance degradation;bursty traffic;network saturation;network load conditions;network traffic;free virtual output channels;threshold value;network congestion;message injection;",
	pages = "19 - 26",
	title = "{A} congestion control mechanism for wormhole networks",
	url = "http://dx.doi.org/10.1109/EMPDP.2001.904965",
	year = 2001
}

J M Orduna, Federico Silla and Jose Duato. Towards a communication-aware task scheduling strategy for heterogeneous systems. Computing and Informatics 20(3):245 - 67, 2001. BibTeX

@article{7109983,
	author = "J.M. Orduna and Silla, Federico and Duato, Jose",
	abstract = "Many research activities have focused on the problem of task scheduling in heterogeneous systems from the computational point of view. However, a scheduling strategy should also take into account the communication requirements of the applications and the communication bandwidth offered by the network. Towards this end, we first propose a model of communication cost between network nodes. This model can be used to properly characterize the existing network resources. Second, we propose a criterion to measure the suitability of each allocation of network resources to each parallel application, according to the communication requirements. Third, we propose a scheduling technique based exclusively on this criterion that provides a near-optimal mapping of processes to processors according to the communication requirements. Evaluation results show that the use of this scheduling technique fully exploits the available network bandwidth, greatly improving network performance. Therefore, the proposed scheduling technique can be used in the design of communication-aware scheduling strategies for those situations where the communication requirements are the system performance bottleneck",
	address = "Slovakia",
	issn = "0232-0274",
	journal = "Computing and Informatics",
	keywords = "directed graphs;performance evaluation;processor scheduling;resource allocation;trees (mathematics);workstation clusters;",
	note = "communication-aware task scheduling strategy;heterogeneous systems;communication cost;network nodes;network resources;parallel application;near-optimal mapping;available network bandwidth;network performance;performance bottleneck;interconnection networks;cluster computing;",
	number = 3,
	pages = "245 - 67",
	title = "{T}owards a communication-aware task scheduling strategy for heterogeneous systems",
	volume = 20,
	year = 2001
}

J C Sancho, Antonio Robles and Jose Duato. On the relative behavior of source and distributed routing in NOWs using Up*/Down* routing schemes. 2001, 11 - 18. URL BibTeX

@conference{6867162,
	author = "J.C. Sancho and Robles, Antonio and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are arranged as a switch-based network with irregular topology, which makes routing and deadlock avoidance quite complicated. Current proposals use the up*/down* routing algorithm to remove cyclic dependencies between channels and avoid deadlock. Recently, a simple and effective methodology to compute up*/down* routing tables has been proposed by us. The resulting up*/down* routing scheme increases the number of alternative paths between every pair of switches and allows most messages to follow minimal paths. Also, up*/down* routing is suitable to be implemented using source or distributed routing. Source routing provides a safer and lower cost implementation of up*/down* routing than that provided by distributed routing. However distributed routing may benefit from routing messages through alternative paths to reach their destination. In this paper we evaluate the performance of up*/down* routing when using two methodologies to compute routing tables, and when both source and distributed routing are used. Evaluation results show that it is not worth to implement up*/down* routing in a distributed way in a NOW environment, since its performance is very close to that achieved by implementing it with source routing when a traffic-balancing algorithm is used. Moreover it is shown that a greater improvement in performance can be achieved by modifying the method to compute up*/down* routing tables when source routing is used",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings Ninth Euromicro Workshop on Parallel and Distributed Processing",
	keywords = "network routing;performance evaluation;system recovery;workstation clusters;",
	note = "distributed routing;NOWs;Up*/Down* routing schemes;networks of workstations;switch-based network;irregular topology;deadlock avoidance;cyclic dependencies;lower cost implementation;performance;",
	pages = "11 - 18",
	title = "{O}n the relative behavior of source and distributed routing in {NOW}s using {U}p*/{D}own* routing schemes",
	url = "http://dx.doi.org/10.1109/EMPDP.2001.904962",
	year = 2001
}

Xavier Molero, Federico Silla, Vicente Santonja and Jose Duato. On the scalability of topologies for storage area networks in building environments. 2001, 332 - 5. URL BibTeX

@conference{7114065,
	author = "Molero, Xavier and Silla, Federico and Santonja, Vicente and Duato, Jose",
	abstract = "Nowadays, the fast growth of data intensive applications is changing the way storage is devised. The traditional server-to-disk approach is being replaced by storage area networks (SANs), which are a separate network for storage, isolated from the messaging network and optimized for the movement of data between servers and storage devices (usually disks). We analyze the performance and cost scalability of a family of network topologies devised to be used in building environments. Performance simulation results combined with cost estimations have revealed that slight modifications in network topology can affect the overall scalability. In particular wraparound links connecting the lowest and highest floors in the building significantly affect the scalability of the network. Anyway, the use of this kind of links by itself does not provide the best solution. It is also necessary to have a good interconnection pattern in the backbone",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings IEEE International Symposium on Network Computing and Applications. NCA 2001",
	keywords = "digital storage;local area networks;network topology;telecommunication network routing;",
	note = "storage area networks;building environments;data intensive applications;servers;storage devices;cost scalability;performance simulation;cost estimations;network topology;wraparound links;interconnection pattern;backbone;",
	pages = "332 - 5",
	title = "{O}n the scalability of topologies for storage area networks in building environments",
	url = "http://dx.doi.org/10.1109/NCA.2001.962549",
	year = 2001
}

Xavier Molero, Federico Silla, Vicente Santonja and Jose Duato. On the switch architecture for fibre channel storage area networks. 2001, 484 - 491. URL BibTeX

@conference{2001416673902,
	author = "Molero, Xavier and Silla, Federico and Santonja, Vicente and Duato, Jose",
	abstract = "The fast growth of data intensive applications has caused a change in the traditional storage model. The server-to-disk approach is being replaced by storage area networks (SANs), which enable storage to be externalized from servers, thus allowing storage devices to be shared among multiple servers. Nowadays, the majority of SANs use Fibre Channel. The standard for Fibre Channel defines several issues related to the switch interface, but does not make any suggestion about the internal switch architecture to be implemented by manufacturers. In this paper we analyze the key architectural switch characteristics for building Fibre Channel storage area networks. To do so, our starting point is the performance analysis of two different switch architectures, identifying their strongest and weakest points, and thus taking advantage of the best features from both of them. After this first analysis, we introduce several other features in the switch, concluding with a proposed architecture that doubles network throughput while reducing response delay.",
	address = "Kyongju, Korea, Republic of",
	journal = "Proceedings of the Internatoinal Conference on Parallel and Distributed Systems - ICPADS",
	key = "Client server computer systems",
	keywords = "Computer architecture;Computer networks;Data storage equipment;Network protocols;",
	note = "Fiber channel;Storage area network;",
	pages = "484 - 491",
	title = "{O}n the switch architecture for fibre channel storage area networks",
	url = "http://dx.doi.org/10.1109/ICPADS.2001.934857",
	year = 2001
}

J Fernandez, J M Garcia and Jose Duato. Performance evaluation of real-time communication services on high-speed LANs under topology changes. 2001, 341 - 50. BibTeX

@conference{7307079,
	author = "J. Fernandez and J.M. Garcia and Duato, Jose",
	abstract = "Topology changes, such as switches being turned on/off, hot expansion, hot replacement or link re-mapping, are very likely to occur in NOWs and clusters. Moreover, topology changes are much more frequent than faults. However, their impact on real-time communications has not been considered a major problem up to now, mostly because they are not feasible in traditional environments, such as massive parallel processors (MPPs), which have fixed topologies. Topology changes are supported and handled by some current and future interconnects, such as Myrinet or Infiniband. Unfortunately, they do not include support for real-time communications in the presence of topology changes. In this paper, we evaluate a previously proposed protocol, called Dynamically Re-established Real-Time Channels (DRRTC) protocol, that provides topology change- and fault-tolerant real-time communication services on NOWs. We present and analyze the performance evaluation results when a single switch or a single link is deactivated/activated for different topologies and workloads. The simulation results suggest that topology change tolerance is only limited by the resources available to establish real-time channels as well as by the topology connectivity",
	address = "Berlin, Germany",
	journal = "High Performance Computing - HiPC 2001. 8th International Conference. Proceedings (Lecture Notes in Computer Science Vol.2238)",
	keywords = "performance evaluation;protocols;workstation clusters;",
	note = "NOWs;clusters;protocol;Dynamically Re-established Real-Time Channels;DRRTQ;fault-tolerant;topology change;real-time communication;topology connectivity;networks of workstations;",
	pages = "341 - 50",
	title = "{P}erformance evaluation of real-time communication services on high-speed {LAN}s under topology changes",
	year = 2001
}

Xavier Molero, Federico Silla, Vicente Santonja and Jose Duato. On the impact of message packetization in networks of workstations with irregular topology. 2001, 3 - 10. URL BibTeX

@conference{6867161,
	author = "Molero, Xavier and Silla, Federico and Santonja, Vicente and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are becoming an increasingly popular alternative to parallel computers for those applications with high needs of resources such as memory capacity and input/output storage space, and also for small scale parallel computing. Usually, the software messaging layers in these systems become a bottleneck due to the overhead they introduce. Some proposals like FM and BIP considerably reduce this overhead by splitting long messages into several packets. These proposals have been shown to improve communication performance. However, the effect of message packetization on the network interconnects has not been analyzed yet. In this paper we examine the effect of message packetization from the point of view of the interconnection network in the context of bimodal traffic. Two different routing algorithms have been considered: up*/down* and minimal adaptive routing. Our study shows that when the up */down* routing algorithm is used, message packetization dramatically increases latency and reduces throughput for both long and short messages. On the other hand, if minimal adaptive routing is used, short messages could benefit from message packetization, but at the cost of increasing latency for long messages. In any case, network throughput is considerably reduced",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings Ninth Euromicro Workshop on Parallel and Distributed Processing",
	keywords = "multiprocessor interconnection networks;network routing;performance evaluation;workstation clusters;",
	note = "message packetization;networks of workstations;irregular topology;resources;memory capacity;input/output storage space;software messaging layers;interconnection network;bimodal traffic;routing algorithms;minimal adaptive routing;latency;",
	pages = "3 - 10",
	title = "{O}n the impact of message packetization in networks of workstations with irregular topology",
	url = "http://dx.doi.org/10.1109/EMPDP.2001.904960",
	year = 2001
}

Manuel E Acacio, Jose Gonzalez, Jose M Garcia and Jose Duato. New scalable directory architecture for large-scale multiprocessors. 2001, 97 - 106. BibTeX

@conference{2001385584315,
	author = "Manuel E. Acacio and Jose Gonzalez and Jose M. Garcia and Duato, Jose",
	abstract = "The memory overhead introduced by directories constitutes a major hurdle in the scalability of cc-NUMA architectures, which makes the shared-memory paradigm unfeasible for very large-scale systems. This work is focused on improving the scalability of shared-memory multiprocessors by significantly reducing the size of the directory. We propose multilayer clustering as an effective approach to reduce the directory-entry width. Detailed evaluation for 64 processors shows that using this approach we can drastically reduce the memory overhead, while suffering a performance degradation very similar to previous compressed schemes (such as Coarse Vector). In addition, a novel two-level director), architecture is proposed in order to eliminate the penalty caused by these compressed directories. This organization consists of a small Full-Map first-level directory (which provides precise information for the most recently referenced lines) and a compressed second-level directory (which provides in-excess information). Results show that a system with this directory architecture can achieve the same performance as a multiprocessor with a big and non-scalable Full-Map directory, with a very significant reduction of the memory overhead.",
	address = "Nuevo Leon, Mex",
	journal = "IEEE High-Performance Computer Architecture Symposium Proceedings",
	key = "Multiprocessing systems",
	keywords = "Computer architecture;Data storage equipment;Program processors;Storage allocation (computer);",
	note = "Multilayer clustering;Shared-memory multiprocessors;",
	pages = "97 - 106",
	title = "{N}ew scalable directory architecture for large-scale multiprocessors",
	year = 2001
}

R Casado, A Bermudez, F J Quiles and Jose Duato. Influence of network size and load on the performance of reconfiguration protocols. 2001, 46 - 57. URL BibTeX

@conference{7114036,
	author = "R. Casado and A. Bermudez and F.J. Quiles and Duato, Jose",
	abstract = "Switched point-to-point interconnection networks provide the high bandwidth and low latency required by current distributed applications. When the topology changes, a reconfiguration of the routing tables is performed to maintain network connectivity. In order to prevent deadlock, traditional reconfiguration schemes discard application traffic during the reconfiguration process. The consequence is that the network cannot provide the bandwidth demanded by user applications. In order to solve this problem, we proposed two deadlock-free schemes that allow traffic through the network while the reconfiguration is being performed By using these schemes, the network is able to fulfill the applications requirements. In this paper, we evaluate these traditional and novel reconfiguration schemes. In particular, we analyze the impact of network size and load on their behavior. Application traffic has been modeled by means of a self-similar pattern. Simulation results clearly show the large performance degradation associated with the traditional approach and the significant benefits that can be obtained by using dynamic reconfiguration techniques",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings IEEE International Symposium on Network Computing and Applications. NCA 2001",
	keywords = "local area networks;multiprocessor interconnection networks;performance evaluation;protocols;reconfigurable architectures;workstation clusters;",
	note = "interconnection networks;network connectivity;reconfiguration schemes;deadlock-free schemes;dynamic reconfiguration;point-to-point interconnection networks;system area networks;networks of workstations;reconfiguration protocol;performance evaluation;protocol;",
	pages = "46 - 57",
	title = "{I}nfluence of network size and load on the performance of reconfiguration protocols",
	url = "http://dx.doi.org/10.1109/NCA.2001.962515",
	year = 2001
}

Xavier Molero, Federico Silla, Vicente Santonja and Jose Duato. Improving network performance by efficiently dealing with short control messages in fibre channel SANs. 2001, 901 - 10. BibTeX

@conference{7219763,
	author = "Molero, Xavier and Silla, Federico and Santonja, Vicente and Duato, Jose",
	abstract = "Traffic in a storage area networks (SANs) is bimodal, composed of long messages carrying several KBytes of data, and short messages containing control information (I/O commands). From the network point of view, latency of control messages is highly affected by the transmission of data messages, due to their length. As a consequence, it is necessary to establish management policies that benefit the transmission of short control messages, thus reducing the overall response time for I/O operations and increasing network throughput. We propose several strategies for dealing with short control messages and analyze their impact on the performance of storage area networks. This analysis is carried out for a fully adaptive routing algorithm in the context of two different network topology environments: buildings and departments. Simulation results show that both I/O response time and network throughput may be improved when efficiently managing control messages",
	address = "Berlin, Germany",
	journal = "Euro-Par 2001 Parallel Processing. 7th International Euro-Par Conference. Proceedings (Lecture Notes in Computer Science Vol.2150)",
	keywords = "digital storage;local area networks;",
	note = "network performance;short control messages;fibre channel SANs;storage area networks;bimodal traffic;latency;data messages;management policies;response time;I/O operations;network topology environments;",
	pages = "901 - 10",
	title = "{I}mproving network performance by efficiently dealing with short control messages in fibre channel {SAN}s",
	year = 2001
}

E Moyano, F J Quiles, A Garrido, T Orozco-Barbosa and Jose Duato. Efficient 3D wavelet transform decomposition for video compression. 2001, 118 - 25. URL BibTeX

@conference{7005751,
	author = "E. Moyano and F.J. Quiles and A. Garrido and T. Orozco-Barbosa and Duato, Jose",
	abstract = "We present an efficient three-dimensional wavelet transform (3D-WT) algorithm for video compression. This algorithm performs the temporal decomposition of a video sequence in a more efficient way than the classical 3D-WT algorithm. We have conducted a set of experimental evaluations of the proposed algorithm using various video sequences. Experimental results show that our algorithm exhibits lower memory demands and lower latencies for the compression and decompression processes than the classical algorithm at the same compression ratio",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings Second International Workshop on Digital and Computational Video",
	keywords = "data compression;image sequences;transform coding;video coding;wavelet transforms;",
	note = "3D wavelet transform;video compression;three-dimensional wavelet transform;3D-WT;temporal decomposition;video sequences;memory demands;latencies;decompression;",
	pages = "118 - 25",
	title = "{E}fficient 3{D} wavelet transform decomposition for video compression",
	url = "http://dx.doi.org/10.1109/DCV.2001.929950",
	year = 2001
}

J C Sancho, Antonio Robles and Jose Duato. Effective strategy to compute forwarding tables for infiniBand networks. 2001, 48 - 57. BibTeX

@conference{7081877,
	author = "J.C. Sancho and Robles, Antonio and Duato, Jose",
	abstract = "InfiniBand is very likely to become the facto standard for communication between processing nodes and I/O devices as well as for interprocessor communication. The InifiniBand Architecture (IBA) defines a switch-based network with point-to-point links that support any topology defined by the user. Routing in IBA is distributed based on forwarding tables, and only considers the packet destination ID for routing within subnets. Up*/down* routing is the simplest and most popular routing algorithm for irregular topologies. Unfortunately, up*/down* routing cannot be used in IBA switches because it may leads to deadlock. In this paper we address this issue, proposing an easy-to-implement strategy to complete up*/down* forwarding tables for IBA switches that guarantees deadlock freedom, and is effective whatever the methodology applied to compute up*/down* routing tables. Preliminary evaluation results modeling an InfiniBand network at register transfer level show that the proposed strategy allows up*/down* routing algorithms to be implemented on InfiniBand networks with minimal performance degradation",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings International Conference on Parallel Processing",
	keywords = "concurrency control;multiprocessor interconnection networks;network routing;performance evaluation;",
	note = "forwarding tables;infiniBand networks;processing nodes;I/O devices;interprocessor communication;switch-based network;point-to-point links;packet destination;easy-to-implement strategy;register transfer level;minimal performance degradation;",
	pages = "48 - 57",
	title = "{E}ffective strategy to compute forwarding tables for infini{B}and networks",
	year = 2001
}

Pedro Lopez, Jose Flich and Jose Duato. Deadlock-free routing in InfiniBandTM through destination renaming. In Parallel Processing, International Conference on, 2001.. 2001, 427 - 434. DOI BibTeX

@conference{952089,
	author = "Lopez, Pedro and Flich, Jose and Duato, Jose",
	abstract = "The InfiniBand Architecture (IBA) defines a switch-based network with point-to-point links that supports any topology defined by the user including irregular ones, in order to provide flexibility and incremental expansion capability. Routing in IBA is distributed, based on forwarding tables, and only considers the packet destination ID for routing within subnets in order to drastically reduce forwarding table size. Unfortunately, the forwarding tables for most of the previously proposed routing algorithms for irregular topologies consider both the destination ID and the input channel. Therefore, these popular routing algorithms for irregular topologies may not be usable in InfiniBand networks because they do nor conform to the IBA specifications. In this paper we propose an easy-to-implement strategy to adapt the forwarding tables already computed following any routing algorithm that considers the destination ID and the input channel into the required IBA forwarding table format. The resulting routing algorithm is deadlock-free on IBA. Indeed, the originally computed paths are not modified at all. Hence, the proposed strategy does not degrade performance with respect to the original routing scheme.",
	booktitle = "Parallel Processing, International Conference on, 2001.",
	doi = "10.1109/ICPP.2001.952089",
	issn = "",
	keywords = "InfiniBand Architecture; deadlock-free; destination renaming; packet destination; routing algorithms; switch-based network; multiprocessor interconnection networks; network routing;",
	month = "3-7",
	pages = "427 - 434",
	title = "{D}eadlock-free routing in {I}nfini{B}and{TM} through destination renaming",
	year = 2001
}

Xavier Molero, Federico Silla, Vicente Santonja and Jose Duato. A tool for the design and evaluation of fibre channel storage area networks. 2001, 133 - 140. URL BibTeX

@conference{2001296584391,
author = "Molero, Xavier and Silla, Federico and Santonja, Vicente and Duato, Jose",
abstract = "The fast growth of data intensive applications has caused a change in the traditional storage model. The server-to-disk approach, usually implemented with SCSI buses, is being replaced by storage area networks (SANs), which enable storage to be externalized from servers, thus allowing storage devices to be shared among multiple servers. A SAN is a separate network for storage, isolated from the messaging network and optimized for the movement of data between servers and storage devices. Nowadays, most of current SANs use Fibre Channel as the technology to move data between servers and storage devices. In order to design and evaluate the performance of these systems it is necessary to have adequate tools. Usually, performance evaluation may be based on analytical modeling or simulation. Each of them differs in their scope and applicability. However, the simulation modeling technique offers more freedom, flexibility, and accuracy than analytical methods. Thus, when evaluating the performance of SANs, simulation modeling should be used. In this paper we present the main capabilities of a simulator for Fibre Channel SANs, focusing on its input parameters and output variables. We also show several simple examples of performance measurements that can be obtained using this tool.",
address = "Seattle, WA, United states",
issn = 02724715,
journal = "Proceedings of the IEEE Annual Simulation Symposium",
key = "Data storage equipment",
keywords = "Client server computer systems;Communication channels;Computer simulation;Local area networks;Mathematical models;Optimization;",
note = "Fiber channel storage area networks;Multiple servers;",
pages = "133 - 140",
title = "{A} tool for the design and evaluation of fibre channel storage area networks",
url = "http://dx.doi.org/10.1109/SIMSYM.2001.922125",
year = 2001
}

J M Orduna, Federico Silla and Jose Duato. A new task mapping technique for communication-aware scheduling strategies. 2001, 349 - 54. URL BibTeX

@conference{7075370,
	author = "J.M. Orduna and Silla, Federico and Duato, Jose",
	abstract = "Clusters have become a very cost-effective platform for high-performance computing. In these systems, the trend is towards the interconnection network becoming the system bottleneck. Therefore, in the future, scheduling strategies will have to take into account the communication requirements of the applications and the communication bandwidth that the network can offer. One of the key issues in these strategies is the task mapping technique used when the network becomes the system bottleneck. In this paper, we propose an enhanced version of a previously proposed mapping technique that takes into account not only the existing network resources, but also the traffic generated by the applications. Also, we evaluate the mapping technique using real MPI application traces with timestamps. Evaluation results show that the rise of the new mapping technique fully exploits the available network bandwidth, improving load balancing and increasing the throughput that can be delivered by the network",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings International Conference on Parallel Processing Workshops",
	keywords = "multiprocessor interconnection networks;performance evaluation;processor scheduling;workstation clusters;",
	note = "interconnection network;scheduling;clusters;communication-aware scheduling;mapping technique;MPI application traces;task mapping;",
	pages = "349 - 54",
	title = "{A} new task mapping technique for communication-aware scheduling strategies",
	url = "http://dx.doi.org/10.1109/ICPPW.2001.951971",
	year = 2001
}

M E Acacio, J Gonzalez, J M Garcia and Jose Duato. A new scalable directory architecture for large-scale multiprocessors. 2001, 97 - 106. URL BibTeX

@conference{6846670,
	author = "M.E. Acacio and J. Gonzalez and J.M. Garcia and Duato, Jose",
	abstract = "The memory overhead introduced by directories constitutes a major hurdle in the scalability of cc-NUMA architectures, which makes the shared-memory paradigm unfeasible for very large-scale systems. This work is focused on improving the scalability of shared-memory multiprocessors by significantly reducing the size of the directory. We propose multilayer clustering as an effective approach to reduce the directory-entry width. Detailed evaluation for 64 processors shows that using this approach we can drastically reduce the memory overhead, while suffering a performance degradation we similar to previous compressed schemes (such as Coarse Vector). In addition, a novel two-level directory architecture is proposed in order to eliminate the penalty caused by these compressed directories. This organization consists of a small Full-Map first-level directory (which provides precise information for the most recently referenced lines) and a compressed second-level directory (which provides in-excess information). Results show that a system with this directory architecture can achieve the same performance as a multiprocessor with a big and non-scalable Full-Map directory with a very significant reduction of the memory overhead",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture",
	keywords = "parallel architectures;performance evaluation;shared memory systems;",
	note = "large-scale multiprocessors;scalable directory architecture;memory overhead;scalability;shared-memory multiprocessors;multilayer clustering;",
	pages = "97 - 106",
	title = "{A} new scalable directory architecture for large-scale multiprocessors",
	url = "http://dx.doi.org/10.1109/HPCA.2001.903255",
	year = 2001
}

Jose Duato and Timothy Mark Pinkston. A general theory for deadlock-free adaptive routing using a mixed set of resources. IEEE Transactions on Parallel and Distributed Systems 12(12):1219 - 1235, 2001. URL BibTeX

@article{2002056842591,
	author = "Duato, Jose and Timothy Mark Pinkston",
	abstract = "This paper presents a theoretical framework for the design of deadlock-free fully adaptive routing algorithms for a general class of network topologies and switching techniques in a single, unified theory. A general theory is proposed that allows the design of deadlock avoidance-based as well as deadlock recovery-based wormhole and virtual cut-through adaptive routing algorithms that use a homogeneous or a heterogeneous (mixed) set of resources. The theory also allows channel queues to be allocated nonatomically, utilizing resources efficiently. A general methodology for the design of fully adaptive routing algorithms applicable to arbitrary network topologies is also proposed. The proposed theory and methodology allow the design of efficient network routers that require minimal resources for handling infrequent deadlocks.",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Parallel processing systems",
	keywords = "Adaptive algorithms;Interconnection networks;Queueing networks;Resource allocation;Routers;",
	note = "Adaptive routing algorithms;Irregular networks;Network routers;Network topologies;Nonatomic queue allocation;Regular networks;",
	number = 12,
	pages = "1219 - 1235",
	title = "{A} general theory for deadlock-free adaptive routing using a mixed set of resources",
	url = "http://dx.doi.org/10.1109/71.970556",
	volume = 12,
	year = 2001
}

M B Caminero, C Carrion, F J Quiles, Jose Duato and S Yalamanchili. A cost-effective hardware link scheduling algorithm for the multimedia router (MMR). 2001, 358 - 69. BibTeX

@conference{7175396,
	author = "M.B. Caminero and C. Carrion and F.J. Quiles and Duato, Jose and S. Yalamanchili",
	abstract = "The primary objective of the Multimedia Router (MMR) project is the design and implementation of a compact router optimized for multimedia applications. The router is targeted for use in cluster and LAN interconnection networks, which offer different constraints and therefore differing router solutions than WANs. One of the key elements in order to achieve these goals is the scheduling algorithm. The authors have proposed a link/switch scheduling algorithm that is capable of providing different QoS guarantees to flows as needed. This work focuses on the reduction of the hardware complexity necessary to implement such an algorithm. A novel priority algorithm is presented, and its hardware complexity is compared to that of the original proposal",
	address = "Berlin, Germany",
	journal = "Networking - ICN 2001. First International Conference on Networking. Proceedings, Part II (Lecture Notes in Computer Science Vol.2094)",
	keywords = "communication complexity;firmware;LAN interconnection;multimedia communication;performance evaluation;quality of service;scheduling;telecommunication computing;telecommunication network routing;telecommunication switching;",
	note = "multimedia router;cost-effective hardware link scheduling algorithm;optimized compact router;cluster networks;LAN interconnection networks;constraints;link/switch scheduling algorithm;service quality guarantees;hardware complexity reduction;priority algorithm;multimedia communications switching;performance evaluation;",
	pages = "358 - 69",
	title = "{A} cost-effective hardware link scheduling algorithm for the multimedia router ({MMR})",
	year = 2001
}

Jose Duato, Antonio Robles, Federico Silla and R Beivide. A Comparison of Router Architectures for Virtual Cut-Through and Wormhole Switching in a NOW Environment. Journal of Parallel and Distributed Computing 61(2):224 - 253, 2001. URL BibTeX

@article{2004488488316,
	author = "Duato, Jose and Robles, Antonio and Silla, Federico and R. Beivide",
	abstract = "Most multicomputer interconnection networks use wormhole switching, leading to fast and compact routers. Current routers incorporate virtual channels and even fully adaptive routing. Networks of workstations (NOWs) inherited multicomputer technology. Most commercial routers designed for NOWs implement wormhole switching. However, wormhole switching is not well suited for NOWs. The long wires required in this environment lead to large buffers to prevent buffer overflow during flow control signaling. Moreover, wire length is limited by buffer size. Virtual cut-through (VCT) achieves a higher throughput than wormhole switching. However, buffer requirements and packetizing overhead prevented its widespread use in multicomputers. Nevertheless, wormhole and VCT switching require similar buffer capacity in NOWs. Moreover, some messaging layers such as Illinois Fast Messages (FM) and BIP split messages into packets for increased performance. Therefore, the traditional disadvantages of VCT switching disappear in NOWs. In this paper, we show that VCT routers can be simpler than wormhole routers, while still achieving the advantages of using virtual channels and adaptive routing. We also propose a fully adaptive routing algorithm for VCT switching in a NOW environment. Moreover, we show that VCT routers outperform wormhole routers in a NOW environment at a lower cost. Also, VCT routers require buffer capacity independent of wire length, making them suitable for networks of workstations. © 2001 Academic Press.",
	address = "Orlando, United States",
	issn = 07437315,
	journal = "Journal of Parallel and Distributed Computing",
	number = 2,
	pages = "224 - 253",
	title = "{A} {C}omparison of {R}outer {A}rchitectures for {V}irtual {C}ut-{T}hrough and {W}ormhole {S}witching in a {NOW} {E}nvironment",
	url = "http://dx.doi.org/10.1006/jpdc.2000.1679",
	volume = 61,
	year = 2001
}

Rosa Alcover, Vicente Chirivella and Jose Duato. An accurate analysis of reliability parameters in meshes with fault-tolerant adaptive routing. In Parallel Architectures, Algorithms and Networks, 2000. I-SPAN 2000. Proceedings. International Symposium on. December 2000, 88 - 93. URL, DOI BibTeX

@conference{6832471,
	author = "Alcover, Rosa and Chirivella, Vicente and Duato, Jose",
	abstract = "The traditional approach to study fault-tolerance in multicomputer interconnection networks consists of determining the worst possible combination of faulty components that causes a network failure, and then assuming that this will occur. But the worst possible combination does not always occur, and the routing algorithm allows the network to work in the presence of a greater number of failures. The network reliability parameters computed according to the traditional approach will be under-estimated. In this paper we use a new methodology to compute accurately the reliability and availability functions. The reliability parameters have been computed for a network with mesh topology, taking into account size, routing algorithm, failure and repair rates of the network channels and coverage",
	booktitle = "Parallel Architectures, Algorithms and Networks, 2000. I-SPAN 2000. Proceedings. International Symposium on",
	doi = "10.1109/ISPAN.2000.900267",
	isbn = "0-7695-0936-3",
	journal = "Proceedings International Symposium on Parallel Architectures, Algorithms and Networks. I-SPAN 2000",
	keywords = "fault tolerant computing;multiprocessor interconnection networks;network routing;",
	month = "Dec",
	note = "reliability parameters;meshes;fault-tolerant adaptive routing;multicomputer interconnection networks;faulty components;network failure;routing algorithm;network reliability parameters;mesh topology;network channels;",
	pages = "88 - 93",
	title = "{A}n accurate analysis of reliability parameters in meshes with fault-tolerant adaptive routing",
	url = "http://dx.doi.org/10.1109/ISPAN.2000.900267",
	year = 2000
}

Jose Flich, Pedro Lopez, M P Malumbres and Jose Duato. Improving routing performance in Myrinet networks. In Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International. 2000, 27 -32. URL, DOI BibTeX

@conference{845961,
	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. Typically, these networks connect processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. In some of these networks, packets are delivered using source routing. Due to the irregular topology, the routing scheme is often non-minimal. In this paper we analyze the routing scheme used in Myrinet networks in order to improve its performance. We propose new routing algorithms that balance the utilization of the available routes and always use minimal paths. We show through simulation that the current routing schemes used in Myrinet networks can be improved by modifying only the routing software without increasing the software overhead significantly. The overall throughput can be doubled without modifying the network hardware",
	booktitle = "Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International",
	doi = "10.1109/IPDPS.2000.845961",
	keywords = "Myrinet networks;NOWs;networks of workstations;routing performance;routing scheme;network routing;workstation clusters;",
	pages = "27 -32",
	title = "{I}mproving routing performance in {M}yrinet networks",
	url = "http://dx.doi.org/10.1109/IPDPS.2000.845961",
	year = 2000
}

Elvira Baydal, Pedro Lopez and Jose Duato. A simple and efficient mechanism to prevent saturation in wormhole networks. In Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International. 2000, 617 -622. URL, DOI BibTeX

@conference{846043,
author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
abstract = "Both deadlock avoidance and recovery techniques suffer from severe performance degradation when the network is close to or beyond saturation. This performance degradation appears because messages block in the network faster than they are drained by the escape paths in the deadlock avoidance strategies or the deadlock recovery mechanism. Many parallel applications produce bursty traffic that may saturate the network during some intervals, significantly increasing execution time. Therefore, the use of techniques that prevent network saturation are of crucial importance. Although several mechanisms have been proposed in the literature to reach this goal, some of them introduce some penalty when the network is not fully saturated, require complex hardware to be implemented or do not behave well under all network load conditions. In this paper we propose a new mechanism to avoid network saturation that overcomes these drawbacks",
booktitle = "Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International",
doi = "10.1109/IPDPS.2000.846043",
keywords = "deadlock avoidance;deadlock recovery;network saturation;performance degradation;wormhole networks;computer networks;concurrency control;multiprocessor interconnection networks;",
pages = "617 -622",
title = "{A} simple and efficient mechanism to prevent saturation in wormhole networks",
url = "http://dx.doi.org/10.1109/IPDPS.2000.846043",
year = 2000
}

Jose Flich, Pedro Lopez, M P Malumbres and Jose Duato. Improving the performance of regular networks with source routing. In Parallel Processing, 2000. Proceedings. 2000 International Conference on. 2000, 353 -361. URL, DOI BibTeX

@conference{876151,
author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose",
abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. In these machines, the network connects processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Also, when performance is the primary concern, these network products are being used to build large commodity clusters with regular topologies. In previous papers, we have proposed the in-transit buffer mechanism to improve network performance, applying it to NOWs with irregular topology and source routing. This mechanism allows the use of minimal paths among all hosts, breaking cyclic dependencies between channels by storing and later re-injecting packers at some intermediate hosts. In this paper we apply the in-transit buffer mechanism to regular networks with source routing in order to improve their performance. Also, two path selection policies are evaluated. The first one will always choose the same minimal path from source to destination, whereas the second one will choose from different alternative minimal paths in a round-robin fashion. The evaluation results show that the overall network throughput can be doubled for large networks",
booktitle = "Parallel Processing, 2000. Proceedings. 2000 International Conference on",
doi = "10.1109/ICPP.2000.876151",
keywords = "NOWs;networks of workstations;parallel computers;path selection policies;regular networks;round-robin;source routing;buffer storage;network routing;performance evaluation;workstation clusters;",
pages = "353 -361",
title = "{I}mproving the performance of regular networks with source routing",
url = "http://dx.doi.org/10.1109/ICPP.2000.876151",
year = 2000
}

Elvira Baydal, Pedro Lopez and Jose Duato. Simple and efficient mechanism to prevent saturation in wormhole networks. Proceedings of the International Parallel Processing Symposium, IPPS, pages 617 - 622, 2000. BibTeX

@article{2000265175264,
	author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
	abstract = "Both deadlock avoidance and recovery techniques suffer from severe performance degradation when the network is close to or beyond saturation. This performance degradation appears because messages block in the network faster than they are drained by the escape paths in the deadlock avoidance strategies or the deadlock recovery mechanism. Many parallel applications produce bursty traffic that may saturate the network during some intervals, significantly increasing execution time. Therefore, the use of techniques that prevent network saturation are of crucial importance. Although several mechanisms have been proposed in the literature to reach this goal, some of them introduce some penalty when the network is not fully saturated, require complex hardware to be implemented or do not behave well under all network load conditions. In this paper, we propose a new mechanism to avoid network saturation that overcomes these drawbacks.",
	address = "United States",
	issn = "1063-7133",
	journal = "Proceedings of the International Parallel Processing Symposium, IPPS",
	key = "Parallel processing systems",
	keywords = "Computer system recovery;Congestion control;Fault tolerant computer systems;Response time;Telecommunication traffic;",
	note = "Deadlock recovery methods;Wormhole networks;",
	pages = "617 - 622",
	title = "{S}imple and efficient mechanism to prevent saturation in wormhole networks",
	year = 2000
}

Juan Carlos Martinez, Federico Silla, Pedro Lopez and Jose Duato. On the influence of the selection function on the performance of networks of workstations. 2000, 292 - 9. BibTeX

@conference{6977556,
	author = "Martinez, Juan Carlos and Silla, Federico and Lopez, Pedro and Duato, Jose",
	abstract = "Previous research has pointed out the influence of adaptive routing on the performance improvement of interconnection networks for clusters of workstations. One of the design issues of adaptive routing algorithms is the selection function, which selects the output channel among all the available choices. We analyze in detail several selection functions in order to evaluate their influence on network performance. Simulation results show that network throughput may be increased up to 10%. When the network is close to saturation, improvements in latency up to 40% may be achieved",
	address = "Berlin, Germany",
	journal = "High Performance Computing. Third International Symposium, ISHPC 2000. Proceedings (Lecture Notes in Computer Science Vol.1940)",
	keywords = "delays;multiprocessor interconnection networks;network routing;network topology;performance evaluation;workstation clusters;",
	note = "selection function;networks of workstations;interconnection networks;workstation clusters;adaptive routing algorithms;performance evaluation;network throughput;latency;",
	pages = "292 - 9",
	title = "{O}n the influence of the selection function on the performance of networks of workstations",
	year = 2000
}

JC Sancho, Antonio Robles and Jose Duato. Improving minimal adaptive routing in networks with irregular topology. In G Chaudhry and E Sha (eds.). PARALLEL AND DISTRIBUTED COMPUTING SYSTEMS. 2000, 314-319. BibTeX

@conference{ISI:000179773600050,
	author = "JC Sancho and Robles, Antonio and Duato, Jose",
	abstract = "Networks, of workstations (NOWs) are being considered as a cost-effective alternative to parallel computers. Many NOWs are arranged as a switch-based network with irregular topology, which makes routing and deadlock avoidance quite complicated. Several current proposals, like up{*}/down{*} routing, avoid deadlock by removing cyclic dependencies between channels. A more efficient approach consists of allowing cyclic dependencies between channels while providing some escape paths to avoid deadlock. Minimal adaptive routing (MA) is a distributed adaptive routing algorithm that is able to use all the minimal paths and guarantees deadlock freedom by using up{*}/down{*} routing to route messages through the escape paths. Recently, a simple and effective methodology to compute up{*}/down{*} routing tables has been proposed by us. The resulting up{*}/down{*} routing scheme makes use of a different link direction assignment to compute routing tables. Assignment of link direction is based on generating an underlying acyclic connected graph from the network graph. In this paper, we analyze the influence of using the new methodology to compute up{*}/down{*} routing tables on the performance of the minimal adaptive routing algorithm. Evaluation results show that when the methodology to compute up{*}/down{*} routing tables is combined with minimal adaptive routing, an improvement in throughput of up to 40\% is achieved, also reducing latency.",
	booktitle = "PARALLEL AND DISTRIBUTED COMPUTING SYSTEMS",
	editor = "Chaudhry, G and Sha, E",
	isbn = "188084334X",
	note = "13th International Conference on Parallel and Distributed Computing Systems, LAS VEGAS, NV, AUG 08-10, 2000",
	pages = "314-319",
	title = "{I}mproving minimal adaptive routing in networks with irregular topology",
	year = 2000
}

M P Malumbres and Jose Duato. An efficient implementation of tree-based multicast routing for distributed shared-memory multiprocessors. Journal of Systems Architecture 46(11):1019 - 32, 2000. URL BibTeX

@article{6711597,
	author = "M.P. Malumbres and Duato, Jose",
	abstract = "This paper presents an efficient routing and flow control mechanism to implement multidestination message passing in wormhole networks. The mechanism is a variation of tree-based multicast with pruning to recover from deadlocks and it is well suited for distributed shared-memory multiprocessors (DSMs) with hardware cache coherence. It does not require any preprocessing of multicast messages reducing notably the software overhead required to send a multicast message. Also, it allows messages to use any deadlock-free routing function. The new scheme has been evaluated by simulation using synthetic loads. It achieves multicast latency reductions of 30% on average. Also it was compared with other multicast mechanisms proving its benefits. Finally, it can be easily implemented in hardware with minimal changes to existing unicast wormhole routers",
	address = "Netherlands",
	issn = "1383-7621",
	journal = "Journal of Systems Architecture",
	keywords = "distributed shared memory systems;message passing;multiprocessor interconnection networks;network routing;",
	note = "tree-based multicast routing;distributed shared-memory multiprocessors;flow control mechanism;multidestination message passing;wormhole networks;tree-based multicast;deadlocks;hardware cache coherence;multicast messages;software overhead;deadlock-free routing function;synthetic loads;multicast latency reductions;unicast wormhole routers;",
	number = 11,
	pages = "1019 - 32",
	title = "{A}n efficient implementation of tree-based multicast routing for distributed shared-memory multiprocessors",
	url = "http://dx.doi.org/10.1016/S1383-7621(00)00007-2",
	volume = 46,
	year = 2000
}

Elvira Baydal, Pedro Lopez and Jose Duato. A simple and efficient mechanism to prevent saturation in wormhole networks. 2000, 617 - 22. URL BibTeX

@conference{6590366,
	author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
	abstract = "Both deadlock avoidance and recovery techniques suffer from severe performance degradation when the network is close to or beyond saturation. This performance degradation appears because messages block in the network faster than they are drained by the escape paths in the deadlock avoidance strategies or the deadlock recovery mechanism. Many parallel applications produce bursty traffic that may saturate the network during some intervals, significantly increasing execution time. Therefore, the use of techniques that prevent network saturation are of crucial importance. Although several mechanisms have been proposed in the literature to reach this goal, some of them introduce some penalty when the network is not fully saturated, require complex hardware to be implemented or do not behave well under all network load conditions. In this paper we propose a new mechanism to avoid network saturation that overcomes these drawbacks",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000",
	keywords = "computer networks;concurrency control;multiprocessor interconnection networks;",
	note = "wormhole networks;deadlock avoidance;performance degradation;deadlock recovery;network saturation;",
	pages = "617 - 22",
	title = "{A} simple and efficient mechanism to prevent saturation in wormhole networks",
	url = "http://dx.doi.org/10.1109/IPDPS.2000.846043",
	year = 2000
}

JC Sancho, Antonio Robles and Jose Duato. A new methodology to compute deadlock-free routing tables for irregular networks. In B Falsafi and M Lauria (eds.). NETWORK-BASED PARALLEL COMPUTING, PROCEEDINGS - COMMUNICATION, ARCHITECTURE, AND APPLICATIONS 1797. 2000, 45-60. BibTeX

@conference{ISI:000171691200004,
	author = "JC Sancho and Robles, Antonio and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are being considered as a cost-effective alternative to parallel computers. Many NOWs are arranged as a switch-based network with irregular topology, which makes routing and deadlock avoidance quite complicated. Current proposals use the up{*}/down{*} routing algorithm to remove cyclic dependencies between channels and avoid deadlock. However, routing is considerably restricted and most messages must follow non-minimal paths, increasing latency and wasting resources. In this paper, we propose a new methodology to compute deadlock-free routing tables for NOWs. The methodology tries to minimize the limitations of the current proposals in order to improve network performance. It is based on generating an underlying acyclic connected graph from the network graph and assigning a sequence number to each switch, which is used to remove cyclic dependencies. Evaluation results show that the routing algorithm based on the new methodology increases throughput by a factor of up to 2 in large networks, also reducing latency significantly.",
	booktitle = "NETWORK-BASED PARALLEL COMPUTING, PROCEEDINGS - COMMUNICATION, ARCHITECTURE, AND APPLICATIONS",
	editor = "Falsafi, B and Lauria, M",
	isbn = 3540678794,
	issn = "0302-9743",
	note = "4th International Workshop on Communication, Architecture, and Applications for Network-Based Parallel Computing (CANPC 2000), TOULOUSE, FRANCE, JAN 08, 2000",
	pages = "45-60",
	series = "LECTURE NOTES IN COMPUTER SCIENCE",
	title = "{A} new methodology to compute deadlock-free routing tables for irregular networks",
	volume = 1797,
	year = 2000
}

Young-Joo Suh, Binh Vien Dao, Jose Duato and Sudhakar Yalamanchili. Software-based rerouting for fault-tolerant pipelined communication. IEEE Transactions on Parallel and Distributed Systems 11(3):193 - 211, 2000. URL BibTeX

@article{2000295197161,
	author = "Young-Joo Suh and Binh Vien Dao and Duato, Jose and Sudhakar Yalamanchili",
	abstract = "This paper presents a software-based approach to fault-tolerant routing in networks using wormhole or virtual cut-through switching. When a message encounters a faulty output link, it is removed from the network by the local router and delivered to the messaging layer of the local node's operating system. The message passing software can reroute this message, possibly along nonminimal paths. Alternatively, the message may be addressed to an intermediate node, which will forward the message to the destination. A message may encounter multiple faults and pass through multiple intermediate nodes. The proposed techniques are applicable to both obliviously and adaptively routed networks. The techniques are specifically targeted toward commercial multiprocessors where the mean time to repair (MTTR) is much smaller than the mean time between router failures (MTBF), i.e., it is sufficient to tolerate a maximum of three failures. This paper presents requirements for buffer management, deadlock freedom, and livelock freedom. Simulation results are presented to evaluate the degradation in latency and throughput as a function of the number and distribution of faults. There are several advantages of such an approach. Router designs are minimally impacted, and thus remain compact and fast. Only messages that encounter faulty components are affected, while the machine is ensured of continued operation until the faulty components can be replaced. The technique leverages existing network technology, and the concepts are portable across evolving switch and router designs. Therefore, we feel that the technique is a good candidate for incorporation into the next generation of multiprocessor networks.",
	address = "Los Alamitos, CA, United States",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Pipeline processing systems",
	keywords = "Computer simulation;Computer software;Congestion control (communication);Fault tolerant computer systems;Routers;Telecommunication traffic;",
	note = "Mean time between router failures (MTBF);Mean time to repair (MTTR);Wormhole switching;",
	number = 3,
	pages = "193 - 211",
	title = "{S}oftware-based rerouting for fault-tolerant pipelined communication",
	url = "http://dx.doi.org/10.1109/71.841738",
	volume = 11,
	year = 2000
}

Ruoming Pang, T M Pinkston and Jose Duato. The double scheme: deadlock-free dynamic reconfiguration of cut-through networks. 2000, 439 - 48. URL BibTeX

@conference{6742429,
	author = "Ruoming Pang and T.M. Pinkston and Duato, Jose",
	abstract = "Network-based computing systems often require the ability to reconfigure the routing algorithm to reflect changes in network topology if and when those changes occur. The process of reconfiguring a network's routing capabilities may lead to deadlock if not handled properly. In this paper we propose efficient and deadlock-free dynamic reconfiguration techniques that are generically applicable to distributed routing algorithms and networks, including those which use wormhole switching. The proposed techniques do not impede the transmission of packets during the reconfiguration process, thus providing increased network availability and quality-of-service (QoS) support as compared to traditional techniques based on static reconfiguration",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 2000 International Conference on Parallel Processing",
	keywords = "concurrency control;local area networks;multiprocessor interconnection networks;reconfigurable architectures;",
	note = "deadlock-free;dynamic reconfiguration;cut-through networks;network topology;distributed routing;wormhole switching;",
	pages = "439 - 48",
	title = "{T}he double scheme: deadlock-free dynamic reconfiguration of cut-through networks",
	url = "http://dx.doi.org/10.1109/ICPP.2000.876160",
	year = 2000
}

D Love, S Yalamanchili, Jose Duato, M B Caminero and F J Quiles. Switch scheduling in the multimedia router (MMR). 2000, 5 - 11. URL BibTeX

@conference{6590288,
	author = "D. Love and S. Yalamanchili and Duato, Jose and M.B. Caminero and F.J. Quiles",
	abstract = "The primary goal of the Multimedia Router (MMR) project is the design and implementation of a router optimized for multimedia applications. The router is targeted for use in cluster and LAN interconnection networks which offer different constraints and therefore differing router solutions than WANs. This paper describes and evaluates a switch scheduling algorithm based on a priority biasing scheme for dynamically updating the priorities of the connections established through the router. Unlike existing schemes that simply use the age of a flit as its priority, the novel feature of the proposed approach is that the priority is biased using the measured quality of service (QoS) values for the connection. Furthermore, the structure of the switch scheduling algorithm is motivated by opportunities for pipelined and concurrent operation so that scheduling decisions could be made at switching speeds. The performance of two of the many possible biasing functions is evaluated",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000",
	keywords = "LAN interconnection;local area networks;multimedia communication;",
	note = "multimedia router;multimedia;cluster;LAN interconnection networks;switch scheduling;quality of service;",
	pages = "5 - 11",
	title = "{S}witch scheduling in the multimedia router ({MMR})",
	url = "http://dx.doi.org/10.1109/IPDPS.2000.845958",
	year = 2000
}

D Love, S Yalamanchili, Jose Duato, M B Caminero and F J Quiles. Switch scheduling in the Multimedia Router (MMR). Proceedings of the International Parallel Processing Symposium, IPPS, pages 5 - 11, 2000. BibTeX

@article{2000265175186,
	author = "D. Love and S. Yalamanchili and Duato, Jose and M.B. Caminero and F.J. Quiles",
	abstract = "The primary goal of the Multimedia Router (MMR) project is the design and implementation of a router optimized for multimedia applications. The router is targeted for use in cluster and LAN interconnection networks which offer different constraints and therefore differing router solutions than WANs. This paper describes and evaluates a switch scheduling algorithm based on a priority biasing scheme for dynamically updating the priorities of the connections established through the router. Unlike existing schemes that simply use the age of a flit as its priority, the novel feature of the proposed approach is that the priority is biased using the measured quality of service (QoS) values for the connection. Furthermore, the structure of the switch scheduling algorithm is motivated by opportunities for pipelined and concurrent operation so that scheduling decisions could be made at switching speeds. The performance of two of the many possible biasing functions is evaluated.",
	address = "United States",
	issn = 10637133,
	journal = "Proceedings of the International Parallel Processing Symposium, IPPS",
	key = "Multimedia systems",
	keywords = "Algorithms;Data communication systems;Interconnection networks;Local area networks;Pipeline processing systems;Routers;Telecommunication services;",
	note = "Multimedia routers (MMR);Switch scheduling;",
	pages = "5 - 11",
	title = "{S}witch scheduling in the {M}ultimedia {R}outer ({MMR})",
	year = 2000
}

Federico Silla and Jose Duato. On the use of virtual channels in networks of workstations with irregular topology. IEEE Transactions on Parallel and Distributed Systems 11(8):813 - 828, 2000. URL BibTeX

@article{2000515393317,
	author = "Silla, Federico and Duato, Jose",
	abstract = "Networks of workstations are becoming increasingly popular as a cost-effective alternative to parallel computers. Typically, these networks connect workstations using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Recently, we proposed two methodologies for the design of adaptive routing algorithms for networks with irregular topology, as well as fully adaptive routing algorithms for these networks. These algorithms increase throughput considerably with respect to previously existing ones, but require the use of at least two virtual channels. In this paper, we propose a very efficient flow control protocol to support virtual channels when link wires are very long and/or have different lengths. This flow control protocol relies on the use of channel pipelining and control flits. Control traffic is minimized by assigning physical bandwidth to virtual channels until the corresponding message blocks or it is completely transmitted. Simulation results show that this flow control protocol performs as efficiently as an ideal network with short wires and flit-by-flit multiplexing. The effect of additional virtual channels per physical channel has also been studied, revealing that the optimal number of virtual channels varies with network size. The use of virtual channel priorities is also analyzed. The proposed flow control protocol may increase short message latency, due to long messages monopolizing channels and hindering the progress of short messages. Therefore, we have analyzed the impact of limiting the number of flits (block size) that a virtual channel may forward once it gets the link. Simulation results show that limiting the maximum block size causes the overall network performance to decrease.",
	address = "Los Alamitos, CA, United States",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Network protocols",
	keywords = "Adaptive algorithms;Bandwidth;Communication channels;Computer simulation;Computer workstations;Congestion control;Multiplexing;Pipeline processing systems;Telecommunication traffic;",
	note = "Adaptive routing algorithms;Block multiplexing;Channel pipelining;Virtual channels;Wormhole switching;",
	number = 8,
	pages = "813 - 828",
	title = "{O}n the use of virtual channels in networks of workstations with irregular topology",
	url = "http://dx.doi.org/10.1109/71.877939",
	volume = 11,
	year = 2000
}

Xavier Molero, Federico Silla, Vicente Santonja and Jose Duato. Performance analysis of storage area networks using high-speed LAN interconnects. 2000, 474 - 8. URL BibTeX

@conference{6783964,
	author = "Molero, Xavier and Silla, Federico and Santonja, Vicente and Duato, Jose",
	abstract = "Storage area networks (SANs) are an emerging data communications platform which interconnects servers an storage devices (such as disks, disk arrays, and tape drives) to create a pool of storage that users can access directly. SANs eliminate the bandwidth bottlenecks and scalability limitations imposed by previous SCSI bus-based architectures and LAN connections between servers and the stored data. This networking approach reports benefits such as computer clustering, topological flexibility, fault tolerance, high availability, and remote management. The prominent technology for implementing SANs is the fibre channel, due to the suitability of this technology for storage networking. Other technologies for high performance interconnects have also been developed. These interconnects provide switch-based networks with links transferring data at more than 1 Gigabit per second, being mainly used in the LAN environments. We analyze whether these high-speed LAN technologies could also be an interesting alternative to storage networking. We perform this analysis using real-world I/O traces. The main conclusion from our study is that most of the messages present the base network latency, meaning that the network is not heavily loaded. Moreover the response time is, in general, acceptable, being dominated by the time disks need to process the requests",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings IEEE International Conference on Networks 2000 (ICON 2000). Networking Trends and Challenges in the New Millennium",
	keywords = "data communication;digital storage;disc storage;fault tolerance;LAN interconnection;network servers;network topology;performance evaluation;",
	note = "storage area networks;high-speed LAN interconnects;performance analysis;data communications platform;servers interconnection;storage devices;disks;disk arrays;tape drives;computer clustering;topological flexibility;fault tolerance;high availability;fibre channel;switch-based networks;real-world I/O traces;network latency;response time;remote management;",
	pages = "474 - 8",
	title = "{P}erformance analysis of storage area networks using high-speed {LAN} interconnects",
	url = "http://dx.doi.org/10.1109/ICON.2000.875833",
	year = 2000
}

Jose Flich, M P Malumbres, Pedro Lopez and Jose Duato. Performance evaluation of a new routing strategy for irregular networks with source routing. 2000, 34 - 43. URL BibTeX

@conference{7144248,
author = "Flich, Jose and M.P. Malumbres and Lopez, Pedro and Duato, Jose",
abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. Typically, these networks connect processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. In some of these networks, messages are delivered using the up*/down* routing algorithm. However, the up*/down* routing scheme is often non-minimal. Also, some of these networks use source routing. With this technique, the entire path to destination is generated at the source host before the message is sent. In this paper we develop a new mechanism in order to improve the performance of irregular networks with source routing, increasing overall throughput. With this mechanism, messages always use minimal paths. To avoid possible deadlocks, when necessary, routes between a pair of hosts are divided into sub-routes, and a special kind of virtual cut-through is performed at some intermediate hosts. We evaluate the new mechanism by simulation using parameters taken from the Myrinet network. We show that the current routing schemes used in Myrinet can be improved by modifying only the routing software without increasing its overhead significantly and, most importantly, without modifying the network hardware. The benefits of using the new routing scheme are noticeable for networks with 16 or more switches, and increase with network size. For 32 and 64-switch networks, throughput is increased on average by a factor ranging from 1.3 to 3.3",
address = "New York, NY, USA",
journal = "Conference Proceedings of the 2000 International Conference on Supercomputing",
keywords = "multiprocessor interconnection networks;network routing;performance evaluation;",
note = "performance evaluation;routing strategy;irregular networks;source routing;networks of workstations;deadlocks;virtual cut-through;Myrinet network;routing software;wormhole switching;minimal routing;",
pages = "34 - 43",
title = "{P}erformance evaluation of a new routing strategy for irregular networks with source routing",
url = "http://dx.doi.org/10.1145/335231.335235",
year = 2000
}

Rafael Casado, Aurelio Bermudez, Francisco J Quiles, Jose L Sanchez and Jose Duato. Performance evaluation of dynamic reconfiguration in high-speed local area networks. 2000, 85 - 96. BibTeX

@conference{2002126889956,
	author = "Rafael Casado and Aurelio Bermudez and Francisco J. Quiles and Jose L. Sanchez and Duato, Jose",
	abstract = "A new deadlock-free distributed reconfiguration algorithm that is able to asynchronously update routing tables without stopping the user traffic is proposed. This algorithm is valid for any topology, including regular as well as irregular topologies. Simulation results show that the behavior of such algorithm is significantly better than for other algorithms based on a spanning-tree formation.",
	address = "Toulouse, France",
	journal = "IEEE High-Performance Computer Architecture Symposium Proceedings",
	key = "Local area networks",
	keywords = "Algorithms;Computer simulation;Computer system recovery;Distributed computer systems;Multimedia systems;Packet switching;Quality of service;Real time systems;",
	note = "Dynamic reconfiguration;High speed networks;Network interface card;",
	pages = "85 - 96",
	title = "{P}erformance evaluation of dynamic reconfiguration in high-speed local area networks",
	year = 2000
}

Xavier Molero, Federico Silla, Vicente Santonja and Jose Duato. Performance sensitivity of routing algorithms to failures in networks of workstations. 2000, 230 - 42. BibTeX

@conference{6977549,
	author = "Molero, Xavier and Silla, Federico and Santonja, Vicente and Duato, Jose",
	abstract = "Networks of workstations (NOW) are becoming an increasingly popular alternative to parallel computers for those applications with high needs of resources such as memory capacity and input/output storage space, and also for small-scale parallel computing. Although the mean time between failures (MTBF) for individual links and switches in a NOW is very high, the probability of a failure occurrence dramatically increases as the network size becomes larger. Moreover, there are external factors, such as accidental link disconnections, that also can affect the overall NOW reliability. Until the faulty element is replaced, the NOW is functioning in a degraded mode. Thus, it becomes necessary to quantify how much the global NOW performance is reduced during the time the system remains in this state. We analyze the performance degradation of networks of workstations when failures in links or switches occur. Because the routing algorithm is a key issue in the design of a NOW, we quantify the sensitivity to failures of two routing algorithms: up*/down* and minimal adaptive routing algorithms. Simulation results show that, in general, up*/down* routing is highly robust to failures. On the other hand, the minimal adaptive routing algorithm presents a better performance, even in the presence of failures, but at the expense of a larger sensitivity",
	address = "Berlin, Germany",
	journal = "High Performance Computing. Third International Symposium, ISHPC 2000. Proceedings (Lecture Notes in Computer Science Vol.1940)",
	keywords = "computer network reliability;network routing;performance evaluation;probability;workstation clusters;",
	note = "performance sensitivity;networks of workstations;NOW;small-scale parallel computing;mean time between failures;MTBF;failure probability;reliability;performance degradation;up*/down* routing algorithm;minimal adaptive routing algorithm;",
	pages = "230 - 42",
	title = "{P}erformance sensitivity of routing algorithms to failures in networks of workstations",
	year = 2000
}

F J Alfaro, A Bermudez, R Casado, Jose Duato, F J Quiles and J L Sanchez. On the performance of up*/down* routing. 2000, 61 - 72. BibTeX

@conference{6826450,
	author = "F.J. Alfaro and A. Bermudez and R. Casado and Duato, Jose and F.J. Quiles and J.L. Sanchez",
	abstract = "Networks of Workstations (NOWs) are usually arranged as a set of interconnected switches with hosts connected to switch ports through interface cards. Several commercial interconnects for high-speed NOWs use up*/down* routing. Every time the network is powered on or the topology is changed, a configuration algorithm is executed, which provides information about the topology and generates a directed graph. Routing tables are computed from this directed graph. There are several ways to obtain the directed graph. The most frequent way is by means of algorithms based on minimum-depth spanning-trees (MDST) or propagation-order spanning-trees (POST). This paper shows that, for most networks, graphs obtained by means of these methods can be improved in order to achieve higher network performance",
	address = "Berlin, Germany",
	journal = "Network-Based Parallel Computing. Communication, Architecture, and Applications. 4th International Workshop, CANPC 2000. Proceedings (Lecture Notes in Computer Science Vol.1797)",
	keywords = "directed graphs;network routing;performance evaluation;workstation clusters;",
	note = "performance;up*/down* routing;networks of workstations;interconnected switches;configuration algorithm;directed graph;routing tables;minimum-depth spanning-trees;propagation-order spanning-trees;",
	pages = "61 - 72",
	title = "{O}n the performance of up*/down* routing",
	year = 2000
}

Xavier Molero, Federico Silla, Vicente Santonja and Jose Duato. On the effect of link failures in fibre channel storage area networks. 2000, 102 - 11. URL BibTeX

@conference{6832473,
	author = "Molero, Xavier and Silla, Federico and Santonja, Vicente and Duato, Jose",
	abstract = "The fast growth of data intensive applications has caused a change in the traditional storage model. The server-to-disk approach is being replaced by storage area networks (SANs), which enable storage to be externalized from servers, thus allowing storage devices to be shared among multiple servers. The prominent technology for implementing SANs is Fibre Channel, due to its suitability for storage networking. Although the probability of a link failure for individual links in a SAN is very low, this probability dramatically increases as the network size becomes larger. Moreover, there are external factors, such as accidental link disconnections, that also can affect the overall SAN reliability. Until the faulty element is replaced, the SAN is functioning in a degraded mode. In this paper we analyze by simulation the performance degradation of Fibre Channel storage area networks when failures in links occur, quantifying how much the global SAN performance is reduced during the time the system remains in the degraded state. We perform this analysis by using both synthetic and real I/O traffic. Simulation results show that performance degradation mainly depends on the routing algorithm and the switch architecture used",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings International Symposium on Parallel Architectures, Algorithms and Networks. I-SPAN 2000",
	keywords = "optical fibre LAN;optical storage;performance evaluation;",
	note = "link failures;fibre channel storage area networks;storage model;server-to-disk approach;multiple servers;link failure;network size;performance degradation;real I/O traffic;routing algorithm;switch architecture;",
	pages = "102 - 11",
	title = "{O}n the effect of link failures in fibre channel storage area networks",
	url = "http://dx.doi.org/10.1109/ISPAN.2000.900269",
	year = 2000
}

J M Orduna, Vicente Arnau, A Ruiz, R Valero and Jose Duato. On the design of communication-aware task scheduling strategies for heterogeneous systems. 2000, 391 - 8. URL BibTeX

@conference{6742424,
	author = "J.M. Orduna and Arnau, Vicente and A. Ruiz and R. Valero and Duato, Jose",
	abstract = "Many research activities have focused on the problem of task scheduling in heterogeneous systems from the computational point of view. However an ideal scheduling strategy would also take into account the communication requirements of the applications and the communication bandwidth that the network can offer. In this paper, we first propose a criterion to measure the suitability of each allocation of network resources to each parallel application, according to the communication requirements. Second, we propose a scheduling technique based exclusively on this criterion that provides a near-optimal mapping of processes to processors according to the communication requirements. Evaluation results show that the use of this scheduling technique fully exploits the available network bandwidth, greatly improving network performance. Therefore, the proposed scheduling technique may be used in the design of communication-aware scheduling strategies for those situations where the communication requirements are the system performance bottleneck",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 2000 International Conference on Parallel Processing",
	keywords = "processor scheduling;resource allocation;",
	note = "communication-aware task scheduling;heterogeneous systems;task scheduling;network resources;parallel application;scheduling technique;near-optimal mapping;",
	pages = "391 - 8",
	title = "{O}n the design of communication-aware task scheduling strategies for heterogeneous systems",
	url = "http://dx.doi.org/10.1109/ICPP.2000.876155",
	year = 2000
}

Xavier Molero, Federico Silla, Vicente Santonja and Jose Duato. Modeling and simulation of storage area networks. 2000, 307 - 14. URL BibTeX

@conference{6735495,
	author = "Molero, Xavier and Silla, Federico and Santonja, Vicente and Duato, Jose",
	abstract = "Storage area networks (SANs) are an emerging data communications platform which interconnects servers and storage devices (such as disks, disk arrays, and tape drives) to create a pool of storage that users can access directly. This networking approach reports benefits such as computer clustering, topological flexibility, fault tolerance, high availability, and remote management. In order to evaluate the performance of these systems it is necessary to have the adequate tools. Usually, performance evaluation may be based on analytical modeling or simulation. Each of them differs in their scope and applicability. However the simulation modeling technique offers more freedom, flexibility, and accuracy than the analytical methods. Thus, when evaluating the performance of SANs, simulation modeling should be used. In this paper the issues involved in the modeling and design of a very flexible and easy to use SAN simulator are presented. This tool is able to consider among others, both real-world I/O traces and synthetic I/O traffic, message packetization, faults in links and switches, virtual channels, different routing algorithms, etc. We describe its main internal organization, the basic modeling mechanisms the simulator is based on, the main input parameters and output performance variables. Also, the analysis of preliminary results using I/O traces is presented, showing that the storage network increases self-similarity of the traffic received by servers, latency variations are more important for control messages than for data messages, and links have a low utilization",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728)",
	keywords = "local area networks;performance evaluation;storage management;virtual machines;",
	note = "storage area networks;modeling;simulation;data communications platform;servers;storage devices;computer clustering;topological flexibility;fault tolerance;high availability;remote management;performance evaluation;real-world I/O traces;synthetic I/O traffic;message packetization;faults;virtual channel;routing algorithms;traffic self-similarity;control messages;data messages;",
	pages = "307 - 14",
	title = "{M}odeling and simulation of storage area networks",
	url = "http://dx.doi.org/10.1109/MASCOT.2000.876553",
	year = 2000
}

Jose Flich, Pedro Lopez, M P Malumbres and Jose Duato. Improving the performance of regular networks with source routing. 2000, 353 - 61. URL BibTeX

@conference{6742420,
	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. In these machines, the network connects processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Also, when performance is the primary concern, these network products are being used to build large commodity clusters with regular topologies. In previous papers, we have proposed the in-transit buffer mechanism to improve network performance, applying it to NOWs with irregular topology and source routing. This mechanism allows the use of minimal paths among all hosts, breaking cyclic dependencies between channels by storing and later re-injecting packers at some intermediate hosts. In this paper we apply the in-transit buffer mechanism to regular networks with source routing in order to improve their performance. Also, two path selection policies are evaluated. The first one will always choose the same minimal path from source to destination, whereas the second one will choose from different alternative minimal paths in a round-robin fashion. The evaluation results show that the overall network throughput can be doubled for large networks",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 2000 International Conference on Parallel Processing",
	keywords = "buffer storage;network routing;performance evaluation;workstation clusters;",
	note = "regular networks;source routing;networks of workstations;NOWs;parallel computers;path selection policies;round-robin;",
	pages = "353 - 61",
	title = "{I}mproving the performance of regular networks with source routing",
	url = "http://dx.doi.org/10.1109/ICPP.2000.876151",
	year = 2000
}

Jose Flich, M P Malumbres, Pedro Lopez and Jose Duato. Improving routing performance in Myrinet networks. 2000, 27 - 32. URL BibTeX

@conference{6590291,
	author = "Flich, Jose and M.P. Malumbres and Lopez, Pedro and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. Typically, these networks connect processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. In some of these networks, packets are delivered using source routing. Due to the irregular topology, the routing scheme is often non-minimal. In this paper we analyze the routing scheme used in Myrinet networks in order to improve its performance. We propose new routing algorithms that balance the utilization of the available routes and always use minimal paths. We show through simulation that the current routing schemes used in Myrinet networks can be improved by modifying only the routing software without increasing the software overhead significantly. The overall throughput can be doubled without modifying the network hardware",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000",
	keywords = "network routing;workstation clusters;",
	note = "routing performance;Myrinet networks;NOWs;networks of workstations;routing scheme;",
	pages = "27 - 32",
	title = "{I}mproving routing performance in {M}yrinet networks",
	url = "http://dx.doi.org/10.1109/IPDPS.2000.845961",
	year = 2000
}

Federico Silla and Jose Duato. High-performance routing in networks of workstations with irregular topology. IEEE Transactions on Parallel and Distributed Systems 11(7):699 - 719, 2000. URL BibTeX

@article{2000465351371,
author = "Silla, Federico and Duato, Jose",
abstract = "Networks of workstations are rapidly emerging as a cost-effective alternative to parallel computers. Switch-based interconnects with irregular topology allow the wiring flexibility, scalability, and incremental expansion capability required in this environment. However, the irregularity also makes routing and deadlock avoidance on such systems quite complicated. In current proposals, many messages are routed following nonminimal paths, increasing latency and wasting resources. In this paper, we propose two general methodologies for the design of adaptive routing algorithms for networks with irregular topology. Routing algorithms designed according to these methodologies allow messages to follow minimal paths in most cases, reducing message latency and increasing network throughput. As an example of application, we propose two adaptive muting algorithms for AN1 (previously known as Autonet). They can be implemented either by duplicating physical channels or by splitting each physical channel into two virtual channels. In the former case, the implementation does not require a new switch design. It only requires changing the routing tables and adding links in parallel with existing ones, taking advantage of spare switch ports. In the latter case, a new switch design is required, but the network topology is not changed. Evaluation results for several different topologies and message distributions show that the new muting algorithms are able to increase throughput for random traffic by a factor of up to 4 with respect to the original up*/down* algorithm, also reducing latency significantly. For other message distributions, throughput is increased more than seven times. We also show that most of the improvement comes from the use of minimal muting.",
address = "Los Alamitos, CA, United States",
issn = 10459219,
journal = "IEEE Transactions on Parallel and Distributed Systems",
key = "Parallel processing systems",
keywords = "Adaptive algorithms;Communication channels;Computer workstations;Congestion control;Interconnection networks;Response time;Telecommunication traffic;Topology;",
note = "Adaptive routing algorithms;Wormhole switching;",
number = 7,
pages = "699 - 719",
title = "{H}igh-performance routing in networks of workstations with irregular topology",
url = "http://dx.doi.org/10.1109/71.877816",
volume = 11,
year = 2000
}

O Lysne and Jose Duato. Fast dynamic reconfiguration in irregular networks. 2000, 449 - 58. URL BibTeX

@conference{6742430,
	author = "O. Lysne and Duato, Jose",
	abstract = "Exploitation of the wiring flexibility in Networks of Workstations demands configuration methods that can handle dynamic changes in irregular topologies. During reconfiguration of a network based on virtual cut-through or wormhole switching, however deadlocks in the transition phase between the old and the new routing function must be avoided. The avoidance of such deadlocks will in general make the performance of the network suffer during reconfiguration. Keeping reconfiguration time as short as possible, and leaving as much as possible of the network untouched is therefore of importance. We propose a method for dynamic reconfiguration of networks using up*/down* routing that aims at reducing the consequences of reconfiguration. This is done by identifying a restricted parr of the network, the skyline, as the only part where a full reconfiguration is necessary. This means that most of the network does not need to take part in the reconfiguration at all (other than adding entries for new nodes, and removing entries for removed nodes). Experiments show that for the most frequent configuration changes the skyline will be empty in 85-95% of the cases, leaving the whole of the network operational through the entire reconfiguration. For the most dramatic changes in topology-the addition of a link connecting two previously disjoint networks-an average of 90% of the links can start using the new routing function immediately for some topologies. Our approach is in principle orthogonal to other approaches, thus existing methods for dynamic reconfiguration can be applied in the reconfiguration of the skyline",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 2000 International Conference on Parallel Processing",
	keywords = "concurrency control;multiprocessor interconnection networks;reconfigurable architectures;workstation clusters;",
	note = "dynamic reconfiguration;irregular networks;wiring flexibility;Networks of Workstations;deadlocks;reconfiguration;routing function;",
	pages = "449 - 58",
	title = "{F}ast dynamic reconfiguration in irregular networks",
	url = "http://dx.doi.org/10.1109/ICPP.2000.876161",
	year = 2000
}

A Bermudez, F J Alfaro, R Casado, Jose Duato, F J Quiles and J L Sanchez. Extending dynamic reconfiguration to NOWs with adaptive routing. 2000, 73 - 83. BibTeX

@conference{6826451,
	author = "A. Bermudez and F.J. Alfaro and R. Casado and Duato, Jose and F.J. Quiles and J.L. Sanchez",
	abstract = "Many distributed applications executed on networks of workstations (NOWs) require the interconnection network to provide some quality of service (QoS) support. These networks must be able to support topology changes (due to component failures, hot replacement, hot expansion, etc.) without stopping traffic, in order to satisfy QoS requirements. Traditional network reconfiguration methods do not take this into account, causing a serious performance degradation while the network is being reconfigured. previously, we proposed a new dynamic network reconfiguration protocol, called Partial Progressive Reconfiguration. It significantly reduces the negative effects produced by traditional methods. For this reason, it is especially suitable for applications requiring QoS. This reconfiguration protocol requires that messages are routed using up*/down* routing. In this paper, we extend this dynamic reconfiguration technique to support adaptive routing, based on the design methodology for adaptive algorithms proposed previously. We also present performance evaluation results, clearly showing the benefits of using dynamic reconfiguration combined with adaptive routing",
	address = "Berlin, Germany",
	journal = "Network-Based Parallel Computing. Communication, Architecture, and Applications. 4th International Workshop, CANPC 2000. Proceedings (Lecture Notes in Computer Science Vol.1797)",
	keywords = "performance evaluation;quality of service;workstation clusters;",
	note = "dynamic reconfiguration;adaptive routing;network of workstation;quality of service;topology changes;network reconfiguration methods;dynamic network reconfiguration protocol;Partial Progressive Reconfiguration;reconfiguration protocol;performance evaluation;",
	pages = "73 - 83",
	title = "{E}xtending dynamic reconfiguration to {NOW}s with adaptive routing",
	year = 2000
}

Jose Flich, Pedro Lopez, M P Malumbres, Jose Duato and T Rokicki. Combining in-transit buffers with optimized routing schemes to boost the performance of networks with source routing. 2000, 300 - 9. BibTeX

@conference{6977557,
	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose and T. Rokicki",
	abstract = "In previous papers we proposed the ITB mechanism to improve the performance of up^*/down^* routing in irregular networks with source routing. With this mechanism, both minimal routing and a better use of network links are guaranteed, resulting on an overall network performance improvement. In this paper, we show that the ITB mechanism can be used with any source routing scheme in the NOW environment. In particular, we apply ITB to DFS and Smart routing algorithms, which provide better routes than up^*/down^* routing. Results show that ITB strongly improves DFS (by 63%, for 64-switch networks) and Smart throughput (23%, for 32-switch networks)",
	address = "Berlin, Germany",
	journal = "High Performance Computing. Third International Symposium, ISHPC 2000. Proceedings (Lecture Notes in Computer Science Vol.1940)",
	keywords = "buffer storage;network routing;performance evaluation;workstation clusters;",
	note = "in-transit buffers;optimized routing schemes;network performance;source routing;ITB mechanism;NOW;Smart routing algorithm;DFS routing algorithm;",
	pages = "300 - 9",
	title = "{C}ombining in-transit buffers with optimized routing schemes to boost the performance of networks with source routing",
	year = 2000
}

P Holenarsipur, V Yarmolenko, Jose Duato, D K Panda and P Sadayappan. Characterization and enhancement of static mapping heuristics for heterogeneous systems. 2000, 37 - 48. BibTeX

@conference{6984097,
	author = "P. Holenarsipur and V. Yarmolenko and Duato, Jose and D.K. Panda and P. Sadayappan",
	abstract = "Heterogeneous computing environments have become attractive platforms to schedule computationally intensive jobs. We consider the problem of mapping independent tasks onto machines in a heterogeneous environment where expected execution time of each task on each machine is known. Although this problem has been much studied in the past, we derive new insights into the effectiveness of different mapping heuristics by use of two metrics-efficacy (E) and utilization (U). Whereas there is no consistent rank ordering of the various previously proposed mapping heuristics on the basis of total task completion time, we find a very consistent rank ordering of the mapping schemes with respect to the new metrics. Minimization of total completion time requires maximization of the product E{{\&}}times;U. Using the insights provided by the metrics, we develop a new matching heuristic that produces high-quality mappings using much less time than the most effective previously proposed schemes",
	address = "Berlin, Germany",
	journal = "High Performance Computing - HiPC 2000. 7th International Conference. Proceedings (Lecture Notes in Computer Science Vol.1970)",
	keywords = "performance evaluation;processor scheduling;resource allocation;workstation clusters;",
	note = "static mapping heuristics;heterogeneous systems;heterogeneous computing;cluster computing;scheduling;task assignment;mapping heuristics;performance evaluation;",
	pages = "37 - 48",
	title = "{C}haracterization and enhancement of static mapping heuristics for heterogeneous systems",
	year = 2000
}

J M Orduna, Vicente Arnau and Jose Duato. Characterization of communications between processes in message-passing applications. 2000, 91 - 8. URL BibTeX

@conference{6805977,
author = "J.M. Orduna and Arnau, Vicente and Duato, Jose",
abstract = "Many research activities have focused on the problem of task scheduling in heterogeneous systems from the computational point of view. However, an ideal scheduling strategy would also take into account the communication requirements of the applications and the communication bandwidth available in the network. One of the major problems to be solved in the development of this scheduling strategy is precisely the measurement of the communication requirements for each application. We propose a clustering-based method to characterize the communications between processes generated by message-passing applications. This technique provides a model consisting of several partitions of the processes generated by the application. Also, we propose a criterion to measure the quality of the obtained partitions. This approach can be used when a given application is repeatedly executed with different input data. Results show that the proposed method can provide a partition with the highest ratio between the intracluster and the intercluster required communication bandwidth. This partition can be used to map groups of processes to processors in the heterogeneous system",
address = "Los Alamitos, CA, USA",
journal = "Proceedings IEEE International Conference on Cluster Computing. CLUSTER 2000",
keywords = "communication complexity;message passing;parallel programming;processor scheduling;workstation clusters;",
note = "interprocess communication;message passing applications;task scheduling;heterogeneous systems;clustering-based method;intracluster communication bandwidth;intercluster communication bandwidth;",
pages = "91 - 8",
title = "{C}haracterization of communications between processes in message-passing applications",
url = "http://dx.doi.org/10.1109/CLUSTR.2000.889009",
year = 2000
}

V Yarmolenko, Jose Duato, D K Panda and P Sadayappan. Characterization and enhancement of dynamic mapping heuristics for heterogeneous systems. 2000, 437 - 44. URL BibTeX

@conference{6728039,
	author = "V. Yarmolenko and Duato, Jose and D.K. Panda and P. Sadayappan",
	abstract = "Clusters of heterogeneous PCs/workstations have become attractive systems for executing a set of computationally intensive independent tasks. This paper focuses on scheduling schemes in a dynamic context - i.e. where scheduling decisions are made periodically as jobs arrive, in contrast to static scheduling where scheduling is performed after all jobs have been submitted. The paper characterizes different scheduling schemes with respect to varying arrival rates and burstiness in the job arrival rate. Using the insights gained by the characterization, a set of approaches are proposed to improve the previously developed strategies with respect to turnaround time. Simulation results indicate improvements of up to 40% in turnaround time by using the proposed enhancements",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 2000. International Workshop on Parallel Processing",
	keywords = "performance evaluation;processor scheduling;queueing theory;workstation clusters;",
	note = "dynamic mapping heuristics;heterogeneous PC clusters;heterogeneous workstation clusters;computationally intensive independent tasks;task scheduling schemes;periodic scheduling decisions;bursty job arrival rate;turnaround time;simulation;performance evaluation;",
	pages = "437 - 44",
	title = "{C}haracterization and enhancement of dynamic mapping heuristics for heterogeneous systems",
	url = "http://dx.doi.org/10.1109/ICPPW.2000.869149",
	year = 2000
}

D Buntinas, D K Panda, Jose Duato and P Sadayappan. Broadcast/multicast over Myrinet using NIC-assisted multidestination messages. 2000, 115 - 29. BibTeX

@conference{6826454,
	author = "D. Buntinas and D.K. Panda and Duato, Jose and P. Sadayappan",
	abstract = "Broadcasting and multicasting are common operations in parallel and distributed programs. Some modern Network Interface Cards (NICs) have programmable processors which can be used to provide support for these operations. However these processors are 5-15 times slower than the host processor. In this paper we propose a design and an implementation of a multi-send primitive to support efficient broadcast/multicast that requires minimal assistance from the NIC. Our scheme is designed with the idea that as much processing as possible should be done by the host processor. This gives us more flexibility with, for example, creating multicast trees which would be optimal for a particular message size, or choosing a multicast tree dynamically based on requirements of bandwidth versus latency for a particular message. We have designed a multi-send primitive and implemented it as an addition to Fast-Messages (FM) 2.1 running over a Myrinet network. The proposed scheme does less processing at the NIC. The impact of adding such NIC-assisted multicast operation to a run-time system is also very small, less than 500 ns for non-multi-send packets. To fully, utilize the benefits of this primitive, we propose a method for constructing an optimal multicast tree using the new primitive. We have evaluated this scheme and obtained a speedup factor of up to 1.85 for multicasting 16 K messages with 16 nodes",
	address = "Berlin, Germany",
	journal = "Network-Based Parallel Computing. Communication, Architecture, and Applications. 4th International Workshop, CANPC 2000. Proceedings (Lecture Notes in Computer Science Vol.1797)",
	keywords = "computer networks;multicast communication;network interfaces;",
	note = "Myrinet;multidestination messages;Network Interface Cards;multi-send primitive;multicast tree;speedup factor;",
	pages = "115 - 29",
	title = "{B}roadcast/multicast over {M}yrinet using {NIC}-assisted multidestination messages",
	year = 2000
}

J C Sancho, Antonio Robles and Jose Duato. A new methodology to compute deadlock-free routing tables for irregular networks. 2000, 45 - 60. BibTeX

@conference{6826449,
	author = "J.C. Sancho and Robles, Antonio and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are being considered as a cost-effective alternative to parallel computers. Many NOWs are arranged as a switch-based network with irregular topology, which makes routing and deadlock avoidance quite complicated. Current proposals use the up*/down* routing algorithm to remove cyclic dependencies between channels and avoid deadlock. However, routing is considerably restricted and most messages must follow non-minimal paths, increasing latency and wasting resources. In this paper, we propose a new methodology to compute deadlock-free routing tables for NOWs. The methodology tries to minimize the limitations of the current proposals in order to improve network performance. It is based on generating an underlying acyclic connected graph from the network graph and assigning a sequence number to each switch, which is used to remove cyclic dependencies. Evaluation results show that the routing algorithm based on the new methodology increases throughput by a factor of up to 2 in large networks, also reducing latency significantly",
	address = "Berlin, Germany",
	journal = "Network-Based Parallel Computing. Communication, Architecture, and Applications. 4th International Workshop, CANPC 2000. Proceedings (Lecture Notes in Computer Science Vol.1797)",
	keywords = "performance evaluation;workstation clusters;",
	note = "deadlock-free routing tables;irregular networks;networks of workstations;switch-based network;irregular topology;routing;deadlock avoidance;cyclic dependencies;latency;network performance;acyclic connected graph;network graph;",
	pages = "45 - 60",
	title = "{A} new methodology to compute deadlock-free routing tables for irregular networks",
	year = 2000
}

G Bernabe, J Gonzalez, J M Garcia and Jose Duato. A new lossy 3-D wavelet transform for high-quality compression of medical video. 2000, 226 - 31. URL BibTeX

@conference{6806067,
	author = "G. Bernabe and J. Gonzalez and J.M. Garcia and Duato, Jose",
	abstract = "The authors present a new compression scheme based on applying the 3D Fast Wavelet Transform, to code medical video. This video has special features such as its representation in gray scale, the small amount of interframe variations, and the quality requirements of the reconstructed images. These characteristics as well as the social impact of desired applications deserve the design and implementation of coding schemes especially oriented to exploit its features. We analyze different parameters of the codification process, such as the utilization of different wavelet functions, the number of steps this function is applied, the way the thresholds are chosen, and the selected methods in the quantization and entropy encoder. Our coder achieves a good trade-off between compression ratio and quality of the reconstructed video. These results are better than MPEG-2, without the complexity of motion compensation",
	address = "Piscataway, NJ, USA",
	journal = "Proceedings 2000 IEEE EMBS International Conference on Information Technology Applications in Biomedicine. ITAB-ITIS 2000. Joint Meeting Third IEEE EMBS International Conference on Information Technology Applications in Biomedicine (ITAB'00). Third Worksh",
	keywords = "data compression;medical image processing;telemedicine;video coding;wavelet transforms;",
	note = "lossy 3D wavelet transform;high-quality compression;medical video compression;compression scheme;3D Fast Wavelet Transform;medical video coding;interframe variations;quality requirements;reconstructed images;social impact;coding schemes;codification process;wavelet functions;quantization;entropy encoder;compression ratio;reconstructed video;MPEG-2;motion compensation;",
	pages = "226 - 31",
	title = "{A} new lossy 3-{D} wavelet transform for high-quality compression of medical video",
	url = "http://dx.doi.org/10.1109/ITAB.2000.892391",
	year = 2000
}

J C Sancho, Antonio Robles and Jose Duato. A flexible routing scheme for networks of workstations. 2000, 260 - 7. BibTeX

@conference{6977552,
	author = "J.C. Sancho and Robles, Antonio and Duato, Jose",
	abstract = "NOW are arranged as a switch-based network which allows the layout of both regular and irregular topologies. However, the irregular pattern interconnect makes routing and deadlock avoidance quite complicated. Current proposals use the up^*/down^* routing algorithm to remove cyclic dependencies between channels and avoid deadlock. Recently, a simple and effective methodology to compute up^*/down^* routing tables has been proposed by us. The resulting routing algorithm is very effective in irregular topologies. However, its behavior is very poor in regular networks with orthogonal dimensions. Therefore, we propose a more flexible routing scheme that is effective in both regular and irregular topologies. Unlike up^*/down^* routing algorithms, the proposed routing algorithm breaks cycles at different nodes for each direction in the cycle, thus providing better traffic balancing than that provided by up^*/down^* routing algorithms. Evaluation results modeling a Myrinet network show that the new routing algorithm increases throughput with respect to the original up^*/down^* routing algorithm by a factor of up to 3.5 for regular networks, also maintaining the performance of the improved up^*/down^* routing scheme proposed in Sancho et al., (2000), when applied to irregular networks",
	address = "Berlin, Germany",
	journal = "High Performance Computing. Third International Symposium, ISHPC 2000. Proceedings (Lecture Notes in Computer Science Vol.1940)",
	keywords = "concurrency control;network routing;network topology;performance evaluation;workstation clusters;",
	note = "networks of workstations;routing scheme;NOW;regular topologies;irregular topologies;deadlock avoidance;traffic balancing;Myrinet network;performance;",
	pages = "260 - 7",
	title = "{A} flexible routing scheme for networks of workstations",
	year = 2000
}

Pedro Lopez, Rosa Alcover, Jose Duato and L Zunica. Optimizing network throughput: optimal versus robust design. In Parallel and Distributed Processing, 1999. PDP '99. Proceedings of the Seventh Euromicro Workshop on. February 1999, 45 -52. URL, DOI BibTeX

@conference{746644,
author = "Lopez, Pedro and Alcover, Rosa and Duato, Jose and L. Zunica",
abstract = "Interconnection network performance is usually measured in terms of its latency (time required to deliver a message) and throughput (maximum traffic accepted by the network). At first glance, minimizing average message latency is the main designer goal, because average network traffic is usually far from saturation. However, applications can also generate very high peak traffic. In order to deal with such situations, it is important that network throughput is also high. On the other hand, interconnection network performance depends on several parameters. Some of them can be chosen by the designer: routing algorithm, switching technique, topology and node design parameters. However, there are other parameters that cannot be selected by the designer. Among these, there are parameters that depend on the application, such as message size, message destination distribution and message traffic, as well as parameters defined by the customer, such as network size. Network designer can select the design parameters that maximize average (optimal design) or the design parameters that achieve a good performance under all the feasible combinations of the parameters that cannot be selected by him (robust design). Notice that both alternatives do not always lead to the same parameter configuration. Previously we chose the design parameters of a k-ary n-cube network considering optimize latency. In this case, optimal and robust design lead to the same choice. In this paper, we obtain these design parameters considering optimized network throughput. Unfortunately, there is a discrepancy between optimal and robust design criteria, being the former the best choice",
booktitle = "Parallel and Distributed Processing, 1999. PDP '99. Proceedings of the Seventh Euromicro Workshop on",
doi = "10.1109/EMPDP.1999.746644",
isbn = "0-7695-0059-5",
issn = "1066-6192",
keywords = "average message latency;average network traffic;interconnection network performance;latency;message destination distribution;network throughput optimisation;node design parameters;optimal design;parameter configuration;robust design;routing algorithm;swit",
month = "feb",
pages = "45 -52",
title = "{O}ptimizing network throughput: optimal versus robust design",
url = "http://dx.doi.org/10.1109/EMPDP.1999.746644",
year = 1999
}

Pedro Lopez, Rosa Alcover, Jose Duato and L Zunica. Optimizing network throughput: optimal versus robust design. 1999, 45 - 52. URL BibTeX

@conference{6169182,
author = "Lopez, Pedro and Alcover, Rosa and Duato, Jose and L. Zunica",
abstract = "Interconnection network performance is usually measured in terms of its latency (time required to deliver a message) and throughput (maximum traffic accepted by the network). At first glance, minimizing average message latency is the main designer goal, because average network traffic is usually far from saturation. However, applications can also generate very high peak traffic. In order to deal with such situations, it is important that network throughput is also high. On the other hand, interconnection network performance depends on several parameters. Some of them can be chosen by the designer: routing algorithm, switching technique, topology and node design parameters. However, there are other parameters that cannot be selected by the designer. Among these, there are parameters that depend on the application, such as message size, message destination distribution and message traffic, as well as parameters defined by the customer, such as network size. Network designer can select the design parameters that maximize average (optimal design) or the design parameters that achieve a good performance under all the feasible combinations of the parameters that cannot be selected by him (robust design). Notice that both alternatives do not always lead to the same parameter configuration. Previously we chose the design parameters of a k-ary n-cube network considering optimize latency. In this case, optimal and robust design lead to the same choice. In this paper, we obtain these design parameters considering optimized network throughput. Unfortunately, there is a discrepancy between optimal and robust design criteria, being the former the best choice",
address = "Los Alamitos, CA, USA",
journal = "Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed Processing. PDP'99",
keywords = "multiprocessor interconnection networks;performance evaluation;telecommunication network routing;",
note = "network throughput optimisation;robust design;optimal design;interconnection network performance;latency;average message latency;average network traffic;routing algorithm;switching technique;node design parameters;message destination distribution;parameter configuration;",
pages = "45 - 52",
title = "{O}ptimizing network throughput: optimal versus robust design",
url = "http://dx.doi.org/10.1109/EMPDP.1999.746644",
year = 1999
}

Rafael Casado, Aurelio Bermudez, Francisco J Quiles, Jose L Sanchez and Jose Duato. Performance evaluation of dynamic reconfiguration in high-speed local area networks. 1999, 85 - 96. BibTeX

@conference{2000215131150,
	author = "Rafael Casado and Aurelio Bermudez and Francisco J. Quiles and Jose L. Sanchez and Duato, Jose",
	abstract = "A new deadlock-free distributed reconfiguration algorithm that is able to asynchronously update routing tables without stopping user traffic is proposed. The algorithm is valid for any topology, including regular as well as irregular topologies. Simulation results show that the behavior of the algorithm is significantly better than for other algorithms based on a spanning-tree formation.",
	address = "Toulouse, France",
	journal = "IEEE High-Performance Computer Architecture Symposium Proceedings",
	key = "Local area networks",
	keywords = "Algorithms;Computer simulation;Computer system recovery;Congestion control (communication);Distributed computer systems;Electric network topology;Multimedia systems;Packet switching;Performance;Telecommunication traffic;",
	note = "Deadlock free distributed reconfiguration algorithm;Network interface card;Quality of service;Spanning free formation;",
	pages = "85 - 96",
	title = "{P}erformance evaluation of dynamic reconfiguration in high-speed local area networks",
	year = 1999
}

R Casado, A Bermudez, F J Quiles, J L Sanchez and Jose Duato. Performance evaluation of dynamic reconfiguration in high-speed local area networks. 1999, 85 - 96. URL BibTeX

@conference{6498655,
	author = "R. Casado and A. Bermudez and F.J. Quiles and J.L. Sanchez and Duato, Jose",
	abstract = "High-speed local area networks (LANs) consist of a set of switches connected by point-to-point links, and hosts linked to switches through a network interface card. High-speed LANs may change their topology due to switches and hosts being turned on/off, link remapping, and component failures. In these cases, a distributed reconfiguration algorithm analyzes the topology, computes the new routing tables, and downloads them to the corresponding switches. Unfortunately, in most cases, user traffic is stopped during the reconfiguration process to avoid deadlock. Although network reconfigurations are not frequent, static reconfiguration such as this may take hundreds of milliseconds to execute, thus degrading system availability significantly. In this paper, we propose a new deadlock-free distributed reconfiguration algorithm that is able to asynchronously update routing tables without stopping user traffic. This algorithm is valid for any topology, including regular as well as irregular topologies. Simulation results show that the behavior of our algorithm is significantly better than for other algorithms based on a spanning-tree formation",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550)",
	keywords = "concurrency control;digital simulation;local area networks;performance evaluation;quality of service;system recovery;",
	note = "performance evaluation;dynamic reconfiguration;high-speed local area networks;point-to-point links;network interface card;link remapping;component failures;distributed reconfiguration algorithm;deadlock;simulation results;spanning-tree formation;",
	pages = "85 - 96",
	title = "{P}erformance evaluation of dynamic reconfiguration in high-speed local area networks",
	url = "http://dx.doi.org/10.1109/HPCA.2000.824341",
	year = 1999
}

Jose Flich, M P Malumbres, Pedro Lopez and Jose Duato. Performance evaluation of networks of workstations with hardware shared memory model using execution-driven simulation. In Parallel Processing, 1999. Proceedings. 1999 International Conference on. 1999, 146 -153. DOI BibTeX

@conference{797399,
author = "Flich, Jose and M.P. Malumbres and Lopez, Pedro and Duato, Jose",
abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. Typically, these networks connect processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Similar to the evolution of parallel computers, NOWs are also evolving from distributed memory to shared memory programming model. However, physical distances between processors are longer in NOWs than in tightly-coupled distributed shared-memory multiprocessors (DSMs), leading to higher message latency and lower network bandwidth. Therefore, the network may be a bottleneck when executing some parallel applications in a NOW supporting a shared-memory programming paradigm. In this paper we analyze whether the interconnection network is able to efficiently handle the traffic generated in a NOW with the shared memory model. In particular, we are interested in analyzing the influence of the routing mechanism in the performance of the system. We evaluate the behavior of a NOW with irregular topology by means of an execution-driven simulator using SPLASH-2 applications as the input load. The results show that the routing algorithm can considerably reduce the total execution time of applications. In particular routing adaptivity can reduce the total execution time by 58% in some applications. These results confirm the behavior observed in previous works using synthetic traffic loads",
booktitle = "Parallel Processing, 1999. Proceedings. 1999 International Conference on",
doi = "10.1109/ICPP.1999.797399",
keywords = "SPLASH-2;distributed shared-memory multiprocessors;execution-driven simulation;execution-driven simulator;hardware shared memory model;incremental expansion capability;interconnection network;irregular topologies;message latency;networks of workstations;p",
pages = "146 -153",
title = "{P}erformance evaluation of networks of workstations with hardware shared memory model using execution-driven simulation",
year = 1999
}

B Caminero, F J Quiles, Jose Duato, D S Love and S Yalamanchili. Performance evaluation of the multimedia router with MPEG-2 video traffic. 1999, 62 - 76. BibTeX

@conference{6429570,
	author = "B. Caminero and F.J. Quiles and Duato, Jose and D.S. Love and S. Yalamanchili",
	abstract = "The Multimedia Router (MMR) architecture is aimed at providing QoS to multimedia traffic in a local area environment, while retaining a compact and simple design. In this paper, we show some preliminary performance evaluation results. The workload was composed of a mix of synthetic CBR traffic and semi-synthetic VBR traffic. The latter was obtained from real MPEG-2 video sequences. We show that, with a simple scheduling algorithm, amenable for single-chip implementation, the link bandwidth utilization is quite satisfactory, while still providing acceptable delays to both CBR and VBR traffic",
	address = "Berlin, Germany",
	journal = "Network-Based Parallel Computing. Communication, Architecture, and Applications. Third International Workshop, CANPC'99 Proceedings",
	keywords = "local area networks;multimedia communication;performance evaluation;",
	note = "performance evaluation;multimedia router;MPEG-2 video traffic;multimedia traffic;workload;scheduling;",
	pages = "62 - 76",
	title = "{P}erformance evaluation of the multimedia router with {MPEG}-2 video traffic",
	year = 1999
}

Jose Duato, S Yalamanchili, M B Caminero, D Love and F J Quiles. MMR: a high-performance MultiMedia Router-architecture and design trade-offs. 1999, 300 - 9. URL BibTeX

@conference{6169107,
	author = "Duato, Jose and S. Yalamanchili and M.B. Caminero and D. Love and F.J. Quiles",
	abstract = "This paper presents the architecture of a router designed to efficiently support traffic generated by multimedia applications. The router is targeted for use in clusters and LANs rather than in WANs, the latter being served by communication substrates such as ATM. The distinguishing features of the proposed router architecture are the use of small fixed-size buffers, a large number of virtual channels, link-level virtual channel flow control, support for dynamic modification of connection bandwidth and priorities, and coordinated scheduling of connections across all output channels. The paper begins with a discussion of the design choices and architectural trade-offs made in the current MultiMedia Router (MMR) project. The performance evaluation section presents some preliminary results of the coordinated scheduling of constant bit rate (CBR) traffic streams",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings Fifth International Symposium on High-Performance Computer Architecture",
	keywords = "local area networks;multimedia systems;multiprocessor interconnection networks;performance evaluation;",
	note = "MMR;high-performance multimedia router;LANs;ATM;virtual channels;performance evaluation;coordinated scheduling;constant bit rate traffic streams;",
	pages = "300 - 9",
	title = "{MMR}: a high-performance {M}ulti{M}edia {R}outer-architecture and design trade-offs",
	url = "http://dx.doi.org/10.1109/HPCA.1999.744383",
	year = 1999
}

Jose Duato, Sudhakar Yalamanchili, M.Blanca Caminero, Damon Love and Francisco J Quiles. MMR: A high-performance multimedia router - architecture and design trade-offs. 1999, 300 - 309. BibTeX

@conference{1999164582264,
	author = "Duato, Jose and Sudhakar Yalamanchili and M.Blanca Caminero and Damon Love and Francisco J. Quiles",
	abstract = "This paper presents the architecture of a router designed to efficiently support traffic generated by multimedia applications. The router is targeted for use in clusters and LANs rather than in WANs, the latter being served by communication substrates such as ATM. The distinguishing features of the proposed router architecture are the use of small fixed-size buffers, a large number of virtual channels, link-level virtual channel flow control, support for dynamic modification of connection bandwidth and priorities, and coordinated scheduling of connections across all output channels. The paper begins with a discussion of the design choices and architectural trade-offs made in the current MultiMedia Router (MMR) project. The performance evaluation section presents some preliminary results of the coordinated scheduling of constant bit rate (CBR) traffic streams.",
	address = "Orlando, FL, USA",
	journal = "IEEE High-Performance Computer Architecture Symposium Proceedings",
	key = "Computer architecture",
	keywords = "Bandwidth;Communication channels (information theory);Congestion control (communication);Local area networks;Multimedia systems;Routers;Telecommunication traffic;Wide area networks;",
	note = "Constant bit rate (CBR);Coordination scheduling;Multimedia routers (MMR);",
	pages = "300 - 309",
	title = "{MMR}: {A} high-performance multimedia router - architecture and design trade-offs",
	year = 1999
}

Federico Silla and Jose Duato. Is it worth the flexibility provided by irregular topologies in networks of workstations?. 1999, 47 - 61. BibTeX

@conference{6439234,
	author = "Silla, Federico and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are becoming a cost-effective alternative for small-scale parallel computing. Usually, NOWs present an irregular topology as a consequence of the needs in a local area network. Routing algorithms used in NOWs are inherently different from those used in regular networks, mainly due to the irregular connections between switches. In these algorithms, routing is considerably restricted in order to avoid deadlocks. Recently, a general methodology for the design of adaptive routing algorithms for irregular networks has been proposed by the authors. The resulting algorithms increase the maximum achievable throughput while reducing message latency. In this paper, we study how much network performance we are losing due to the irregular topology of NOWs. We analyze the performance of the up^*/down^* routing algorithm in a 2D mesh topology and compare it with the performance achieved by the XY routing scheme in the same network, in order to answer the following two questions: 1) in a 2D mesh, which of the two routing algorithms achieves better performance?, and 2) where does the up^*/down^* routing algorithm work better, in a 2D mesh or in an irregular network? Simulation results show that the up^*/down^* routing strategy performs better in a regular network than in an irregular one. On the other hand, the XY routing algorithm considerably outperforms the up^*/down^* scheme. However, when the adaptive routing algorithm proposed by the authors is used, differences in performance are much smaller. Thus, the higher performance of a regular topology could not compensate for the loss in wiring flexibility with respect to irregular networks, or their capability of adding a single switch at any moment",
	address = "Berlin, Germany",
	journal = "Network-Based Parallel Computing. Communication, Architecture, and Applications. Third International Workshop, CANPC'99 Proceedings",
	keywords = "multiprocessor interconnection networks;network routing;network topology;performance evaluation;workstation clusters;",
	note = "NOWs;networks of workstations;irregular topology;local area network;routing;adaptive routing algorithms;2D mesh topology;performance;",
	pages = "47 - 61",
	title = "{I}s it worth the flexibility provided by irregular topologies in networks of workstations?",
	year = 1999
}

J F Martinez, J Torrellas and Jose Duato. Improving the performance of bristled CC-NUMA systems using virtual channels and adaptivity. 1999, 202 - 9. URL BibTeX

@conference{6734273,
	author = "J.F. Martinez and J. Torrellas and Duato, Jose",
	abstract = "Current high-end parallel systems achieve low-latency, high-bandwidth network communication through the use of aggressive design techniques and expensive mechanical and electrical parts. High-speed interconnection networks, which are crucial to achieving acceptable system performance, may account for an important fraction of the total cost of the machine. To reduce the network cost and still maintain scalability, bristled configurations, in which each router connects to several processing nodes, pose an attractive alternative. Their lower bandwidth, however, may adversely affect the efficiency of the parallel codes. We show how virtual channels and adaptive routing can make bristled systems more attractive: overall performance improves in congested scenarios while remaining practically unaltered under light traffic conditions. Experimental results are obtained by using execution-driven simulation of a complete state-of-the-art CC-NUMA system, with dynamic superscalar processors and contemporary pipelined routers. The results show that, in bristled hypercubes with 2 processing nodes per router, SPLASH-2 applications with significant communication run 5-15% faster if we make use of virtual channels and adaptive routing. The resulting systems are only 1-10% slower than systems with non-bristled hypercubes and similar routing support, even though the former only need about half of the network hardware components present in the latter. Additionally, virtual channels and adaptivity are shown to be of negligible effect in non-bristled hypercubes",
	address = "New York, NY, USA",
	journal = "Conference Proceedings of the 1999 International Conference on Supercomputing",
	keywords = "discrete event simulation;distributed shared memory systems;hypercube networks;network routing;parallel architectures;performance evaluation;pipeline processing;",
	note = "bristled CC-NUMA systems;virtual channels;high-end parallel systems;low-latency network communication;high-speed interconnection networks;system performance;network cost;scalability;processing nodes;parallel code efficiency;adaptive routing;congestion;traffic conditions;execution-driven simulation;dynamic superscalar processors;contemporary pipelined routers;bristled hypercubes;SPLASH-2 applications;",
	pages = "202 - 9",
	title = "{I}mproving the performance of bristled {CC}-{NUMA} systems using virtual channels and adaptivity",
	url = "http://dx.doi.org/10.1145/305138.305194",
	year = 1999
}

J M Martinez, Pedro Lopez and Jose Duato. Impact of buffer size on the efficiency of deadlock detection. 1999, 315 - 18. URL BibTeX

@conference{6169109,
author = "J.M. Martinez and Lopez, Pedro and Duato, Jose",
abstract = "Deadlock detection is one of the most important design issues in recovery strategies for routing in interconnection networks. In a previous paper, we presented an efficient deadlock detection mechanism. This mechanism requires that when a message header blocks it must be quickly notified to all the channels reserved by that message. To achieve this goal, the detection mechanism uses the information provided by flow control. Some recent commercial multiprocessors use deep buffers, since they may increase network throughput and efficiently allow transmission over long wires. However, deep buffers may increase the elapsed time between header blocking at a router and the propagation of flow control signals, thus negatively affecting the behavior of our deadlock detection mechanism. On the other hand, deeper buffers reduce deadlock frequency. As a consequence, buffer size has opposing effects on deadlock detection. In this paper, we analyze by simulation the influence of these effects on the efficiency of our deadlock detection mechanism, showing that overall performance improves with buffer size",
address = "Los Alamitos, CA, USA",
journal = "Proceedings Fifth International Symposium on High-Performance Computer Architecture",
keywords = "concurrency control;multiprocessor interconnection networks;",
note = "buffer size;deadlock detection;recovery strategies;interconnection networks routing;multiprocessors;deep buffers;simulation;",
pages = "315 - 18",
title = "{I}mpact of buffer size on the efficiency of deadlock detection",
url = "http://dx.doi.org/10.1109/HPCA.1999.744385",
year = 1999
}

Binh Vien Dao, Jose Duato and Sudhakar Yalamanchili. Dynamically configurable message flow control for fault-tolerant routing. IEEE Transactions on Parallel and Distributed Systems 10(1):7 - 22, 1999. URL BibTeX

@article{1999160018138,
	author = "Binh Vien Dao and Duato, Jose and Sudhakar Yalamanchili",
	abstract = "Fault-tolerant routing protocols in modern interconnection networks rely heavily on the network flow control mechanisms used. Optimistic flow control mechanisms, such as wormhole switching (WS), realize very good performance, but are prone to deadlock in the presence of faults. Conservative flow control mechanisms, such as pipelined circuit switching (PCS), ensure the existence of a path to the destination prior to message transmission, achieving reliable transmission at the expense of performance. This paper proposes a general class of flow control mechanisms that can be dynamically configured to trade-off reliability and performance. Routing protocols can then be designed such that, in the vicinity of faults, protocols use a more conservative flow control mechanism, while the majority of messages that traverse fault-free portions of the network utilize a WS like flow control to maximize performance. We refer to such protocols as two-phase protocols. This ability provides new avenues for optimizing message passing performance in the presence of faults. A fully adaptive two-phase protocol is proposed, and compared via simulation to those based on WS and PCS. The architecture of a network router supporting configurable flow control is also described.",
	address = "Los Alamitos, CA, United States",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Interconnection networks",
	keywords = "Communication channels;Computer system recovery;Data communication systems;Fault tolerant computer systems;Network protocols;Pipeline processing systems;Virtual reality;",
	note = "Message flow control;Pipelined circuit switching;Wormhole switching;",
	number = 1,
	pages = "7 - 22",
	title = "{D}ynamically configurable message flow control for fault-tolerant routing",
	url = "http://dx.doi.org/10.1109/71.744829",
	volume = 10,
	year = 1999
}

R Casado, F J Quiles, J L Sanchez and Jose Duato. Deadlock-free routing in irregular networks with dynamic reconfiguration. 1999, 165 - 80. BibTeX

@conference{6429577,
	author = "R. Casado and F.J. Quiles and J.L. Sanchez and Duato, Jose",
	abstract = "High-speed local area networks (LANs) support many distributed applications, These applications require some system availability guarantees. However, LANs may change their topology due to switches and hosts being turned on/off, link remapping, and component failures. In these cases, a distributed reconfiguration algorithm is executed. This algorithm analyzes the topology, computes the new routing tables, and downloads them to the corresponding switches. Unfortunately, in most cases user traffic is stopped during the reconfiguration process to avoid deadlock. Although network reconfigurations are not frequent, they may take hundreds of milliseconds to execute, thus degrading system availability significantly. In this paper, we propose a new deadlock-free distributed reconfiguration algorithm that is able to asynchronously update the routing tables without stopping user traffic. This dynamic reconfiguration algorithm is valid for any topology, including regular as well as irregular topologies",
	address = "Berlin, Germany",
	journal = "Network-Based Parallel Computing. Communication, Architecture, and Applications. Third International Workshop, CANPC'99 Proceedings",
	keywords = "concurrency control;local area networks;reconfigurable architectures;telecommunication network routing;",
	note = "local area networks;distributed reconfiguration;deadlock;distributed reconfiguration algorithm;routing tables;dynamic reconfiguration;irregular networks;deadlock-free routing;",
	pages = "165 - 80",
	title = "{D}eadlock-free routing in irregular networks with dynamic reconfiguration",
	year = 1999
}

Jose Duato, Antonio Robles, Federico Silla and R Beivide. Comparison of router architectures for virtual cut-through and wormhole switching in a NOW environment. Proceedings of the International Parallel Processing Symposium, IPPS, pages 240 - 247, 1999. BibTeX

@article{1999394752205,
	author = "Duato, Jose and Robles, Antonio and Silla, Federico and R. Beivide",
	abstract = "Most commercial routers designed for networks of workstations (NOWs) implement wormhole switching. However, wormhole switching is not well suited for NOWs. The long wires required in this environment lead to large buffers to prevent buffer overflow during flow control signaling. Moreover, wire length is limited by buffer size. Virtual cut-through (VCT) achieves a higher throughput than wormhole switching. Moreover, the traditional disadvantages of VCT switching, as buffer requirements and packetizing overhead, disappear in NOWs. In this paper, we show that VCT routers can be simpler than wormhole ones, while still achieving the advantages of using virtual channels and adaptive routing. We also propose a fully adaptive routing algorithm for VCT switching in NOWs. Moreover, we show that VCT routers outperform wormhole routers in a NOW environment at a lower cost.",
	address = "San Juan",
	issn = 10637133,
	journal = "Proceedings of the International Parallel Processing Symposium, IPPS",
	key = "Pipeline processing systems",
	keywords = "Adaptive algorithms;Computer architecture;Computer workstations;Switching networks;",
	note = "Virtual cut-through (VCT);Wormhole switching;",
	pages = "240 - 247",
	title = "{C}omparison of router architectures for virtual cut-through and wormhole switching in a {NOW} environment",
	year = 1999
}

V Puente, R Beivide, J A Gregorio, J M Prellezo, Jose Duato and C Izu. Adaptive bubble router: a design to improve performance in torus networks. 1999, 58 - 67. URL BibTeX

@conference{6397188,
	author = "V. Puente and R. Beivide and J.A. Gregorio and J.M. Prellezo and Duato, Jose and C. Izu",
	abstract = "A router design for torus networks that significantly reduces message latency over traditional wormhole routers is presented in this paper. This new router implements virtual cut-through switching and fully-adaptive minimal routing. Packet deadlock is avoided by providing escape ways governed by Bubble flow control, a mechanism that guarantees enough free buffer space in the network to allow continuous packet movement. Both deterministic and adaptive Bubble routers have been designed in VLSI using VHDL synthesis tools. Adopting a fair quantitative comparison, we demonstrate that Bubble routers exhibit a reduction in base latency values over 40% with respect to the corresponding wormhole routers, without any penalty in network throughput. With much lower VLSI costs than adaptive wormhole routers, the adaptive Bubble router is even faster than deterministic wormhole routers based on virtual channels",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings of the 1999 International Conference on Parallel Processing",
	keywords = "multiprocessor interconnection networks;network routing;performance evaluation;VLSI;",
	note = "adaptive bubble router;performance;torus networks;router design;message latency;virtual cut-through switching;fully-adaptive minimal routing;VLSI;VHDL synthesis tools;base latency values;virtual channels;",
	pages = "58 - 67",
	title = "{A}daptive bubble router: a design to improve performance in torus networks",
	url = "http://dx.doi.org/10.1109/ICPP.1999.797388",
	year = 1999
}

Jose Duato, Antonio Robles, Federico Silla and R Beivide. A comparison of router architectures for virtual cut-through and wormhole switching in a NOW environment. 1999, 240 - 7. URL BibTeX

@conference{6245442,
	author = "Duato, Jose and Robles, Antonio and Silla, Federico and R. Beivide",
	abstract = "Most commercial routers designed for networks of workstations (NOWs) implement wormhole switching. However wormhole switching is not well suited for NOWs. The long wires required in this environment lead to large buffers to prevent buffer overflow during flow control signaling. Moreover, wire length is limited by buffer size. Virtual cut-through (VCT) achieves a higher throughput than wormhole switching. Moreover, the traditional disadvantages of VCT switching, as buffer requirements and packetizing overhead, disappear in NOWs. In this paper, we show that VCT routers can be simpler than wormhole ones, while still achieving the advantages of using virtual channels and adaptive routing. We also propose a fully adaptive routing algorithm for VCT switching in NOWs. Moreover, we show that VCT routers outperform wormhole routers in a NOW environment at a lower cost",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999",
	keywords = "multiprocessor interconnection networks;network routing;workstation clusters;",
	note = "router architectures;virtual cut-through;wormhole switching;NOW environment;networks of workstations;buffer requirements;packetizing overhead;VCT routers;",
	pages = "240 - 7",
	title = "{A} comparison of router architectures for virtual cut-through and wormhole switching in a {NOW} environment",
	url = "http://dx.doi.org/10.1109/IPPS.1999.760469",
	year = 1999
}

Pedro Lopez, Juan Miguel Martínez and Jose Duato. DRIL: dynamically reduced message injection limitation mechanism for wormhole networks. In Parallel Processing, 1998. Proceedings. 1998 International Conference on. August 1998, 535 -542. URL, DOI BibTeX

@conference{708527,
	author = "Lopez, Pedro and Mart{\'i}nez, Juan Miguel and Duato, Jose",
	abstract = "Deadlock avoidance and recovery techniques are alternatives to deal with the interconnection network deadlock problem. Both techniques allow fully adaptive routing on some set of resources while providing dedicated resources to escape from deadlock. They mainly differ in the way they supply escape paths and when those paths are used. As the escape paths only provide limited bandwidth to escape from deadlocks, both techniques suffer from severe performance degradation when the network is close to saturation. On the other hand, deadlock recovery is based on the assumption that deadlocks are rare. Several studies show that deadlock are more prone when the network is close to or beyond saturation. In this paper we propose a new mechanism that prevents network saturation by dynamically adjusting message injection limitation into the network. As a consequence, this mechanism will avoid the performance degradation problem that typically occurs in both deadlock avoidance and recovery techniques, making fully adaptive feasible. Also, it will guarantee that the frequency of deadlock is really negligible, allowing the use of simple low-cost recovery strategies",
	booktitle = "Parallel Processing, 1998. Proceedings. 1998 International Conference on",
	doi = "10.1109/ICPP.1998.708527",
	isbn = "0-8186-8650-2",
	issn = "0190-3918",
	keywords = "DRIL;deadlock avoidance;interconnection network deadlock;message injection limitation;network saturation;performance degradation;recovery techniques;wormhole networks;concurrency control;multiprocessor interconnection networks;performance evaluation;syste",
	month = "aug",
	pages = "535 -542",
	title = "{DRIL}: dynamically reduced message injection limitation mechanism for wormhole networks",
	url = "http://dx.doi.org/10.1109/ICPP.1998.708527",
	year = 1998
}

Pedro Lopez, Juan Miguel Martínez and Jose Duato. Very efficient distributed deadlock detection mechanism for wormhole networks. 1998, 57 - 66. BibTeX

@conference{1998534159795,
	author = "Lopez, Pedro and Mart{\'i}nez, Juan Miguel and Duato, Jose",
	abstract = "Networks using wormhole switching have traditionally relied upon deadlock avoidance strategies for the design of routing algorithms. More recently, deadlock recovery strategies have begun to gain acceptance. Progressive deadlock recovery techniques are very attractive because they allocate a few dedicated resources to quickly deliver deadlocked messages, instead of killing them. However, the distributed deadlock detection techniques proposed up to now detect many false deadlocks, especially when the network is heavily loaded and messages have different lengths. As a consequence, messages detected as deadlocked may saturate the bandwidth offered by recovery resources, thus degrading performance considerably. In this paper we propose an improved distributed deadlock detection mechanism that uses only local information, detects all the deadlocks, considerably reduces the probability of false deadlock detection and is not strongly affected by variations in message length and message destination distribution.",
	address = "Las Vegas, NV, USA",
	journal = "IEEE High-Performance Computer Architecture Symposium Proceedings",
	key = "Computer system recovery",
	keywords = "Algorithms;Bandwidth;Computer networks;Distributed computer systems;Error detection;",
	note = "Distributed deadlock detection mechanisms;Wormhole networks;",
	pages = "57 - 66",
	title = "{V}ery efficient distributed deadlock detection mechanism for wormhole networks",
	year = 1998
}

Pedro Lopez, J M Martinez and Jose Duato. A very efficient distributed deadlock detection mechanism for wormhole networks. 1998, 57 - 66. URL BibTeX

@conference{5842955,
	author = "Lopez, Pedro and J.M. Martinez and Duato, Jose",
	abstract = "Networks using wormhole switching have traditionally relied upon deadlock avoidance strategies for the design of routing algorithms. More recently, deadlock recovery strategies have begun to gain acceptance. Progressive deadlock recovery techniques are very attractive because they allocate a few dedicated resources to quickly deliver deadlocked messages, instead of killing them. However, the distributed deadlock detection techniques proposed up to now detect many false deadlocks, especially when the network is heavily loaded and messages have different lengths. As a consequence, messages detected as deadlocked may saturate the bandwidth offered by recovery resources, thus degrading performance considerably. In this paper we propose an improved distributed deadlock detection mechanism that uses only local information, detects all the deadlocks, considerably reduces the probability of false deadlock detection and is not strongly affected by variations in message length and message destination distribution",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture (Cat. No.98TB100224)",
	keywords = "multiprocessor interconnection networks;performance evaluation;system recovery;",
	note = "distributed deadlock detection mechanism;wormhole networks;wormhole switching;deadlock avoidance strategies;routing algorithms;deadlock recovery strategies;deadlock recovery techniques;performance degradation;local information;false deadlock detection;message length;message destination distribution;",
	pages = "57 - 66",
	title = "{A} very efficient distributed deadlock detection mechanism for wormhole networks",
	url = "http://dx.doi.org/10.1109/HPCA.1998.650546",
	year = 1998
}

Federico Silla, Jose Duato, A Sivasubramaniam and C R Das. Virtual channel multiplexing in networks of workstations with irregular topology. 1998, 147 - 54. URL BibTeX

@conference{6129280,
	author = "Silla, Federico and Duato, Jose and A. Sivasubramaniam and C.R. Das",
	abstract = "Networks of workstations are becoming a cost-effective alternative for small-scale parallel computing. Although they may not provide the closely coupled environment of multicomputers and multiprocessors, they meet the needs of a great variety of parallel computing problems at a lower cost. However in order to achieve a high efficiency, the interconnects used to build the network of workstations must provide a very high bandwidth and low latencies, making their design a critical issue. Recently, a very efficient flow control protocol for networks of workstations has been proposed by the authors. This protocol multiplexes physical channels between several virtual channels and minimizes the use of control flits by transmitting several data flits each time a virtual channel gets the link. In this protocol, a virtual channel sends data flits until the message blocks or is completely transmitted. However it can reduce network throughput, by increasing short message latency, due to long messages monopolizing channels and hindering the progress of short messages. In this paper, we analyze the impact of limiting the number of flits (block size) that a virtual channel can send once it gets the link. We propose a new version of the previous flow control protocol that is easily, implementable on hardware. Simulation results show that limiting the maximum block size is not a good design decision, because the overall network performance decreases. Only when short message latency is crucial is it is acceptable to limit the block size",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238)",
	keywords = "multiplexing;parallel processing;performance evaluation;protocols;workstation clusters;",
	note = "virtual channel multiplexing;workstation networks;irregular topology;small-scale parallel computing;high efficiency;interconnects;high bandwidth;low latency;flow control protocol;physical channels;minimized control flit use;data flit transmission;network throughput;simulation;network performance;short message latency;",
	pages = "147 - 54",
	title = "{V}irtual channel multiplexing in networks of workstations with irregular topology",
	url = "http://dx.doi.org/10.1109/HIPC.1998.737983",
	year = 1998
}

J L Sanchez, Jose Duato and J M Garcia. Using channel pipelining in reconfigurable interconnection networks. 1998, 120 - 6. URL BibTeX

@conference{5836358,
	author = "J.L. Sanchez and Duato, Jose and J.M. Garcia",
	abstract = "The major problem in wormhole routing networks is related with the contention due to message blocking. Reconfigurable networks are an alternative to reduce the negative effect that congestion produces on the performance of the network. Our work is focused on dynamic reconfiguration. This technique consists basically of placing the different processors in the network in those positions which, at each computational moment and according to the existing communication pattern among them, are more adequate for the development of such computation. In a reconfigurable architecture, the clock period is determined by the transmission time across the switch. To increase this frequency the channel pipelined technique is used. In this paper we present the foundations of reconfigurable network architecture. We show the general structure of the reconfigurable systems and we indicate the characteristics of the channel pipelining technique. Finally, we evaluate the performance of a reconfigurable system",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings of the Sixth Euromicro Workshop on Parallel and Distributed Processing - PDP'98 - (Cat. No.98EX134)",
	keywords = "multiprocessor interconnection networks;performance evaluation;reconfigurable architectures;telecommunication network routing;",
	note = "channel pipelining;reconfigurable interconnection networks;wormhole routing networks;message blocking;performance;dynamic reconfiguration;communication pattern;reconfigurable architecture;transmission time;",
	pages = "120 - 6",
	title = "{U}sing channel pipelining in reconfigurable interconnection networks",
	url = "http://dx.doi.org/10.1109/EMPDP.1998.647188",
	year = 1998
}

R Garcia and Jose Duato. Suboptimal-optimal routing for LAN internetworking using transparent bridges. International Journal of Foundations of Computer Science 9(2):139 - 56, 1998. URL BibTeX

@article{5996619,
	author = "R. Garcia and Duato, Jose",
	abstract = "The current standard transparent bridge protocol IEEE-802.1D is based on the Spanning Tree (ST) algorithm. It has a very important restriction: it cannot work when the topology has active loops. Therefore, a tree is the only possible interconnection topology that can be used. The ST algorithm guarantees that the active topology is a tree discarding lines that form loops. However, because of this, network bandwidth cannot be fully utilized. Moreover, trees have a very serious bottleneck near the root. This paper proposes a new transparent bridge protocol for LAN interconnection that allows active loops. Therefore, strongly connected regular topologies like tori, hypercubes, meshes, etc., as well as irregular topologies can be used without wasting bandwidth. As loops provide alternative paths, the new protocol (named OSR for Optimal-Suboptimal Routing) uses optimal routing or, in the worst case, suboptimal routing",
	address = "Singapore",
	issn = "0129-0541",
	journal = "International Journal of Foundations of Computer Science",
	keywords = "hypercube networks;LAN interconnection;telecommunication network routing;transport protocols;",
	note = "suboptimal-optimal routing;LAN internetworking;transparent bridges;transparent bridge protocol IEEE-802.1D;spanning tree algorithm;interconnection topology;strongly connected regular topologies;tori;hypercubes;meshes;suboptimal routing;",
	number = 2,
	pages = "139 - 56",
	title = "{S}uboptimal-optimal routing for {LAN} internetworking using transparent bridges",
	url = "http://dx.doi.org/10.1142/S0129054198000118",
	volume = 9,
	year = 1998
}

Pedro Lopez, J M Martinez, Jose Duato and F Petrini. On the reduction of deadlock frequency by limiting message injection in wormhole networks. 1998, 295 - 307. BibTeX

@conference{5992388,
	author = "Lopez, Pedro and J.M. Martinez and Duato, Jose and F. Petrini",
	abstract = "Recently, deadlock recovery strategies have begun to gain acceptance in networks using wormhole switching. In particular, progressive deadlock recovery techniques are very attractive because they allocate a few dedicated resources to quickly deliver deadlocked packets, instead of killing them. Deadlock recovery is based on the assumption that deadlocks are really rare. Otherwise, recovery techniques are not efficient. We propose the use of a message injection limitation mechanism that reduces the probability of deadlock to negligible values, even when fully adaptive routing is used. The main new feature is that it can be used with different message destination distributions. The proposed mechanism can be combined with any deadlock detection mechanism. In particular, we use the deadlock detection mechanism proposed in Martinez (1997). In addition, the proposed injection limitation mechanism considerably reduces performance degradation when the network reaches the saturation point",
	address = "Berlin, Germany",
	journal = "Parallel Computer Routing and Communication. Second International Workshop, PCRCW'97. Proceedings",
	keywords = "multiprocessor interconnection networks;network routing;packet switching;performance evaluation;probability;resource allocation;system recovery;",
	note = "deadlock frequency;wormhole switching;progressive deadlock recovery;resource allocation;deadlocked packets;message injection limitation;probability;fully adaptive routing;message destination distributions;network performance;",
	pages = "295 - 307",
	title = "{O}n the reduction of deadlock frequency by limiting message injection in wormhole networks",
	year = 1998
}

Federico Silla and Jose Duato. On the use of virtual channels in networks of workstations with irregular topology. 1998, 203 - 16. BibTeX

@conference{5992382,
	author = "Silla, Federico and Duato, Jose",
	abstract = "Networks of workstations are becoming increasingly popular as a cost-effective alternative to parallel computers. Typically, these networks connect processors using irregular topologies, providing the wiring flexibility, scalability and incremental expansion capability required in this environment. Recently, we proposed a design methodology as well as fully adaptive routing algorithms for irregular topologies. These algorithms increase throughput considerably with respect to previously existing ones but require the use of virtual channels. In this paper we propose a very efficient flow control mechanism to support virtual channels when link wires are very long, and/or have different lengths. This flow control mechanism relies on the use of channel pipelining and control flits. Control traffic is minimized by assigning physical bandwidth to virtual channels until the corresponding message blocks or it is completely transmitted. Simulations show that the resulting flow control protocol performs almost as efficiently as an ideal network with short wires and flit-by-flit multiplexing",
	address = "Berlin, Germany",
	journal = "Parallel Computer Routing and Communication. Second International Workshop, PCRCW'97. Proceedings",
	keywords = "message passing;multiprocessor interconnection networks;network topology;pipeline processing;resource allocation;shared memory systems;",
	note = "virtual channels;networks of workstations;irregular topology;processor interconnection;flow control mechanism;link wires;channel pipelining;control flits;traffic minimization;physical bandwidth assignment;message transmission;simulations;adaptive routing;",
	pages = "203 - 16",
	title = "{O}n the use of virtual channels in networks of workstations with irregular topology",
	year = 1998
}

J M Orduna and Jose Duato. On the design of network routers for multimedia applications. 1998, 13 - 20. URL BibTeX

@conference{6076091,
	author = "J.M. Orduna and Duato, Jose",
	abstract = "Parallel computing systems based on high performance interconnection networks are being used nowadays for online multimedia applications. The potential market of such applications seems to be large enough to justify a specific architecture oriented to support them. Wave switching is a hybrid switching technique for high performance routers which combines wormhole switching and circuit switching in the same router architecture. We propose the use of wave switching in parallel computing systems for applications like distributed multimedia systems or multicomputer based databases. These applications generate an intensive, bursty traffic together with a small percentage of control message traffic. For this kind of traffic, wave switching can considerably improve the throughput of a parallel computing system. Performance evaluation results show a drastic reduction in latency and an improvement in throughput with regard to networks with the same channel width using wormhole switching",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings of the 1998 ICPP Workshop on Architectural and OS Support for Multimedia Applications Flexible Communication Systems. Wireless Networks and Mobile Computing (Cat. No.98EX206)",
	keywords = "circuit switching;multimedia systems;parallel processing;telecommunication congestion control;",
	note = "network router design;multimedia applications;parallel computing systems;high performance interconnection networks;online multimedia applications;wave switching;hybrid switching technique;high performance routers;wormhole switching;circuit switching;router architecture;distributed multimedia systems;multicomputer based databases;bursty traffic;control message traffic;parallel computing system;channel width;",
	pages = "13 - 20",
	title = "{O}n the design of network routers for multimedia applications",
	url = "http://dx.doi.org/10.1109/ICPPW.1998.721869",
	year = 1998
}

Federico Silla, Antonio Robles and Jose Duato. Improving performance of networks of workstations by using Disha Concurrent. 1998, 80 - 7. URL BibTeX

@conference{6034697,
	author = "Silla, Federico and Robles, Antonio and Duato, Jose",
	abstract = "Networks of workstations are currently emerging as a cost-effective alternative to parallel computers. Recently, deadlock recovery techniques have been shown to be an alternative to deadlock avoidance. Disha Concurrent is a progressive deadlock recovery scheme able to simultaneously redirect several deadlocked messages through a deadlock-free lane. Unlike deadlock avoidance techniques, Disha provides true fully adaptive routing without using virtual channels to guarantee deadlock freedom. In this paper, we analyze the application of Disha to networks of workstations. We propose an implementation of Disha on irregular networks that allows concurrent deadlock recovery proving that this implementation is always able to recover from deadlock. A new switch organization and a new flow control protocol are proposed to support Disha. Performance evaluation results show that applying Disha to irregular networks increases network throughput by a factor of up to 3.5, and also reduces latency with regard to other routing algorithms based on deadlock avoidance techniques",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205)",
	keywords = "concurrency control;local area networks;parallel processing;performance evaluation;system recovery;workstations;",
	note = "performance improvement;networks of workstations;Disha Concurrent;deadlock recovery techniques;deadlock avoidance;flow control protocol;latency;",
	pages = "80 - 7",
	title = "{I}mproving performance of networks of workstations by using {D}isha {C}oncurrent",
	url = "http://dx.doi.org/10.1109/ICPP.1998.708466",
	year = 1998
}

Federico Silla, Antonio Robles and Jose Duato. Improving performance of networks of workstations by using Disha Concurrent. In TH Lai (ed.). 1998 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING - PROCEEDINGS. 1998, 80-87. BibTeX

@conference{ISI:000075698400010,
	author = "Silla, Federico and Robles, Antonio and Duato, Jose",
	abstract = "Networks of workstations are currently emerging as a cost-effective alternative to parallel computers. Recently, deadlock recovery techniques have been shown to be an alternative to deadlock avoidance. Disha Concurrent is a progressive deadlock recovery scheme able to simultaneously redirect several deadlocked messages through a deadlock-free lane. Unlike deadlock avoidance techniques, Disha provides true fully adaptive routing without using virtual channels to guarantee deadlock freedom. In this paper, we analyze the application of Disha to networks of workstations. We propose an implementation of Disha on irregular networks that allows concurrent deadlock recovery, proving that this implementation is always able to recover from deadlock. A new switch organization and a new flow control protocol are proposed to support Disha. Performance evaluation results shaw that applying Disha to irregular networks increases network throughput by a factor of up to 3.5, and also reduces latency with regard to other routing algorithms based on deadlock avoidance techniques.",
	booktitle = "1998 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING - PROCEEDINGS",
	editor = "Lai, TH",
	isbn = 0818686510,
	issn = "0190-3918",
	note = "International Conference on Parallel Processing (ICPP), MINNEAPOLIS, MN, AUG 10-14, 1998",
	pages = "80-87",
	series = "PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING",
	title = "{I}mproving performance of networks of workstations by using {D}isha {C}oncurrent",
	year = 1998
}

Federico Silla, M P Malumbres, Jose Duato, D Dai and D K Panda. Impact of adaptivity on the behavior of networks of workstations under bursty traffic. 1998, 88 - 95. URL BibTeX

@conference{6034698,
	author = "Silla, Federico and M.P. Malumbres and Duato, Jose and D. Dai and D.K. Panda",
	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as an alternative to parallel computers. Typically, these networks present irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Similar to the evolution of parallel computers, NOWs are also evolving from distributed memory to shared memory. However distances between processors are longer in NOWs, leading to higher message latency and lower network bandwidth. Therefore, one can expect the network to be a bottleneck when executing some parallel applications on a NOW supporting a shared-memory programming paradigm. The authors analyze whether the interconnection network in a NOW is able to efficiently handle the traffic generated in a DSM with the same number of processors. They evaluate the behavior of a NOW using application traces captured during the execution of several SPLASH2 applications on a DSM simulator. They show through simulation that the adaptive routing algorithm previously proposed by them almost eliminates network saturation due to its ability to support a higher sustained throughput. Therefore, adaptive routing becomes a key design issue to achieve similar performance in NOWs and tightly-coupled DSMs",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205)",
	keywords = "distributed memory systems;local area networks;parallel processing;shared memory systems;telecommunication network routing;telecommunication traffic;virtual machines;workstations;",
	note = "workstation network behaviour;bursty traffic;adaptivity;irregular topologies;wiring flexibility;wiring scalability;incremental expansion capability;distributed memory;shared memory;message latency;network bandwidth;parallel applications;shared-memory programming paradigm;interconnection network;traffic handling;application traces;SPLASH2 applications;simulator;adaptive routing algorithm;network saturation;",
	pages = "88 - 95",
	title = "{I}mpact of adaptivity on the behavior of networks of workstations under bursty traffic",
	url = "http://dx.doi.org/10.1109/ICPP.1998.708467",
	year = 1998
}

Jose Flich, Pedro Lopez, M P Malumbres and Jose Duato. Edinet: an execution driven interconnection network simulator for DSM systems. 1998, 336 - 9. BibTeX

@conference{6161583,
	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose",
	abstract = "Evaluation studies on interconnection networks for distributed memory multiprocessors usually assume synthetic or trace-driven workloads. However, when the final design choices must be done a more precise evaluation study should be performed. In this paper, we describe a new execution-driven simulation tool to evaluate interconnection networks for distributed memory multiprocessors using real application workloads. As an example, we have developed a NCC-NUMA memory model and obtained some simulation results from the SPLASH-2 suite, using different network routing algorithms",
	address = "Berlin, Germany",
	journal = "Computer Performance Evaluation. Modelling Techniques and Tools. 10th International Conference, Tools'98. Proceedings",
	keywords = "discrete event simulation;distributed shared memory systems;multiprocessor interconnection networks;performance evaluation;",
	note = "Edinet;execution driven interconnection network simulator;distributed memory multiprocessors;trace-driven workloads;execution-driven simulation tool;NCC-NUMA memory model;simulation results;SPLASH-2 suite;network routing algorithms;",
	pages = "336 - 9",
	title = "{E}dinet: an execution driven interconnection network simulator for {DSM} systems",
	year = 1998
}

Pedro Lopez, J M Martinez and Jose Duato. DRIL: dynamically reduced message injection limitation mechanism for wormhole networks. 1998, 535 - 42. URL BibTeX

@conference{6034749,
author = "Lopez, Pedro and J.M. Martinez and Duato, Jose",
abstract = "Deadlock avoidance and recovery techniques are alternatives to deal with the interconnection network deadlock problem. Both techniques allow fully adaptive routing on some set of resources while providing dedicated resources to escape from deadlock. They mainly differ in the way they supply escape paths and when those paths are used. As the escape paths only provide limited bandwidth to escape from deadlocks, both techniques suffer from severe performance degradation when the network is close to saturation. On the other hand, deadlock recovery is based on the assumption that deadlocks are rare. Several studies show that deadlock are more prone when the network is close to or beyond saturation. In this paper we propose a new mechanism that prevents network saturation by dynamically adjusting message injection limitation into the network. As a consequence, this mechanism will avoid the performance degradation problem that typically occurs in both deadlock avoidance and recovery techniques, making fully adaptive feasible. Also, it will guarantee that the frequency of deadlock is really negligible, allowing the use of simple low-cost recovery strategies",
address = "Los Alamitos, CA, USA",
journal = "Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205)",
keywords = "concurrency control;multiprocessor interconnection networks;performance evaluation;system recovery;",
note = "DRIL;wormhole networks;interconnection network deadlock;network saturation;message injection limitation;performance degradation;deadlock avoidance;recovery techniques;",
pages = "535 - 42",
title = "{DRIL}: dynamically reduced message injection limitation mechanism for wormhole networks",
url = "http://dx.doi.org/10.1109/ICPP.1998.708527",
year = 1998
}

Jose Duato. Deadlock avoidance and adaptive routing in interconnection networks. 1998, 359 - 64. URL BibTeX

@conference{5842933,
	author = "Duato, Jose",
	abstract = "Networks of workstations are rapidly emerging as a cost effective alternative to parallel computers. Switch based interconnects with irregular topologies allow the wiring flexibility, scalability and incremental expansion capability required in this environment. The irregularity also makes routing and deadlock avoidance on such systems quite complicated. Current proposals avoid deadlock by removing cyclic dependencies between channels. As a consequence, many messages are routed following non minimal paths, therefore increasing latency and wasting resources. We describe a methodology for the design of adaptive routing algorithms for networks with irregular topology. The resulting algorithms allow messages to follow minimal paths in most cases, reducing message latency and balancing channel utilization. The proposed routing algorithms can be implemented simply by changing the routing tables and adding some links in parallel with existing links, taking advantage of spare switch ports. Alternatively, routing algorithms can be implemented by designing new switches that support virtual channels. Evaluation results show that the new routing algorithms are able to increase throughput by a factor of more than four for random traffic, also reducing latency",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings of the Sixth Euromicro Workshop on Parallel and Distributed Processing - PDP'98 - (Cat. No.98EX134)",
	keywords = "concurrency control;message passing;multiprocessor interconnection networks;workstations;",
	note = "deadlock avoidance;adaptive routing;interconnection networks;networks of workstations;switch based interconnects;irregular topologies;wiring flexibility;incremental expansion capability;cyclic dependencies;message routing;irregular topology;message latency;channel utilization;spare switch ports;virtual channels;random traffic;",
	pages = "359 - 64",
	title = "{D}eadlock avoidance and adaptive routing in interconnection networks",
	url = "http://dx.doi.org/10.1109/EMPDP.1998.647220",
	year = 1998
}

Pedro Lopez, Rosa Alcover, Jose Duato and L Zunica. Cost-effective methodology for the evaluation of interconnection networks. Journal of Systems Architecture 44(9-10):815 - 830, 1998. URL BibTeX

@article{1998384306573,
	author = "Lopez, Pedro and Alcover, Rosa and Duato, Jose and L. Zunica",
	abstract = "Interconnection network performance depends on several parameters, including network design parameters, network size, message traffic and message length. Simulation is the methodology usually followed in evaluation studies, because the model can more faithfully represent hardware implementation, taking into account more details. Nevertheless, the number of parameter combinations is often very high, and simulations also take long to complete. Therefore, evaluation studies must choose a subset of the parameters and restrict the variability of each of them. In this paper, we propose a methodology for evaluating interconnection networks. It is based on experimental design used in statistical studies. Using this methodology, we can study network behavior considering many parameters, running only a subset of the simulations required to study all the combinations. In addition, the methodology permits to quantify the effect of interactions among the parameters. We apply this methodology to adjust node design parameters such as number of virtual channels, input buffer size, and output buffer size for a 8-ary 3-cube with adaptive (both partially and fully) wormhole routing. We show that running only one third of the simulations required to study all the combinations, the most significant effects can be estimated without a noticeable loss in precision.",
	address = "Amsterdam, Netherlands",
	issn = 13837621,
	journal = "Journal of Systems Architecture",
	key = "Interconnection networks",
	keywords = "Buffer storage;Communication channels (information theory);Computer simulation;Cost effectiveness;Data communication systems;Statistical methods;Telecommunication traffic;",
	note = "Adaptive routing;Virtual channels;Wormhole routing;",
	number = "9-10",
	pages = "815 - 830",
	title = "{C}ost-effective methodology for the evaluation of interconnection networks",
	url = "http://dx.doi.org/10.1016/S1383-7621(97)00019-2",
	volume = 44,
	year = 1998
}

R Casado, B Caminero, P Cuenca, F Quiles, A Garrido and Jose Duato. A tool for the analysis of reconfiguration and routing algorithms in irregular networks. 1998, 159 - 73. BibTeX

@conference{5959258,
	author = "R. Casado and B. Caminero and P. Cuenca and F. Quiles and A. Garrido and Duato, Jose",
	abstract = "High performance interconnection networking is one of the most active research fields in the area of communications. Their quick development has been increased by the interest in using multiple workstations in parallel processing. These local networks use ideas that are already successfully applied in parallel computer interconnection networks. However, their more flexible and dynamic environment exposes new problems, such as topology configuration and message routing, which are difficult to solve with the current methods used in regular networks. Therefore, it is advisable to apply tools that help the researcher to develop and verify the behaviour of new algorithms for these new networks. Nowadays, the RAAP group (Redes y Arquitecturas de Altas Prestaciones, High Performance Networks and Architectures) of the University of Castilla-La Mancha is working in this way. In this paper, we present a software tool developed by the RAAP group with the aim of helping in the research. It does not try to simulate the communications within the network (where a long computation process would not be able to guarantee none of its properties) but to analyze its behaviour, through the channel dependency graph. The result is an agile and practical tool that provides conclusions in a quick and reliable way",
	address = "Berlin, Germany",
	journal = "Network-Based Parallel Computing. Communication, Architecture, and Applications. Second International Workshop, CANPC '98 Proceedings",
	keywords = "local area networks;parallel processing;reconfigurable architectures;software tools;workstations;",
	note = "routing algorithms;reconfiguration algorithms;irregular networks;high performance interconnection networking;multiple workstations;local networks;parallel computer interconnection networks;dynamic environment;topology configuration;message routing;software tool;computation process;channel dependency graph;",
	pages = "159 - 73",
	title = "{A} tool for the analysis of reconfiguration and routing algorithms in irregular networks",
	year = 1998
}

R Garcia, Jose Duato and J J Serrano. A new transparent bridge protocol for LAN internetworking using topologies with active loops. 1998, 295 - 303. URL BibTeX

@conference{6034722,
	author = "R. Garcia and Duato, Jose and J.J. Serrano",
	abstract = "This paper proposes a new transparent bridge protocol for LAN interconnection that considerably improves the performance of current standard IEEE-802.1D bridges. The current standard is based on the Spanning Tree (ST) algorithm and the most important restriction is that it cannot work when the topology has active loops. The new protocol (named OSR for Optimal-Suboptimal Routing) allows them. Therefore, strongly connected regular topologies, like torus, hypercubes, meshes, etc., as well as irregular topologies, can be used without wasting bandwidth. As loops imply alternative paths, the OSR protocol uses optimal routing or in the worst cases, suboptimal routing. The new protocol has been evaluated on highly connected regular topologies, like meshes. The results are compared with those of a network of the same size managed by the standard spanning tree protocol, showing the superior behavior of the OSR protocol",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205)",
	keywords = "hypercube networks;LAN interconnection;performance evaluation;protocols;",
	note = "transparent bridge protocol;LAN internetworking;active loops topologies;performance;standard IEEE-802.1D bridges;spanning tree algorithm;torus;hypercubes;meshes;highly connected regular topologies;spanning tree protocol;",
	pages = "295 - 303",
	title = "{A} new transparent bridge protocol for {LAN} internetworking using topologies with active loops",
	url = "http://dx.doi.org/10.1109/ICPP.1998.708499",
	year = 1998
}

J M Orduna and Jose Duato. A high performance router architecture for multimedia applications. 1998, 142 - 9. URL BibTeX

@conference{5973112,
	author = "J.M. Orduna and Duato, Jose",
	abstract = "Parallel computing systems and network of workstations (NOW) are being used nowadays for on-line multimedia applications. The potential market of such applications seems to be large enough to justify a specific architecture oriented to support them more efficiently. Wave switching is a hybrid switching technique for high performance routers that combines wormhole switching and circuit switching in the same router architecture. This switching technique is very well suited for parallel computers and NOWs using optical interconnections. In this paper we propose the use of wave switching for applications like distributed multimedia systems or MPEG video encoding. These applications generate an intensive, bursty traffic together with a small percentage of control message traffic. For this kind of traffic, wave switching can considerably improve the throughput of parallel computing systems. Performance evaluation results for a MPEG video encoding application show a drastic reduction in latency and an improvement in throughput, making easier for these systems to support real-time constraints",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings. Fifth International Conference on Massively Parallel Processing (Cat. No.98EX182)",
	keywords = "distributed memory systems;multimedia communication;multiprocessor interconnection networks;optical interconnections;",
	note = "network of workstations;on-line multimedia;router architecture;wormhole switching;circuit switching;wave switching;distributed multimedia;MPEG video encoding;parallel computing;",
	pages = "142 - 9",
	title = "{A} high performance router architecture for multimedia applications",
	url = "http://dx.doi.org/10.1109/MPPOI.1998.682137",
	year = 1998
}

Juan Miguel Martínez, Pedro Lopez, Jose Duato and T M Pinkston. Software-based deadlock recovery technique for true fully adaptive routing in wormhole networks. In Parallel Processing, 1997., Proceedings of the 1997 International Conference on. August 1997, 182 -189. URL, DOI BibTeX

@conference{622586,
	author = "Mart{\'i}nez, Juan Miguel and Lopez, Pedro and Duato, Jose and T.M. Pinkston",
	abstract = "In this paper, we take a different approach to handle deadlocks and performance degradation. We propose the use of an injection limitation mechanism that prevents performance degradation near the saturation point and reduces the probability of deadlock to negligible values even when fully adaptive routing is used. We also propose an improved deadlock detection mechanism that only uses local information, detects all the deadlocks, and considerably reduces the probability of false deadlock detection over previous proposals. In the rare case when impending deadlock is detected, our proposed recovery technique absorbs the deadlocked message at the current node and later re-injects it for continued routing towards its destination. Performance evaluation results show that our new approach to deadlock handling is more efficient than previously proposed techniques",
	booktitle = "Parallel Processing, 1997., Proceedings of the 1997 International Conference on",
	doi = "10.1109/ICPP.1997.622586",
	keywords = "deadlock detection mechanism;deadlocked message;fully adaptive routing;injection limitation mechanism;performance degradation;performance evaluation;software-based deadlock recovery technique;true fully adaptive routing;wormhole networks;concurrency contr",
	month = "aug",
	pages = "182 -189",
	title = "{S}oftware-based deadlock recovery technique for true fully adaptive routing in wormhole networks",
	url = "http://dx.doi.org/10.1109/ICPP.1997.622586",
	year = 1997
}

Jose Duato, Pedro Lopez and S Yalamanchili. Deadlock- and livelock-free routing protocols for wave switching. In Parallel Processing Symposium, 1997. Proceedings., 11th International. April 1997, 570 -577. URL, DOI BibTeX

@conference{580958,
author = "Duato, Jose and Lopez, Pedro and S. Yalamanchili",
abstract = "Wave switching is a hybrid switching technique for high performance routers. It combines wormhole switching and circuit switching in the same router architecture. Wave switching achieves very high performance by exploiting communication locality. When two nodes are going to communicate frequently, a physical circuit is established between them. By combining circuit switching, pre-established physical circuits and wave pipelining across channels and switches, it is possible to increase network bandwidth considerably, also reducing latency for communications that use pre-established physical circuits. In this paper we propose two protocols for routers implementing wave switching. The first protocol handles the network as a cache of circuits, automatically establishing a circuit when two nodes are going to communicate. Subsequent communications use the previously established circuit. When a new circuit requests channels belonging to another circuit, a replacement algorithm selects the circuit to be torn down. The second protocol relies on the programmer and/or the compiler to decide when a circuit should be established or torn down for a set of messages. Also, we show that the proposed protocols are always able to deliver messages, and are deadlock- and livelock-free",
booktitle = "Parallel Processing Symposium, 1997. Proceedings., 11th International",
doi = "10.1109/IPPS.1997.580958",
keywords = "circuit switching;deadlock-free;high performance routers;livelock-free;protocol;routing protocols;wave switching;wormhole switching;circuit switching;concurrency control;multiprocessor interconnection networks;network routing;protocols;",
month = "apr",
pages = "570 -577",
title = "{D}eadlock- and livelock-free routing protocols for wave switching",
url = "http://dx.doi.org/10.1109/IPPS.1997.580958",
year = 1997
}

Fabrizio Petrini, Jose Duato, Pedro Lopez and Juan Miguel Martínez. LIFE: A Limited Injection, Fully adaptivE, recovery-based routing algorithm. 1997, 316 - 321. BibTeX

@conference{1998104020143,
	author = "Fabrizio Petrini and Duato, Jose and Lopez, Pedro and Mart{\'i}nez, Juan Miguel",
	abstract = "Networks using wormhole switching have traditionally relied upon deadlock avoidance strategies for the design of deadlock-free algorithms. The past few years have seen a rise in popularity of deadlock recovery strategies, that are based on the property that deadlocks are quite rare in practice and happen only at or beyond the network saturation point. In fact, recovery-based routing algorithms have a higher potential performance over the deadlock avoidance-based ones which allow less routing freedom. In this paper we present a recovery-based fully adaptive routing algorithm, LIFE, which is based on an innovative injection policy that reduces the probability of deadlocks to negligible values, both with uniform and non-uniform traffic patterns. The experimental results, conducted on a 8-ary 3-cube with 512 nodes, show that it is possible to implement true fully adaptive routing using only two virtual channels. Also, LIFE outperforms state-of-the-art avoidance- and recovery-based algorithms of the same cost, both in terms of throughput and message latency under uniform traffic and provides stable throughput under non-uniform traffic patterns.",
	address = "Bangalore, India",
	journal = "Proceedings of the International Conference on High Performance Computing, HiPC",
	key = "Computer system recovery",
	keywords = "Adaptive algorithms;Communication channels;Computer networks;Congestion control;Switching circuits;Telecommunication traffic;",
	note = "Deadlock free algorithms;Non uniform traffic patterns;",
	pages = "316 - 321",
	title = "{LIFE}: {A} {L}imited {I}njection, {F}ully adaptiv{E}, recovery-based routing algorithm",
	year = 1997
}

Federico Silla and Jose Duato. Tuning the number of virtual channels in networks of workstations. 1997, 72 - 5. BibTeX

@conference{5870025,
	author = "Silla, Federico and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. Typically, these networks connect processors using switch-based interconnects with irregular topology. We proposed a design methodology as well as fully adaptive routing algorithms for irregular topologies. These algorithms require the use of, at least, two virtual channels. We have also proposed a very efficient flow control mechanism to support virtual channels in the environment of irregular networks with varying wire lengths. We study the effect that additional virtual channels have on the performance of irregular networks built using the routing algorithms and the flow control mechanism. Results reveal that the optimal number of virtual channels per physical channel varies with network size",
	address = "Raleigh, NC, USA",
	journal = "Proceedings of the ISCA 10th International Conference on Parallel and Distributed Computing Systems",
	keywords = "local area networks;multiprocessor interconnection networks;performance evaluation;telecommunication channels;telecommunication network routing;",
	note = "virtual channel tuning;workstation networks;cost effective;parallel computers;processor interconnection networks;switch based interconnects;irregular topology;design methodology;adaptive routing algorithms;flow control;varying wire length;network performance;routing algorithms;network size;wormhole switching;",
	pages = "72 - 5",
	title = "{T}uning the number of virtual channels in networks of workstations",
	year = 1997
}

J M Martinez, Pedro Lopez, Jose Duato and T M Pinkston. Software-based deadlock recovery technique for true fully adaptive routing in wormhole networks. 1997, 182 - 9. URL BibTeX

@conference{5698560,
	author = "J.M. Martinez and Lopez, Pedro and Duato, Jose and T.M. Pinkston",
	abstract = "In this paper, we take a different approach to handle deadlocks and performance degradation. We propose the use of an injection limitation mechanism that prevents performance degradation near the saturation point and reduces the probability of deadlock to negligible values even when fully adaptive routing is used. We also propose an improved deadlock detection mechanism that only uses local information, detects all the deadlocks, and considerably reduces the probability of false deadlock detection over previous proposals. In the rare case when impending deadlock is detected, our proposed recovery technique absorbs the deadlocked message at the current node and later re-injects it for continued routing towards its destination. Performance evaluation results show that our new approach to deadlock handling is more efficient than previously proposed techniques",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)",
	keywords = "concurrency control;hypercube networks;network routing;software performance evaluation;system recovery;",
	note = "software-based deadlock recovery technique;true fully adaptive routing;wormhole networks;performance degradation;injection limitation mechanism;fully adaptive routing;deadlock detection mechanism;deadlocked message;performance evaluation;",
	pages = "182 - 9",
	title = "{S}oftware-based deadlock recovery technique for true fully adaptive routing in wormhole networks",
	url = "http://dx.doi.org/10.1109/ICPP.1997.622586",
	year = 1997
}

Jose Duato. Theory of fault-tolerant routing in wormhole networks. IEEE Transactions on Parallel and Distributed Systems 8(8):790 - 802, 1997. URL BibTeX

@article{1997463838107,
	author = "Duato, Jose",
	abstract = "Fault-tolerant systems aim at providing continuous operation in the presence of faults. Multicomputers rely on an interconnection network between processors to support the message-passing mechanism. Therefore, the reliability of the interconnection network is very important for the reliability of the whole system. This paper analyzes the effective redundancy available in a wormhole network by combining connectivity and deadlock freedom. Redundancy is defined at the channel level. We propose a sufficient condition for channel redundancy, also computing the set of redundant channels. The redundancy level of the network is also defined, proposing a theorem that supplies its value. This theory is developed on top of our necessary and sufficient condition for deadlock-free adaptive routing. The new theory also considers the failure of physical channels when virtual channels are used. Finally, we propose a methodology for the design of fault-tolerant routing algorithms, showing its application to n-dimensional meshes.",
	address = "Los Alamitos, CA, United States",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Fault tolerant computer systems",
	keywords = "Algorithms;Communication channels (information theory);Computational complexity;Computer system recovery;Interconnection networks;Redundancy;Reliability;Virtual reality;",
	note = "Adaptive routing;Channel redundancy;Fault tolerant routing;Wormhole networks;",
	number = 8,
	pages = "790 - 802",
	title = "{T}heory of fault-tolerant routing in wormhole networks",
	url = "http://dx.doi.org/10.1109/71.605766",
	volume = 8,
	year = 1997
}

Jose Duato. Switching techniques, adaptive routing and deadlock handling in interconnection networks. 1997, 88 -. BibTeX

@conference{1997403781776,
	author = "Duato, Jose",
	abstract = "Three key issues in the design of interconnection networks: switching techniques, mechanisms for deadlock handling, and routing algorithms is discussed. These three issues are closely related to each other. Several switching techniques are described, including hybrid techniques, and highlighting the relationship between switching technique and network technology. Mechanisms in deadlock handling in interconnection networks and their application to the design of adaptive routing algorithms is presented. Techniques for deadlock avoidance and recovery, focusing mainly on proposals that allow cyclic dependencies between network resources are also described.",
	address = "Montreal, Can",
	journal = "International Conference on Massively Parallel Processing Using Optical Interconnections (MPPOI), Proceedings",
	key = "Interconnection networks",
	keywords = "Adaptive algorithms;Computer networks;Computer system recovery;",
	note = "Adaptive routing algorithms;Deadlock handling;",
	pages = "88 -",
	title = "{S}witching techniques, adaptive routing and deadlock handling in interconnection networks",
	year = 1997
}

T Olivares, P Cuenca, F J Quiles, A Garrido, J L Sanchez and Jose Duato. Interconnection network behavior on a multicomputer in the parallelization of the MPEG coding algorithm. Worm-hole vs. packet-switching routing. 1997, 48 - 53. URL BibTeX

@conference{5767620,
	author = "T. Olivares and P. Cuenca and F.J. Quiles and A. Garrido and J.L. Sanchez and Duato, Jose",
	abstract = "We propose the implementation of a MPEG encoder developed by the University of California at Berkeley on a multicomputer system. Since this application is in real time, we present a mapping of the video sequence between the EPs of the architecture, where the communication between EPs is minimized. We also propose the necessary load/store process with a simple mechanism input/output, where the global distribution process latency is compensated. Idonety of the topology of the system is analyzed, together with the most adequate commutation technique for the interconnection network. Finally the incidence of the frame format on the system communication performance is analyzed",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings. Fourth International Conference on High-Performance Computing (Cat. No.97TB100185)",
	keywords = "data compression;multiprocessing systems;multiprocessor interconnection networks;packet switching;parallel algorithms;real-time systems;video coding;",
	note = "interconnection network behavior;parallelization;MPEG coding algorithm;packet switching routing;MPEG encoder;multicomputer system;video sequence;load/store process;simple mechanism input/output;global distribution process latency;system topology;commutation technique;frame format;system communication performance;",
	pages = "48 - 53",
	title = "{I}nterconnection network behavior on a multicomputer in the parallelization of the {MPEG} coding algorithm. {W}orm-hole vs. packet-switching routing",
	url = "http://dx.doi.org/10.1109/HIPC.1997.634469",
	year = 1997
}

@conference{1998104020104,
	author = "T. Olivares and P. Cuenca and F.J. Quiles and A. Garrido and J.L. Sanchez and Duato, Jose",
	abstract = "In this work we propose the implementation of a MPEG encoder developed by the University of California at Berkeley on a multicomputer system. Since this application is in real time, we present a mapping of the video sequence between the EPs of the architecture, where the communication between EPs is minimized. We also propose the necessary load/store process with a simple mechanism input/output, where the global distribution process latency is compensated. Idoneity of the topology of the system is analyzed together with the most adequate commutation technique for the interconnection network. Finally the incidence of the frame format on the system communication performance will be analyzed.",
	address = "Bangalore, India",
	journal = "Proceedings of the International Conference on High Performance Computing, HiPC",
	key = "Image coding",
	keywords = "Algorithms;Computer architecture;Data communication systems;Image compression;Interconnection networks;Packet switching;Parallel processing systems;Real time systems;Standards;",
	note = "Motion Picture Experts Group (MPEG) standards;Worm hole routing;",
	pages = "48 - 53",
	title = "{I}nterconnection network behavior on a multicomputer in the parallelization of the {MPEG} coding algorithm. {W}orm-hole vs {P}acket-{S}witching {R}outing",
	year = 1997
}

Federico Silla and Jose Duato. Improving the efficiency of adaptive routing in networks with irregular topology. 1997, 330 - 5. URL BibTeX

@conference{5767661,
	author = "Silla, Federico and Duato, Jose",
	abstract = "Networks of workstations are emerging as a cost-effective alternative to parallel computers. The interconnection between workstations usually relies on switch-based networks with irregular topologies. This irregularity makes routing and deadlock avoidance quite complicated. Current proposals avoid deadlock by removing cyclic dependencies between channels and therefore, many messages are routed along non-minimal paths, increasing latency and wasting resources. We propose a general methodology for the design of adaptive routing algorithms for networks with irregular topology that improves a previously proposed one by reducing the probability of routing over non-minimal paths. The resulting routing algorithms allow messages to follow minimal paths in most cases, reducing message latency and increasing network throughput. As an example of application, we propose an improved adaptive routing algorithm for Autonet",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings. Fourth International Conference on High-Performance Computing (Cat. No.97TB100185)",
	keywords = "concurrency control;graph theory;local area networks;message switching;parallel processing;performance evaluation;telecommunication network routing;",
	note = "adaptive routing;irregular topology networks;workstation networks;cost-effective;parallel computers;switch-based networks;deadlock avoidance;cyclic dependencies;message routing;latency;probability;minimal paths;message latency;network throughput;Autonet;local area networks;",
	pages = "330 - 5",
	title = "{I}mproving the efficiency of adaptive routing in networks with irregular topology",
	url = "http://dx.doi.org/10.1109/HIPC.1997.634511",
	year = 1997
}

Federico Silla and Jose Duato. Improving the efficiency of adaptive routing in networks with irregular topology. 1997, 330 - 335. BibTeX

@conference{1998104020145,
	author = "Silla, Federico and Duato, Jose",
	abstract = "Networks of workstations are emerging as a cost-effective alternative to parallel computers. The interconnection between workstations usually relies on switch-based networks with irregular topologies. This irregularity makes routing and deadlock avoidance quite complicated. Current proposals avoid deadlock by removing cyclic dependencies between channels and therefore, many messages are routed along non-minimal paths, increasing latency and wasting resources. In this paper, we propose a general methodology for the design of adaptive routing algorithms for networks with irregular topology that improves over a previously proposed one by reducing the probability of routing over non-minimal paths. The resulting routing algorithms allow messages to follow minimal paths in most cases, reducing message latency and increasing network throughput. As an example of application, we propose an improved adaptive routing algorithm for Autonet.",
	address = "Bangalore, India",
	journal = "Proceedings of the International Conference on High Performance Computing, HiPC",
	key = "Computer networks",
	keywords = "Adaptive algorithms;Communication channels (information theory);Computer system recovery;Computer workstations;Congestion control (communication);Electric network topology;Probability;Response time (computer systems);Switching circuits;",
	note = "Adaptive routing algorithms;",
	pages = "330 - 335",
	title = "{I}mproving the efficiency of adaptive routing in networks with irregular topology",
	year = 1997
}

Jose Duato, Pedro Lopez and S Yalamanchili. Deadlock- and livelock-free routing protocols for wave switching. 1997, 570 - 7. URL BibTeX

@conference{5559828,
author = "Duato, Jose and Lopez, Pedro and S. Yalamanchili",
abstract = "Wave switching is a hybrid switching technique for high performance routers. It combines wormhole switching and circuit switching in the same router architecture. Wave switching achieves very high performance by exploiting communication locality. When two nodes are going to communicate frequently, a physical circuit is established between them. By combining circuit switching, pre-established physical circuits and wave pipelining across channels and switches, it is possible to increase network bandwidth considerably, also reducing latency for communications that use pre-established physical circuits. In this paper we propose two protocols for routers implementing wave switching. The first protocol handles the network as a cache of circuits, automatically establishing a circuit when two nodes are going to communicate. Subsequent communications use the previously established circuit. When a new circuit requests channels belonging to another circuit, a replacement algorithm selects the circuit to be torn down. The second protocol relies on the programmer and/or the compiler to decide when a circuit should be established or torn down for a set of messages. Also, we show that the proposed protocols are always able to deliver messages, and are deadlock- and livelock-free",
address = "Los Alamitos, CA, USA",
journal = "Proceedings. 11th International Parallel Processing Symposium (Cat. No.97TB100107)",
keywords = "circuit switching;concurrency control;multiprocessor interconnection networks;network routing;protocols;",
note = "wave switching;routing protocols;high performance routers;wormhole switching;circuit switching;protocol;livelock-free;deadlock-free;",
pages = "570 - 7",
title = "{D}eadlock- and livelock-free routing protocols for wave switching",
url = "http://dx.doi.org/10.1109/IPPS.1997.580958",
year = 1997
}

Vien B Dao, S Yalamanchili and Jose Duato. Architectural support for reducing communication overhead in multiprocessor interconnection networks. 1997, 343 - 52. URL BibTeX

@conference{5514285,
	author = "B. Vien Dao and S. Yalamanchili and Duato, Jose",
	abstract = "Modern multicomputer interconnection networks offer the delivery of messages with very low latency. However the message in-flight time is only a small portion of the total time that is required to send a message from source to destination. For fine to medium grained message sizes, the majority of time is spent in overheads for setting up and managing message transmission. It is often possible for compilers/programmers to separate inter-processor communication traffic into messages that exhibit communication locality and messages that do not. This paper proposes architectural modifications to network interfaces and routers to enable compilers/programmers to exploit known locality properties of programs in reducing the fixed overhead of transmission. These techniques work well on traffic exhibiting communication locality without unduly penalizing {{\&}}ldquo;ordinary{{\&}}rdquo; message traffic. The proposed techniques are evaluated using communication traces from 5 application program kernels. Significant reductions in average message latency are possible, and we argue that the approach can be used in the next generation of cluster interconnects",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings. Third International Symposium on High-Performance Computer Architecture (Cat. No.97TB100094)",
	keywords = "message passing;multiprocessor interconnection networks;network interfaces;",
	note = "architectural support;communication overhead reduction;multiprocessor interconnection networks;message in-flight time;inter-processor communication traffic;communication locality;network interfaces;routers;communication traces;application program kernels;cluster interconnects;",
	pages = "343 - 52",
	title = "{A}rchitectural support for reducing communication overhead in multiprocessor interconnection networks",
	url = "http://dx.doi.org/10.1109/HPCA.1997.569699",
	year = 1997
}

Binh Vien Dao, Sudhakar Yalamanchili and Jose Duato. Architectural support for reducing communication overhead in multiprocessor interconnection networks. 1997, 343 - 352. BibTeX

@conference{1997173562519,
	author = "Binh Vien Dao and Sudhakar Yalamanchili and Duato, Jose",
	abstract = "Modern multicomputer interconnection networks offer the delivery of messages with very low latency. However, the message in-flight time is only a small portion of the total time that is required to send a message from source to destination. For fine to medium grained message sizes, the majority of time is spent in overheads for setting up and managing message transmission. It is often possible for compilers/programmers to separate inter-processor communication traffic into messages that exhibit communication locality, and messages that do not. This paper proposes architectural modifications to network interfaces and routers to enable compilers/programmers to exploit known locality properties of programs in reducing the fixed overhead of transmission. These techniques work well on traffic exhibiting communication locality without unduly penalizing `ordinary' message traffic. The proposed techniques are evaluated using communication traces from 5 application program kernels. Significant reductions in average message latency are possible, and we argue that the approach can be used tn the next generation of cluster interconnects.",
	address = "San Antonio, TX, USA",
	journal = "IEEE High-Performance Computer Architecture Symposium Proceedings",
	key = "Interconnection networks",
	keywords = "Buffer storage;Computer architecture;Data communication systems;Interfaces;Pipeline processing systems;Program compilers;Telecommunication traffic;",
	note = "Communication locality;",
	pages = "343 - 352",
	title = "{A}rchitectural support for reducing communication overhead in multiprocessor interconnection networks",
	year = 1997
}

Rosa Alcover, Pedro Lopez, Jose Duato and L Zunica. A methodology for optimal interconnection network design. 1997, 81 - 4. BibTeX

@conference{5863027,
	author = "Alcover, Rosa and Lopez, Pedro and Duato, Jose and L. Zunica",
	abstract = "Interconnection network performance depends on several parameters. Some of them can be chosen by the designer: routing algorithm, switching technique, topology and node design parameters. However, there are other parameters that cannot be selected by the designer. Among these, there are parameters that depend on the application, such as message size, message destination distribution and message traffic, as well as parameters defined by the customer, such as network size. The optimization criteria that the network designer should follow is not only maximizing performance, but also selecting the design parameters that achieve a good performance under all the feasible combinations of the parameters that cannot be selected by the designer. We propose a methodology for optimal network design based on robust experimental design techniques used in statistics. As an application, we choose the most important design parameters of a k-ary n-cube network based on that methodology",
	address = "Raleigh, NC, USA",
	journal = "Proceedings of the ISCA 10th International Conference on Parallel and Distributed Computing Systems",
	keywords = "design of experiments;message passing;multiprocessor interconnection networks;optimisation;parallel architectures;performance evaluation;",
	note = "optimal interconnection network design;interconnection network performance;routing algorithm;switching technique;topology;node design parameters;message size;message destination distribution;message traffic;network size;optimization criteria;experimental design techniques;statistics;k-ary n-cube network;",
	pages = "81 - 4",
	title = "{A} methodology for optimal interconnection network design",
	year = 1997
}

Rosa Alcover, Pedro Lopez, Jose Duato and L Zunica. Interconnection network design: a statistical analysis of interactions between factors. In Parallel and Distributed Processing, 1996. PDP '96. Proceedings of the Fourth Euromicro Workshop on. January 1996, 211 -218. URL, DOI BibTeX

@conference{500589,
author = "Alcover, Rosa and Lopez, Pedro and Duato, Jose and L. Zunica",
abstract = "Interconnection network performance depends on several parameters, including network design parameters, network size, message traffic and message length. Simulation is the methodology usually followed in evaluation studies, because the model can more faithfully represent hardware implementation, taking into account more details. Nevertheless, the number of parameter combinations is often high, and simulations also take long to complete. Therefore, evaluation studies must choose a subset of the parameters and restrict the variability of each of them. In a previous paper (IEEE Computer Soc. TCCA Newsletter, pp. 32-37, Aug. 1995), we have proposed a methodology for evaluating interconnection networks. It is based on experimental design used in statistical studies. Using this methodology, we can study network behavior considering many parameters, running only a subset of the simulations required to study all the combination. In addition, the methodology permits us to quantify the effect of interactions among the parameters. In this paper, we make use of the second advantage of this methodology, analysing the effect of node design parameters and their interactions for an 8-ary 3-cube with adaptive wormhole routing",
booktitle = "Parallel and Distributed Processing, 1996. PDP '96. Proceedings of the Fourth Euromicro Workshop on",
doi = "10.1109/EMPDP.1996.500589",
keywords = "8-ary 3-cube;adaptive wormhole routing;evaluation studies;interconnection network design;interconnection network performance;message length;message traffic;network behavior;network design parameters;network size;node design parameters;parameter combinatio",
month = "jan",
pages = "211 -218",
title = "{I}nterconnection network design: a statistical analysis of interactions between factors",
url = "http://dx.doi.org/10.1109/EMPDP.1996.500589",
year = 1996
}

Jose Duato and M P Malumbres. Optimal topology for distributed shared memory multiprocessors: Hypercubes again?. Lecture Notes in Computer Science 1123:205 - 205, 1996. BibTeX

@article{1996123450253,
	author = "Duato, Jose and M.P. Malumbres",
	address = "Lyon, France",
	issn = 03029743,
	journal = "Lecture Notes in Computer Science",
	pages = "205 - 205",
	title = "{O}ptimal topology for distributed shared memory multiprocessors: {H}ypercubes again?",
	volume = 1123,
	year = 1996
}

Jose Duato and M P Malumbres. Optimal topology for distributed shared-memory multiprocessors: hypercubes again?. 1996, 205 - 12. BibTeX

@conference{5464760,
	author = "Duato, Jose and M.P. Malumbres",
	abstract = "Many distributed shared memory multiprocessors (DSM) use a direct interconnection network to implement a cache coherence protocol. An interesting characteristic of the message traffic produced by coherence protocols is that all the messages are very short. Most current multicomputers use low dimensional meshes or tori because these topologies usually achieve a higher performance. However, when messages are very short, latency is mainly dominated by the distance traveled in the network. As a consequence, higher dimensional topologies may achieve a lower latency than low dimensional topologies. We compare the 2D mesh and the hypercube topologies assuming a very detailed router model. Network load has been modeled taking into account the traffic produced by cache coherence protocols. Performance results show that average latency for hypercubes is slightly lower than for meshes. Moreover, hypercubes achieve a much higher throughput than meshes, making them suitable for DSMs",
	address = "Berlin, Germany",
	journal = "Euro-Par '96 Parallel Processing. Second International Euro-Par Conference. Proceedings",
	keywords = "distributed memory systems;hypercube networks;memory protocols;message passing;performance evaluation;shared memory systems;",
	note = "optimal topology;distributed shared memory multiprocessors;DSM;direct interconnection network;message traffic;multicomputers;higher dimensional topologies;low dimensional topologies;2D mesh;hypercube topologies;router model;network load;cache coherence protocols;average latency;",
	pages = "205 - 12",
	title = "{O}ptimal topology for distributed shared-memory multiprocessors: hypercubes again?",
	volume = "vol.1",
	year = 1996
}

Jose Duato. Necessary and sufficient condition for deadlock-free routing in cut-through and store-and-forward networks. IEEE Transactions on Parallel and Distributed Systems 7(8):841 - 854, 1996. URL BibTeX

@article{1996463341409,
	author = "Duato, Jose",
	abstract = "This paper develops the theoretical background for the design of deadlock-free adaptive routing algorithms for virtual cut-through and store-and-forward switching. This theory is valid for networks using either central buffers or edge buffers. Some basic definitions and three theorems are proposed, developing conditions to verify that an adaptive algorithm is deadlock-free, even when there are cyclic dependencies between routing resources. Moreover, we propose a necessary and sufficient condition for deadlock-free routing. Also, a design methodology is proposed. It supplies fully adaptive, minimal and non-minimal routing algorithms, guaranteeing that they are deadlock-free. The theory proposed in this paper extends the necessary and sufficient condition for wormhole switching previously proposed by us. The resulting routing algorithms are more flexible than the ones for wormhole switching. Also, the design methodology is much easier to apply because it automatically supplies deadlock-free routing algorithms.",
	address = "Los Alamitos, CA, United States",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Computer networks",
	keywords = "Adaptive algorithms;Buffer storage;Communication channels;Interconnection networks;Network protocols;Storage allocation;Switching theory;Systems analysis;Telecommunication traffic;",
	note = "Adaptive routing;Deadlock free routing;Store and forward networks;Virtual cut through;",
	number = 8,
	pages = "841 - 854",
	title = "{N}ecessary and sufficient condition for deadlock-free routing in cut-through and store-and-forward networks",
	url = "http://dx.doi.org/10.1109/71.532115",
	volume = 7,
	year = 1996
}

Rosa Alcover, Pedro Lopez, Jose Duato and L Zunica. Interconnection network design: a statistical analysis of interactions between factors. 1996, 211 - 18. URL BibTeX

@conference{5242395,
	author = "Alcover, Rosa and Lopez, Pedro and Duato, Jose and L. Zunica",
	abstract = "Interconnection network performance depends on several parameters, including network design parameters, network size, message traffic and message length. Simulation is the methodology usually followed in evaluation studies, because the model can more faithfully represent hardware implementation, taking into account more details. Nevertheless, the number of parameter combinations is often high, and simulations also take long to complete. Therefore, evaluation studies must choose a subset of the parameters and restrict the variability of each of them. In a previous paper (IEEE Computer Soc. TCCA Newsletter, pp. 32-37, Aug. 1995), we have proposed a methodology for evaluating interconnection networks. It is based on experimental design used in statistical studies. Using this methodology, we can study network behavior considering many parameters, running only a subset of the simulations required to study all the combination. In addition, the methodology permits us to quantify the effect of interactions among the parameters. In this paper, we make use of the second advantage of this methodology, analysing the effect of node design parameters and their interactions for an 8-ary 3-cube with adaptive wormhole routing",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings of the Fourth Euromicro Workshop on Parallel and Distributed Processing - PDP '96",
	keywords = "design of experiments;multiprocessor interconnection networks;network routing;network synthesis;network topology;performance evaluation;statistical analysis;",
	note = "interconnection network design;statistical analysis;interconnection network performance;network design parameters;network size;message traffic;message length;simulations;parameter combinations;evaluation studies;parameter variability;network behavior;parameter interactions;node design parameters;8-ary 3-cube;adaptive wormhole routing;",
	pages = "211 - 18",
	title = "{I}nterconnection network design: a statistical analysis of interactions between factors",
	url = "http://dx.doi.org/10.1109/EMPDP.1996.500589",
	year = 1996
}

Anjan K V., Timothy Mark Pinkston and Jose Duato. Generalized theory for deadlock-free adaptive wormhole routing and its application to Disha concurrent. 1996, 815 - 821. BibTeX

@conference{1996363246596,
	author = "Anjan K. V. and Timothy Mark Pinkston and Duato, Jose",
	abstract = "This paper generalizes a theory for deadlock-free adaptive wormhole routing by considering a mixed set of resources: edge and central buffers. This generalized theory is then applied to a concurrent version of Disha deadlock-recovery which relaxes the sequential recovery requirement for simultaneous recovery from deadlocks. The proposed extension to Disha does not necessitate any additional resource cost; rather, it serves to eliminate the requirement of mutual exclusive access to the deadlock-free lane implemented by a Token. With this extension, Disha Concurrent remains applicable to any topology with a Hamiltonian path including k-ary n-cube networks and is also applicable to tree-based networks.",
	address = "Honolulu, HI, USA",
	issn = 10636374,
	journal = "IEEE Symposium on Parallel and Distributed Processing - Proceedings",
	key = "Computation theory",
	keywords = "Adaptive algorithms;Buffer storage;Communication channels (information theory);Computer system recovery;Electric network topology;Interconnection networks;Packet switching;",
	note = "Deadlock free adaptive wormhole routing;Deadlocks;Disha concurrent;Generalized theory;",
	pages = "815 - 821",
	title = "{G}eneralized theory for deadlock-free adaptive wormhole routing and its application to {D}isha concurrent",
	year = 1996
}

A K Venkatramani, T M Pinkston and Jose Duato. Generalized theory for deadlock-free adaptive wormhole routing and its application to Disha Concurrent. 1996, 815 - 21. URL BibTeX

@conference{5309807,
	author = "A.K. Venkatramani and T.M. Pinkston and Duato, Jose",
	abstract = "This paper generalizes a theory for deadlock-free adaptive wormhole routing by considering a mixed set of resources: edge and central buffers. This generalized theory is then applied to a concurrent version of Disha deadlock-recovery which relaxes the sequential recovery requirement for simultaneous recovery from deadlocks. The proposed extension to Disha does not necessitate any additional resource cost; rather it serves to eliminate the requirement of mutual exclusive access to the deadlock-free lane implemented by a Token. With this extension, Disha Concurrent remains applicable to any topology with a Hamiltonian path including k-ary n-cube networks and is also applicable to tree-based networks",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings of IPPS '96. The 10th International Parallel Processing Symposium (Cat. No.96TB100038)",
	keywords = "concurrency control;fault tolerant computing;multiprocessor interconnection networks;network routing;parallel architectures;performance evaluation;system recovery;",
	note = "deadlock-free adaptive wormhole routing;Disha Concurrent;edge;central buffers;deadlock recovery;sequential recovery;resource cost;mutual exclusive access;deadlock-free lane;Token;Hamiltonian path;k-ary n-cube networks;tree-based networks;multiprocessor interconnection networks;",
	pages = "815 - 21",
	title = "{G}eneralized theory for deadlock-free adaptive wormhole routing and its application to {D}isha {C}oncurrent",
	url = "http://dx.doi.org/10.1109/IPPS.1996.508153",
	year = 1996
}

M P Malumbres, Jose Duato and Josep Torrellas. Efficient implementation of tree-based multicast routing for distributed shared-memory multiprocessors. 1996, 186 - 189. BibTeX

@conference{1997093486439,
	author = "M.P. Malumbres and Duato, Jose and Josep Torrellas",
	abstract = "This paper presents an efficient routing and flow control mechanism to implement multidestination message passing in wormhole networks. It is targeted to situations where the size of message data is very small, like in invalidation and update messages in distributed shared-memory multiprocessors (DSMs) with hardware cache coherence. The mechanism is a variation of tree-based multicast with pruning to avoid deadlocks. The new scheme does not require that the destination addresses in a given multicast message be ordered, thereby avoiding any ordering overhead. It allows messages to use any deadlock-free routing function and only requires one startup for each multicast message. The new scheme has been evaluated on several k-ary n-cube networks under synthetic loads. The results show that the proposed scheme is faster than other multicast mechanisms when the multicast traffic is composed of short messages.",
	address = "New Orleans, LA, USA",
	issn = 10636374,
	journal = "IEEE Symposium on Parallel and Distributed Processing - Proceedings",
	key = "Distributed computer systems",
	keywords = "Computer system recovery;Data communication systems;Distributed database systems;Synchronization;Telecommunication traffic;Trees;",
	note = "Distributed shared memory multiprocessors (DSM);Tree based multicast routing;",
	pages = "186 - 189",
	title = "{E}fficient implementation of tree-based multicast routing for distributed shared-memory multiprocessors",
	year = 1996
}

M P Malumbres, Jose Duato and J Torrellas. An efficient implementation of tree-based multicast routing for distributed shared-memory multiprocessors. 1996, 186 - 9. URL BibTeX

@conference{5465328,
	author = "M.P. Malumbres and Duato, Jose and J. Torrellas",
	abstract = "This paper presents an efficient routing and flow control mechanism to implement multidestination message passing in wormhole networks. It is targeted to situations where the size of message data is very small, like in invalidation and update messages in distributed shared-memory multiprocessors (DSMs) with hardware cache coherence. The mechanism is a variation of tree-based multicast with pruning to avoid deadlocks. The new scheme does not require that the destination addresses in a given multicast message be ordered, thereby avoiding any ordering overhead. It allows messages to use any deadlock-free routing function and only requires one startup for each multicast message. The new scheme has been evaluated on several k-ary n-cube networks under synthetic loads. The results show that the proposed scheme is faster than other multicast mechanisms when the multicast traffic is composed of short messages",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings. Eighth IEEE Symposium on Parallel and Distributed Processing (Cat. No.96TB100088)",
	keywords = "communication complexity;distributed memory systems;message passing;shared memory systems;tree data structures;",
	note = "tree-based;multicast routing;shared-memory multiprocessors;distributed shared-memory;wormhole networks;multidestination message passing;synthetic loads;multicast mechanisms;",
	pages = "186 - 9",
	title = "{A}n efficient implementation of tree-based multicast routing for distributed shared-memory multiprocessors",
	url = "http://dx.doi.org/10.1109/SPDP.1996.570332",
	year = 1996
}

Jose Duato, Pedro Lopez, Federico Silla and S Yalamanchili. A high performance router architecture for interconnection networks. 1996, 61 - 8. URL BibTeX

@conference{5376067,
	author = "Duato, Jose and Lopez, Pedro and Silla, Federico and S. Yalamanchili",
	abstract = "We propose a new router architecture that supports wormhole switching and circuit switching concurrently. This architecture has been designed to take advantage of temporal communication locality. This can be done by establishing a circuit between nodes that are going to communicate frequently. Messages using those circuits face no contention. By combining circuit switching, pre-established physical circuits and wave pipelining across channels and switches, it is possible to increase network bandwidth considerably, also reducing latency for communications that use pre-established physical circuits. This router architecture also allows to reduce the overhead of the software messaging layer in multicomputers by offering a better hardware support. Preliminary performance evaluation results show a drastic reduction in latency and increment in throughput when messages are long enough, even if circuits are established for a single transmission and locality is not exploited",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings of the 1996 International Conference on Parallel Processing. Vol.1 Architecture",
	keywords = "message passing;multiprocessor interconnection networks;parallel architectures;performance evaluation;",
	note = "high performance router architecture;interconnection networks;wormhole switching;circuit switching;temporal communication locality;router architecture;software messaging layer;performance evaluation;",
	pages = "61 - 8",
	title = "{A} high performance router architecture for interconnection networks",
	url = "http://dx.doi.org/10.1109/ICPP.1996.537144",
	volume = "vol.1",
	year = 1996
}

Jose Duato. Theory of deadlock-free adaptive multicast routing in wormhole networks. IEEE Transactions on Parallel and Distributed Systems 6(9):976 - 987, 1995. URL BibTeX

@article{1995512906712,
	author = "Duato, Jose",
	abstract = "A theory for the design of deadlock-free adaptive routing algorithms for wormhole networks was proposed in [2], [16]. This theory supplies the sufficient conditions for an adaptive routing algorithm to be deadlock-free, even when there are cyclic dependencies between channels. Also, two design methodologies were proposed. Multicast communication refers to the delivery of the same message from one source node to an arbitrary number of destination nodes. A tree-like routing scheme is not suitable for hardware-supported multicast in wormhole networks because it produces many headers for each message, drastically increasing the probability of a message being blocked. A path-based multicast routing model was proposed in [25] for multicomputers with 2D-mesh and hypercube topologies. In this model, messages are not replicated at intermediate nodes. This paper develops the theoretical background for the design of deadlock-free adaptive multicast routing algorithms. This theory is valid for wormhole networks using the path-based routing model. It is also valid when messages with a single destination and multiple destinations are mixed together. The new channel dependencies produced by messages with several destinations are studied. Also, two theorems are proposed, developing conditions to verify that an adaptive multicast routing algorithm is deadlock-free, even when there are cyclic dependencies between channels. As an example, the multicast routing algorithms presented in [25] are extended, so that they can take advantage of the alternative paths offered by the network.",
	address = "Los Alamitos, CA, United States",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Interconnection networks",
	keywords = "Algorithms;Bandwidth;Communication channels (information theory);Congestion control (communication);Data communication systems;Graph theory;Multiprocessing systems;",
	note = "Adaptive routing;Deadlock avoidance;Multicast routing;Path based multicast;Virtual channels;Wormhole routing;",
	number = 9,
	pages = "976 - 987",
	title = "{T}heory of deadlock-free adaptive multicast routing in wormhole networks",
	url = "http://dx.doi.org/10.1109/71.466634",
	volume = 6,
	year = 1995
}

Jose Duato. Necessary and sufficient condition for deadlock-free adaptive routing in wormhole networks. IEEE Transactions on Parallel and Distributed Systems 6(10):1055 - 1067, 1995. URL BibTeX

@article{1996032932822,
	author = "Duato, Jose",
	abstract = "Deadlock avoidance is a key issue in wormhole networks. A first approach [8] consists of removing the cyclic dependencies between channels. Many deterministic and adaptive routing algorithms have been proposed based on that approach. Although the absence of cyclic dependencies is a necessary and sufficient condition for deadlock-free deterministic routing, it is only a sufficient condition for deadlock-free adaptive routing. A more powerful approach [11] only requires the absence of cyclic dependencies on a connected channel subset. The remaining channels can be used in almost any way. In this paper, we show that the previously mentioned approach is also a sufficient condition. Moreover, we propose a necessary and sufficient condition for deadlock-free adaptive routing. This condition is the key for the design of fully adaptive routing algorithms with minimum restrictions. An example shows the application of the new theory.",
	address = "Los Alamitos, CA, United States",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Data communication systems",
	keywords = "Adaptive algorithms;Bandwidth;Communication channels;Computer networks;Computer system recovery;Multiprocessing systems;Theorem proving;",
	note = "Adaptive routing;Deadlock avoidance;Routing algorithms;Virtual channels;Wormhole networks;",
	number = 10,
	pages = "1055 - 1067",
	title = "{N}ecessary and sufficient condition for deadlock-free adaptive routing in wormhole networks",
	url = "http://dx.doi.org/10.1109/71.473515",
	volume = 6,
	year = 1995
}

Jose Duato and Pedro Lopez. Highly adaptive wormhole routing algorithms for n-dimensional torus. 1995, 87 - 104. BibTeX

@conference{5513276,
author = "Duato, Jose and Lopez, Pedro",
abstract = "Deadlock avoidance is a key issue in wormhole networks. A first approach consists of removing the cyclic dependencies between channels. Many deterministic and adaptive routing algorithms have been proposed based on that approach. The absence of cyclic dependencies is a necessary and sufficient condition for deadlock-free deterministic routing. However, it can be relaxed for adaptive routing. A more powerful approach was proposed by us. It only requires the absence of cyclic dependencies on a connected channel subset. The remaining channels can be used in almost any way. In this paper, we show that there exists a more relaxed condition for deadlock-free adaptive routing. This condition is the key for the design of more powerful adaptive routing algorithms. We apply this condition to the design of adaptive routing algorithms for n-dimensional torus. In particular, we propose a partially adaptive routing algorithm which doubles the throughput achieved by the deterministic algorithm without increasing the hardware complexity significantly",
address = "New York, NY, USA",
journal = "Interconnection Networks and Mapping and Scheduling Parallel Computations. DIMACS Workshop",
keywords = "deterministic algorithms;multiprocessor interconnection networks;telecommunication network routing;",
note = "wormhole networks;n-dimensional torus;wormhole routing;deadlock avoidance;cyclic dependencies;deterministic routing;deterministic algorithm;",
pages = "87 - 104",
title = "{H}ighly adaptive wormhole routing algorithms for n-dimensional torus",
year = 1995
}

Pedro Lopez and Jose Duato. Deadlock-free fully-adaptive minimal routing algorithms: limitations and solutions. Computers and Artificial Intelligence 14(2):105 - 25, 1995. BibTeX

@article{5024414,
	author = "Lopez, Pedro and Duato, Jose",
	abstract = "In previous papers, a theory for the design of deadlock-free adaptive routing algorithms as well as a design methodology have been proposed. In this paper, an adaptive routing algorithm, obtained from the application of this theory to the 3D-torus, is evaluated under different load conditions and compared with other algorithms. The results show that this algorithm is very fast, also increasing the network throughput considerably. Nevertheless, this adaptive algorithm has cycles in its channel dependency graph. Consequently, when the network is heavily loaded messages may temporarily block cyclically, drastically reducing the performance of the algorithm. Two mechanisms are proposed to avoid this problem",
	address = "Slovakia",
	issn = "0232-0274",
	journal = "Computers and Artificial Intelligence",
	keywords = "concurrency control;distributed algorithms;distributed memory systems;distributed processing;message passing;processor scheduling;",
	note = "deadlock-free fully-adaptive minimal routing algorithm;distributed memory computer;interconnection network;multiprocessor design;theory;3D-torus;three dimensional torus;network throughput;channel dependency graph;message passing;temporary block;",
	number = 2,
	pages = "105 - 25",
	title = "{D}eadlock-free fully-adaptive minimal routing algorithms: limitations and solutions",
	volume = 14,
	year = 1995
}

Binh Vien Dao, Jose Duato and Sudhakar Yalamanchili. Configurable flow control mechanisms for fault-tolerant routing. 1995, 220 - 229. BibTeX

@conference{1995482886519,
	author = "Binh Vien Dao and Duato, Jose and Sudhakar Yalamanchili",
	abstract = "Fault-tolerant routing protocols in modern interconnection networks rely heavily on the network flow control mechanisms used. Optimistic flow control mechanisms such as wormhole routing (WR) realize very good performance, but are prone to deadlock in the presence of faults. Conservative flow control mechanisms such as pipelined circuit switching (PCS) insures existence of a path to the destination prior to message transmission, but incurs increased overhead. Existing fault-tolerant routing protocols are designed with one or the other, and must accommodate their associated constraints. This paper proposes the use of configurable flow control mechanisms. Routing protocols can then be designed such that in the vicinity of faults, protocols use a more conservative flow control mechanism, while the majority of messages that traverse fault-free portions of the network utilize a WR like flow control to maximize performance. Such protocols are referred to as two-phase protocols, where routing decisions are provided some control over the operation of the virtual channels. This ability provides new avenues for optimizing message passing performance in the presence of faults. A fully adaptive two-phase protocol is proposed and compared via simulation to those based on WR and PCS. The architecture of a network router supporting configurable flow control is described, and the paper concludes with avenues for future research.",
	address = "Santa Margherita Ligure, Italy",
	journal = "ACM SIGARCH (Association for Computing Nachinery Special Interest Group on Computer Architecture) - Conference Proceedings",
	key = "Fault tolerant computer systems",
	keywords = "Algorithms;Computer architecture;Computer simulation;Constraint theory;Interconnection networks;Multiprocessing systems;Network protocols;Pipeline processing systems;Switching;",
	note = "Fault tolerant routing;Pipeline circuit switching;Scouting routing;Virtual channels;Wormhole routing;",
	pages = "220 - 229",
	title = "{C}onfigurable flow control mechanisms for fault-tolerant routing",
	year = 1995
}

Binh Vien Dao, Jose Duato and Sudhakar Yalamanchili. Configurable flow control mechanisms for fault-tolerant routing. 1995, 220 - 229. BibTeX

@conference{1995492892924,
	author = "Binh Vien Dao and Duato, Jose and Sudhakar Yalamanchili",
	abstract = "Fault-tolerant routing protocols in modern interconnection networks rely heavily on the network flow control mechanisms used. Optimistic flow control mechanisms such as wormhole routing (WR) realize very good performance, but are prone to deadlock in the presence of faults. Conservative flow control mechanisms such as pipelined circuit switching (PCS) insures existence of a path to the destination prior to message transmission, but incurs increased overhead. Existing fault-tolerant routing protocols are designed with one or the other, and must accommodate their associated constraints. This paper proposes the use of configurable flow control mechanisms. Routing protocols can then be designed such that in the vicinity of faults, protocols use a more conservative flow control mechanism, while the majority of messages that traverse fault-free portions of the network utilize a WR like flow control to maximize performance. Such protocols are referred to as two-phase protocols, where routing decisions are provided some control over the operation of the virtual channels. This ability provides new avenues for optimizing message passing performance in the presence of faults. A fully adaptive two-phase protocol is proposed and compared via simulation to those based on WR and PCS. The architecture of a network router supporting configurable flow control is described, and the paper concludes with avenues for future research.",
	address = "Santa Margherita Ligure, Italy",
	issn = 08847495,
	journal = "Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA",
	key = "Network protocols",
	keywords = "Congestion control;Critical path analysis;Data communication systems;Interconnection networks;Pipeline processing systems;Telecommunication traffic;",
	note = "Fault tolerant routing;Flow control mechanisms;Pipelined circuit switching;Wormhole routing;",
	pages = "220 - 229",
	title = "{C}onfigurable flow control mechanisms for fault-tolerant routing",
	year = 1995
}

Binh Vien Dao, Jose Duato and S Yalamanchili. Configurable flow control mechanisms for fault-tolerant routing. 1995, 220 - 9. URL BibTeX

@conference{5086788,
	author = "Binh Vien Dao and Duato, Jose and S. Yalamanchili",
	abstract = "Fault-tolerant routing protocols in modern interconnection networks rely heavily on the network flow control mechanisms used. Optimistic flow control mechanisms such as wormhole routing (WR) realize very good performance, but are prone to deadlock in the presence of faults. Conservative flow control mechanisms such as pipelined circuit switching (PCS) insures existence of a path to the destination prior to message transmission, but incurs increased overhead. Existing fault-tolerant routing protocols are designed with one or the other, and must accommodate their associated constraints. This paper proposes the use of configurable flow control mechanisms. Routing protocols can then be designed such that in the vicinity of faults, protocols use a more conservative flow control mechanism, while the majority of messages that traverse fault-free portions of the network utilize a WR like flow control to maximize performance. Such protocols are referred to as two-phase protocols where routing decisions are provided some control over the operation of the virtual channels. This ability provides new avenues for optimizing message passing performance in the presence of faults. A fully adaptive two-phase protocol is proposed and compared via simulation to those based on WR and PCS. The architecture of a network router supporting configurable flow control is described, and the paper concludes with avenues for future research",
	address = "New York, NY, USA",
	journal = "Proceedings 22nd Annual International Symposium on Computer Architecture (IEEE Cat. No.95CB35801)",
	keywords = "fault tolerant computing;message passing;multiprocessor interconnection networks;protocols;",
	note = "configurable flow control mechanisms;fault-tolerant routing;protocols;interconnection networks;wormhole routing;pipelined circuit switching;message transmission;fault-free portions;message passing performance;",
	pages = "220 - 9",
	title = "{C}onfigurable flow control mechanisms for fault-tolerant routing",
	url = "http://dx.doi.org/10.1109/ISCA.1995.524563",
	year = 1995
}

Jose Duato, B V Dao, P T Gaughan and S Yalamanchili. Scouting: fully adaptive, deadlock-free routing in faulty pipelined networks. 1994, 608 - 13. URL BibTeX

@conference{4864749,
	author = "Duato, Jose and B.V. Dao and P.T. Gaughan and S. Yalamanchili",
	abstract = "Adaptive routing protocols based on message pipelining using wormhole routing (WR) can provide superior performance. However, the occurrence of faults can lead to situations that may produce deadlock. Variants of adaptive WR have been introduced (P.T. Gaughan and S. Yalamanchili, 1992) that employ backtracking and misrouting to first establish a path, followed by message pipelining (pipelined circuit switching, or PCS). This scheme avoids deadlock due to faults, but is overly conservative leading to reduced performance. The paper introduces a new family of flow control mechanisms ranging from WR to PCS that offers a compromise by only decoupling the routing probe and the data fits the minimal extent required to provide deadlock-free routing in the presence of faults",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 1994 International Conference on Parallel and Distributed Systems (Cat. No.94TH06817)",
	keywords = "adaptive systems;concurrency control;fault tolerant computing;multiprocessor interconnection networks;network routing;parallel architectures;pipeline processing;reliability;",
	note = "deadlock-free routing;faulty pipelined networks;scouting;adaptive routing protocols;message pipelining;wormhole routing;adaptive WR;pipelined circuit switching;PCS;flow control mechanisms;routing probe;minimal extent;fault tolerant routing;",
	pages = "608 - 13",
	title = "{S}couting: fully adaptive, deadlock-free routing in faulty pipelined networks",
	url = "http://dx.doi.org/10.1109/ICPADS.1994.590406",
	year = 1994
}

Jose Duato, B V Dao, P T Gaughan and S Yalamanchili. Scouting: fully adaptive, deadlock-free routing in faulty pipelined networks. 1994, 608 - 613. BibTeX

@conference{1995282705474,
	author = "Duato, Jose and B.V. Dao and P.T. Gaughan and S. Yalamanchili",
	abstract = "Adaptive routing protocols based on message pipelining using wormhole routing (WR) can provide superior performance. However, the occurrence of faults can lead to situations that may produce deadlock. Variants of adaptive WR have been introduced[10] that employ backtracking and misrouting to first establish a path, followed by message pipelining (pipelined circuit switching, or PCS). This scheme avoids deadlocks due to faults, but is overly conservative leading to reduced performance. This paper introduces a new family of flow control mechanisms ranging from WR to PCS that offers a compromise by only decoupling the routing probe and the data flits the minimal extent required to provide deadlock-free routing in the presence of faults.",
	address = "Hsinchu, China",
	journal = "Proceedings of the Internatoinal Conference on Parallel and Distributed Systems - ICPADS",
	key = "Network protocols",
	keywords = "Computer networks;Computer system recovery;Fault tolerant computer systems;Performance;Pipeline processing systems;",
	note = "Deadlock free routing;Faulty pipelined networks;Message pipelining;Pipelined circuit switching;Scouting;Wormhole routing;",
	pages = "608 - 613",
	title = "{S}couting: fully adaptive, deadlock-free routing in faulty pipelined networks",
	year = 1994
}

Jose Duato. Theory to increase the effective redundancy in wormhole networks. Parallel Processing Letters 4(1-2):125 - 138, 1994. BibTeX

@article{1995042477806,
	author = "Duato, Jose",
	abstract = "Fault-tolerant systems aim at providing continuous operations in the presence of faults. Multicomputers rely on an interconnection network between processors to support the message-passing mechanism. Therefore, the reliability of the interconnection network is very important for the reliability of the whole system. This paper analyses the effective redundancy available in a wormhole network by combining connectivity and deadlock freedom. Redundancy is defined at the channel level, giving a sufficient condition for a channel to be redundant and computing the set of redundant channels. The redundancy level of the network is also defined, proposing a theorem that supplies a lower bound for it. Finally, a fault-tolerant routing algorithm based on the former theory is proposed.",
	address = "Singapore, Singapore",
	issn = 02196264,
	journal = "Parallel Processing Letters",
	key = "Fault tolerant computer systems",
	keywords = "Algorithms;Communication channels (information theory);Computational methods;Computer system recovery;Data communication systems;Error analysis;Large scale systems;Multiprocessing systems;Program processors;Redundancy;",
	note = "Adaptive routing;Deadlock avoidance;Wormhole routing;",
	number = "1-2",
	pages = "125 - 138",
	title = "{T}heory to increase the effective redundancy in wormhole networks",
	volume = 4,
	year = 1994
}

Jose Duato. Theory to increase the effective redundancy in wormhole networks. Parallel processing letters 4(1-2):125 - 138, 1994. BibTeX

@article{1995132546653,
	author = "Duato, Jose",
	abstract = "Fault-tolerant systems aim at providing continuous operations in the presence of faults. Multicomputers rely on an interconnection network between processors to support the message-passing mechanism. Therefore, the reliability of the interconnection network is very important for the reliability of the whole system. This paper analyses the effective redundancy available in a wormhole network by combining connectivity and deadlock freedom. Redundancy is defined at the channel level, giving a sufficient condition for a channel to be redundant and computing the set of redundant channels. The redundancy level of the network is also defined, proposing a theorem that supplies a lower bound for it. Finally, a fault-tolerant routing algorithm based on the former theory is proposed.",
	issn = 01296264,
	journal = "Parallel processing letters",
	key = "Fault tolerant computer systems",
	keywords = "Adaptive systems;Algorithms;Computer networks;Computer system recovery;Multiprocessing systems;Redundancy;Reliability;System theory;",
	note = "Adaptive routing;Deadlock avoidance;Fault tolerance;Multicomputers;Wormhole networks;",
	number = "1-2",
	pages = "125 - 138",
	title = "{T}heory to increase the effective redundancy in wormhole networks",
	volume = 4,
	year = 1994
}

Jose Duato. Theory of fault-tolerant routing in wormhole networks. 1994, 600 - 607. BibTeX

@conference{1995282705473,
	author = "Duato, Jose",
	abstract = "Fault-tolerant systems aim at providing continuous operations in the presence of faults. Multicomputers rely on an interconnection network between processors to support the message-passing mechanism. Therefore, the reliability of the interconnection network is very important for the reliability of the whole system. This paper analyzes the effective redundancy available in a wormhole network by combining connectivity and deadlock freedom. Redundancy is defined at the channel level. We propose a sufficient condition for channel redundancy, also computing the set of redundant channels. The redundancy level of the network is also defined, proposing a theorem that supplies its value. This theory is developed on top of our necessary and sufficient condition for deadlock-free adaptive routing. Finally, a fault-tolerant routing algorithm for n-dimensional meshes is proposed.",
	address = "Hsinchu, China",
	journal = "Proceedings of the Internatoinal Conference on Parallel and Distributed Systems - ICPADS",
	key = "Fault tolerant computer systems",
	keywords = "Algorithms;Communication channels (information theory);Computation theory;Computer networks;Computer system recovery;Data communication systems;Interconnection networks;Multiprocessing systems;Redundancy;Reliability;Theorem proving;",
	note = "Deadlock free adaptive routing;Fault tolerant routing;Message passing mechanism;Virtual channels;Wormhole networks;",
	pages = "600 - 607",
	title = "{T}heory of fault-tolerant routing in wormhole networks",
	year = 1994
}

Jose Duato and Pedro Lopez. Performance evaluation of adaptive routing algorithms for k-ary n-cubes. Number 853, pages 45 - 45, 1994. BibTeX

@inbook{1994122484814,
	author = "Duato, Jose and Lopez, Pedro",
	address = "Seattle, WA, United states",
	issn = 03029743,
	journal = "Lecture Notes in Computer Science",
	number = 853,
	pages = "45 - 45",
	title = "{P}erformance evaluation of adaptive routing algorithms for k-ary n-cubes",
	year = 1994
}

Jose Duato and Pedro Lopez. Performance evaluation of adaptive routing algorithms for k-ary n-cubes. 1994, 45 - 59. BibTeX

@conference{4897362,
	author = "Duato, Jose and Lopez, Pedro",
	abstract = "Deadlock avoidance is a key issue in wormhole networks. A first approach consists in removing the cyclic dependencies between channels. Although the absence of cyclic dependencies is a necessary and sufficient condition for deadlock-free deterministic routing, it is only a sufficient condition for deadlock-free adaptive routing. A more powerful approach only requires the absence of cyclic dependencies on a connected channel subset. Moreover, we proposed a necessary and sufficient condition for deadlock-free adaptive routing previously (1994). In this paper, we design adaptive routing algorithms for k-ary n-cubes. In particular, we propose partially adaptive and fully adaptive routing algorithms which considerably increase the throughput achieved by the deterministic routing algorithm. Also, we evaluate the performance of the new routing algorithms under both, uniform and non-uniform distribution of message destinations",
	address = "Berlin, Germany",
	journal = "Parallel Computer Routing and Communication. First International Workshop, PCRCW '94. Proceedings",
	keywords = "concurrency control;multiprocessor interconnection networks;performance evaluation;telecommunication network routing;",
	note = "performance evaluation;adaptive routing algorithms;k-ary n-cubes;deadlock avoidance;wormhole networks;cyclic dependencies;necessary and sufficient condition;connected channel subset;deterministic routing algorithm;routing algorithms;",
	pages = "45 - 59",
	title = "{P}erformance evaluation of adaptive routing algorithms for k-ary n-cubes",
	year = 1994
}

Jose Duato. Improving the efficiency of virtual channels with time-dependent selection functions. Future Generation Computer Systems 10(1):45 - 58, 1994. URL BibTeX

@article{1994081333099,
	author = "Duato, Jose",
	abstract = "In previous papers, a new theory for the design of deadlock-free adaptive routing algorithms for wormhole and store-and-forward networks as well as two design methodologies have been proposed. Also, a new adaptive routing algorithm, obtained from the application of the former theory to the binary n-cube, has been evaluated using both, a uniform and an exponential distribution for message destination. The results are good, especially for large networks and a uniform distribution for message destination. When locality is exploited, the results are comparatively worse, mainly due to the reduction in channel bandwidth produced by channel multiplexing. In this paper, we analyze the advantages and disadvantages produced by the use of virtual channels, proposing a new approach to maximize their efficiency. This approach uses time-dependent selection functions, associating a threshold to some virtual channels. Those channels cannot be selected by a message unless it is waiting for longer than the corresponding threshold. The evaluation of the new selection function for the binary n-cube shows an important improvement, especially when locality is exploited.",
	address = "Amsterdam, Netherlands",
	issn = "0167739X",
	journal = "Future Generation Computer Systems",
	key = "Virtual storage",
	keywords = "Adaptive control systems;Algorithms;Communication channels (information theory);Computer networks;Computer system recovery;Critical path analysis;Data communication systems;Data processing;Large scale systems;Multiprocessing systems;Packet switching;",
	note = "Adaptive routing;Time dependent selection functions;Virtual channels;Wormhole routing;",
	number = 1,
	pages = "45 - 58",
	title = "{I}mproving the efficiency of virtual channels with time-dependent selection functions",
	url = "http://dx.doi.org/10.1016/0167-739X(94)90050-7",
	volume = 10,
	year = 1994
}

Jose Duato. Improving the efficiency of virtual channels with time-dependent selection functions. Computers and Artificial Intelligence 13(1):25 - 44, 1994. BibTeX

@article{4717789,
	author = "Duato, Jose",
	abstract = "In previous papers, a new theory for the design of deadlock-free adaptive routing algorithms for wormhole and store-and-forward networks as well as two design methodologies have been proposed. A new adaptive routing algorithm, obtained from the application of the former theory to the binary n-cube, has been also evaluated using both, a uniform and an exponential distribution for message destination. The results are good, especially for large networks and a uniform distribution for message destination. When locality is exploited, the results are comparatively worse, mainly due to the reduction in channel bandwidth produced by channel multiplexing. We analyse the advantages and disadvantages produced by the use of virtual channels proposing a new approach to maximize their efficiency. This approach uses time-dependent selection functions associating a threshold to some virtual channels. Those channels cannot be selected by a message unless it is waiting for longer than the corresponding threshold. The evaluation of the new selection function for the binary n-cube shows an important improvement, especially when locality is exploited",
	address = "Slovakia",
	issn = "0232-0274",
	journal = "Computers and Artificial Intelligence",
	keywords = "adaptive systems;algorithm theory;multiplexing;multiprocessor interconnection networks;",
	note = "virtual channels;efficiency;time-dependent selection functions;deadlock-free adaptive routing algorithms;store-and-forward networks;wormhole networks;design methodologies;adaptive routing algorithm;binary n-cube;message destination;locality;channel bandwidth;channel multiplexing;",
	number = 1,
	pages = "25 - 44",
	title = "{I}mproving the efficiency of virtual channels with time-dependent selection functions",
	volume = 13,
	year = 1994
}

Ziqiang Liu and Jose Duato. Adaptive unicast and multicast in 3D mesh networks. 1994, 173 - 182. BibTeX

@conference{1994101398696,
	author = "Ziqiang Liu and Duato, Jose",
	abstract = "In this paper, we present an adaptive unicast and multicast routing algorithm for 3D mesh networks with wormhole routing and virtual channel flow control, which is called adaptive-cast. The unique feature of the adaptive-cast is that it is valid when messages with a single destination (unicast) and with multiple destinations (multicast) are mixed together, which drastically simplifies the implementation of the router. Also, only two virtual channels per physical channel are needed to support the adaptive-cast. Our simulation experiment result in 10 × 10 × 10 3D mesh have confirmed that the adaptive-cast achieves better performance than the corresponding static routing algorithm under both uniform and nonuniform traffic patterns.",
	address = "Wailea, HI, USA",
	issn = 10603425,
	journal = "Proceedings of the Hawaii International Conference on System Sciences",
	key = "Algorithms",
	keywords = "Communication channels;Computer hardware;Computer networks;Computer simulation;Critical path analysis;Data communication systems;Network protocols;Packet switching;Telecommunication traffic;Three dimensional;Virtual storage;",
	note = "Adaptive multicast;Adaptive unicast;Multicast communication;Router;Three dimensional mesh networks;Virtual channel;Virtual channel flow control;Wormhole routing;",
	pages = "173 - 182",
	title = "{A}daptive unicast and multicast in 3{D} mesh networks",
	volume = 1,
	year = 1994
}

Ziqiang Liu and Jose Duato. Adaptive unicast and multicast in 3D mesh networks. 1994, 173 - 82. URL BibTeX

@conference{4682102,
	author = "Ziqiang Liu and Duato, Jose",
	abstract = "Presents an adaptive unicast and multicast routing algorithm for 3D mesh networks with wormhole routing and virtual channel flow control, which is called adaptive-cast. The unique feature of the adaptive-cast is that it is valid when messages with a single destination (unicast) and with multiple destinations (multicast) are mixed together, which drastically simplifies the implementation of the router. Also, only two virtual channels per physical channel are needed to support the adaptive-cast. The authors' simulation experiment results in 10{{\&}}times;10{{\&}}times;10 3D mesh have confirmed that the adaptive-cast achieves better performance than the corresponding static routing algorithm under both uniform and nonuniform traffic patterns",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings of the Twenty-Seventh Hawaii Internation Conference on System Sciences Vol. I: Architecture (Cat. No.94TH0607-2)",
	keywords = "multiprocessor interconnection networks;network routing;performance evaluation;",
	note = "3D mesh networks;multicast routing algorithm;unicast routing algorithm;adaptive unicast;adaptive-cast;wormhole routing;virtual channel flow control;performance;",
	pages = "173 - 82",
	title = "{A}daptive unicast and multicast in 3{D} mesh networks",
	url = "http://dx.doi.org/10.1109/HICSS.1994.323174",
	year = 1994
}

Jose Duato. A theory of fault-tolerant routing in wormhole networks. 1994, 600 - 7. URL BibTeX

@conference{4864748,
	author = "Duato, Jose",
	abstract = "Fault-tolerant systems aim at providing continuous operations in the presence of faults. Multicomputers rely on an interconnection network between processors to support the message-passing mechanism. Therefore, the reliability of the interconnection network is very important for the reliability of the whole system. This paper analyzes the effective redundancy available in a wormhole network by combining connectivity and deadlock freedom. Redundancy is defined at the channel level. We propose a sufficient condition for channel redundancy, also computing the set of redundant channels. The redundancy level of the network is also defined, proposing a theorem that supplies its value. This theory is developed on top of our necessary and sufficient condition for deadlock-free adaptive routing. Finally, a fault-tolerant routing algorithm for n-dimensional meshes is proposed",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 1994 International Conference on Parallel and Distributed Systems (Cat. No.94TH06817)",
	keywords = "concurrency control;fault tolerant computing;message passing;multiprocessor interconnection networks;network routing;parallel algorithms;reliability;",
	note = "fault-tolerant routing;wormhole networks;fault-tolerant systems;continuous operations;multicomputers;interconnection network;message-passing;interconnection network reliability;redundancy;connectivity;deadlock;channel level;channel redundancy;deadlock-free adaptive routing;fault-tolerant routing algorithm;n-dimensional meshes;",
	pages = "600 - 7",
	title = "{A} theory of fault-tolerant routing in wormhole networks",
	url = "http://dx.doi.org/10.1109/ICPADS.1994.590404",
	year = 1994
}

Jose Duato. A theory to increase the effective redundancy in wormhole networks. Parallel Processing Letters 4(1-2):125 - 38, 1994. BibTeX

@article{4749319,
	author = "Duato, Jose",
	abstract = "Fault-tolerant systems aim at providing continuous operations in the presence of faults. Multicomputers rely on an interconnection network between processors to support the message-passing mechanism. Therefore, the reliability of the interconnection network is very important for the reliability of the whole system. This paper analyses the effective redundancy available in a wormhole network by combining connectivity and deadlock freedom. Redundancy is defined at the channel level, giving a sufficient condition for a channel to be redundant and computing the set of redundant channels. The redundancy level of the network is also defined, proposing a theorem that supplies a lower bound for it. Finally, a fault-tolerant routing algorithm based on the former theory is proposed",
	address = "Singapore",
	issn = "0129-6264",
	journal = "Parallel Processing Letters",
	keywords = "concurrency control;fault tolerant computing;multiprocessor interconnection networks;redundancy;",
	note = "effective redundancy;wormhole networks;fault-tolerant systems;continuous operations;message-passing mechanism;interconnection network;deadlock freedom;lower bound;fault-tolerant routing algorithm;",
	number = "1-2",
	pages = "125 - 38",
	title = "{A} theory to increase the effective redundancy in wormhole networks",
	volume = 4,
	year = 1994
}

Jose Duato. A necessary and sufficient condition for deadlock-free adaptive routing in wormhole networks. 1994, 142 - 9. BibTeX

@conference{5247791,
	author = "Duato, Jose",
	abstract = "Deadlock avoidance is a key issue in wormhole networks. A first approach (Dally and Seitz, 1987) consists of removing the cyclic dependencies between channels. Although this is a necessary and sufficient condition for deadlock-free deterministic routing, it is only a sufficient condition for deadlock-free adaptive routing. A more powerful approach (Duato, 1991) only requires the absence of cyclic dependencies on a connected channel subset. The remaining channels can be used in almost any way. In this paper, we propose a necessary and sufficient condition for deadlock-free adaptive routing. This condition is the key for the design of maximally adaptive routing algorithms with minimum restrictions. Some examples are given, showing the application of the new theory. In particular, we propose a partially adaptive routing algorithm for k-ary n-cubes which doubles the throughput without increasing the hardware complexity significantly",
	address = "Boca Raton, FL, USA",
	journal = "Proceedings of the 1994 International Conference on Parallel Processing",
	keywords = "computational complexity;fault tolerant computing;multiprocessor interconnection networks;telecommunication network routing;",
	note = "deadlock-free;adaptive routing;wormhole networks;deadlock avoidance;maximally adaptive routing algorithms;minimum restrictions;k-ary n-cubes;hardware complexity;partially adaptive routing;",
	pages = "142 - 9",
	title = "{A} necessary and sufficient condition for deadlock-free adaptive routing in wormhole networks",
	volume = "vol.1",
	year = 1994
}

Jose Duato. Theory to increase the effective redundancy in wormhole networks. Number A-39, pages 277 - 288, 1993. BibTeX

@inbook{1994111410611,
	author = "Duato, Jose",
	abstract = "Fault-tolerant systems aim at providing continuous operations in the presence of faults. Multicomputers rely on an interconnection network between processors to support the message-passing mechanism. Therefore, the reliability of the interconnection network is very important for the reliability of the whole system. This paper analyses the effective redundancy available in a wormhole network by combining connectivity and deadlock freedom. Redundancy is defined at the channel level, giving a sufficient condition for a channel to be redundant and computing the set of redundant channels. The redundancy level of the network is also defined, proposing a theorem that supplies a lower bound for it. Finally, a fault-tolerant routing algorithm based on the former theory is proposed.",
	address = "Palma de Mallorca, Spain",
	issn = 09265473,
	journal = "IFIP Transactions A: Computer Science and Technology",
	key = "Multiprocessing systems",
	keywords = "Algorithms;Communication channels;Computer networks;Computer system recovery;Critical path analysis;Data communication systems;Fault tolerant computer systems;Graph theory;Redundancy;Reliability;Theorem proving;",
	note = "Connectivity;Deadlock freedom;Routing;Wormhole networks;",
	number = "A-39",
	pages = "277 - 288",
	title = "{T}heory to increase the effective redundancy in wormhole networks",
	year = 1993
}

Jose Duato. Theory to increase the effective redundancy in wormhole networks. 1993, 277 - 277. BibTeX

@conference{1995072767955,
	author = "Duato, Jose",
	address = "Palma de Mallorca, Spain",
	pages = "277 - 277",
	title = "{T}heory to increase the effective redundancy in wormhole networks",
	year = 1993
}

Jose Duato. On the design of deadlock-free adaptive multicast routing algorithms. Parallel Processing Letters 3(4):321 - 33, 1993. BibTeX

@article{4704978,
	author = "Duato, Jose",
	abstract = "Multicast communication refers to the delivery of the same message from one source node to an arbitrary number of destination nodes. Two multicast wormhole routing methods have been presented previously by X. Lin and L.M. Ni (1991) for multicomputers with 2D-mesh and hypercube topologies. Also, a theory for the design of deadlock-free adaptive routing algorithms for wormhole networks has been proposed previously by J. Duato ( 1991, 1993 ). This theory supplies the sufficient conditions for an adaptive routing algorithm to be deadlock-free, even when there are cyclic dependencies between channels. This paper analyses the additional channel dependencies produced by multicast routing algorithms on wormhole networks. Then, the theory proposed previously is extended by considering them. As an example, the multicast routing algorithms are extended, taking advantage of the alternative paths offered by the network",
	address = "Singapore",
	issn = "0129-6264",
	journal = "Parallel Processing Letters",
	keywords = "concurrency control;hypercube networks;system recovery;",
	note = "deadlock-free adaptive multicast routing algorithms;multicast communication;2D-mesh;hypercube topologies;wormhole networks;sufficient conditions;cyclic dependencies;",
	number = 4,
	pages = "321 - 33",
	title = "{O}n the design of deadlock-free adaptive multicast routing algorithms",
	volume = 3,
	year = 1993
}

Jose Duato. New theory of deadlock-free adaptive routing in wormhole networks. IEEE Transactions on Parallel and Distributed Systems 4(12):1320 - 1331, 1993. URL BibTeX

@article{1994041229381,
	author = "Duato, Jose",
	abstract = "Second generation multicomputers use wormhole routing, allowing a very low channel setup time and drastically reducing the dependency between network latency and internode distance. Deadlock-free routing strategies have been developed, allowing the implementation of fast hardware routers that reduce the communication bottleneck. Also, adaptive routing algorithms with deadlock-avoidance or deadlock-recovery techniques have been proposed for some topologies, being very effective and outperforming static strategies. This paper develops the theoretical background for the design of deadlock-free adaptive routing algorithms for wormhole networks. Some basic definitions and two theorems are proposed, developing conditions to verify that an adaptive algorithm is deadlock-free, even when there are cycles in the channel dependency graph. Also, two design methodologies are proposed. The first one supplies algorithms with a high degree of freedom, without increasing the number of physical channels. The second methodology is intended for the design of fault-tolerant algorithms. Some examples are given, showing the application of the methodologies. Finally, some simulations show the performance improvement that can be achieved by designing the routing algorithms with the new theory.",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Computer networks",
	keywords = "Algorithms;Computer system recovery;Data communication systems;Data processing;Electric network topology;Fault tolerant computer systems;Graph theory;Multiprocessing systems;Virtual storage;",
	note = "Adaptive routing;Deadlock avoidance;Design methodologies;Fault tolerance;Multicomputers;Virtual channels;Wormhole routing;",
	number = 12,
	pages = "1320 - 1331",
	title = "{N}ew theory of deadlock-free adaptive routing in wormhole networks",
	url = "http://dx.doi.org/10.1109/71.250114",
	volume = 4,
	year = 1993
}

Jose Duato. New theory of deadlock-free adaptive multicast routing in wormhole networks. 1993, 64 - 71. BibTeX

@conference{1994041213869,
	author = "Duato, Jose",
	abstract = "A theory for the design of deadlock-free adaptive routing algorithms for wormhole networks has been proposed in [11, 14]. This theory supplies the sufficient conditions for an adaptive routing algorithm to be deadlock-free, even when there are cyclic dependencies between channels. Also, two design methodologies have been proposed. Multicast communication refers to the delivery of the same message from one source node to an arbitrary number of destination nodes. Two multicast wormhole routing methods have been presented in [22] for multicomputers with 2D-mesh and hypercube topologies. This paper develops the theoretical background for the design of deadlock-free adaptive multicast routing algorithms for wormhole networks. Some basic definitions and two theorems are proposed, developing conditions to verify that an adaptive multicast routing algorithm is deadlock-free, even when there are cyclic dependencies between channels. As an example, the multicast routing algorithms presented in [22] are extended, so that they can take advantage of the alternative paths offered by the network.",
	address = "Dallas, TX, USA",
	journal = "Proceedings of the 5th IEEE Symposium on Parallel and Distributed Processing",
	key = "Multiprocessing systems",
	keywords = "Adaptive systems;Algorithms;Computer networks;Computer system recovery;Data communication systems;Data processing;Electric network topology;Program processors;Storage allocation (computer);",
	note = "Cyclic dependencies;Dally's theorem;Deadlock free adaptive multicast routing;Hypercube topologies;Multicast communications;Wormhole networks;Wormhole routing;",
	pages = "64 - 71",
	title = "{N}ew theory of deadlock-free adaptive multicast routing in wormhole networks",
	year = 1993
}

Z Liu, Jose Duato and L -E Thorelli. Grouping virtual channels for deadlock-free adaptive wormhole routing. 1993, 254 - 65. BibTeX

@conference{4607908,
	author = "Z. Liu and Duato, Jose and L.-E. Thorelli",
	abstract = "Recently, intensive research has been done to develop adaptive deadlock-free wormhole routing strategies for interconnection networks. One effective method is to partition the physical network into several virtual networks such that there is no channel dependency cycle in each of them even if full or partial adaptive routing strategies are used. However, each physical channel can be split into more virtual channels than the number necessary to set up the virtual networks. The additional virtual channels can be considered as one resource pool for all virtual networks. It means the packet which is blocked in one virtual network can borrow one free valid virtual channel from the resource pool, returning it to the resource pool when it is released. The authors call this scheme the grouping technique and have applied it to double-y adaptive routing on a 2D mesh network, producing a new fully adaptive routing algorithm called group-double-y. The simulation results show that with heavily loaded network it can double/(increase 26%) the average physical channel utilization under uniform/matrix-transpose traffic pattern. They have also applied the grouping technique in the Turn model on a 2D mesh network, producing a fully adaptive, minimum and nonminimum routing algorithm called group-turn-model. Compared with group-double-y, the simulation results show that with heavily loaded network the group-turn-model increases/decreases the average physical channel utilization by (12%)/(2%) under matrix-transpose/uniform traffic pattern",
	address = "Berlin, Germany",
	journal = "PARLE '93 Parallel Architectures and Languages Europe. 5th International PARLE Conference Proceedings",
	keywords = "digital simulation;multiprocessor interconnection networks;",
	note = "virtual channels grouping;deadlock-free adaptive wormhole routing;interconnection networks;virtual networks;grouping technique;2D mesh network;simulation results;Turn model;nonminimum routing algorithm;group-turn-model;group-double-y;average physical channel utilization;",
	pages = "254 - 65",
	title = "{G}rouping virtual channels for deadlock-free adaptive wormhole routing",
	year = 1993
}

J M Garcia and Jose Duato. Dynamic reconfiguration of multicomputer networks: limitations and tradeoffs. 1993, 317 - 23. URL BibTeX

@conference{4658021,
	author = "J.M. Garcia and Duato, Jose",
	abstract = "The dynamic reconfiguration of the interconnection network is an advanced feature of some multicomputers to reduce the communication overhead. Up to now, the work carried out in this field has focused on static switching, i.e., the network changes its topology before starting the execution of a phase of an application program and then it remains constant throughout the phase execution. However, the authors' work focuses on true dynamic reconfiguration, i.e., the network topology can change almost arbitrarily at runtime. In a previous paper (see Garcia and Duato, 1991), they presented an algorithm to handle the dynamic reconfiguration and some simulation results, showing the benefits achieved by this reconfiguration algorithm. In this paper, they expound in depth the reconfiguration algorithm and the different concepts related to it. The previous work is analyzed and compared with their algorithm, showing the improvements achieved",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings. Euromicro Workshop on Parallel and Distributed Processing",
	keywords = "multiprocessor interconnection networks;network topology;parallel algorithms;reconfigurable architectures;",
	note = "dynamic reconfiguration;multicomputer networks;interconnection network;communication overhead;network topology;reconfiguration algorithm;application program;runtime;",
	pages = "317 - 23",
	title = "{D}ynamic reconfiguration of multicomputer networks: limitations and tradeoffs",
	url = "http://dx.doi.org/10.1109/EMPDP.1993.336386",
	year = 1993
}

Pedro Lopez and Jose Duato. Deadlock-free adaptive routing algorithms for the 3D-torus: limitations and solutions. 1993, 684 - 7. BibTeX

@conference{4585304,
	author = "Lopez, Pedro and Duato, Jose",
	abstract = "A deadlock-free adaptive routing algorithm, obtained from the application of the theory proposed by J. Duato (1991) to the 3D-torus, is evaluated under different load conditions and compared with other algorithms. The results show that this algorithm is very fast, also increasing the network throughput considerably. Nevertheless, this adaptive algorithm has cycles in its channel dependency graph. As a consequence, when the network is heavily loaded messages may temporarily block cyclically, drastically reducing the performance of the algorithm. Two mechanisms are proposed to avoid this problem",
	address = "Berlin, Germany",
	journal = "PARLE '93 Parallel Architectures and Languages Europe. 5th International PARLE Conference Proceedings",
	keywords = "multiprocessor interconnection networks;performance evaluation;",
	note = "deadlock-free adaptive routing algorithms;3D-torus;network throughput;channel dependency graph;",
	pages = "684 - 7",
	title = "{D}eadlock-free adaptive routing algorithms for the 3{D}-torus: limitations and solutions",
	year = 1993
}

Jose Duato. A theory to increase the effective redundancy in wormhole networks. 1993, 277 - 88. BibTeX

@conference{4616928,
	author = "Duato, Jose",
	abstract = "Fault-tolerant systems aim at providing continuous operations in the presence of faults. Multicomputers rely on an interconnection network between processors to support the message-passing mechanism. Therefore, the reliability of the interconnection network is very important for the reliability of the whole system. This paper analyses the effective redundancy available in a wormhole network by combining connectivity and deadlock freedom. Redundancy is defined at the channel level, giving a sufficient condition for a channel to be redundant and computing the set of redundant channels. The redundancy level of the network is also defined, proposing a theorem that supplies a lower bound for it. Finally, a fault tolerant routing algorithm based on the former theory is proposed",
	address = "Netherlands",
	issn = "0926-5473",
	journal = "IFIP Transactions A (Computer Science and Technology)",
	keywords = "computer networks;distributed memory systems;fault tolerant computing;message passing;multiprocessor interconnection networks;redundancy;",
	note = "redundancy;wormhole networks;fault-tolerant systems;multicomputers;interconnection network;message-passing;reliability;connectivity;deadlock freedom;channel level;",
	pages = "277 - 88",
	title = "{A} theory to increase the effective redundancy in wormhole networks",
	volume = "A-39",
	year = 1993
}

Jose Duato. A new theory of deadlock-free adaptive multicast routing in wormhole networks. 1993, 64 - 71. URL BibTeX

@conference{4945959,
	author = "Duato, Jose",
	abstract = "A theory for the design of deadlock-free adaptive routing algorithms for wormhole networks has been proposed previously. This theory supplies the sufficient conditions for an adaptive routing algorithm to be deadlock-free, even when there are cyclic dependencies between channels. Also, two design methodologies have been proposed. Multicast communication refers to the delivery of the same message from one source node to an arbitrary number of destination nodes. Two multicast wormhole routing methods have been presented previously for multicomputers with 2D-mesh and hypercube topologies. This paper develops the theoretical background for the design of deadlock-free adaptive multicast routing algorithms for wormhole networks. Some basic definitions and two theorems are proposed, developing conditions to verify that an adaptive multicast routing algorithm is deadlock-free, even when there are cyclic dependencies between channels. As an example, the multicast routing algorithms presented previously are extended, so that they can take advantage of the alternative paths offered by the network",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings of the Fifth IEEE Symposium on Parallel and Distributed Processing (Cat. No.93TH0584-3)",
	keywords = "hypercube networks;message passing;network routing;",
	note = "deadlock-free adaptive multicast routing;wormhole networks;deadlock-free adaptive routing algorithms;cyclic dependencies;destination nodes;multicomputers;2D-mesh;hypercube topologies;",
	pages = "64 - 71",
	title = "{A} new theory of deadlock-free adaptive multicast routing in wormhole networks",
	url = "http://dx.doi.org/10.1109/SPDP.1993.395549",
	year = 1993
}

Antonio Robles and Jose Duato. Multilinks: a new approach to the design of adaptive routing algorithms for multicomputers. 1992, 405 - 10. BibTeX

@conference{4214095,
	author = "Robles, Antonio and Duato, Jose",
	abstract = "A new methodology for the design of deadlock-free adaptive routing algorithms is proposed, which is based on the use of multilinks. This is a new concept consisting of a virtual link formed by several adjacent physical channels simultaneously reserved by the router. Through simulation, the paper investigates the performance of two adaptive strategies for wormhole routing based on multilinks, comparing them with static routing. All adaptive routing strategies outperformed static routing significantly. Different network sizes have been evaluated, showing that the relative improvement of adaptive routing with regard to static routing increases with the network size",
	address = "Amsterdam, Netherlands",
	journal = "Parallel and Distributed Computing in Engineering Systems. Proceedings of the IMACS/IFAC International Symposium",
	keywords = "multiprocessor interconnection networks;switching theory;",
	note = "adaptive routing algorithms;multicomputers;deadlock-free adaptive routing algorithms;multilinks;virtual link;physical channels;wormhole routing;network size;",
	pages = "405 - 10",
	title = "{M}ultilinks: a new approach to the design of adaptive routing algorithms for multicomputers",
	year = 1992
}

Jose Duato. Improving the efficiency of virtual channels with time-dependent selection functions. 1992, 635 - 50. BibTeX

@conference{4325039,
author = "Duato, Jose",
abstract = "In previous papers by the author (1991, 1992), a new theory for the design of deadlock-free adaptive routing algorithms for wormhole and store-and-forward routing as well as two design methodologies have been proposed. Also, a new adaptive routing algorithm, obtained from the application of the former theory to the binary n-cube, has been evaluated using both, a uniform and an exponential distribution for message destination. The results are good, especially for large networks and a uniform distribution for message destination. When locality is exploited, the results are comparatively worse, mainly due to the reduction in channel bandwidth produced by channel multiplexing. In this paper, the author analyses the advantages and disadvantages produced by the use of virtual channels, proposing a new approach to maximize their efficiency. This approach uses time-dependent selection functions, associating a threshold to some virtual channels. Those channels cannot be selected by a message unless it is waiting for longer than the corresponding threshold. The evaluation of the new selection function for the binary n-cube shows an important improvement, especially when locality is exploited",
address = "Berlin, Germany",
journal = "PARLE '92. Parallel Architectures and Languages Europe. 4th International PARLE Conference. Proceedings",
keywords = "message passing;multiprocessor interconnection networks;",
note = "wormhole routing;virtual channels;time-dependent selection functions;deadlock-free adaptive routing algorithms;store-and-forward routing;binary n-cube;message destination;channel multiplexing;",
pages = "635 - 50",
title = "{I}mproving the efficiency of virtual channels with time-dependent selection functions",
year = 1992
}

Jose Duato. Impact of locality on the performance of some adaptive routing algorithms for the hypercube. Proceedings of the European Workshops on Parallel Computing, pages 123 - 123, 1992. BibTeX

@article{1993041545718,
	author = "Duato, Jose",
	address = "Barcelona, Spain",
	journal = "Proceedings of the European Workshops on Parallel Computing",
	pages = "123 - 123",
	title = "{I}mpact of locality on the performance of some adaptive routing algorithms for the hypercube",
	year = 1992
}

Jose Duato. Impact of locality on the performance of some adaptive routing algorithms for the hypercube. 1992, 123 - 6. BibTeX

@conference{4391376,
	author = "Duato, Jose",
	abstract = "In previous papers, a new theory for the design of deadlock-free adaptive routing algorithms for wormhole and store-and-forward routing as well as two design methodologies have been proposed. Also, a new adaptive routing algorithm, obtained from the application of the former theory to the binary n-cube, has been evaluated using a uniform message distribution. The current paper analyses the effect of locality using a decreasing probability distribution for message destination. For that distribution, the results show that adaptive algorithms outperform static ones, except for very small networks with little traffic",
	address = "Amsterdam, Netherlands",
	journal = "Parallel Computing: From Theory to Sound Practice. Proceedings of EWPC '92, the European Workshops on Parallel Computing",
	keywords = "hypercube networks;parallel algorithms;performance evaluation;",
	note = "wormhole routing;adaptive routing algorithms;hypercube;deadlock-free;store-and-forward routing;message distribution;locality;decreasing probability distribution;message destination;",
	pages = "123 - 6",
	title = "{I}mpact of locality on the performance of some adaptive routing algorithms for the hypercube",
	year = 1992
}

Jose Duato. Channel classes: a new concept for deadlock avoidance in wormhole networks. Parallel Processing Letters 2(4):347 - 54, 1992. BibTeX

@article{4573086,
	author = "Duato, Jose",
	abstract = "The author has developed the theoretical background for the design of deadlock-free adaptive routing algorithms for store-and-forward and wormhole networks. Some definitions and theorems have been proposed, developing conditions to verify that an adaptive algorithm is deadlock-free, even when they are cyclic dependencies between channels. Also, two design methodologies have been proposed. She proposes a partial order between channels as well as an equivalence relation. This relation splits the set of channels into equivalence classes. Then, she extends her previous theory by considering equivalence classes (channel classes) instead of channels. This extension drastically simplifies the verification of deadlock freedom for adaptive routing algorithms with cyclic dependencies between channels. Finally, she presents an example",
	address = "Singapore",
	issn = "0129-6264",
	journal = "Parallel Processing Letters",
	keywords = "concurrency control;message passing;multiprocessor interconnection networks;",
	note = "channel classes;deadlock avoidance;wormhole networks;deadlock-free adaptive routing algorithms;store-and-forward;adaptive algorithm;cyclic dependencies;partial order;equivalence relation;verification;",
	number = 4,
	pages = "347 - 54",
	title = "{C}hannel classes: a new concept for deadlock avoidance in wormhole networks",
	volume = 2,
	year = 1992
}

J M Garcia and Jose Duato. An advanced environment for programming transputer networks with dynamic reconfiguration. 1992, 601 - 10. BibTeX

@conference{4513529,
	author = "J.M. Garcia and Duato, Jose",
	abstract = "The authors present a programming environment for multicomputers. Among other features, it allows them to evaluate the performance of parallel algorithms running on a multicomputer with both, static and dynamically reconfigurable topologies. The results of this evaluation are obtained by simulating a machine model based on a transputer network. Their environment-called FDP-permits the simulation of the behaviour of a multicomputer. Several machine parameters can be adjusted. For example, the authors can vary the network topology, the number of nodes, the routing algorithm in the network, etc. Choosing different options is easy, because FDP has a friendly user interface. In their environment, a parallel algorithm is programmed in the distributed Pascal language. This new parallel language, which they have developed, is based on standard Pascal. Some extensions allow an easy and elegant programming of parallel algorithms, consisting of processes which communicate by means of message-passing",
	address = "Barcelona, Spain",
	journal = "Parallel Computing and Transputer Applications",
	keywords = "message passing;parallel algorithms;parallel processing;Pascal;programming environments;user interfaces;",
	note = "performance evaluation;static topologies;advanced environment;programming transputer networks;dynamic reconfiguration;programming environment;parallel algorithms;dynamically reconfigurable topologies;machine model;transputer network;routing algorithm;user interface;distributed Pascal language;message-passing;",
	pages = "601 - 10",
	title = "{A}n advanced environment for programming transputer networks with dynamic reconfiguration",
	year = 1992
}

Jose Duato. On the design of deadlock-free adaptive routing algorithms for multicomputers: theoretical aspects. 1991, 234 - 43. BibTeX

@conference{3928381,
	author = "Duato, Jose",
	abstract = "Second generation multicomputers use wormhole routing, drastically reducing the dependency between network latency and internode distance. Deadlock-free routing strategies have been developed, allowing the implementation of fast hardware routers. Also, adaptive routing algorithms with deadlock-avoidance or deadlock-recovery techniques have been proposed for some topologies, being very effective and outperforming static strategies. This paper develops the theoretical aspects for the design of deadlock-free adaptive routing algorithms. Some basic definitions and three theorems are proposed, developing conditions to verify that an adaptive algorithm is deadlock-free, even when there are cycles in the channel dependency graph. As an example, a new adaptive algorithm for 2D-meshes is presented",
	address = "Berlin, Germany",
	journal = "Distributed Memory Computing. 2nd European Conference, EDMCC2 Proceedings",
	keywords = "multiprocessor interconnection networks;parallel architectures;switching theory;",
	note = "message passing;deadlock-free adaptive routing algorithms;multicomputers;wormhole routing;network latency;deadlock-avoidance;channel dependency graph;2D-meshes;",
	pages = "234 - 43",
	title = "{O}n the design of deadlock-free adaptive routing algorithms for multicomputers: theoretical aspects",
	year = 1991
}

Jose Duato. On the design of deadlock-free adaptive routing algorithms for multicomputers: design methodologies. 1991, 390 - 405. BibTeX

@conference{3967718,
	author = "Duato, Jose",
	abstract = "The paper develops the theoretical background for the design of deadlock-free adaptive routing algorithms for wormhole as well as store-and-forward routing. Some basic definitions and four theorems are proposed, developing conditions to verify that an adaptive algorithm is deadlock-free, even when there are cycles in the channel dependency graph. Also, two design methodologies are proposed. The first one supplies algorithms with a high degree of freedom, without increasing the number of physical channels. The second methodology is intended for the design of fault-tolerant algorithms. Some examples are given, showing the application of the methodologies",
	address = "Berlin, Germany",
	journal = "PARLE '91. Parallel Architectures and Languages Europe. Volume I: Parallel Architectures and Algorithms",
	keywords = "fault tolerant computing;parallel processing;system recovery;",
	note = "deadlock-free adaptive routing algorithms;wormhole;store-and-forward routing;adaptive algorithm;deadlock-free;channel dependency graph;fault-tolerant algorithms;",
	pages = "390 - 405",
	title = "{O}n the design of deadlock-free adaptive routing algorithms for multicomputers: design methodologies",
	year = 1991
}

Jose Duato. Deadlock-free adaptive routing algorithms for multi-computers. Evaluation of a new algorithm. 1991, 840 - 840. URL BibTeX

@conference{1993031511897,
	author = "Duato, Jose",
	address = "Dallas, TX, USA",
	pages = "840 - 840",
	title = "{D}eadlock-free adaptive routing algorithms for multi-computers. {E}valuation of a new algorithm",
	url = "http://dx.doi.org/10.1109/SPDP.1991.218233",
	year = 1991
}

Jose Duato. Deadlock-free adaptive routing algorithms for multicomputers. Technique et Science Informatiques 10(4):275 - 85, 1991. BibTeX

@article{4002820,
	author = "Duato, Jose",
	abstract = "The paper proposes a very simple and powerful methodology to design deadlock-free adaptive routing algorithms for wormhole networks. The routing algorithms obtained from the application of that methodology to 2D and 3D-meshes are evaluated by simulation. As simulations are time consuming and adaptive algorithms are interesting when the network traffic is high, the simulations are restricted to the evaluation of networks of different sizes under worst conditions for medium to high message injection rates",
	address = "France",
	issn = "0752-4072",
	journal = "Technique et Science Informatiques",
	keywords = "concurrency control;parallel algorithms;",
	note = "2D meshes;multicomputers;deadlock-free adaptive routing algorithms;wormhole networks;3D-meshes;simulation;worst conditions;message injection rates;",
	number = 4,
	pages = "275 - 85",
	title = "{D}eadlock-free adaptive routing algorithms for multicomputers",
	volume = 10,
	year = 1991
}

J M Garcia and Jose Duato. An algorithm for dynamic reconfiguration of a multicomputer network. 1991, 848 - 55. URL BibTeX

@conference{4368132,
	author = "J.M. Garcia and Duato, Jose",
	abstract = "The dynamic reconfiguration of the interconnection network is an advanced feature of some multicomputers to reduce the communication overhead. The authors present an algorithm for the dynamic reconfiguration of the network. Reconfiguration is limited, preserving the original topology. Long distance message passing is minimized by positioning communication partners close to each other. The algorithm is transparent to the application programmer and is not restricted to a particular class of applications, being very well suited for parallel applications whose communication pattern varies over time. The paper also presents some simulation results, showing the benefits from the new reconfiguration algorithm",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing (Cat. No.91TH0396-2)",
	keywords = "message passing;multiprocessor interconnection networks;parallel processing;",
	note = "algorithm;dynamic reconfiguration;multicomputer network;interconnection network;message passing;simulation;",
	pages = "848 - 55",
	title = "{A}n algorithm for dynamic reconfiguration of a multicomputer network",
	url = "http://dx.doi.org/10.1109/SPDP.1991.218232",
	year = 1991
}

Jose Duato and J Pons. Parallel triangularization of sparse matrices on distributed memory multiprocessors. 1989, 133 - 133. BibTeX

@conference{1991021057189,
	author = "Duato, Jose and J. Pons",
	address = "Rennes, France",
	pages = "133 - 133",
	title = "{P}arallel triangularization of sparse matrices on distributed memory multiprocessors",
	year = 1989
}

Jose Duato and J Pons. Parallel triangularization of sparse matrices on distributed memory multiprocessors. 1989, 133 - 46. BibTeX

@conference{3601992,
	author = "Duato, Jose and J. Pons",
	abstract = "Square root free Givens rotations and their suitability for the parallel triangularization of sparse matrices are studied. Also, ways to split the overall problem into several tasks and the distribution of these task among the processors are analysed. A parallel algorithm to implement the triangularization of a sparse matrix on a distributed memory multiprocessor is proposed. This paper also presents the results of the performance evaluation on a multicomputer simulator, showing that very good speedups can be obtained, even with relatively small matrices. The evaluation also shows that a ring supports the algorithm efficiently",
	address = "Amsterdam, Netherlands",
	journal = "Hypercube and Distributed Computers. Proceedings of the First European Workshop",
	keywords = "linear algebra;multiprocessing systems;parallel algorithms;performance evaluation;",
	note = "parallel triangularization;square root free Givens rotation;sparse matrices;distributed memory multiprocessors;parallel algorithm;distributed memory multiprocessor;performance evaluation;multicomputer simulator;",
	pages = "133 - 46",
	title = "{P}arallel triangularization of sparse matrices on distributed memory multiprocessors",
	year = 1989
}

R Bru, Jose Duato, A Gonzalez, J Mas and A Urbano. Performance evaluation of a parallel algorithm for inverting dense matrices on distributed memory multiprocessors. 1989, 647 - 50. BibTeX

@conference{3799459,
	author = "R. Bru and Duato, Jose and A. Gonzalez and J. Mas and A. Urbano",
	abstract = "The authors present a parallel algorithm to invert a square dense matrix A, based on the Sherman-Morrison formula. It has been developed for distributed memory multiprocessors, obtaining a high degree of parallelism for matrices with a very large size. They have implemented this algorithm on a simulation tool in order to check its correctness. They also give results about the efficiency and speed-up as a function of some variables (size of A, number of processors, arithmetic and communication times) for three interconnection networks",
	address = "Los Altos, CA, USA",
	journal = "Proceedings of the Fourth Conference on Hypercubes, Concurrent Computers and Applications",
	keywords = "hypercube networks;matrix algebra;parallel algorithms;performance evaluation;",
	note = "performance evaluation;matrix inversion;correctness;checking;parallel algorithm;distributed memory multiprocessors;square dense matrix;Sherman-Morrison formula;parallelism;simulation tool;interconnection networks;",
	pages = "647 - 50",
	title = "{P}erformance evaluation of a parallel algorithm for inverting dense matrices on distributed memory multiprocessors",
	year = 1989
}

Jose Duato and A Gonzalez. Multicomputer simulator. 1989, 367 - 367. BibTeX

@conference{1991021057214,
	author = "Duato, Jose and A. Gonzalez",
	address = "Rennes, France",
	pages = "367 - 367",
	title = "{M}ulticomputer simulator",
	year = 1989
}

Jose Duato and A Gonzalez. Multicomputer simulator. 1989, 367 - 8. BibTeX

@conference{3602017,
	author = "Duato, Jose and A. Gonzalez",
	abstract = "Presents a multicomputer simulator, specially oriented to the development, debugging and evaluation of parallel numerical algorithms. This simulator implements a model, based on the features of real multicomputers and is not intended to substitute real machines, but to complement them. Additionally, the simulator allows the use of friendly languages, like Modula-2, including the definition of complex data structures and dynamical memory allocation. The model currently implemented is described together with the way to implement parallel algorithms with this simulator. The paper also indicates how to use the simulator and proposes some future developments",
	address = "Amsterdam, Netherlands",
	journal = "Hypercube and Distributed Computers. Proceedings of the First European Workshop",
	keywords = "mathematics computing;parallel algorithms;",
	note = "multicomputer simulator;development;debugging;evaluation;parallel numerical algorithms;friendly languages;Modula-2;data structures;dynamical memory allocation;",
	pages = "367 - 8",
	title = "{M}ulticomputer simulator",
	year = 1989
}

Jose Duato and P Albertos. Simplified non-linear control of lifts. 1988, 149 - 54. BibTeX

@conference{3147950,
	author = "Duato, Jose and P. Albertos",
	abstract = "A non-linear controller for lift installations is proposed in this paper. After a brief description of the system, the main difficulties that arose during the evaluation of a prototype with a linear regulator are reported, showing the load influence and the non-linearity of the motor. Later, a simplified dynamic model for the lift motor is proposed, analysing two methods for load estimation. Finally, some non-linear functions are included in the regulator, thus improving the system behaviour. Also, the main results obtained with an 85 HP motor are reported",
	address = "Oxford, UK",
	journal = "Components, Instruments and Techniques for Low Cost Automation and Applications. IFAC Symposium",
	keywords = "computerised control;controllers;lifts;nonlinear control systems;",
	note = "nonlinear controller;lifts;load influence;dynamic model;load estimation;regulator;",
	pages = "149 - 54",
	title = "{S}implified non-linear control of lifts",
	year = 1988
}

Jose Duato. Parallel processing of the square root free Givens rotations by means of a transputer network. 1988, 257 - 64. BibTeX

@conference{3228374,
author = "Duato, Jose",
abstract = "There are many applications that require the use of orthogonal transformations. Since the cost of microprocessors is decreasing rapidly, they allow the construction of inexpensive multiprocessors. In particular, the transputer from Inmos allows an easy implementation of a powerful multiprocessor, within a great flexibility in the design of the interconnection network topology. Among the orthogonal transformations, the Givens rotations are best suited for a parallel computer because they exhibit a great potential parallelism. Moreover, the square root free Givens rotations are twice as fast as the conventional ones. In this paper, the possibility to split the transformation of a matrix into several tasks and their distribution among the processors are analysed. Besides, a parallel algorithm to implement the fast Givens rotations with a network of transputers is proposed. Also, the topology of the transputer network is proposed. This topology can be implemented with a different number of processors, depending on the processing speed required. Finally, the results of the algorithm simulation on different sized networks are shown",
address = "Amsterdam, Netherlands",
journal = "Parallel Processing and Applications. Proceedings of the International Conference",
keywords = "microprocessor chips;parallel algorithms;",
note = "parallel processing;square root free Givens rotations;transputer network;orthogonal transformations;Inmos;interconnection network topology;parallel algorithm;algorithm simulation;",
pages = "257 - 64",
title = "{P}arallel processing of the square root free {G}ivens rotations by means of a transputer network",
year = 1988
}

Jose Duato. A network topology for parallel processing on message-passing architectures. 1988, 167 - 73. BibTeX

@conference{3531987,
author = "Duato, Jose",
abstract = "In order to construct general purpose massively parallel systems, message-passing architectures appear as a trade-off between flexibility and cost. In this kind of systems, the communication among processors relies on an interconnection network. Point-to-point topologies are normally used, each node sending, receiving and routing messages in a distributed manner. Some authors have tried to find a trade-off between diameter and node degree, also maintaining the possibility to design a simple routing algorithm, such as the hypernets or the cube-connected cycles. In this paper, a new topology with a node degree equal to four is defined. Its main feature is a very small diameter, which is only slightly larger than the diameter of a hypercube with the same number of nodes, whatever the network size is. The diameter and the distances between nodes are also presented for different sized networks, comparing the proposed topology with other topologies. Finally, a distributed algorithm to route messages through the network is given",
address = "St.Petersburg, FL, USA",
journal = "ICS 88. Third International Conference on Supercomputing. Proceedings, Supercomputing '88",
keywords = "computer architecture;parallel processing;",
note = "network topology;parallel processing;message-passing architectures;massively parallel systems;interconnection network;hypercube;distributed algorithm;",
pages = "167 - 73",
title = "{A} network topology for parallel processing on message-passing architectures",
year = 1988
}

Jose Duato. A network topology with small diameter and constant node degree. 1988, 107 - 10. BibTeX

@conference{3451387,
	author = "Duato, Jose",
	abstract = "In order to construct general purpose massively parallel computers, message-passing architectures appear as a trade-off between flexibility and cost. In these kind of systems, the communication among processors relies on an interconnection network. Point-to-point topologies are normally used, each node sending, receiving and routing message in a distributed manner. In the paper, a topology with a node degree equal to four is defined. Its main feature is a very small diameter, which is only slightly larger than the diameter of a hypercube with the same number of nodes, whatever the network size is. Besides, distributed algorithms to route and broadcast messages through the network are proposed, also giving an Occam implementation",
	address = "Anaheim, CA, USA",
	journal = "Proceedings of the IASTED International Symposium Applied Informatics - AI '88",
	keywords = "multiprocessor interconnection networks;",
	note = "network topology;constant node degree;massively parallel computers;message-passing architectures;interconnection network;small diameter;hypercube;distributed algorithms;route;broadcast;Occam implementation;",
	pages = "107 - 10",
	title = "{A} network topology with small diameter and constant node degree",
	year = 1988
}

Jose Duato and J Pons. A new family of network topologies for message-passing architectures. 1988, 111 - 14. BibTeX

@conference{3532077,
	author = "Duato, Jose and J. Pons",
	abstract = "The communication among processors in message-passing architectures relies on an interconnection network. Some authors have tried to find a trade-off between diameter and node degree, also allowing the design of a simple routing algorithm such as the hypernets or the cube-connected cycles. In this paper, a new family of topologies is defined, allowing to choose the diameter and node degree independently. Its main feature is a very small diameter, obtained with a small node degree. The diameter and normalized diameter are presented for different sized networks, comparing the proposed family with other topologies",
	address = "Amsterdam, Netherlands",
	journal = "Parallel Processing. Proceedings of the IFIP WG 10.3 Working Conference",
	keywords = "computer architecture;multiprocessor interconnection networks;network topology;",
	note = "network topologies;message-passing architectures;interconnection network;routing algorithm;hypernets;",
	pages = "111 - 14",
	title = "{A} new family of network topologies for message-passing architectures",
	year = 1988
}

Jose Duato. Nonlinear digital control of three-phase asynchronous motors. Automatica e Instrumentacion 21(170):177 - 81, 1987. BibTeX

@article{3015985,
	author = "Duato, Jose",
	abstract = "The problem of the speed control of three-phase asynchronous motors driving, for example, elevators (lifts) in cases in which the control may be effected by varying the voltage on the stator is addressed. The article proposes a simplified dynamic model of the motor and its load and analyses simple methods of estimating the actual load. It suggests methods of digital control based upon the use of microprocessors. It discusses the technology in some detail and indicates the nature of results obtained experimentally",
	address = "Spain",
	issn = "0213-3113",
	journal = "Automatica e Instrumentacion",
	keywords = "computerised control;electric drives;induction motors;lifts;nonlinear control systems;",
	note = "nonlinear digital control;stator-voltage variation;load estimation;microprocessor-based digital control;three-phase asynchronous motors;speed control;elevators;lifts;simplified dynamic model;",
	number = 170,
	pages = "177 - 81",
	title = "{N}onlinear digital control of three-phase asynchronous motors",
	volume = 21,
	year = 1987
}

Jose Duato. Fault-tolerant microprocessor-based control system for multiple lift installations. 1987, 209 - 13. BibTeX

@conference{3067181,
	author = "Duato, Jose",
	abstract = "A fault-tolerant control system for multiple lift installations is proposed. The system has a modular structure. It is composed by a system controller and as many motor controllers as lifts. The motor controller supports transient failure recovery through software routines and watchdog timers. Each lift may be assigned to two motor controllers, thus taking advantage of the existence of similar components. The system controller has a redundant architecture and, among other things, governs the system reconfiguration in the case of a failure. The additional cost of true redundancy is less than 1% of the overall installation cost. Also, the main causes of failure are analysed as well as the suitability of the proposed architecture to recover them",
	address = "Oxford, UK",
	journal = "Microcomputer Application in Process Control. Selected Papers from the IFAC Symposium",
	keywords = "control systems;induction motors;lifts;microcomputer applications;position control;",
	note = "induction motors;position control;fault-tolerant microprocessor based control system;multiple lift installations;transient failure recovery;watchdog timers;motor controllers;",
	pages = "209 - 13",
	title = "{F}ault-tolerant microprocessor-based control system for multiple lift installations",
	year = 1987
}

Jose Duato, P Albertos and J M Valiente. Position and speed control for lift motors based on a micro. Regulacion y Mando Automatico 19(147):155 - 9, 1985. BibTeX

@article{2503597,
	author = "Duato, Jose and P. Albertos and J.M. Valiente",
	abstract = "Many industrial applications involve the movement of large bodies with a precise end position but which do not require high precision speed control. This article describes the principle operating features of a micro-based control system for lift motors which offers the appropriate desirable features",
	address = "Spain",
	issn = "0040-1722",
	journal = "Regulacion y Mando Automatico",
	keywords = "computerised control;lifts;position control;velocity control;",
	note = "speed control;position control;computerised control;lift motors;micro;industrial applications;operating features;",
	number = 147,
	pages = "155 - 9",
	title = "{P}osition and speed control for lift motors based on a micro",
	volume = 19,
	year = 1985
}

P Albertos and Jose Duato. Position control with power induction motors. 1984, 547 - 52. BibTeX

@conference{2318470,
	author = "P. Albertos and Duato, Jose",
	abstract = "Some industrial control applications require large mass displacement with accurate final positioning control, displacement speed being limited by the maximum available motor torque. A microprocessor-based control system is presented. To reach the target position a multimode control is implemented, including acceleration, maximum speed and deceleration controls, as well as final position control. A squirrel cage induction motor is used as actuator, speed being controlled by stator voltage control. The control system has been implemented on a 10 HP induction motor",
	address = "Oxford, UK",
	journal = "Control in Power Electronics and Electrical Drives. Proceedings of the Third IFAC Symposium",
	keywords = "computerised control;induction motors;position control;squirrel cage motors;voltage control;",
	note = "induction motors;industrial control applications;positioning control;microprocessor-based control system;position control;squirrel cage;actuator;stator voltage control;",
	pages = "547 - 52",
	title = "{P}osition control with power induction motors",
	year = 1984
}

J A Puente, A Crespo, Jose Duato and P Albertos. KERNEL FOR HIGH LEVEL REAL-TIME PROGRAMMING.. 1983, IEEE Region 8 -. BibTeX

@conference{1984020034052,
	author = "de la Puente, J.A. and A. Crespo and Duato, Jose and P. Albertos",
	address = "Athens, Greece",
	key = "COMPUTER PROGRAMMING LANGUAGES",
	note = "HIGH LEVEL LANGUAGES;KERNEL;MICROCOMPUTERS;MINICOMPUTERS;REAL-TIME PROGRAMMING;SOFTWARE INTERFACES;",
	pages = "IEEE Region 8 -",
	title = "{KERNEL} {FOR} {HIGH} {LEVEL} {REAL}-{TIME} {PROGRAMMING}.",
	volume = 1,
	year = 1983
}

J A Puente, A Crespo, Jose Duato and P Albertos. A kernel for high level real-time programming. 1983, 7 - 10. BibTeX

@conference{2237347,
	author = "de la Puente, J.A. and A. Crespo and Duato, Jose and P. Albertos",
	abstract = "A kernel for the implementation of real-time primitives in high-level languages is presented. The kernel is suited for developing control applications in minicomputers and microcomputers and is portable from one system to another. An application to a pilot-scale process control system has been developed using the kernel",
	address = "New York, NY, USA",
	journal = "Proceedings of MELECON '83. Mediterranean Electrotechnical Conference",
	keywords = "high level languages;process computer control;programmed control;real-time systems;",
	note = "high level real-time programming;kernel;real-time primitives;high-level languages;control applications;minicomputers;microcomputers;pilot-scale process control system;",
	pages = "7 - 10",
	title = "{A} kernel for high level real-time programming",
	year = 1983
}

Thesis

Addressing Manufacturing Challenges in NoC-based ULSI, Designs. Jose Duato, Federico Silla (Network-On-Chip)

High-performance arch. for high-radix switches. Jose Jose Flich Duato (Switch Architectures)

Efficient techniques to provide scalability for token-based cache coherence protocols. Antonio Robles, Jose Duato (Routing Algorithms)

Routing and flow control in networks of workstations. Jose Duato (Networks of Workstations)

On the Enhancement of Remote GPU Virtualization in High Performance Clusters. Jose Duato, Federico Silla (High Performance Clusters)