Pedro López

Contact

Position:: Full Professor

Address:: Valencia
Email:: This email address is being protected from spambots. You need JavaScript enabled to view it.
Phone:: +34963877007x77572

Image & Curriculum Vitae

Image & Curriculum Vitae

Publications

Roberto Peñaranda, Maria E Gomez and Pedro Lopez. A Fault-Tolerant Routing Strategy for KNS Topologies Based on Intermediate Nodes. Concurrency and Computation Practice and Experience 29(SI HiPINEB 2016), 2017. BibTeX

@article{10.1002/cpe.4065,
	author = "Pe{\~n}aranda, Roberto and Gomez, Maria E. and Lopez, Pedro",
	abstract = "Exascale computing systems are being built with thousands of nodes. The high number of components of these systems significantly increases the probability of failure. A key component for them is the interconnection network. If failures occur in the interconnection network, they may isolate a large fraction of the machine. For this reason, an efficient fault-tolerant mechanism is needed to keep the system interconnected, even in the presence of faults. A recently proposed topology for these large systems is the hybrid k-ary n-direct s-indirect (KNS) family that provides optimal performance and connectivity at a reduced hardware cost. This paper presents a fault-tolerant routing methodology for the KNS topology that degrades performance gracefully in presence of faults and tolerates a large number of faults without disabling any healthy computing node. In order to tolerate network failures, the methodology uses a simple mechanism. For any source-destination pair, if necessary, packets are forwarded to the destination node through a set of intermediate nodes (without being ejected from the network) with the aim of circumventing faults. The evaluation results shows that the proposed methodology tolerates a large number of faults. For instance, it is able to tolerate more than 99.5% of fault combinations when there are ten faults in a 3-D network with 1,000 nodes using only one intermediate node and more than 99.98% if two intermediate nodes are used. Furthermore, the methodology offers a gracious performance degradation. As an example, performance degrades only by 1% for a 2-D network with 1,024 nodes and 1% faulty links.",
	journal = "Concurrency and Computation Practice and Experience 29(SI HiPINEB 2016)",
	title = "{A} {F}ault-{T}olerant {R}outing {S}trategy for {KNS} {T}opologies {B}ased on {I}ntermediate {N}odes",
	year = 2017
}

Roberto Peñaranda, Pedro Lopez and Maria E Gomez. A New Fault-Tolerant Routing Methodology for KNS Topologies. 2016 2nd IEEE International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB), 2016. BibTeX

@article{10.1109/HIPINEB.2016.9,
	author = "Pe{\~n}aranda, Roberto and Lopez, Pedro and Gomez, Maria E.",
	journal = "2016 2nd IEEE International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB)",
	title = "{A} {N}ew {F}ault-{T}olerant {R}outing {M}ethodology for {KNS} {T}opologies",
	year = 2016
}

Roberto Peñaranda, Crispín Gomez, Maria E Gomez and Pedro Lopez. XORAdap: A HoL-Blocking Aware Adaptive Routing Algorithm. 2015 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), 2015. BibTeX

@article{https://www.researchgate.net/deref/http%3A%2F%2Fdx.doi.org%2F10.1109%2FPDP.2015.50,
	author = "Pe{\~n}aranda, Roberto and Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro",
	journal = "2015 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)",
	title = "{XORA}dap: {A} {H}o{L}-{B}locking {A}ware {A}daptive {R}outing {A}lgorithm",
	year = 2015
}

Salvador Petit, Rafael Ubal, Julio Sahuquillo and Pedro Lopez. Efficient Register Renaming and Recovery for High-Performance Processors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 7(22):1506-1514, 2014. BibTeX

@article{10.1109/TVLSI.2013.2270001,
	author = "Petit, Salvador and Ubal, Rafael and Sahuquillo, Julio and Lopez, Pedro",
	abstract = "Modern superscalar processors implement register renaming using either random access memory (RAM) or content-addressable memories (CAM) tables. The design of these structures should address both access time and misprediction recovery penalty. Although direct-mapped RAMs provide faster access times, CAMs are more appropriate to avoid recovery penalties. The presence of associative ports in CAMs, however, prevents them from scaling with the number of physical registers and pipeline width, negatively impacting performance, area, and energy consumption at the rename stage. In this paper, we present a new hybrid RAM–CAM register renaming scheme, which combines the best of both approaches. In a steady state, a RAM provides fast and energy-efficient access to register mappings. On misspeculation, a low-complexity CAM enables immediate recovery. Experimental results show that in a four-way state-of-the-art superscalar processor, the new approach provides almost the same performance as an ideal CAM-based renaming scheme, while dissipating only between 17% and 26% of the original energy and, in some cases, consuming less energy than purely RAM-based renaming schemes. Overall, the silicon area required to implement the hybrid RAM–CAM scheme does not exceed the area required by conventional renaming mechanisms.",
	journal = "IEEE Transactions on Very Large Scale Integration (VLSI) Systems",
	number = 22,
	pages = "1506-1514",
	title = "{E}fficient {R}egister {R}enaming and {R}ecovery for {H}igh-{P}erformance {P}rocessors",
	volume = 7,
	year = 2014
}

Roberto Peñaranda, Crispín Gomez, Maria E Gomez, Pedro Lopez and Jose Duato. A New Family of Hybrid Topologies for Large-Scale Interconnection Networks. IEEE 11th International Symposium on Network Computing and Applications, pages 220-227, August 2012. BibTeX

@article{HybridTopology,
	author = "Pe{\~n}aranda, Roberto and Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "In large supercomputers the topology of the interconnection network is a key design issue that impacts the performance and cost of the whole system. Direct topologies provide a reduced hardware cost, but, as the number of dimensions is conditioned by 3D wiring restrictions, a high number of nodes per dimension is used, which increases communication latency and reduces network throughput. On the other hand, indirect topologies can provide better performance for large network sizes, but at the cost of a high number of switches and links. In this paper, we propose a new family of topologies that combines the best features of both direct and indirect topologies to efficiently connect an extremely high number of nodes. In particular, we propose an n–dimensional topology, where the nodes of each dimension are connected through a small indirect topology. This combination results in a family of topologies that provides high performance, with latency and throughput figures of merit close to indirect topologies, but with a lower hardware cost. In particular, it is able to double the throughput obtained per switching element of indirect topologies. Moreover, the layout of the topology is much simpler than in indirect topologies. Indeed, its fault–tolerance degree is equal or higher than the one for direct and indirect topologies.",
	journal = "IEEE 11th International Symposium on Network Computing and Applications",
	keywords = "routing algorithm, direct topology, indirect topology",
	month = "August",
	pages = "220-227",
	title = "{A} {N}ew {F}amily of {H}ybrid {T}opologies for {L}arge-{S}cale {I}nterconnection {N}etworks",
	year = 2012
}

Roberto Peñaranda, Crispín Gomez, Maria E Gomez and Pedro Lopez. A New Family of Hybrid Topologies for Large-Scale Interconnection Networks. Network Computing and Applications (NCA), 2012 11th IEEE International Symposium on, 2012. BibTeX

@article{10.1109/NCA.2012.22,
	author = "Pe{\~n}aranda, Roberto and Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro",
	abstract = "In large supercomputers the topology of the interconnection network is a key design issue that impacts the performance and cost of the whole system. Direct topologies provide a reduced hardware cost, but as the number of dimensions is conditioned by 3D wiring restrictions, a high number of nodes per dimension is used, which increases communication latency and reduces network throughput. On the other hand, indirect topologies can provide better performance for large network sizes, but at the cost of a high amount of switches and links. In this paper we propose a new family of topologies that combines the best features of both direct and indirect topologies to efficiently connect an extremely high number of nodes. In particular, we propose an n-dimensional topology where the nodes of each dimension are connected through a small indirect topology. This combination results in a family of topologies that provides high performance, with latency and throughput figures of merit close to indirect topologies, but with a lower hardware cost. In particular, it is able to double the throughput obtained per switching element of indirect topologies. Moreover, the layout of the topology is much simpler than in indirect topologies. Indeed, its fault-tolerance degree is equal or higher than the one for direct and indirect topologies.",
	journal = "Network Computing and Applications (NCA), 2012 11th IEEE International Symposium on",
	title = "{A} {N}ew {F}amily of {H}ybrid {T}opologies for {L}arge-{S}cale {I}nterconnection {N}etworks",
	year = 2012
}

Crispín Gomez, Maria E Gomez, Pedro Lopez and Jose Duato. How to reduce packet dropping in a bufferless NoC. Concurrency and Computation: Practice and Experience 23(1):86 - 99, 2011. URL, DOI BibTeX

@article{11723780,
	author = "Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Networks on-chip (NoCs) interconnect the components located inside a chip. In multicore chips, NoCs have a strong impact on the overall system performance. NoC bandwidth is limited by the critical path delay. Recent works show that the critical path delay is heavily affected by switch port buffer size. Therefore, by removing buffers, switch clock frequency can be increased. Recently, a new switching technique for NoCs called Blind Packet Switching (BPS) has been proposed, which is based on removing the switch port buffers. Since buffers consume a high percentage of switch power and area, BPS not only improves performance but also reduces power and area. In BPS, as there are no buffers at the switch ports, packets cannot be stopped and stored on them. If contention arises packets are dropped and later reinjected, negatively affecting performance. In order to prevent packet dropping, some techniques based on resource replication have been proposed. In this paper, we propose some alternative and complementary techniques that do not rely on resource replication. By using them, packet dropping is highly reduced. In particular, packet dropping is completely removed for a very wide network traffic range. Moreover, network throughput is increased and packet latency is reduced. © 2010 John Wiley {\&} Sons, Ltd.",
	address = "UK",
	doi = "10.1002/cpe.1606",
	issn = "1532-0626",
	journal = "Concurrency and Computation: Practice and Experience",
	keywords = "buffer circuits;circuit switching;network-on-chip;",
	note = "packet dropping reduction;bufferless NoC;networks on-chip;critical path delay;switch clock frequency;blind packet switching;switch port buffers;network traffic range;",
	number = 1,
	pages = "86 - 99",
	title = "{H}ow to reduce packet dropping in a bufferless {N}o{C}",
	url = "http://dx.doi.org/10.1002/cpe.1606",
	volume = 23,
	year = 2011
}

Crispín Gomez, Maria E Gomez, Pedro Lopez and Jose Duato. How to reduce packet dropping in a bufferless NoC. Concurrency and Computation: Practice and Experience 23(1):86-99, 2011. URL, DOI BibTeX

@article{DBLP:journals/concurrency/RequenaGLD11,
	author = "Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Abstract Networks on-chip (NoCs) interconnect the components located inside a chip. In multicore chips, NoCs have a strong impact on the overall system performance. NoC bandwidth is limited by the critical path delay. Recent works show that the critical path delay is heavily affected by switch port buffer size. Therefore, by removing buffers, switch clock frequency can be increased. Recently, a new switching technique for NoCs called Blind Packet Switching (BPS) has been proposed, which is based on removing the switch port buffers. Since buffers consume a high percentage of switch power and area, BPS not only improves performance but also reduces power and area. In BPS, as there are no buffers at the switch ports, packets cannot be stopped and stored on them. If contention arises packets are dropped and later reinjected, negatively affecting performance. In order to prevent packet dropping, some techniques based on resource replication have been proposed. In this paper, we propose some alternative and complementary techniques that do not rely on resource replication. By using them, packet dropping is highly reduced. In particular, packet dropping is completely removed for a very wide network traffic range. Moreover, network throughput is increased and packet latency is reduced. Copyright © 2010 John Wiley {\&} Sons, Ltd.",
	doi = "10.1002/cpe.1606",
	issn = "1532-0634",
	journal = "Concurrency and Computation: Practice and Experience",
	keywords = "networks on-chip;buffer limitations;packet dropping reduction",
	number = 1,
	pages = "86-99",
	title = "{H}ow to reduce packet dropping in a bufferless {N}o{C}",
	url = "http://dx.doi.org/10.1002/cpe.1606",
	volume = 23,
	year = 2011
}

Crispín Gomez, Maria E Gomez, Pedro Lopez and Jose Duato. How to reduce packet dropping in a bufferless NoC. Concurrency Computation Practice and Experience 23(1):86 - 99, 2011. URL BibTeX

@article{20105213526965,
	author = "Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Networks on-chip (NoCs) interconnect the components located inside a chip. In multicore chips, NoCs have a strong impact on the overall system performance. NoC bandwidth is limited by the critical path delay. Recent works show that the critical path delay is heavily affected by switch port buffer size. Therefore, by removing buffers, switch clock frequency can be increased. Recently, a new switching technique for NoCs called Blind Packet Switching (BPS) has been proposed, which is based on removing the switch port buffers. Since buffers consume a high percentage of switch power and area, BPS not only improves performance but also reduces power and area. In BPS, as there are no buffers at the switch ports, packets cannot be stopped and stored on them. If contention arises packets are dropped and later reinjected, negatively affecting performance. In order to prevent packet dropping, some techniques based on resource replication have been proposed. In this paper, we propose some alternative and complementary techniques that do not rely on resource replication. By using them, packet dropping is highly reduced. In particular, packet dropping is completely removed for a very wide network traffic range. Moreover, network throughput is increased and packet latency is reduced. Copyright {\&}copy; 2010 John Wiley {\&}amp; Sons, Ltd.",
	address = "Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom",
	issn = 15320626,
	journal = "Concurrency Computation Practice and Experience",
	key = "Packet switching",
	keywords = "Signal filtering and prediction;",
	note = "buffer limitations;Buffer sizes;Clock frequency;Critical path delays;Multicore chips;Network throughput;Network traffic;On chips;Packet dropping;Packet latencies;Resource replication;Switch ports;Switch power;Switching techniques;",
	number = 1,
	pages = "86 - 99",
	title = "{H}ow to reduce packet dropping in a bufferless {N}o{C}",
	url = "http://dx.doi.org/10.1002/cpe.1606",
	volume = 23,
	year = 2011
}

Marina Alonso, Salvador Coll, Juan Miguel Martínez, Vicente Santonja, Pedro Lopez and Jose Duato. Power saving in regular interconnection networks. Parallel Computing 36(12):696 - 712, 2010. URL, DOI BibTeX

@article{MarinaAlonso|Coll2010696,
	author = "Alonso, Marina and Coll, Salvador and Mart{\'i}nez, Juan Miguel and Santonja, Vicente and Lopez, Pedro and Duato, Jose",
	abstract = "The high level of computing power required for some applications can only be achieved by multiprocessor systems. These systems consist of several processors that communicate by means of an interconnection network. The huge increase both in size and complexity of high-end multiprocessor systems has triggered up their power consumption. Complex cooling systems are needed, which, in turn, increases power consumption. Power consumption reduction techniques are being applied everywhere in computer systems and the interconnection network is not an exception, as its contribution is not negligible. In this paper, we propose a mechanism to reduce interconnect power consumption that combines two alternative techniques: (i) dynamically switching on and off network links as a function of traffic (any link can be switched off, provided that network connectivity is guaranteed), (ii) dynamically reducing the available network bandwidth when traffic becomes low. In both cases, the topology of the network is not modified. Therefore, the same routing algorithm can be used regardless of the power saving actions taken, thus simplifying router design. Our simulation results show that the network power consumption can be greatly reduced, at the expense of some increase in latency. However, the achieved power reduction is always higher than the latency penalty.",
	doi = "DOI: 10.1016/j.parco.2010.08.003",
	issn = "0167-8191",
	journal = "Parallel Computing",
	keywords = "Power saving; Interconnection networks; Routing",
	number = 12,
	pages = "696 - 712",
	title = "{P}ower saving in regular interconnection networks",
	url = "http://www.sciencedirect.com/science/article/B6V12-50VTWG7-1/2/7972b8869966237a0ab6b680fd5fa6ba",
	volume = 36,
	year = 2010
}

Joan-Lluis Ferrer, Elvira Baydal, Antonio Robles, Pedro Lopez and Jose Duato. A Scalable and Early Congestion Management Mechanism for MINs. In Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP 2010. 2010, 43 - 50. URL BibTeX

@conference{11260741,
	author = "Ferrer, Joan-Lluis and Baydal, Elvira and Robles, Antonio and Lopez, Pedro and Duato, Jose",
	abstract = "Several packet marking-based mechanisms have been proposed to manage congestion in multistage interconnection networks. One of them, the MVCM mechanism obtains very good results for different network configurations and traffic loads. However, as MVCM applies full virtual output queuing at origin, its memory requirements may jeopardize its scalability. Additionally, the applied packet marking technique introduces certain delay to detect congestion. In this paper, we propose and evaluate the Scalable Early Congestion Management mechanism which eliminates the drawbacks exhibited by MVCM. The new mechanism replaces the full virtual output queuing at origin by either a partial virtual output queuing or a shared buffer, in order to reduce its memory requirements, thus making the mechanism scalable. Also, it applies an improved packet marking technique based on marking packets at output buffers regardless of their marking at input buffers, which simplifies the marking technique, allowing also a sooner detection of the root of a congestion tree.",
	address = "Piscataway, NJ, USA",
	booktitle = "Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP 2010",
	journal = "Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2010)",
	keywords = "multistage interconnection networks;",
	note = "packet marking based mechanisms;multistage interconnection networks;MVCM mechanism;virtual output queuing;scalable early congestion management mechanism;shared buffer;",
	pages = "43 - 50",
	title = "{A} {S}calable and {E}arly {C}ongestion {M}anagement {M}echanism for {MIN}s",
	url = "http://dx.doi.org/10.1109/PDP.2010.36",
	year = 2010
}

Salvador Petit, Rafael Ubal, Julio Sahuquillo and Pedro Lopez. A power-aware hybrid RAM-CAM renaming mechanism for fast recovery. In Computer Design, 2009. ICCD 2009. IEEE International Conference on. 2009, 150 -157. URL, DOI BibTeX

@conference{5413160,
	author = "Petit, Salvador and Ubal, Rafael and Sahuquillo, Julio and Lopez, Pedro",
	abstract = "Modern superscalar processors implement register renaming by using either RAM or CAM tables. The design of these structures should address their access time and misprediction recovery penalty. While direct-mapped RAMs provide faster access times, CAMs are more appropriate to avoid recovery penalties. Although they are more complex and slower, CAMs usually match the processor cycle in current designs. However, they do not scale with the number of physical registers and the pipeline width. In this paper we present a new hybrid RAM-CAM register renaming scheme, which combines the best of both approaches. In a steady state, a RAM provides the current mappings quickly; on mispeculation, a low-complexity CAM enables immediate recovery and further register renaming. Compared to an ideal CAM in a 4-way state-of-the-art superscalar microprocessor, and for almost the same performance (1% slowdown) and area (95% of the ideal CAM size), the proposed scheme consumes about 90% less dynamic energy.",
	booktitle = "Computer Design, 2009. ICCD 2009. IEEE International Conference on",
	doi = "10.1109/ICCD.2009.5413160",
	issn = "1063-6404",
	keywords = "direct-mapped RAM;misprediction recovery penalty;physical registers;pipeline width;power-aware hybrid RAM-CAM renaming mechanism;processor cycle;register renaming;superscalar processors;microprocessor chips;power aware computing;random-access storage;",
	month = "oct.",
	pages = "150 -157",
	title = "{A} power-aware hybrid {RAM}-{CAM} renaming mechanism for fast recovery",
	url = "http://dx.doi.org/10.1109/ICCD.2009.5413160",
	year = 2009
}

Salvador Petit, Rafael Ubal, Julio Sahuquillo, Pedro Lopez and Jose Duato. An Efficient Low-Complexity Alternative to the ROB for Out-of-Order Retirement of Instructions. In Antonio Nunez; Pedro P Carballo (ed.). Digital System Design, Architectures, Methods and Tools, 2009. DSD '09. 12th Euromicro Conference on. 2009, 635 -642. URL, DOI BibTeX

@conference{5350186,
	author = "Petit, Salvador and Ubal, Rafael and Sahuquillo, Julio and Lopez, Pedro and Duato, Jose",
	abstract = "Current superscalar processors use a reorder buffer (ROB) to support speculation, precise exceptions, and register reclamation. Instructions are retired from this structure in program order, which may lead to significant performance degradation if a long latency operation blocks the ROB head. In this paper, a checkpoint-free out-of-order commit architecture is proposed, which replaces the ROB with a small structure called validation buffer (VB) from which instructions are retired as soon as their speculative state is resolved. An aggressive register reclamation mechanism targeted to this microarchitecture is also devised. Experimental results show that the VB microarchitecture is much more efficient than a ROB-based microprocessor. For example, a 32-entry VB provides similar performance to a 256-entry ROB, while reducing the utilization of other major processor structures.",
	booktitle = "Digital System Design, Architectures, Methods and Tools, 2009. DSD '09. 12th Euromicro Conference on",
	doi = "10.1109/DSD.2009.237",
	editor = "Antonio Nunez; Pedro P. Carballo",
	isbn = "978-0-7695-3782-5",
	keywords = "ROB-based microprocessor;checkpoint-free out-of-order commit architecture;out-of-order instruction retirement;register reclamation;register reclamation mechanism;superscalar reorder buffer processors;validation buffer;buffer circuits;microprocessor chips;",
	month = "aug.",
	pages = "635 -642",
	title = "{A}n {E}fficient {L}ow-{C}omplexity {A}lternative to the {ROB} for {O}ut-of-{O}rder {R}etirement of {I}nstructions",
	url = "http://dx.doi.org/10.1109/DSD.2009.237",
	year = 2009
}

, M Palesi, Jose Flich, S Kumar, Pedro Lopez, R Holsmark and Jose Duato. Region-Based Routing: A Mechanism to Support Efficient Routing Algorithms in NoCs. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 17(3):356 -369, March 2009. URL, DOI BibTeX

@article{4804124,
	author = ", and M. Palesi and Flich, Jose and S. Kumar and Lopez, Pedro and R. Holsmark and Duato, Jose",
	abstract = "An efficient routing algorithm is important for large on-chip networks [network-on-chip (NoC)] to provide the required communication performance to applications. Implementing NoC using table-based switches provide many advantages, including possibility of changing routing algorithms and fault tolerance, due to the option of table reconfigurations. However, table-based switches have been considered unsuitable for NoCs due to their perceived high area and power consumption. In this paper, we describe the region-based routing (RBR) mechanism which groups destinations into network regions allowing an efficient implementation with logic blocks. RBR can also be viewed as a mechanism to reduce the number of entries in routing tables. RBR is general and can be used in conjunction with any adaptive routing algorithm. In particular, we have evaluated the proposed scheme in conjunction with a general routing algorithm, namely segment-based routing (SR) and an application specific routing algorithm (APSRA) using regular and irregular mesh topologies. Our study shows that the number of entries in the table is significantly reduced, especially for large networks. Evaluation results show that RBR requires only four regions to support several routing algorithms in a 2-D mesh with no performance degradation. Considering link failures, our results indicate that RBR combined with SR is able to tolerate up to 7 link failures in an 8times8 mesh. RBR also reduces area and power dissipation of an equivalent table-based implementation by factors of 8 and 10, respectively. Moreover, the degradation in performance of the network is insignificant when using APSRA combined with RBR.",
	doi = "10.1109/TVLSI.2008.2012010",
	issn = "1063-8210",
	journal = "Very Large Scale Integration (VLSI) Systems, IEEE Transactions on",
	keywords = "adaptive routing algorithm;application specific routing algorithm;fault tolerance;large on-chip networks;network-on-chip;region-based routing mechanism;segment-based routing;table-based switches;network topology;network-on-chip;",
	month = "march",
	number = 3,
	pages = "356 -369",
	title = "{R}egion-{B}ased {R}outing: {A} {M}echanism to {S}upport {E}fficient {R}outing {A}lgorithms in {N}o{C}s",
	url = "http://dx.doi.org/10.1109/TVLSI.2008.2012010",
	volume = 17,
	year = 2009
}

D Ludovici, Francisco Gilabert, S Medardoni, Crispín Gomez, Maria E Gomez, Pedro Lopez, G N Gaydadjiev and D Bertozzi. Assessing fat-tree topologies for regular network-on-chip design under nanoscale technology constraints. In 2009 Design, Automation & Test in Europe Conference & Exhibition (DATE'09). 2009, 4 pp. -. BibTeX

@conference{10730481,
	author = "D. Ludovici and Gilabert, Francisco and S. Medardoni and Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and G.N. Gaydadjiev and D. Bertozzi",
	abstract = "Most of past evaluations of fat-trees for on-chip interconnection networks rely on oversimplifying or even irrealistic architecture and traffic pattern assumptions, and very few layout analyses are available to relieve practical feasibility concerns in nanoscale technologies. This work aims at providing an in-depth assessment of physical synthesis efficiency of fat-trees and at extrapolating silicon-aware performance figures to back-annotate in the system-level performance analysis. A 2D mesh is used as a reference architecture for comparison, and a 65 nm technology is targeted by our study. Finally, in an attempt to mitigate the implementation cost of k-ary n-tree topologies, we also review an alternative unidirectional multi-stage interconnection network which is able to simplify the fat-tree architecture and to minimally impact performance.",
	address = "Piscataway, NJ, USA",
	booktitle = "2009 Design, Automation {\&} Test in Europe Conference {\&} Exhibition (DATE'09)",
	journal = "2009 Design, Automation {{\&}}amp; Test in Europe Conference {{\&}}amp; Exhibition (DATE'09)",
	keywords = "extrapolation;integrated circuit interconnections;integrated circuit layout;nanoelectronics;network topology;network-on-chip;",
	note = "fat-tree topology;network-on-chip design;nanoscale technology;on-chip interconnection network;traffic pattern;layout analysis;extrapolation;system-level performance analysis;",
	pages = "4 pp. -",
	title = "{A}ssessing fat-tree topologies for regular network-on-chip design under nanoscale technology constraints",
	year = 2009
}

Crispín Gomez, Maria E Gomez, Pedro Lopez and Jose Duato. An Efficient Switching Technique for NoCs with Reduced Buffer Requirements. In Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on. 2008, 713 -720. URL, DOI BibTeX

@conference{4724384,
	author = "Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Networks on chip (NoCs) communicate the components located inside a chip. Overall system performance depends on NoC performance, that is affected by several factors. One of them is the network clock frequency, imposed by the critical path delay. Recent works show that switch critical path includes buffer control logic. Consequently, by removing switch buffers, switch frequency can be doubled. In this paper, we exploit this idea, proposing a new switching technique for NoCs which requires a reduced amount of storage at the switches. It is based on replacing switch port buffers by single latches. By doing so, network cycle can be reduced, which reduces packet latency. On the other hand, power and area consumption requirements can be reduced. However, since there are no buffers at the switch ports, packets can not be stopped. Stopped packets due to contention are dropped and reinjected from their senders via negative acknowledgments. Packet dropping is strongly reduced by exploiting NoCs wiring capability.",
	booktitle = "Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on",
	doi = "10.1109/ICPADS.2008.43",
	issn = "1521-9097",
	keywords = "buffer control logic;critical path delay;network clock frequency;network cycle;networks on chip;packet dropping;reduced buffer requirements;switching technique;network-on-chip;performance evaluation;",
	month = "dec.",
	pages = "713 -720",
	title = "{A}n {E}fficient {S}witching {T}echnique for {N}o{C}s with {R}educed {B}uffer {R}equirements",
	url = "http://dx.doi.org/10.1109/ICPADS.2008.43",
	year = 2008
}

Crispín Gomez, Francisco Gilabert, Maria E Gomez, Pedro Lopez and Jose Duato. Beyond Fat–tree: Unidirectional Load–Balanced Multistage Interconnection Network. Computer Architecture Letters 7(2):49 -52, 2008. URL, DOI BibTeX

@article{4544509,
	author = "Gomez, Crisp{\'i}n and Gilabert, Francisco and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "The fat-tree is one of the most widely-used topologies by interconnection network manufacturers. Recently, it has been demonstrated that a deterministic routing algorithm that optimally balances the network traffic can not only achieve almost the same performance than an adaptive routing algorithm but also outperforms it. On the other hand, fat-trees require a high number of switches with a non-negligible wiring complexity. In this paper, we propose replacing the fat-tree by a unidirectional multistage interconnection network (UMIN) that uses a traffic balancing deterministic routing algorithm. As a consequence, switch hardware is almost reduced to the half, decreasing, in this way, the power consumption, the arbitration complexity, the switch size itself, and the network cost. Preliminary evaluation results show that the UMIN with the load balancing scheme obtains lower latency than fat-tree for low and medium traffic loads. Furthermore, in networks with a high number of stages or with high radix switches, it obtains the same, or even higher, throughput than fat-tree.",
	doi = "10.1109/L-CA.2008.8",
	issn = "1556-6056",
	journal = "Computer Architecture Letters",
	keywords = "adaptive routing algorithm;interconnection network manufacturers;network traffic;nonnegligible wiring complexity;power consumption;radix switches;traffic balancing deterministic routing algorithm;unidirectional load-balanced multistage interconnection net",
	month = "july-dec.",
	number = 2,
	pages = "49 -52",
	title = "{B}eyond {F}at--tree: {U}nidirectional {L}oad--{B}alanced {M}ultistage {I}nterconnection {N}etwork",
	url = "http://dx.doi.org/10.1109/L-CA.2008.8",
	volume = 7,
	year = 2008
}

Crispín Gomez, Maria E Gomez, Pedro Lopez and Jose Duato. Exploiting Wiring Resources on Interconnection Network: Increasing Path Diversity. In Parallel, Distributed and Network-Based Processing, 2008. PDP 2008. 16th Euromicro Conference on. 2008, 20 -29. URL, DOI BibTeX

@conference{4457100,
	author = "Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "On-chip networks are the answer to the growing demands for high communication performance of chip multiprocessors. These networks have a number of characteristics that make their design quite different to off-chip networks. In particular, wires are an abundant available resource inside the chip. In this paper, we explore how to organize the huge wiring capabilities available in on-chip networks. In particular, we analyze the option of distributing the wires among several parallel links connecting the same two switches. This technique is known as Space Division Multiplexing (SDM). The number of parallel sub-links and their width are two key parameters that are studied together with the relationship with the mean packet size. The paper shows that SDM is a technique to take into account in on-chip networks since it allows to highly increase the network accepted traffic at the expense of a small latency increase or even no increase. Moreover, in some networks, it allows to reduce the network hardware, providing simiar performance results, which results in a reduction in the consumption of area and power.",
	booktitle = "Parallel, Distributed and Network-Based Processing, 2008. PDP 2008. 16th Euromicro Conference on",
	doi = "10.1109/PDP.2008.33",
	isbn = "978-0-7695-3089-5",
	issn = "1066-6192",
	keywords = "chip multiprocessors;interconnection network;mean packet size;on-chip networks;parallel links;path diversity;space division multiplexing;wiring capabilities;wiring resources;multiprocessor interconnection networks;space division multiplexing;wiring;",
	month = "feb.",
	pages = "20 -29",
	title = "{E}xploiting {W}iring {R}esources on {I}nterconnection {N}etwork: {I}ncreasing {P}ath {D}iversity",
	url = "http://dx.doi.org/10.1109/PDP.2008.33",
	year = 2008
}

Crispín Gomez, Francisco Gilabert, Maria E Gomez, Pedro Lopez and Jose Duato. RUFT: Simplifying the fat-tree topology. In Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on. 2008, 153 - 160. URL BibTeX

@conference{20090911931135,
	author = "Gomez, Crisp{\'i}n and Gilabert, Francisco and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "The fat-tree is one of the most widely-used topologies by interconnection network manufacturers. Recently, a deterministic routing algorithm that optimally balances the network traffic in fat-trees was proposed. It can not only achieve almost the same performance than adaptive routing, but also outperforms it for some traffic patterns. Nevertheless, fat-trees require a high number of switches with a non-negligible wiring complexity. In this paper, we propose replacing the fat-tree by an unidirectional multistage interconnection network referred to as Reduced Unidirectional Fat-tree (RUFT) that uses a a simplified version of the aforementioned deterministic routing algorithm. As a consequence, switch hardware is almost reduced to the half, decreasing, in this way, power consumption, arbitration complexity, switch size, and network cost. Evaluation results show that RUFT obtains lower latency than fat-tree for low and medium traffic loads. Furthermore, in large networks, it obtains almost the same throughput than the classical fat-tree. {{\&}}copy; 2008 IEEE.",
	address = "Melbourne, VIC, Australia",
	booktitle = "Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on",
	issn = 15219097,
	journal = "Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS",
	key = "Trees (mathematics)",
	keywords = "Interconnection networks;Internet;Routing algorithms;Switches;Switching circuits;",
	note = "Adaptive routing;Deterministic routing algorithms;Evaluation results;Large networks;Multi-stage interconnection networks;Network costs;Network traffics;Number of switches;Power consumption;Switch sizes;Traffic loads;Traffic patterns;Tree topologies;",
	pages = "153 - 160",
	title = "{RUFT}: {S}implifying the fat-tree topology",
	url = "http://dx.doi.org/10.1109/ICPADS.2008.44",
	year = 2008
}

Crispín Gomez, Maria E Gomez, Pedro Lopez and Jose Duato. Reducing packet dropping in a bufferless NoC. In Euro-Par 2008 – Parallel Processing. 2008, 899 - 909. URL BibTeX

@conference{10528093,
	author = "Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Networks on chip (NoCs) has a strong impact on overall chip performance. Interconnection bandwidth is limited by the critical path delay. Recent works show that the critical path includes the switch input buffer control logic. As a consequence, by removing buffers, switch clock frequency can be doubled. Recently, a new switching technique for NoCs called blind packet switching (BPS) has been proposed. It is based on replacing the buffers of the switch ports by simple latches. Since buffers consume a high percentage of switch power and area, BPS not only improves performance but also helps in reducing power and area. In BPS there are no buffers at the switch ports, so packets can not be stopped. If the required output port is busy, the packet will be dropped. In order to prevent packet dropping, some techniques based on resource replication has been proposed. In this paper, we propose some alternative and complementary techniques that does not rely on resource replication. By using these techniques, packet dropping and its negative effects are highly reduced. In particular, packet dropping is completely removed for a very wide network traffic range. The first dropped packet appears at a 11.6 higher traffic load. As a consequence, network throughput is increased and the packet latency is kept almost constant.",
	address = "Berlin, Germany",
	booktitle = "Euro-Par 2008 – Parallel Processing",
	journal = "Euro-Par 2008 Parallel Processing. 14th International Euro-Par Conference",
	keywords = "delays;multiprocessor interconnection networks;network-on-chip;packet switching;",
	note = "bufferless NoC;networks on chip;interconnection bandwidth;buffer control logic;blind packet switching;resource replication;packet dropping;network traffic load;critical path delay;",
	pages = "899 - 909",
	title = "{R}educing packet dropping in a bufferless {N}o{C}",
	url = "http://dx.doi.org/10.1007/978-3-540-85451-7_97",
	year = 2008
}

Joan-Lluis Ferrer, Elvira Baydal, Antonio Robles, Pedro Lopez and Jose Duato. On the influence of the packet marking and injection control schemes in congestion management for MINs. 2008, 930 - 9. URL BibTeX

@conference{10528096,
	author = "Ferrer, Joan-Lluis and Baydal, Elvira and Robles, Antonio and Lopez, Pedro and Duato, Jose",
	abstract = "Several Congestion Management Mechanisms (CMMs) have been proposed for Multistage Interconnection Networks (MINs) in order to avoid the degradation of network performance when congestion appears. Most of them are based on Explicit Congestion Notification (ECN). For this purpose, switches detect congestion and, depending on the applied mechanism, some flags are marked to warn the source hosts. In response, source hosts apply corrective actions to adjust their packet injection rate. These mechanisms have been evaluated by analyzing whether they are able to manage a congestion situation but there is not a comparison study among them. Moreover, marking effects are not separately analyzed from corrective actions. In this paper, we analyze the current proposals for CMMs, showing the impact of the applied packet marking techniques as well as the corrective actions they apply.",
	address = "Berlin, Germany",
	journal = "Euro-Par 2008 Parallel Processing. 14th International Euro-Par Conference",
	keywords = "multistage interconnection networks;packet switching;telecommunication congestion control;",
	note = "packet marking;injection control schemes;congestion management mechanisms;multistage interconnection networks;explicit congestion notification;message throttling;",
	pages = "930 - 9",
	title = "{O}n the influence of the packet marking and injection control schemes in congestion management for {MIN}s",
	url = "http://dx.doi.org/10.1007/978-3-540-85451-7_100",
	year = 2008
}

Francisco Gilabert, S Medardoni, D Bertozzi, L Benini, Maria E Gomez, Pedro Lopez and Jose Duato. Exploring high-dimensional topologies for NoC design through an integrated analysis and synthesis framework. 2008, 107 - 16. BibTeX

@conference{9940710,
	author = "Gilabert, Francisco and S. Medardoni and D. Bertozzi and L. Benini and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Networks-on-chip (NoCs) address the challenge to provide scalable communication bandwidth to tiled architectures in a power-efficient fashion. The 2-D mesh is currently the most popular regular topology used for on-chip networks in tile-based architectures, because it perfectly matches the 2-D silicon surface and is easy to implement. However, a number of limitations have been proved in the open literature, especially for long distance traffic. Two relevant variants of 2-D meshes are explored in this paper: high-dimensional and concentrated topologies. The novelty of our exploration framework includes the use of fast and accurate transaction level simulation to provide constraints to the physical synthesis flow, which is integrated with standard industrial toolchains for accurate physical implementation. Interestingly, this work illustrates how effectively the compared topologies can handle synchronization-intensive traffic patterns and accounts for chip I/O interfaces.",
	address = "Piscataway, NJ, USA",
	journal = "2008 2nd ACM/IEEE International Symposium on Networks-on-Chip (NOCS '08)",
	keywords = "integrated circuit design;logic design;network topology;network-on-chip;",
	note = "NoC design;networks-on-chip;2D mesh topology;on-chip networks;tile-based architectures;industrial toolchains;chip I/O interfaces;",
	pages = "107 - 16",
	title = "{E}xploring high-dimensional topologies for {N}o{C} design through an integrated analysis and synthesis framework",
	year = 2008
}

Crispín Gomez, Maria E Gomez, Pedro Lopez and Jose Duato. Exploiting wiring resources on interconnection network: Increasing path diversity. In Parallel, Distributed and Network-Based Processing, 2008. PDP 2008. 16th Euromicro Conference on. 2008, 20 - 29. URL BibTeX

@conference{20083011395413,
	author = "Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "On-chip networks are the answer to the growing demands for high communication performance of chip multiprocessors. These networks have a number of characteristics that make their design quite different to off-chip networks. In particular, wires are an abundant available resource inside the chip. In this paper, we explore how to organize the huge wiring capabilities available in on-chip networks. In particular, we analyze the option of distributing the wires among several parallel links connecting the same two switches. This technique is known as Space Division Multiplexing (SDM). The number of parallel sub-links and their width are two key parameters that are studied together with the relationship with the mean packet size. The paper shows that SDM is a technique to take into account in on-chip networks since it allows to highly increase the network accepted traffic at the expense of a small latency increase or even no increase. Moreover, in some networks, it allows to reduce the network hardware, providing similar performance results, which results in a reduction in the consumption of area and power. © 2008 IEEE.",
	address = "Toulouse, France",
	booktitle = "Parallel, Distributed and Network-Based Processing, 2008. PDP 2008. 16th Euromicro Conference on",
	journal = "Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP 2008",
	key = "Space division multiple access",
	keywords = "Electric network topology;Internet;Telecommunication;Wire;",
	note = "Chip multi processor (CMP);Communication performances;Key parameters;Latency increase;Off chip;On Chip Network (OCN);Packet size (PS);Parallel links;Path diversity;Performance results;Space division multiplexing (SDM);",
	pages = "20 - 29",
	title = "{E}xploiting wiring resources on interconnection network: {I}ncreasing path diversity",
	url = "http://dx.doi.org/10.1109/PDP.2008.33",
	year = 2008
}

Crispín Gomez, Maria E Gomez, Pedro Lopez and Jose Duato. An efficient switching technique for NoCs with reduced buffer requirements. In Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on. 2008, 713 - 20. URL BibTeX

@conference{10428505,
	author = "Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Networks on chip (NoCs) communicate the components located inside a chip. Overall system performance depends on NoC performance, that is affected by several factors. One of them is the network clock frequency, imposed by the critical path delay. Recent works show that switch critical path includes buffer control logic. Consequently, by removing switch buffers, switch frequency can be doubled. In this paper, we exploit this idea, proposing a new switching technique for NoCs which requires a reduced amount of storage at the switches. It is based on replacing switch port buffers by single latches. By doing so, network cycle can be reduced, which reduces packet latency. On the other hand, power and area consumption requirements can be reduced. However, since there are no buffers at the switch ports, packets can not be stopped. Stopped packets due to contention are dropped and reinjected from their senders via negative acknowledgments. Packet dropping is strongly reduced by exploiting NoCs wiring capability.",
	address = "Piscataway, NJ, USA",
	booktitle = "Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on",
	journal = "Proceedings of the Fourteenth International Conference on Parallel and Distributed Systems",
	keywords = "network-on-chip;performance evaluation;",
	note = "switching technique;reduced buffer requirements;networks on chip;network clock frequency;critical path delay;buffer control logic;network cycle;packet dropping;",
	pages = "713 - 20",
	title = "{A}n efficient switching technique for {N}o{C}s with reduced buffer requirements",
	url = "http://dx.doi.org/10.1109/ICPADS.2008.43",
	year = 2008
}

Scott Pakin, Craig Stunkel, Jose Flich, Francisco Alfaro, Gheorghe Almasi, Angelos Bilas, Ron Brightwell, Darius Buntinas, Wu-Chun Feng, Mitchell Gusat, Nectarios Koziris, Pedro Lopez, Andrew Lumsdaine, Jarek Nieplocha, Greg Pfister, Jamie Riotto, Vikram Saletore, Evan Speight, Pete Wyckoff, D K Panda, Jose Duato and Mazin Yousif. Workshop 9 Introduction: The Workshop on Communication Architecture for Clusters - CAC 2008. IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM, pages IEEE Computer Societ, 2008. URL BibTeX

@article{20083711535136,
	author = "Scott Pakin and Craig Stunkel and Flich, Jose and Francisco Alfaro and Gheorghe Almasi and Angelos Bilas and Ron Brightwell and Darius Buntinas and Wu-Chun Feng and Mitchell Gusat and Nectarios Koziris and Lopez, Pedro and Andrew Lumsdaine and Jarek Nieplocha and Greg Pfister and Jamie Riotto and Vikram Saletore and Evan Speight and Pete Wyckoff and D.K. Panda and Duato, Jose and Mazin Yousif",
	abstract = "No abstract available",
	address = "Miami, FL, United states",
	journal = "IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM",
	pages = "IEEE Computer Societ",
	title = "{W}orkshop 9 {I}ntroduction: {T}he {W}orkshop on {C}ommunication {A}rchitecture for {C}lusters - {CAC} 2008",
	url = "http://dx.doi.org/10.1109/IPDPS.2008.4536118",
	year = 2008
}

Crispin Gomez Requena, Francisco Gilabert Villamon, Maria E Gomez, Pedro Lopez and Jose Duato. Beyond fat - Tree: Unidirectional load - Balanced multistage interconnection network. IEEE Computer Architecture Letters 7(2):49 - 52, 2008. URL BibTeX

@article{20090211850984,
	author = "Crispin Gomez Requena and Francisco Gilabert Villamon and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "The fat-tree is one of the most widely-used topologies by interconnection network manufacturers. Recently, it has been demonstrated that a deterministic routing algorithm that optimally balances the network traffic can not only achieve almost the same performance than an adaptive routing algorithm but also outperforms it. On the other hand, fat-trees require a high number of switches with a non-negligible wiring complexity. In this paper, we propose replacing the fat - tree by a unidirectional multistage interconnection network (UMIN) that uses a traffic balancing deterministic routing algorithm. As a consequence, switch hardware is almost reduced to the half, decreasing, in this way, the power consumption, the arbitration complexity, the switch size itself, and the network cost. Preliminary evaluation results show that the UMIN with the load balancing scheme obtains lower latency than fat - tree for low and medium traffic loads. Furthermore, in networks with a high number of stages or with high radix switches, it obtains the same, or even higher, throughput than fat-tree. © 2006 IEEE.",
	address = "3 Park Avenue, 17th Floor, New York, NY 10016-5997, United States",
	issn = 15566056,
	journal = "IEEE Computer Architecture Letters",
	key = "Computer networks",
	keywords = "Adaptive algorithms;Interconnection networks;Internet;Metropolitan area networks;Routing algorithms;Switches;Switching circuits;Telecommunication networks;Trees;",
	note = "Butterfly network;Deterministic routing;Fat-trees;Multistage Interconnection networks;Traffic balancing;",
	number = 2,
	pages = "49 - 52",
	title = "{B}eyond fat - {T}ree: {U}nidirectional load - {B}alanced multistage interconnection network",
	url = "http://dx.doi.org/10.1109/L-CA.2008.8",
	volume = 7,
	year = 2008
}

Rafael Ubal, Julio Sahuquillo, Salvador Petit and Pedro Lopez. Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors. In Computer Architecture and High Performance Computing, 2007. SBAC-PAD 2007. 19th International Symposium on. 2007, 62 -68. URL, DOI BibTeX

@conference{4384043,
	author = "Ubal, Rafael and Sahuquillo, Julio and Petit, Salvador and Lopez, Pedro",
	abstract = "Current microprocessors are based in complex designs, integrating different components on a single chip, such as hardware threads, processor cores, memory hierarchy or interconnection networks. The permanent need of evaluating new designs on each of these components motivates the development of tools which simulate the system working as a whole. In this paper, we present the Multi2Sim simulation framework, which models the major components of incoming systems, and is intended to cover the limitations of existing simulators. A set of simulation examples is also included for illustrative purposes.",
	booktitle = "Computer Architecture and High Performance Computing, 2007. SBAC-PAD 2007. 19th International Symposium on",
	doi = "10.1109/SBAC-PAD.2007.17",
	issn = "1550-6533",
	keywords = "Multi2Sim;hardware threads;interconnection networks;memory hierarchy;microprocessors;multicore-multithreaded processors;processor cores;multi-threading;multiprocessor interconnection networks;",
	month = "oct.",
	pages = "62 -68",
	title = "{M}ulti2{S}im: {A} {S}imulation {F}ramework to {E}valuate {M}ulticore-{M}ultithreaded {P}rocessors",
	url = "http://dx.doi.org/10.1109/SBAC-PAD.2007.17",
	year = 2007
}

Rafael Ubal, Julio Sahuquillo, Salvador Petit, H Hassan and Pedro Lopez. Leakage Current Reduction in Data Caches on Embedded Systems. In Intelligent Pervasive Computing, 2007. IPC. The 2007 International Conference on. 2007, 45 -50. URL, DOI BibTeX

@conference{4438392,
	author = "Ubal, Rafael and Sahuquillo, Julio and Petit, Salvador and H. Hassan and Lopez, Pedro",
	abstract = "Nowadays, embedded systems can be found in a wide range of pervasive devices (e.g., smart phones, PDAs, or video/digital cameras). These devices contain large cache memories, whose power consumption can reach about 50% of the total spent energy, from which leakage energy is the predominant fraction in current technologies. This paper proposes a technique to reduce leakage energy consumption in data caches on embedded systems, which is based on the fact that most stored bits take a logical value of zero. The proposed technique has been evaluated on a model of a contemporary high-end embedded microprocessor, namely the ARM Cortex A8 processor, executing a set of standard embedded benchmarks. Experimental results show that leakage energy savings reach about 40% with no IPC loss.",
	booktitle = "Intelligent Pervasive Computing, 2007. IPC. The 2007 International Conference on",
	doi = "10.1109/IPC.2007.95",
	keywords = "ARM Cortex A8 processor;cache memories;data caches;high-end embedded microprocessor;leakage energy consumption reduction;pervasive devices;cache storage;microprocessor chips;power consumption;ubiquitous computing;",
	month = "oct.",
	pages = "45 -50",
	title = "{L}eakage {C}urrent {R}eduction in {D}ata {C}aches on {E}mbedded {S}ystems",
	url = "http://dx.doi.org/10.1109/IPC.2007.95",
	year = 2007
}

Rafael Ubal, Julio Sahuquillo, Salvador Petit, Pedro Lopez and Jose Duato. VB-MT: Design Issues and Performance of the Validation Buffer Microarchitecture for Multithreaded Processors. In Parallel Architecture and Compilation Techniques, 2007. PACT 2007. 16th International Conference on. 2007, 429 -429. URL, DOI BibTeX

@conference{4336257,
	author = "Ubal, Rafael and Sahuquillo, Julio and Petit, Salvador and Lopez, Pedro and Duato, Jose",
	abstract = "The validation buffer (VB) Microarchitecture retires instructions out of order, by substituting the classical ROB by the VB structure. The VB removes the negative effect of long latency instructions located at the ROB head, which prevent other instructions from retiring and cause frequent pipeline stalls due to lack of space in the ROB. This work analyzes different multithreading models (coarse grain, fine grain and simultaneous multithreading) and a set of different instruction fetch policies.",
	booktitle = "Parallel Architecture and Compilation Techniques, 2007. PACT 2007. 16th International Conference on",
	doi = "10.1109/PACT.2007.4336257",
	issn = "1089-795X",
	keywords = "ROB head;VB structure;instruction fetch policies;multithreaded processors;validation buffer microarchitecture;buffer storage;multi-threading;parallel architectures;storage allocation;",
	month = "sept.",
	pages = "429 -429",
	title = "{VB}-{MT}: {D}esign {I}ssues and {P}erformance of the {V}alidation {B}uffer {M}icroarchitecture for {M}ultithreaded {P}rocessors",
	url = "http://dx.doi.org/10.1109/PACT.2007.4336257",
	year = 2007
}

Crispín Gomez, Francisco Gilabert, Maria E Gomez, Pedro Lopez and Jose Duato. Deterministic versus Adaptive Routing in Fat-Trees. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International. March 2007, 1 -8. URL, DOI BibTeX

@conference{4228210,
	author = "Gomez, Crisp{\'i}n and Gilabert, Francisco and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Clusters of PCs have become very popular to build high performance computers. These machines use commodity PCs linked by a high speed interconnect. Routing is one of the most important design issues of interconnection networks. Adaptive routing usually better balances network traffic, thus allowing the network to obtain a higher throughput. However, adaptive routing introduces out-of-order packet delivery, which is unacceptable for some applications. Concerning topology, most of the commercially available interconnects are based on fat-tree. Fat-trees offer a rich connectivity among nodes, making possible to obtain paths between all source-destination pairs that do not share any link. We exploit this idea to propose a deterministic routing algorithm for fat-trees, comparing it with adaptive routing in several workloads. The results show that deterministic routing can achieve a similar, and in some scenarios higher, level of performance than adaptive routing, while providing in-order packet delivery.",
	booktitle = "Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International",
	doi = "10.1109/IPDPS.2007.370482",
	isbn = "1-4244-0910-1",
	keywords = "PC clusters;adaptive routing;deterministic routing algorithm;fat-tree topology;interconnection networks;packet delivery;multistage interconnection networks;telecommunication network routing;telecommunication network topology;telecommunication traffic;tree",
	month = "march",
	pages = "1 -8",
	title = "{D}eterministic versus {A}daptive {R}outing in {F}at-{T}rees",
	url = "http://dx.doi.org/10.1109/IPDPS.2007.370482",
	year = 2007
}

Crispín Gomez, Francisco Gilabert, Maria E Gomez, Pedro Lopez and Jose Duato. Deterministic versus adaptive routing in fat-trees. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International. 2007, 8 pp. -. URL, DOI BibTeX

@conference{9516533,
	author = "Gomez, Crisp{\'i}n and Gilabert, Francisco and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Clusters of PCs have become very popular to build high performance computers. These machines use commodity PCs linked by a high speed interconnect. Routing is one of the most important design issues of interconnection networks. Adaptive routing usually better balances network traffic, thus allowing the network to obtain a higher throughput. However, adaptive routing introduces out-of-order packet delivery, which is unacceptable for some applications. Concerning topology, most of the commercially available interconnects are based on fat-tree. Fat-trees offer a rich connectivity among nodes, making possible to obtain paths between all source-destination pairs that do not share any link. We exploit this idea to propose a deterministic routing algorithm for fat-trees, comparing it with adaptive routing in several workloads. The results show that deterministic routing can achieve a similar, and in some scenarios higher, level of performance than adaptive routing, while providing in-order packet delivery.",
	booktitle = "Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International",
	doi = "10.1109/IPDPS.2007.370482",
	isbn = "1-4244-0910-1",
	journal = "2007 IEEE International Parallel and Distributed Processing Symposium (IEEE Cat. No.07TH8938)",
	keywords = "multistage interconnection networks;telecommunication network routing;telecommunication network topology;telecommunication traffic;trees;",
	month = "Mar.",
	note = "adaptive routing;fat-tree topology;PC clusters;interconnection networks;packet delivery;deterministic routing algorithm;",
	pages = "8 pp. -",
	publisher = "IEEE Computer Society",
	title = "{D}eterministic versus adaptive routing in fat-trees",
	url = "http://dx.doi.org/10.1109/IPDPS.2007.370482",
	year = 2007
}

Marina Alonso, Salvador Coll, Vicente Santonja, Juan Miguel Martínez, Pedro Lopez and Jose Duato. Power-aware fat-tree networks using on/off links. In R Perrott, BM Chapman, J Subhlok, RF DeMello and LT Yang (eds.). HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, PROCEEDINGS 4782. 2007, 472-483. BibTeX

@conference{ISI:000250940200040,
	author = "Alonso, Marina and Coll, Salvador and Santonja, Vicente and Mart{\'i}nez, Juan Miguel and Lopez, Pedro and Duato, Jose",
	abstract = "Nowadays, power consumption reduction techniques are being increasingly used in computer systems, and high-performance computing systems are not an exception. In particular, the power consumed by the interconnect circuitry has a non-negligible contribution to the total system budget. In this scenario, fat-tree interconnection networks are one of the most popular topologies. This topology is particularly well-suited for applying power consumption reduction techniques since it provides multiple alternative paths for each source/destination pair. In this paper, we present a mechanism that dynamically adjusts the available network bandwidth by switching links on and off, according to the traffic requirements. This mechanism provides significant reduction in power consumption while maintaining the original underlying routing algorithm, at the expense of slight latency increase for low loads.",
	booktitle = "HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, PROCEEDINGS",
	editor = "Perrott, R and Chapman, BM and Subhlok, J and DeMello, RF and Yang, LT",
	isbn = "978-3-540-75443-5",
	issn = "0302-9743",
	note = "3rd International Conference on High Performance Computing and Communications (HPCC 2007), Houston, TX, SEP 26-28, 2007",
	pages = "472-483",
	series = "LECTURE NOTES IN COMPUTER SCIENCE",
	title = "{P}ower-aware fat-tree networks using on/off links",
	volume = 4782,
	year = 2007
}

Joan-Lluis Ferrer, Elvira Baydal, Antonio Robles, Pedro Lopez and Jose Duato. Congestion management in MINs through marked validated packets. 2007, 260 - 7. BibTeX

@conference{10266202,
	author = "Ferrer, Joan-Lluis and Baydal, Elvira and Robles, Antonio and Lopez, Pedro and Duato, Jose",
	abstract = {Congestion management is a very critical problem tackled in interconnection networks for years but not solved yet. Although several mechanisms have been recently proposed for lossless multistage interconnection networks (MINs), they either have drawbacks or are partial solutions. Some of them introduce penalty over packets not really addressed to the hot-spots, whereas others can cope only with congestion situations that last a short time. In this paper, we propose an effective and efficient congestion management mechanism for lossless interconnection networks based on explicit congestion notification. The mechanism uses two different flags in ACK packets, a Marking Bit (MB) and a Validation Bit (VB), to detect congestion and warn the origin hosts. In this way, packets belonging to "coldflows" but stopped because of head-of-line (HOL) blocking can be distinguished from "hotflow" packets which are really causing congestion. In response, origin hosts can apply corrective actions only to the "hotflows", minimizing the negative impact on "coldflows"performance. Evaluation results show that the proposed congestion management strategy is able to avoid the degradation of network performance, regardless of traffic load and the location of the congestion in the network.},
	address = "Piscataway, NJ, USA",
	journal = "15th EUROMICRO International Conference on Parallel, Distributed and Network-Based Processing (PDP'07)",
	keywords = "multistage interconnection networks;",
	note = "congestion management;lossless multistage interconnection network;validated packet;marked packet;ACK packet;head-of-line blocking;marking bit;validation bit;",
	pages = "260 - 7",
	title = "{C}ongestion management in {MIN}s through marked validated packets",
	year = 2007
}

Jose Flich, , Pedro Lopez and Jose Duato. Region-Based Routing: An Efficient Routing Mechanism to Tackle Unreliable Hardware in Network on Chips. In Networks-on-Chip, 2007. NOCS 2007. First International Symposium on. 2007, 183 -194. URL, DOI BibTeX

@conference{4209007,
	author = "Flich, Jose and , and Lopez, Pedro and Duato, Jose",
	abstract = "The design of scalable and reliable interconnection networks for system on chips (SoCs) introduce new design constraints not present in current multicomputer systems. Although regular topologies are preferred for building NoCs, heterogeneous blocks, fabrication faults and reliability issues derived from the high integration scale may lead to irregular topologies. In this situation, efficient routing becomes a challenge. Although table-based routing allows the use of most routing algorithms on any topology, it does not scale in terms of latency and area. In this paper we propose the region-based routing mechanism that avoids the scalability problems of table-based solutions. From an initial topology and routing algorithm, the mechanism groups, at every switch, destinations into different regions based on the output ports. By doing this, redundant routing information typically found in routing tables is eliminated. Evaluation results show that the mechanism requires only four regions to support several routing algorithms in a 2D mesh with no performance degradation. Moreover, when dealing with link failures, our results indicate that the mechanism combined with the segment-based routing algorithm is able to pack all the routing information into eight regions providing high throughput. The paper provides also a simple and efficient hardware implementation of the mechanism requiring only 240 logic gates per switch to support eight regions in a 2D mesh topology",
	booktitle = "Networks-on-Chip, 2007. NOCS 2007. First International Symposium on",
	doi = "10.1109/NOCS.2007.39",
	keywords = "2D mesh topology;interconnection networks;multicomputer systems;network on chips;region-based routing;segment-based routing algorithm;system on chips;table-based routing;integrated circuit interconnections;logic design;microprocessor chips;network routing",
	month = "7-9",
	pages = "183 -194",
	title = "{R}egion-{B}ased {R}outing: {A}n {E}fficient {R}outing {M}echanism to {T}ackle {U}nreliable {H}ardware in {N}etwork on {C}hips",
	url = "http://dx.doi.org/10.1109/NOCS.2007.39",
	year = 2007
}

Crispín Gomez, Maria E Gomez, Pedro Lopez and Jose Duato. An efficient fault-tolerant routing methodology for fat-tree interconnection networks*. 2007, 509 - 22. BibTeX

@conference{9683889,
	author = "Gomez, Crisp{\'i}n and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "In large cluster-based machines, fault-tolerance in the interconnection network is an issue of growing importance, since their increasing size rises the probability of failure. The topology used in these machines is usually a fat-tree. This paper proposes a new distributed fault-tolerant routing methodology for fat-trees. It does not require additional network hardware. It is scalable, since the required memory, switch hardware and routing delay do not depend on the net work size. The methodology is based on enhancing the interval routing scheme with exclusion intervals. Exclusion intervals are associated to each switch output port, and represent the set of nodes that are unreachable from this port after a failure appears. We propose a mechanism to identify the exclusion intervals that must be updated after detecting a failure, and the values to write on them. Our methodology is able to support a relatively high number of network failures with a low degradation in network performance.",
	address = "Berlin, Germany",
	journal = "Parallel and Distributed Processing and Applications. Proceedings 5th International Symposium, ISPA 2007. (Lecture Notes in Computer Science vol. 4742)",
	keywords = "failure analysis;fault tolerant computing;multiprocessor interconnection networks;network routing;network topology;probability;trees;",
	note = "distributed fault-tolerant routing methodology;fat-tree interconnection networks;large cluster-based machines;failure probability;interval routing scheme;switch output port;",
	pages = "509 - 22",
	title = "{A}n efficient fault-tolerant routing methodology for fat-tree interconnection networks*",
	year = 2007
}

Maria E Gomez, N A Nordbotten, Jose Flich, Pedro Lopez, Antonio Robles, Jose Duato, T Skeie and O Lysne. A routing methodology for achieving fault tolerance in direct networks. Computers, IEEE Transactions on 55(4):400 - 415, April 2006. URL, DOI BibTeX

@article{1608003,
author = "Gomez, Maria E. and N.A. Nordbotten and Flich, Jose and Lopez, Pedro and Robles, Antonio and Duato, Jose and T. Skeie and O. Lysne",
abstract = "Massively parallel computing systems are being built with thousands of nodes. The nterconnection network plays a key role for the performance of such systems. However, the high number of components significantly increases the probability of failure. Additionally, failures in the interconnection network may isolate a large fraction of the machine. It is therefore critical to provide an efficient fault-tolerant mechanism to keep the system running, even in the presence of faults. This paper presents a new fault-tolerant routing methodology that does not degrade performance in the absence of faults and tolerates a reasonably large number of faults without disabling any healthy node. In order to avoid faults, for some source-destination pairs, packets are first sent to an intermediate node and then from this node to the destination node. Fully adaptive routing is used along both subpaths. The methodology assumes a static fault model and the use of a checkpoint/restart mechanism. However, there are scenarios where the faults cannot be avoided solely by using an intermediate node. Thus, we also provide some extensions to the methodology. Specifically, we propose disabling adaptive routing and/or using misrouting on a per-packet basis. We also propose the use of more than one intermediate node for some paths. The proposed fault-tolerant routing methodology is extensively evaluated in terms of fault tolerance, complexity, and performance.",
doi = "10.1109/TC.2006.46",
issn = "0018-9340",
journal = "Computers, IEEE Transactions on",
keywords = "adaptive routing; checkpoint-restart mechanism; direct networks; fault-tolerant routing methodology; interconnection network; parallel computing system; fault tolerant computing; multiprocessor interconnection networks; network routing; parallel processi",
month = "april",
number = 4,
pages = "400 - 415",
title = "{A} routing methodology for achieving fault tolerance in direct networks",
url = "http://dx.doi.org/10.1109/TC.2006.46",
volume = 55,
year = 2006
}

Marina Alonso, Salvador Coll, Jose Maria Martínez, Vicente Santonja, Pedro Lopez and Jose Duato. Dynamic power saving in fat-tree interconnection networks using on/off links. In Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International. April 2006, 8 pp.. URL, DOI BibTeX

@conference{1639599,
	author = "Alonso, Marina and Coll, Salvador and Mart{\'i}nez, Jose Maria and Santonja, Vicente and Lopez, Pedro and Duato, Jose",
	abstract = "Current trends in high-performance parallel computers show that fat-tree interconnection networks are one of the most popular topologies. The particular characteristics of this topology, that provide multiple alternative paths for each source/destination pair, make it an excellent candidate for applying power consumption reduction techniques. Such techniques are being increasingly applied in computer systems and the interconnection network is not an exception, since its contribution to the system power budget is not negligible. In this paper, we present a mechanism that dynamically switches on and off network links as a function of traffic. The mechanism is designed to guarantee network connectivity, according to the underlying routing algorithm. In this way, the default routing algorithm can be used regardless of the power saving actions taken, thus simplifying router design. Our simulation results show that significant network power consumption reductions can be obtained at no cost. Latency remains the same although the number of operating network links is dynamically adjusted.",
	booktitle = "Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International",
	doi = "10.1109/IPDPS.2006.1639599",
	isbn = "0-7695-0990-8",
	keywords = "dynamic power saving; fat-tree interconnection networks; high-performance parallel computers; network power consumption reduction; on-off links; routing algorithm; energy conservation; multiprocessor interconnection networks; parallel processing;",
	month = "april",
	pages = "8 pp.",
	title = "{D}ynamic power saving in fat-tree interconnection networks using on/off links",
	url = "http://dx.doi.org/10.1109/IPDPS.2006.1639599",
	year = 2006
}

Marina Alonso, Salvador Coll, Juan Miguel Martínez, Vicente Santonja, Pedro Lopez and Jose Duato. Dynamic power saving in fat-tree interconnection networks using on/off links. In Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International. 2006, 8 pp. -. URL, DOI BibTeX

@conference{8978456,
	author = "Alonso, Marina and Coll, Salvador and Mart{\'i}nez, Juan Miguel and Santonja, Vicente and Lopez, Pedro and Duato, Jose",
	abstract = "Current trends in high-performance parallel computers show that fat-tree interconnection networks are one of the most popular topologies. The particular characteristics of this topology, that provide multiple alternative paths for each source/destination pair, make it an excellent candidate for applying power consumption reduction techniques. Such techniques are being increasingly applied in computer systems and the interconnection network is not an exception, since its contribution to the system power budget is not negligible. In this paper, we present a mechanism that dynamically switches on and off network links as a function of traffic. The mechanism is designed to guarantee network connectivity, according to the underlying routing algorithm. In this way, the default routing algorithm can be used regardless of the power saving actions taken, thus simplifying router design. Our simulation results show that significant network power consumption reductions can be obtained at no cost. Latency remains the same although the number of operating network links is dynamically adjusted",
	booktitle = "Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International",
	doi = "10.1109/IPDPS.2006.1639599",
	isbn = "1-4244-0054-6",
	journal = "Proceedings. 20th International Parallel and Distributed Processing Symposium (IEEE Cat. No.06TH8860)",
	keywords = "energy conservation;multiprocessor interconnection networks;parallel processing;trees;",
	month = "Apr.",
	note = "dynamic power saving;fat-tree interconnection networks;on/off links;high-performance parallel computers;routing algorithm;network power consumption reduction;",
	pages = "8 pp. -",
	publisher = "IEEE Computer Society",
	title = "{D}ynamic power saving in fat-tree interconnection networks using on/off links",
	url = "http://dx.doi.org/10.1109/IPDPS.2006.1639599",
	year = 2006
}

Gaspar Mora, Jose Flich, Jose Duato, Pedro Lopez, Elvira Baydal and O Lysne. Towards an efficient switch architecture for high-radix switches. 2006, 11 - 20. URL, DOI BibTeX

@conference{10091275,
	author = "Mora, Gaspar and Flich, Jose and Duato, Jose and Lopez, Pedro and Baydal, Elvira and O. Lysne",
	abstract = "The interconnection network plays a key role in the overall performance achieved by high performance computing systems, also contributing an increasing fraction of its cost and power consumption. Current trends in interconnection network technology suggest that high-radix switches will be preferred as networks will become smaller (in terms of switch count) with the associated savings in packet latency, cost, and power consumption. Unfortunately, current switch architectures have scalability problems that prevent them from being effective when implemented with a high number of ports. In this paper, an efficient and cost-effective architecture for high-radix switches is proposed. The architecture, referred to as partitioned crossbar input queued (PCIQ), relies on three key components: a partitioned crossbar organization that allows the use of simple arbiters and crossbars, a packet-based arbiter, and a mechanism to eliminate the switch-level HOL blocking. Under uniform traffic, maximum switch efficiency is achieved. Furthermore, switch-level HOL blocking is completely eliminated under hot-spot traffic, again delivering maximum throughput. Additionally, PCIQ inherently implements an efficient congestion management technique that eliminates all the network-wide HOL blocking. On the contrary, the previously proposed architectures either show poor performance or they require significantly higher costs than PCIQ (in both components and complexity).",
	address = "Piscataway, NJ, USA",
	doi = "10.1109/ANCS.2006.4579519",
	journal = "ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS 2006)",
	keywords = "multistage interconnection networks;",
	note = "high-radix switch architecture;interconnection network;power consumption;partitioned crossbar input queued;switch-level head-of-line block elimination;congestion management technique;",
	pages = "11 - 20",
	title = "{T}owards an efficient switch architecture for high-radix switches",
	url = "http://dx.doi.org/10.1109/ANCS.2006.4579519",
	year = 2006
}

Francisco Gilabert, Maria E Gomez, Pedro Lopez and Jose Duato. On the influence of the selection function on the performance of fat-trees. 2006, 864 - 73. BibTeX

@conference{9112992,
	author = "Gilabert, Francisco and Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Fat-tree topology has become very popular among switch manufacturers. Routing in fat-trees is composed of two phases, an adaptive upwards phase, and a deterministic downwards phase. The unique downwards path to the destination depends on the switch that has been reached in the upwards phase. As adaptive routing is used in the ascending phase, several output ports are possible at each switch and the final choice depends on the selection function. The impact of the selection function on performance has been previously studied for direct networks and has not resulted to be very important. In fat-trees, the decisions made in the upwards phase by the selection function can be critical, since it determines the switch reached in the upwards phase, and therefore the unique downwards path to the destination. In this paper, we analyze the effect of the selection function on fat-trees. Several selection functions are defined, compared and evaluated. The evaluation shows that selection function has a great impact on fat-trees",
	address = "Berlin, Germany",
	journal = "Euro-Par 2006 Parallel Processing. 12th International Euro-Par Conference. Proceedings (Lecture Notes in Computer Science Vol. 4128)",
	keywords = "telecommunication network routing;telecommunication network topology;telecommunication switching;trees;",
	note = "selection function;fat-trees;adaptive routing;interconnection networks;",
	pages = "864 - 73",
	title = "{O}n the influence of the selection function on the performance of fat-trees",
	year = 2006
}

Maria E Gomez, Pedro Lopez and Jose Duato. FIR: An efficient routing strategy for tori and meshes. Journal of Parallel and Distributed Computing 66(7):907 - 21, 2006. URL, DOI BibTeX

@article{8981461,
	author = "Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Recent massively parallel computers are based on clusters of PCs. These machines use one of the recently proposed standard interconnects. These interconnects either use source routing or distributed routing based on forwarding tables. While source routers are simpler, distributed routers provides more flexibility allowing the network to achieve a higher performance. Distributed routing can be implemented by a fixed hardware specific to a routing function on a given topology or by using forwarding tables. The main problem of this approach is the lack of scalability of forwarding tables. In this paper, we propose a distributed routing strategy for commercial switches, flexible interval routing, that is scalable, both in memory and routing time because it is not based on tables. At the same time, the strategy is easy to reconfigure, being able to implement the most commonly used routing algorithms in the most widely used regular topologies. [All rights reserved Elsevier]",
	address = "USA",
	doi = "10.1016/j.jpdc.2005.12.012",
	issn = "0743-7315",
	journal = "Journal of Parallel and Distributed Computing",
	keywords = "multiprocessor interconnection networks;telecommunication network routing;workstation clusters;",
	note = "FIR;flexible interval routing;network routing;PC clusters;network topology;",
	number = 7,
	pages = "907 - 21",
	title = "{FIR}: {A}n efficient routing strategy for tori and meshes",
	url = "http://dx.doi.org/10.1016/j.jpdc.2005.12.012",
	volume = 66,
	year = 2006
}

Maria E Gomez, N A Nordbotten, Jose Flich, Pedro Lopez, Antonio Robles, Jose Duato, T Skeie and O Lysne. A routing methodology for achieving fault tolerance in direct networks. IEEE Transactions on Computers 55(4):400 - 15, 2006. URL, DOI BibTeX

@article{8935111,
author = "Gomez, Maria E. and N.A. Nordbotten and Flich, Jose and Lopez, Pedro and Robles, Antonio and Duato, Jose and T. Skeie and O. Lysne",
abstract = "Massively parallel computing systems are being built with thousands of nodes. The interconnection network plays a key role for the performance of such systems. However, the high number of components significantly increases the probability of failure. Additionally, failures in the interconnection network may isolate a large fraction of the machine. It is therefore critical to provide an efficient fault-tolerant mechanism to keep the system running, even in the presence of faults. This paper presents a new fault-tolerant routing methodology that does not degrade performance in the absence of faults and tolerates a reasonably large number of faults without disabling any healthy node. In order to avoid faults, for some source-destination pairs, packets are first sent to an intermediate node and then from this node to the destination node. Fully adaptive routing is used along both subpaths. The methodology assumes a static fault model and the use of a checkpoint/restart mechanism. However, there are scenarios where the faults cannot be avoided solely by using an intermediate node. Thus, we also provide some extensions to the methodology. Specifically, we propose disabling adaptive routing and/or using misrouting on a per-packet basis. We also propose the use of more than one intermediate node for some paths. The proposed fault-tolerant routing methodology is extensively evaluated in terms of fault tolerance, complexity, and performance",
address = "USA",
doi = "10.1109/TC.2006.46",
issn = "0018-9340",
journal = "IEEE Transactions on Computers",
keywords = "fault tolerant computing;multiprocessor interconnection networks;network routing;parallel processing;",
note = "direct networks;parallel computing system;interconnection network;fault-tolerant routing methodology;adaptive routing;checkpoint-restart mechanism;",
number = 4,
pages = "400 - 15",
title = "{A} routing methodology for achieving fault tolerance in direct networks",
url = "http://dx.doi.org/10.1109/TC.2006.46",
volume = 55,
year = 2006
}

Elvira Baydal, Pedro Lopez and Jose Duato. A family of mechanisms for congestion control in wormhole networks. Parallel and Distributed Systems, IEEE Transactions on 16(9):772 - 784, 2005. URL, DOI BibTeX

@article{1490509,
	author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
	abstract = "Multiprocessor interconnection networks may reach congestion with high traffic loads, which prevents reaching the wished performance. Unfortunately, many of the mechanisms proposed in the literature for congestion control either suffer from a lack of robustness, being unable to work properly with different traffic patterns or message lengths, or detect congestion relying on global information that wastes some network bandwidth. This paper presents a family of mechanisms to avoid network congestion in wormhole networks. All of them need only local information, applying message throttling when it is required. The proposed mechanisms use different strategies to detect network congestion and also apply different corrective actions. The mechanisms are evaluated and compared for several network loads and topologies, noticeably improving network performance with high loads but without penalizing network behavior for low and medium traffic rates, where no congestion control is required.",
	doi = "10.1109/TPDS.2005.102",
	issn = "1045-9219",
	journal = "Parallel and Distributed Systems, IEEE Transactions on",
	keywords = "message throttling; multiprocessor interconnection network; network bandwidth; network congestion control; traffic load; wormhole network; wormhole switching; multiprocessor interconnection networks; telecommunication congestion control; telecommunicatio",
	month = "sept.",
	number = 9,
	pages = "772 - 784",
	title = "{A} family of mechanisms for congestion control in wormhole networks",
	url = "http://dx.doi.org/10.1109/TPDS.2005.102",
	volume = 16,
	year = 2005
}

Maria E Gomez, Pedro Lopez and Jose Duato. A Memory-Effective Fault-Tolerant Routing Strategy for Direct Interconnection Networks. In Parallel and Distributed Computing, 2005. ISPDC 2005. The 4th International Symposium on. July 2005, 341 -348. URL, DOI BibTeX

@conference{1609988,
	author = "Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "High-performance interconnection networks are crucial in massively parallel computers. Routing is one of the most important design issues of interconnection networks. Moreover, the huge amount of hardware of these machines makes fault-tolerance another important design issue. In this paper, we propose a mechanism that combines scalable routing and fault-tolerance for commercial switches to build direct regular topologies, which are the topologies used in large machines. The hardware required is not complex. Furthermore, it allows a high degree of fault-tolerance inflicting a minimal decrease of performance",
	booktitle = "Parallel and Distributed Computing, 2005. ISPDC 2005. The 4th International Symposium on",
	doi = "10.1109/ISPDC.2005.6",
	keywords = "adaptive routing;direct interconnection networks;distributed routing;memory-effective fault-tolerant routing;fault tolerance;multiprocessor interconnection networks;telecommunication network reliability;telecommunication network routing;",
	month = "july",
	pages = "341 -348",
	title = "{A} {M}emory-{E}ffective {F}ault-{T}olerant {R}outing {S}trategy for {D}irect {I}nterconnection {N}etworks",
	url = "http://dx.doi.org/10.1109/ISPDC.2005.6",
	year = 2005
}

Marina Alonso, Juan Miguel Martínez, Vicente Santonja, Pedro Lopez and Jose Duato. Power Saving in Regular Interconnection Networks Built with High-Degree Switches. In Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International. April 2005, 5b - 5b. URL, DOI BibTeX

@conference{1419820,
	author = "Alonso, Marina and Mart{\'i}nez, Juan Miguel and Santonja, Vicente and Lopez, Pedro and Duato, Jose",
	abstract = "Nowadays, high-degree switches are available as building blocks of the interconnection network of clusters of PCs. An alternative to take advantage of the high number of switch ports is to connect every pair of switches through not only one but several links (this is known as link trunking in other environments). This extra connectivity can be exploited by using adaptive routing algorithms, thus improving network throughput and reducing network congestion. However with low traffic loads, all the links that compose the trunk link will not be utilized, but this idle links continue consuming power. Power consumption reduction techniques are being applied everywhere in computer systems and the interconnection network is not an exception, as its contribution is not negligible. In this paper, we present a mechanism that dynamically switches on and off network links as a function of traffic. It is specially targeted to those networks where trunk links are used. The mechanism can switch off any link, provided that network connectivity is guaranteed, (i.e. every pair of switches should be connected through at least one active link). Indeed, this restriction makes possible to use the same routing algorithm regardless the power saving actions taken, thus simplifying router design. Our simulation results show that the network power consumption can be greatly reduced, at the expense of some increase in latency. Nevertheless, it is shown that the power reduction is always higher that this latency increase.",
	booktitle = "Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International",
	doi = "10.1109/IPDPS.2005.349",
	isbn = "0-7695-2312-9",
	keywords = "PC clusters; adaptive routing algorithm; high-degree switch; link trunking; network congestion; network link; network throughput; power consumption; power saving; regular interconnection network; telecommunication traffic; power consumption; telecommunic",
	month = "april",
	pages = "5b - 5b",
	title = "{P}ower {S}aving in {R}egular {I}nterconnection {N}etworks {B}uilt with {H}igh-{D}egree {S}witches",
	url = "http://dx.doi.org/10.1109/IPDPS.2005.349",
	year = 2005
}

Maria E Gomez, Pedro Lopez and Jose Duato. A Memory-Effective Routing Strategy for Regular Interconnection Networks. In Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International. April 2005, 41b - 41b. URL, DOI BibTeX

@conference{1419862,
	author = "Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Massively parallel computing systems have been or are being built with thousands of nodes. In such systems, high-performance interconnection networks are crucial to achieve the maximum performance. Routing is one of the most important design issues of interconnection networks. Routing strategies can be mainly classified as source and distributed routing. Source routing has been used in some networks because routers are very simple. On the other hand, distributed routing allows more flexibility, but the routers are more complex. Distributed routing can be implemented by a fixed hardware specific to a routing function on a given topology, or by using forwarding tables that are very flexible but suffer from a lack of scalability. In this paper, we propose a distributed routing strategy for commercial switches, Flexible Interval Routing, that is scalable for the most widely used regular topologies (tori and meshes) because it is not based on tables. At the same time, the strategy is easy to reconfigure to deal with changes in the topology or in the routing algorithm for a given topology, being able to implement the most commonly-used routing algorithms in regular topologies.",
	booktitle = "Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International",
	doi = "10.1109/IPDPS.2005.44",
	keywords = "distributed routing; flexible interval routing; high-performance interconnection networks; memory-effective routing strategy; parallel computing system; multiprocessor interconnection networks; network routing; parallel machines; performance evaluation;",
	month = "april",
	pages = "41b - 41b",
	title = "{A} {M}emory-{E}ffective {R}outing {S}trategy for {R}egular {I}nterconnection {N}etworks",
	url = "http://dx.doi.org/10.1109/IPDPS.2005.44",
	year = 2005
}

Elvira Baydal, Pedro Lopez and Jose Duato. A family of mechanisms for congestion control in wormhole networks. IEEE Transactions on Parallel and Distributed Systems 16(9):772 - 84, 2005. URL BibTeX

@article{8570709,
	author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
	abstract = "Multiprocessor interconnection networks may reach congestion with high traffic loads, which prevents reaching the wished performance. Unfortunately, many of the mechanisms proposed in the literature for congestion control either suffer from a lack of robustness, being unable to work properly with different traffic patterns or message lengths, or detect congestion relying on global information that wastes some network bandwidth. This paper presents a family of mechanisms to avoid network congestion in wormhole networks. All of them need only local information, applying message throttling when it is required. The proposed mechanisms use different strategies to detect network congestion and also apply different corrective actions. The mechanisms are evaluated and compared for several network loads and topologies, noticeably improving network performance with high loads but without penalizing network behavior for low and medium traffic rates, where no congestion control is required",
	address = "USA",
	issn = "1045-9219",
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	keywords = "multiprocessor interconnection networks;telecommunication congestion control;telecommunication network routing;telecommunication network topology;telecommunication switching;telecommunication traffic;",
	note = "multiprocessor interconnection network;traffic load;network congestion control;network bandwidth;wormhole network;message throttling;wormhole switching;",
	number = 9,
	pages = "772 - 84",
	title = "{A} family of mechanisms for congestion control in wormhole networks",
	url = "http://dx.doi.org/10.1109/TPDS.2005.102",
	volume = 16,
	year = 2005
}

Marina Alonso, Juan Miguel Martínez, Vicente Santonja, Pedro Lopez and Jose Duato. Power saving in regular interconnection networks built with high-degree switches. 2005, 10 pp. -. BibTeX

@conference{8539357,
	author = "Alonso, Marina and Mart{\'i}nez, Juan Miguel and Santonja, Vicente and Lopez, Pedro and Duato, Jose",
	abstract = "Nowadays, high-degree switches are available as building blocks of the interconnection network of clusters of PCs. An alternative to take advantage of the high number of switch ports is to connect every pair of switches through not only one but also several links (this is known as link trunking in other environments). This extra connectivity can be exploited by using adaptive routing algorithms, thus improving network throughput and reducing network congestion. However with low traffic loads, all the links that compose the trunk link will not be utilized, but this idle links continue consuming power. Power consumption reduction techniques are being applied everywhere in computer systems and the interconnection network is not an exception, as its contribution is not negligible. In this paper, we present a mechanism that dynamically switches on and off network links as a function of traffic. It is specially targeted to those networks where trunk links are used. The mechanism can switch off any link, provided that network connectivity is guaranteed, (i.e. every pair of switches should be connected through at least one active link). Indeed, this restriction makes possible to use the same routing algorithm regardless the power saving actions taken, thus simplifying router design. Our simulation results show that the network power consumption can be greatly reduced, at the expense of some increase in latency. Nevertheless, it is shown that the power reduction is always higher that this latency increases",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings. 19th IEEE International Parallel and Distributed Processing Symposium",
	keywords = "power consumption;telecommunication congestion control;telecommunication links;telecommunication network routing;telecommunication switching;telecommunication traffic;workstation clusters;",
	note = "power saving;regular interconnection network;high-degree switch;PC clusters;link trunking;adaptive routing algorithm;network throughput;network congestion;power consumption;telecommunication traffic;network link;",
	pages = "10 pp. -",
	title = "{P}ower saving in regular interconnection networks built with high-degree switches",
	year = 2005
}

Michihiro Koibuchi, Juan Carlos Martinez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Enforcing in-order packet delivery in system area networks with adaptive routing. Journal of Parallel and Distributed Computing 65(10):1223 - 1236, 2005. URL BibTeX

@article{2005379355213,
	author = "Michihiro Koibuchi and Martinez, Juan Carlos and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
	abstract = "Adaptive routing, which dynamically selects the route of packets, has been widely studied for interconnection networks in massively parallel computers and system area networks. Although adaptive routing has the advantage of providing high bandwidth, it may deliver packets out-of-order, which some message passing libraries do not accept. In this paper, we propose two mechanisms called (1) FIFO transmission and (2) couple limitation to guarantee in-order packet delivery in adaptive routing. Both of them limit packet injection at source hosts. The FIFO transmission completely avoids packet sorting at destination hosts, while the couple limitation uses a few buffers to sort packets at destination hosts. Evaluation results show that the FIFO transmission and the couple limitation achieve a similar throughput to that of a method equipped with huge (infinite) buffers enough to store all out-of-order packets at destination hosts under both synthetic traffic and NAS Parallel Benchmarks. © 2005 Elsevier Inc. All rights reserved.",
	issn = 07437315,
	journal = "Journal of Parallel and Distributed Computing",
	key = "Packet networks",
	keywords = "Bandwidth;Benchmarking;Interconnection networks;Routers;Telecommunication traffic;",
	note = "Adaptive routing;In-order packet delivery;PC clusters;System area networks;",
	number = 10,
	pages = "1223 - 1236",
	title = "{E}nforcing in-order packet delivery in system area networks with adaptive routing",
	url = "http://dx.doi.org/10.1016/j.jpdc.2005.04.007",
	volume = 65,
	year = 2005
}

Juan Carlos Martinez, Jose Flich, Antonio Robles, Pedro Lopez, Jose Duato and M Koibuchi. In-Order Packet Delivery in Interconnection Networks using Adaptive Routing. In Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International. 2005, 101 - 101. DOI BibTeX

@conference{1419928,
	author = "Martinez, Juan Carlos and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose and M. Koibuchi",
	abstract = "Most commercial switch-based network technologies for PC clusters use deterministic routing. Alternatively, adaptive routing could be used to improve network performance. In this case, switches decide the path to reach the destination by using local information about the state of the possible outgoing links. However, there are two drawbacks that discourage adaptive routing from being applied to commercial interconnects. The first one concerns the possible switch complexity increase with respect to deterministic routing. The second drawback is due to the fact that adaptive routing may introduce out-of-order packet delivery, which is not acceptable for some applications. For the best of our knowledge, there are no works that analyze the degree of out-of-order packet delivery caused by different network and traffic conditions. In this paper, we take on such a challenge. We show that only for high traffic conditions (reaching saturation) out-of-order delivery is introduced. Moreover, by using small buffers and simple sorting mechanisms at destination, we show that high network throughput can be obtained at the same time packets are delivered in order. Thus, the paper demonstrates that it is possible to use adaptive routing, while still guaranteeing in-order packet delivery, without using large buffer resources nor degrading significantly its performance.",
	booktitle = "Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International",
	doi = "10.1109/IPDPS.2005.255",
	keywords = "PC clusters; adaptive routing; deterministic routing; interconnection networks; out-of-order packet delivery; sorting mechanisms; switch-based network technologies; multiprocessor interconnection networks; network routing; packet switching; sorting; work",
	month = "04-08",
	pages = "101 - 101",
	title = "{I}n-{O}rder {P}acket {D}elivery in {I}nterconnection {N}etworks using {A}daptive {R}outing",
	year = 2005
}

Maria E Gomez, Pedro Lopez and Jose Duato. A memory-effective routing strategy for regular interconnection networks. 2005, 41 -. BibTeX

@conference{2005509538034,
	author = "Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "Massively parallel computing systems are being built with thousands of nodes. In such systems, high-performance inter-connection networks are crucial to achieve the maximum performance. Routing is one of the most important design issues of interconnection networks. Routing strategies can be mainly classified as source and distributed routing. Source routing has been used in some networks because routers are very simple. On the other hand, distributed routing allows more flexibility, but the routers are more complex. Distributed routing can be implemented by a fixed hardware specific to a routing function on a given topology, or by using forwarding tables that are very flexible but suffer from a lack of scalability. In this paper, we propose a distributed routing strategy for commercial switches, Flexible Interval Routing, that is scalable for the most widely used regular topologies (tori and meshes) because it is not based on tables. At the same time, the strategy is easy to reconfigure to deal with changes in the topology or in the routing algorithm for a given topology, being able to implement the most commonly-used routing algorithms in regular topologies.",
	address = "Denver, CO, United states",
	journal = "Proceedings - 19th IEEE International Parallel and Distributed Processing Symposium",
	key = "Interconnection networks",
	keywords = "Algorithms;Computer hardware;Data storage equipment;Parallel processing systems;Routers;Switches;Topology;",
	note = "Distributed routing;Routing algorithms;Routing strategies;Source routing;",
	pages = "41 -",
	title = "{A} memory-effective routing strategy for regular interconnection networks",
	year = 2005
}

Maria E Gomez, Pedro Lopez and Jose Duato. A memory-effective fault-tolerant routing strategy for direct interconnection networks. 2005, 341 - 8. BibTeX

@conference{8762349,
	author = "Gomez, Maria E. and Lopez, Pedro and Duato, Jose",
	abstract = "High-performance interconnection networks are crucial in massively parallel computers. Routing is one of the most important design issues of interconnection networks. Moreover, the huge amount of hardware of these machines makes fault-tolerance another important design issue. In this paper, we propose a mechanism that combines scalable routing and fault-tolerance for commercial switches to build direct regular topologies, which are the topologies used in large machines. The hardware required is not complex. Furthermore, it allows a high degree of fault-tolerance inflicting a minimal decrease of performance",
	address = "Los Alamitos, CA, USA",
	journal = "ISPDC 2005. The 4th International Workshop on Parallel and Distributed Computing",
	keywords = "fault tolerance;multiprocessor interconnection networks;telecommunication network reliability;telecommunication network routing;",
	note = "memory-effective fault-tolerant routing;direct interconnection networks;distributed routing;adaptive routing;",
	pages = "341 - 8",
	title = "{A} memory-effective fault-tolerant routing strategy for direct interconnection networks",
	year = 2005
}

Maria E Gomez, Jose Flich, Pedro Lopez, Antonio Robles, Jose Duato, N A Nordbotten, O Lysne and T Skeie. An effective fault-tolerant routing methodology for direct networks. In Parallel Processing, 2004. ICPP 2004. International Conference on. 2004, 222 - 231 vol.1. URL, DOI BibTeX

@conference{1327925,
author = "Gomez, Maria E. and Flich, Jose and Lopez, Pedro and Robles, Antonio and Duato, Jose and N.A. Nordbotten and O. Lysne and T. Skeie",
abstract = "Current massively parallel computing systems are being built with thousands of nodes, which significantly affect the probability of failure. M. E. Gomez proposed a methodology to design fault-tolerant routing algorithms for direct interconnection networks. The methodology uses a simple mechanism: for some source-destination pairs, packets are first forwarded to an intermediate node, and later, from this node to the destination node. Minimal adaptive routing is used along both subpaths. For those cases where the methodology cannot find a suitable intermediate node, it combines the use of intermediate nodes with two additional mechanisms: disabling adaptive routing and using misrouting on a per-packet basis. While the combination of these three mechanisms tolerates a large number of faults, each one requires adding some hardware support in the network and also introduces some overhead. In this paper, we perform an in-depth detailed analysis of the impact of these mechanisms on network behaviour. We analyze the impact of the three mechanisms separately and combined. The ultimate goal of this paper is to obtain a suitable combination of mechanisms that is able to meet the trade-off between fault-tolerance degree, routing complexity, and performance.",
booktitle = "Parallel Processing, 2004. ICPP 2004. International Conference on",
doi = "10.1109/ICPP.2004.1327925",
issn = "0190-3918",
keywords = "direct networks; fault-tolerant routing algorithm; in-depth detailed analysis; interconnection networks; minimal adaptive routing; parallel computing system; communication complexity; fault tolerant computing; multiprocessor interconnection networks; par",
month = "aug.",
pages = "222 - 231 vol.1",
title = "{A}n effective fault-tolerant routing methodology for direct networks",
url = "http://dx.doi.org/10.1109/ICPP.2004.1327925",
year = 2004
}

J M Montañana, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. A transition-based fault-tolerant routing methodology for InfiniBand networks. In Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International. April 2004, 186. URL, DOI BibTeX

@conference{1303198,
author = "Monta{\~n}ana, J. M. and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
abstract = "Summary form only given. Currently, clusters of PCs are considered a cost-effective alternative to large parallel computers. As the number of elements increases in these systems, the probability of faults increases dramatically. Therefore, it is critical to keep the system running even in the presence of faults. The interconnection network plays a key role in its performance. InfiniBand (IBA) is a new standard interconnect suitable for clusters. Most of the fault-tolerant routing strategies proposed for massively parallel computers cannot be applied to IBA because routing and virtual channel transitions are deterministic, which prevents packets from avoiding the faults. A possible approach to provide fault-tolerance in IBA consists of using several disjoint paths between every source-destination pair of nodes and selecting the appropriate path at the source host. However, to this end, a routing algorithm able to provide enough disjoint paths, while still guaranteeing deadlock freedom, is required. We propose a simple and effective fault-tolerant methodology for IBA networks that can be applied to any network topology and meets the trade-off between fault-tolerance degree and the number of network resources devoted to it. Preliminary results show that the proposed methodology scales well and supports up to three faults in 2D and five in 3D tori using only two virtual channels.",
booktitle = "Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International",
doi = "10.1109/IPDPS.2004.1303198",
isbn = "0-7695-2132-0",
issn = "",
keywords = "fault tolerant computing;multiprocessor interconnection networks;network topology;parallel machines;telecommunication network routing;workstation clusters;",
month = "april",
pages = 186,
title = "{A} transition-based fault-tolerant routing methodology for {I}nfini{B}and networks",
url = "http://dx.doi.org/10.1109/IPDPS.2004.1303198",
year = 2004
}

Maria E Gomez, Jose Duato, Jose Flich, Pedro Lopez, Antonio Robles, N A Nordbotten, O Lysne and T Skeie. An Efficient Fault-Tolerant Routing Methodology for Meshes and Tori. Computer Architecture Letters 3(1):3 - 3, 2004. URL, DOI BibTeX

@article{1650124,
	author = "Gomez, Maria E. and Duato, Jose and Flich, Jose and Lopez, Pedro and Robles, Antonio and N.A. Nordbotten and O. Lysne and T. Skeie",
	abstract = "In this paper we present a methodology to design fault-tolerant routing algorithms for regular direct interconnection networks. It supports fully adaptive routing, does not degrade performance in the absence of faults, and supports a reasonably large number of faults without significantly degrading performance. The methodology is mainly based on the selection of an intermediate node (if needed) for each source-destination pair. Packets are adaptively routed to the intermediate node and, at this node, without being ejected, they are adaptively forwarded to their destinations. In order to allow deadlock-free minimal adaptive routing, the methodology requires only one additional virtual channel (for a total of three), even for tori. Evaluation results for a 4 x 4 x 4 torus network show that the methodology is 5-fault tolerant. Indeed, for up to 14 link failures, the percentage of fault combinations supported is higher than 99.96%. Additionally, network throughput degrades by less than 10% when injecting three random link faults without disabling any node. In contrast, a mechanism similar to the one proposed in the BlueGene/L, that disables some network planes, would strongly degrade network throughput by 79%.",
	doi = "10.1109/L-CA.2004.1",
	issn = "1556-6056",
	journal = "Computer Architecture Letters",
	month = "january-december",
	number = 1,
	pages = "3 - 3",
	title = "{A}n {E}fficient {F}ault-{T}olerant {R}outing {M}ethodology for {M}eshes and {T}ori",
	url = "http://dx.doi.org/10.1109/L-CA.2004.1",
	volume = 3,
	year = 2004
}

Maria E Gomez, Jose Flich, Pedro Lopez, Antonio Robles, Jose Duato, N A Nordbotten, O Lysne and T Skeie. An effective fault-tolerant routing methodology for direct networks. 2004, 222 - 31. BibTeX

@conference{8279975,
author = "Gomez, Maria E. and Flich, Jose and Lopez, Pedro and Robles, Antonio and Duato, Jose and N.A. Nordbotten and O. Lysne and T. Skeie",
abstract = "Current massively parallel computing systems are being built with thousands of nodes, which significantly affect the probability of failure. M. E. Gomex proposed a methodology to design fault-tolerant routing algorithms for direct interconnection networks. The methodology uses a simple mechanism: for some source-destination pairs, packets are first forwarded to an intermediate node, and later, from this node to the destination node. Minimal adaptive routing is used along both subpaths. For those cases where the methodology cannot find a suitable intermediate node, it combines the use of intermediate nodes with two additional mechanisms: disabling adaptive routing and using misrouting on a per-packet basis. While the combination of these three mechanisms tolerates a large number of faults, each one requires adding some hardware support in the network and also introduces some overhead. In this paper, we perform an in-depth detailed analysis of the impact of these mechanisms on network behaviour. We analyze the impact of the three mechanisms separately and combined. The ultimate goal of this paper is to obtain a suitable combination of mechanisms that is able to meet the trade-off between fault-tolerance degree, routing complexity, and performance",
address = "Los Alamitos, CA, USA",
journal = "2004 International Conference on Parallel Processing",
keywords = "communication complexity;fault tolerant computing;multiprocessor interconnection networks;parallel processing;",
note = "parallel computing system;fault-tolerant routing algorithm;interconnection networks;minimal adaptive routing;in-depth detailed analysis;direct networks;",
pages = "222 - 31",
title = "{A}n effective fault-tolerant routing methodology for direct networks",
volume = "vol.1",
year = 2004
}

Maria E Gomez, Jose Duato, Jose Flich, Pedro Lopez, Antonio Robles, N A Nordbotten, T Skeie and O Lysne. A new adaptive fault-tolerant routing methodology for direct networks. 2004, 462 - 73. BibTeX

@conference{8426282,
	author = "Gomez, Maria E. and Duato, Jose and Flich, Jose and Lopez, Pedro and Robles, Antonio and N.A. Nordbotten and T. Skeie and O. Lysne",
	abstract = "Interconnection networks play a key role in the fault tolerance of massively parallel computers, since faults may isolate a large fraction of the machine containing many healthy nodes. In this paper, we present a methodology to design fully adaptive fault-tolerant routing algorithms for direct interconnection networks that can be applied to different regular topologies. The methodology is mainly based on the selection of an intermediate node (if needed) for each source-destination pair. Packets are adaptively routed to the intermediate node and, from this node, they are adaptively forwarded to their destination. This methodology requires only one additional virtual channel, even for tori. Evaluation results show that the methodology is 7-fault tolerant, and for up to 14 faults, more than 99% of the combinations are tolerated, also without significantly degrading performance in the presence of faults",
	address = "Berlin, Germany",
	journal = "High Performance Computing-HiPC 2004. 11th International Conference (Lecture notes in Computer Science Vol.3296)",
	keywords = "fault tolerant computing;multiprocessor interconnection networks;parallel processing;telecommunication network routing;telecommunication network topology;",
	note = "adaptive fault-tolerant routing;direct interconnection networks;massively parallel computers;",
	pages = "462 - 73",
	title = "{A} new adaptive fault-tolerant routing methodology for direct networks",
	year = 2004
}

Marina Alonso, J M Martinez, Vicente Santonja and Pedro Lopez. Reducing power consumption in interconnection networks by dynamically adjusting link width. 2004, 882 - 90. BibTeX

@conference{8314163,
	author = "Alonso, Marina and J.M. Martinez and Santonja, Vicente and Lopez, Pedro",
	abstract = "The huge increase both in size and complexity of high-end multiprocessor systems has triggered their power consumption. Air or liquid cooling systems are needed, which, in turn, increases power consumption. Another important percentage of the consumption is due to the interconnection network. In this paper, we propose a mechanism that dynamically reduces the available network bandwidth when traffic becomes low. Unlike other approaches that completely switch links off when they are not fully utilized, our mechanism is based on reducing their bandwidth by narrowing their width. As the topology of the network is not modified, the same routing algorithm can be used regardless of the power consumption level, which simplifies the router design. By using this strategy, the consumption may be strongly reduced. In fact, the lower bound of this reduction is a design parameter of the mechanism. The price to pay is an increase in the message latency with low network loads",
	address = "Berlin, Germany",
	journal = "Euro-Par 2004 Parallel Processing. 10th International Euro-Par Conference. Proceedings (Lecture Notes in Comput. Sci. Vol.3149)",
	keywords = "bandwidth allocation;multiprocessor interconnection networks;power consumption;telecommunication links;telecommunication network routing;telecommunication traffic;",
	note = "power consumption reduction;interconnection networks;link width adjustment;multiprocessor systems;network bandwidth;",
	pages = "882 - 90",
	title = "{R}educing power consumption in interconnection networks by dynamically adjusting link width",
	year = 2004
}

T Skeie, O Lysne, Jose Flich, Pedro Lopez, Antonio Robles and Jose Duato. LASH-TOR: a generic transition-oriented routing algorithm. In Parallel and Distributed Systems, 2004. ICPADS 2004. Proceedings. Tenth International Conference on. 2004, 595 - 604. URL, DOI BibTeX

@conference{1316144,
	author = "T. Skeie and O. Lysne and Flich, Jose and Lopez, Pedro and Robles, Antonio and Duato, Jose",
	abstract = "Cluster networks are seen as the future access networks for multimedia streaming, e-commerce, network storage, etc. For these applications, performance and high availability are particularly crucial. Regular topologies are preferred when performance is the primary concern. However, due to spatial constraints or fault-related issues, the network structure may become irregular, which makes more difficult to find deadlock-free minimal paths. Over the recent years, several solutions have been proposed. One of them is the LASH routing, which enables minimal routing by assigning paths to different virtual layers. In this paper, we propose an extension of LASH in order to reduce the number of required virtual layers by allowing transitions between virtual layers. Evaluation results show that the new routing scheme (LASH-TOR) is able to obtain full minimal routing with a reduced number of virtual channels. For torus and mesh networks, with only two virtual channels, LASH throughput is increased by an average factor of improvement of 3.30 for large networks. For regular networks with some unconnected (faulty) links, equal performance improvements are achieved. Even for highly irregular networks of size up to 128 switches the new routing scheme only needs three virtual channels for guaranteeing minimal routing. Besides, LASH-TOR performs well compared to dimension order routing for mesh and torus networks.",
	booktitle = "Parallel and Distributed Systems, 2004. ICPADS 2004. Proceedings. Tenth International Conference on",
	doi = "10.1109/ICPADS.2004.1316144",
	isbn = "0-7695-2152-5",
	issn = "1521-9097",
	keywords = "LASH routing; LASH-TOR; access networks; cluster networks; deadlock-free minimal paths; e-commerce; mesh network; multimedia streaming; network storage; network structure; spatial constraints; torus network; transition-oriented routing algorithm; virtual",
	month = "7-9",
	pages = "595 - 604",
	title = "{LASH}-{TOR}: a generic transition-oriented routing algorithm",
	url = "http://dx.doi.org/10.1109/ICPADS.2004.1316144",
	year = 2004
}

N A Nordbotten, Maria E Gomez, Jose Flich, Pedro Lopez, Antonio Robles, T Skeie, O Lysne and Jose Duato. A fully adaptive fault-tolerant routing methodology based on intermediate nodes. 2004, 341 - 56. BibTeX

@conference{8322959,
	author = "N.A. Nordbotten and Gomez, Maria E. and Flich, Jose and Lopez, Pedro and Robles, Antonio and T. Skeie and O. Lysne and Duato, Jose",
	abstract = "Massively parallel computing systems are being built with thousands of nodes. Because of the high number of components, it is critical to keep these systems running even in the presence of failures. Interconnection networks play a key-role in these systems, and this paper proposes a fault-tolerant routing methodology for use in such networks. The methodology supports any minimal routing function (including fully adaptive routing), does not degrade performance in the absence of faults, does not disable any healthy node, and is easy to implement both in meshes and tori. In order to avoid network failures, the methodology uses a simple mechanism: for some source-destination pairs, packets are forwarded to the destination node through a set of intermediate nodes (without being ejected from the network). The methodology is shown to tolerate a large number of faults (e.g., five/nine faults when using two/three intermediate nodes in a 3D torus). Furthermore, the methodology offers a gracious performance degradation: in an 8 × 8 × 8 torus network with 14 faults the throughput is only decreased by 6.49%",
	address = "Germany, Germany",
	journal = "Network and Parallel Computing. IFIP International Conference, NPC 2004. Proceedings (Lecture Notes in Computer Science Vol.3222)",
	keywords = "fault tolerant computing;multiprocessor interconnection networks;packet switching;parallel processing;telecommunication network routing;",
	note = "fully adaptive fault-tolerant routing;intermediate nodes;massively parallel computing systems;interconnection networks;minimal routing function;network failures;source-destination pairs;",
	pages = "341 - 56",
	title = "{A} fully adaptive fault-tolerant routing methodology based on intermediate nodes",
	year = 2004
}

Jose Flich, Pedro Lopez, M P Malumbres, Jose Duato and T Rokicki. Applying in-transit buffers to boost the performance of networks with source routing. Computers, IEEE Transactions on 52(9):1134 - 1153, 2003. DOI BibTeX

@article{1228510,
author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose and T. Rokicki",
abstract = "In this paper, we analyze in depth the effect of using ITB in the network, showing that they not only serve for guaranteeing minimal routing, but also that they are a powerful mechanism able to balance network traffic and reduce network contention. To demonstrate these capabilities, we apply the ITB mechanism to improved routing schemes, such as DFS and smart-routing. These routing algorithms (without ITB) are able to improve the performance of up*/down* by 30 percent and 90 percent, respectively, for a 32-switch network. The evaluation results show that, when ITB are used together with these improved routing algorithms, network throughput achieved by DFS and smart-routing can still be improved by 56 percent and 23 percent, respectively. However, smart-routing requires a time to compute the routing tables that rapidly grows with network size, it being impossible in practice to build networks with more than 32 switches. This high computational cost is mainly motivated by the need of obtaining deadlock-free routing tables. However, when ITB are used, one can decouple the stages of computing routing tables and breaking cycles. Moreover, as stated above, ITB can be used to reduce network contention. In this way, in this paper, we also propose a completely new routing algorithm that tries to balance network traffic by using a simple and low time consuming strategy. The proposed algorithm guarantees deadlock freedom and reduces network contention with the use of ITB. The evaluation results show that our algorithm obtains unprecedented throughputs in 32-switch networks, tripling the original up*/down* and almost doubling smart-routing.",
doi = "10.1109/TC.2003.1228510",
issn = "0018-9340",
journal = "Computers, IEEE Transactions on",
keywords = "32-switch network; DFS; ITB; NOW; breaking cycles; deadlock-free routing tables; in-transit buffers; minimal routing; network contention reduction; network performance; network throughput; network traffic balancing; networks of workstations; performance;",
month = "sept.",
number = 9,
pages = "1134 - 1153",
title = "{A}pplying in-transit buffers to boost the performance of networks with source routing",
volume = 52,
year = 2003
}

J M M Rubio, Pedro Lopez and Jose Duato. FC3D: flow control-based distributed deadlock detection mechanism for true fully adaptive routing in wormhole networks. Parallel and Distributed Systems, IEEE Transactions on 14(8):765 - 779, 2003. URL, DOI BibTeX

@article{1225056,
	author = "J.M.M. Rubio and Lopez, Pedro and Duato, Jose",
	abstract = "Two general approaches have been proposed for deadlock handling in wormhole networks. Traditionally, deadlock-avoidance strategies have been used. In this case, either routing is restricted so that there are no cyclic dependencies between channels or cyclic dependencies between channels are allowed provided that there are some escape paths to avoid deadlock. More recently, deadlock recovery strategies have begun to gain acceptance. These strategies allow the use of unrestricted fully adaptive routing, usually outperforming deadlock avoidance techniques. However, they require a deadlock detection mechanism and a deadlock recovery mechanism that is able to recover from deadlocks faster than they occur. In particular, progressive deadlock recovery techniques are very attractive because they allocate a few dedicated resources to quickly deliver deadlocked messages, instead of killing them. Unfortunately, distributed deadlock detection is usually based on crude time-outs, which detect many false deadlocks. As a consequence, messages detected as deadlocked may saturate the bandwidth offered by recovery resources, thus degrading performance. Additionally, the threshold required by the detection mechanism (the time-out) strongly depends on network load, which is not known in advance at the design stage. This limits the applicability of deadlock recovery on actual networks. We propose a novel distributed deadlock detection mechanism that uses only local information, detects all the deadlocks, considerably reduces the probability of false deadlock detection over previously proposed techniques, and is not significantly affected by variations in message length and/or message destination distribution.",
	doi = "10.1109/TPDS.2003.1225056",
	issn = "1045-9219",
	journal = "Parallel and Distributed Systems, IEEE Transactions on",
	keywords = "FC3D mechanism; crude time-out; deadlock detection mechanism; deadlock-avoidance strategy; deadlocked message; false deadlock detection probability; flow control-based distributed deadlock detection; message destination distribution; message length distr",
	month = "aug.",
	number = 8,
	pages = "765 - 779",
	title = "{FC}3{D}: flow control-based distributed deadlock detection mechanism for true fully adaptive routing in wormhole networks",
	url = "http://dx.doi.org/10.1109/TPDS.2003.1225056",
	volume = 14,
	year = 2003
}

Juan Carlos Martinez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Supporting fully adaptive routing in InfiniBand networks. In Parallel and Distributed Processing Symposium, 2003. Proceedings. International. April 2003, 10 pp.. URL, DOI BibTeX

@conference{1213130,
author = "Martinez, Juan Carlos and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
abstract = "InfiniBand is a new standard for communication between processing nodes and I/O devices as well as for interprocessor communication. The InfiniBand Architecture (IBA) supports distributed routing. However, routing in IBA is deterministic because forwarding tables store a single output port per destination ID. This prevents packets from using alternative paths when the requested output port is busy. Despite the fact that alternative paths could be selected at the source node to reach the same destination node, this is not effective enough to improve network performance. However, using adaptive routing could help to circumvent the congested areas in the network, leading to an increment in performance. In this paper, we propose a simple strategy to implement forwarding tables for IBA switches that support adaptive routing while still maintaining compatibility with the IBA specs. Adaptive routing can be enabled or disabled individually for each packet at the source node. Also, the proposed strategy enables the use in IBA of fully adaptive routing algorithms without using additional network resources to improve network performance. Evaluation results show that extending IBA switch capabilities with fully adaptive routing noticeably increases network performance. In particular, network throughput increases up to an average factor of 3.9.",
booktitle = "Parallel and Distributed Processing Symposium, 2003. Proceedings. International",
doi = "10.1109/IPDPS.2003.1213130",
issn = "1530-2075",
keywords = "InfiniBand networks; distributed routing; fully adaptive routing; interprocessor communication; network performance; network throughput; processing nodes; computer networks; multiprocessor interconnection networks; performance evaluation;",
month = "april",
pages = "10 pp.",
title = "{S}upporting fully adaptive routing in {I}nfini{B}and networks",
url = "http://dx.doi.org/10.1109/IPDPS.2003.1213130",
year = 2003
}

Juan Carlos Martinez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Supporting adaptive routing in InfiniBand networks. In Parallel, Distributed and Network-Based Processing, 2003. Proceedings. Eleventh Euromicro Conference on. 2003, 165 - 172. URL, DOI BibTeX

@conference{1183583,
author = "Martinez, Juan Carlos and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
abstract = "InfiniBand is a new standard for communication between processing nodes and I/O devices as well as for interprocessor communication. The InfiniBand Architecture (IBA) supports distributed deterministic routing because forwarding tables store a single output port per destination ID. This prevents packets from using alternative paths when the requested output port is busy. Despite the fact that alternative paths could be selected at the source node to reach the same destination node, this is not effective enough to improve network performance. However using adaptive routing could help to circumvent the congested areas in the network, leading to an increment in performance. In this paper we propose a simple strategy to implement forwarding tables for IBA switches that supports adaptive routing while still maintaining compatibility with the IBA specifications. Adaptive routing can be individually enabled or disabled for each packet at the source node. The proposed strategy enables the use in IBA of any adaptive routing algorithm with an acyclic channel dependence graph. In this paper, we have taken advantage of the partial adaptivity provided by the well-known up*/down* routing algorithm. Evaluation results show that extending IBA switch capabilities with adaptive routing may noticeably increase network performance. In particular network throughput improvement can be, on average, as high as 46%.",
booktitle = "Parallel, Distributed and Network-Based Processing, 2003. Proceedings. Eleventh Euromicro Conference on",
doi = "10.1109/EMPDP.2003.1183583",
issn = "1066-6192",
keywords = "I-O devices; IBA switches; InfiniBand Architecture; InfiniBand networks; acyclic channel dependence graph; adaptive routing; deterministic routing; forwarding tables; interprocessor communication; network performance; network throughput; processing node",
month = "feb.",
pages = "165 - 172",
title = "{S}upporting adaptive routing in {I}nfini{B}and networks",
url = "http://dx.doi.org/10.1109/EMPDP.2003.1183583",
year = 2003
}

Juan Carlos Martinez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Supporting fully adaptive routing in InfiniBand networks. 2003, 10 pp. -. URL BibTeX

@conference{7891311,
author = "Martinez, Juan Carlos and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
abstract = "InfiniBand is a new standard for communication between processing nodes and I/O devices as well as for interprocessor communication. The InfiniBand Architecture (IBA) supports distributed routing. However, routing in IBA is deterministic because forwarding tables store a single output port per destination ID. This prevents packets from using alternative paths when the requested output port is busy. Despite the fact that alternative paths could be selected at the source node to reach the same destination node, this is not effective enough to improve network performance. However, using adaptive routing could help to circumvent the congested areas in the network, leading to an increment in performance. In this paper, we propose a simple strategy to implement forwarding tables for IBA switches that support adaptive routing while still maintaining compatibility with the IBA specs. Adaptive routing can be enabled or disabled individually for each packet at the source node. Also, the proposed strategy enables the use in IBA of fully adaptive routing algorithms without using additional network resources to improve network performance. Evaluation results show that extending IBA switch capabilities with fully adaptive routing noticeably increases network performance. In particular, network throughput increases up to an average factor of 3.9",
address = "Los Alamitos, CA, USA",
journal = "Proceedings International Parallel and Distributed Processing Symposium",
keywords = "computer networks;multiprocessor interconnection networks;performance evaluation;",
note = "fully adaptive routing;InfiniBand networks;processing nodes;interprocessor communication;distributed routing;network performance;network throughput;",
pages = "10 pp. -",
title = "{S}upporting fully adaptive routing in {I}nfini{B}and networks",
url = "http://dx.doi.org/10.1109/IPDPS.2003.1213130",
year = 2003
}

Maria E Gomez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. VOQSW: a methodology to reduce HOL blocking in InfiniBand networks. In Parallel and Distributed Processing Symposium, 2003. Proceedings. International. 2003, 10 pp.. DOI BibTeX

@conference{1213134,
	author = "Gomez, Maria E. and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
	abstract = "InfiniBand is a new switch-based standard interconnect for communication between processor nodes and I/O devices as well as for interprocessor communication. InfiniBand architecture allows switches to support up to 15 virtual lanes per port for data traffic. To route packets through a given virtual lane (VL), packets are labeled with a certain service level (SL) at injection time, and SLtoVL mapping tables are used at each switch to determine the VL to be used. Many previous works in the literature have shown that separate virtual lanes are able to reduce the influence of the well-known head-of-line (HOL) blocking effect on network performance. However, using virtual lanes to form separate virtual networks is not enough to eliminate the HOL blocking problem. Alternative solutions such as Virtual Output Queuing (VOQ) are able to eliminate it at the expense of modifying the switch buffer organization. In this paper, we propose an effective strategy to implement the VOQ scheme in IBA switches by using virtual lanes. This strategy does not require to modify the switch architecture, simply SL to VL tables must be properly filled. Evaluation results show that our proposed VOQ scheme is able to outperform the results obtained with the virtual network approach using the same number of resources. Moreover, the methodology proposed to implement the VOQ scheme in IBA only requires a small number of resources in order to significantly improve network throughput.",
	booktitle = "Parallel and Distributed Processing Symposium, 2003. Proceedings. International",
	doi = "10.1109/IPDPS.2003.1213134",
	keywords = "HOL blocking; InfiniBand networks; SL to VL mapping tables; head-of-line blocking effect; interprocessor communication; network performance; network throughput; switch buffer organization; switch-based standard interconnect; virtual lane; virtual output",
	month = "22-26",
	pages = "10 pp.",
	title = "{VOQSW}: a methodology to reduce {HOL} blocking in {I}nfini{B}and networks",
	year = 2003
}

JC Sancho, Antonio Robles, Pedro Lopez, Jose Flich and Jose Duato. Routing in InfiniBand (TM) torus network topologies. In P Sadayappan and CS Yang (eds.). 2003 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, PROCEEDINGS. 2003, 509-518. BibTeX

@conference{ISI:000186828800056,
	author = "JC Sancho and Robles, Antonio and Lopez, Pedro and Flich, Jose and Duato, Jose",
	abstract = "InfiniBand is an interconnect standard for communication between processing nodes and I/O devices as well as for interprocessor communication (NOWs). The InfiniBand Architecture (IBA) defines a switch-based network with point-to-point links whose topology can be established by the customer When the performance is the primary concern regular topologies are preferred. Low-dimensional tori (2D and 3D) are some of the regular topologies most widely used in commercial parallel computers. Routing in torus requires the use of virtual channels. Although InfiniBand provides support for deterministic routing and virtual channels, they are selected at each switch by service level (SL) identifiers associated to packets and do not depend on packet destination. This makes routing algorithm implementation more complex. In particular, a large number of SLs may be required, which is a scarce resource. In this paper we analyze the way several routing strategies can be applied in tori InfiniBand networks, also evaluating their resource requirements. In particular, we analyze and compare the well-known e-cube and up{*}/down{*} routing algorithms and the Flexible routing algorithm recently proposed.",
	booktitle = "2003 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, PROCEEDINGS",
	editor = "Sadayappan, P and Yang, CS",
	isbn = 0769520170,
	note = "International Conference on Parallel Processing, KAOHSIUNG, TAIWAN, OCT 06-09, 2003",
	pages = "509-518",
	title = "{R}outing in {I}nfini{B}and ({TM}) torus network topologies",
	year = 2003
}

Juan Carlos Martinez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Supporting adaptive routing in IBA switches. 2003, 441 - 456. URL BibTeX

@conference{2003487758791,
author = "Martinez, Juan Carlos and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
abstract = "InfiniBand is a new standard for communication between processing nodes and I/O devices as well as for interprocessor communication. The InfiniBand Architecture (IBA) supports distributed deterministic routing because forwarding tables store a single output port per destination ID. This prevents packets from using alternative paths when the requested output port is busy. Despite the fact that alternative paths could be selected at the source node to reach the same destination node, this is not effective enough to improve network performance. However, using adaptive routing could help to circumvent the congested areas in the network, leading to an increment in performance. In this paper, we propose a simple strategy to implement forwarding tables for IBA switches that supports adaptive routing while still maintaining compatibility with the IBA specs. Adaptive routing can be individually enabled or disabled for each packet at the source node. The proposed strategy enables the use in IBA of any adaptive routing algorithm with an acyclic channel dependence graph. In this paper, we have taken advantage of the partial adaptivity provided by the well-known up*/down* routing algorithm. Evaluation results show that extending IBA switch capabilities with adaptive routing may noticeably increase network performance. In particular, network throughput improvement can be, on average, as high as 66%. © 2003 Elsevier B.V. All rights reserved.",
issn = 13837621,
journal = "Journal of Systems Architecture",
key = "Systems engineering",
keywords = "Algorithms;Communication;Information technology;Switches;Telecommunication networks;",
note = "Adaptive routing;",
number = "10-11",
pages = "441 - 456",
title = "{S}upporting adaptive routing in {IBA} switches",
url = "http://dx.doi.org/10.1016/S1383-7621(03)00103-6",
volume = 49,
year = 2003
}

J C Sancho, Juan Carlos Martinez, Antonio Robles, Pedro Lopez, Jose Flich and Jose Duato. Performance evaluation of COWS under real parallel applications. In Parallel and Distributed Processing Symposium, 2003. Proceedings. International. 2003, 10 pp.. DOI BibTeX

@conference{1213371,
	author = "J.C. Sancho and Martinez, Juan Carlos and Robles, Antonio and Lopez, Pedro and Flich, Jose and Duato, Jose",
	abstract = "Clusters of workstations (COWS) are often arranged as a switch-based network with irregular topology. Usually, the evaluation of interconnection networks for COWS has been carried out by simulation using synthetic traffic and by traces from real parallel applications. Although both types of traffics are used as a first approximation of the behavior of the system, a more accurate behavior can be obtained by using real parallel applications. In this paper, a new simulation framework has been developed in order to evaluate interconnection networks under real parallel applications by using an execution-driven simulator. Moreover, the new simulator can be used to evaluate the impact on the performance of the whole system of several design parameters in addition to the interconnection network. Evaluation results show that the execution time of real parallel applications can be reduced by using an effective routing algorithm. Moreover, in some cases, the achieved improvements are higher than the ones achieved by improving other design issues, such as the processor instruction issue rate, the cache size or the network bandwidth.",
	booktitle = "Parallel and Distributed Processing Symposium, 2003. Proceedings. International",
	doi = "10.1109/IPDPS.2003.1213371",
	issn = "1530-2075",
	keywords = "COWS; cache size; clusters of workstations; execution-driven simulator; interconnection networks; network bandwidth; performance evaluation; processor instruction issue rate; simulation framework; switch-based network; discrete event simulation; performa",
	month = "22-26",
	pages = "10 pp.",
	title = "{P}erformance evaluation of {COWS} under real parallel applications",
	year = 2003
}

Juan-Miguel Martinez Rubio, Pedro Lopez and Jose Duato. FC3D: Flow control-based distributed deadlock detection mechanism for true fully adaptive routing in wormhole networks. IEEE Transactions on Parallel and Distributed Systems 14(8):765 - 779, 2003. URL BibTeX

@article{2003407655842,
	author = "Juan-Miguel Martinez Rubio and Lopez, Pedro and Duato, Jose",
	abstract = "Two general approaches have been proposed for deadlock handling in wormhole networks. Traditionally, deadlock avoidance strategies have been used. In this case, either routing is restricted so that there are no cyclic dependencies between channels or cyclic dependencies between channels are allowed provided that there are some escape paths to avoid deadlock. More recently, deadlock recovery strategies have begun to gain acceptance. These strategies allow the use of unrestricted fully adaptive routing, usually outperforming deadlock avoidance techniques. However, they require a deadlock detection mechanism and a deadlock recovery mechanism that is able to recover from deadlocks faster than they occur. In particular, progressive deadlock recovery techniques are very attractive because they allocate a few dedicated resources to quickly deliver deadlocked messages, instead of killing them. Unfortunately, distributed deadlock detection is usually based on crude time-outs, which detect many false deadlocks. As a consequence, messages detected as deadlocked may saturate the bandwidth offered by recovery resources, thus degrading performance. Additionally, the threshold required by the detection mechanism (the time-out) strongly depends on network load, which is not known in advance at the design stage. This limits the applicability of deadlock recovery on actual networks. In this paper, we propose a novel distributed deadlock detection mechanism that uses only local information, detects all the deadlocks, considerably reduces the probability of false deadlock detection over previously proposed techniques, and is not significantly affected by variations in message length and/or message destination distribution.",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Distributed computer systems",
	keywords = "Adaptive control systems;Command and control systems;Congestion control (communication);Data communication systems;Probability distributions;Requirements engineering;Resource allocation;",
	note = "Adaptive routing;Deadlock recovery;Distributed deadlock detection;Wormhole networks;",
	number = 8,
	pages = "765 - 779",
	title = "{FC}3{D}: {F}low control-based distributed deadlock detection mechanism for true fully adaptive routing in wormhole networks",
	url = "http://dx.doi.org/10.1109/TPDS.2003.1225056",
	volume = 14,
	year = 2003
}

Jose Flich, Pedro Lopez, M P Malumbres and Jose Duato. Boosting the performance of Myrinet networks. Parallel and Distributed Systems, IEEE Transactions on 13(11):1166 - 1182, November 2002. URL, DOI BibTeX

@article{1058099,
	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. These networks allow the customer to connect processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Some of these networks use source routing and wormhole switching. In particular, we are interested in Myrinet networks because it is a well-known commercial product and its behavior can be controlled by the software running in network interfaces (Myrinet Control Program, MCP). Usually, the Myrinet network uses up*/down* routing for computing the paths for every source-destination pair. We propose the In-Transit Buffer (ITB) mechanism to improve network performance. We apply the ITB mechanism to NOWs with up*/down* source routing, like Myrinet, analyzing its behavior on both networks with regular and irregular topologies. The proposed scheme can be implemented on Myrinet networks by only modifying the MCP, without changing the network hardware. We evaluate by simulation several networks with different traffic patterns using timing parameters taken from the Myrinet network. Results show that the current routing schemes used in Myrinet networks can be strongly improved by applying the ITB mechanism. In general, our proposed scheme is able to double the network throughput on medium and large NOWs. Finally, we present a first implementation of the ITB mechanism on a Myrinet network.",
	doi = "10.1109/TPDS.2002.1058099",
	issn = "1045-9219",
	journal = "Parallel and Distributed Systems, IEEE Transactions on",
	keywords = "In-Transit Buffer; Myrinet network; irregular topologies; network interfaces; network performance boosting; network traffic; parallel computers; performance evaluation; scalability; simulation; throughput; up down source routing; workstation networks; wo",
	month = "nov",
	number = 11,
	pages = "1166 - 1182",
	title = "{B}oosting the performance of {M}yrinet networks",
	url = "http://dx.doi.org/10.1109/TPDS.2002.1058099",
	volume = 13,
	year = 2002
}

Jose Flich, Pedro Lopez, M P Malumbres and Jose Duato. Boosting the performance of Myrinet networks. Parallel and Distributed Systems, IEEE Transactions on 13(7):693 -709, July 2002. URL, DOI BibTeX

@article{1019859,
	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. These networks allow the customer to connect processors using irregular topologies, providing the wiring flexibility, scalability and incremental expansion capability required in this environment. Some of these networks use source routing and wormhole switching. In particular, we are interested in Myrinet networks because they are a well-known commercial product and their behavior can be controlled by the software running on the network interfaces (the Myrinet Control Program, MCP). Usually, the Myrinet network uses up*/down* routing for computing the paths for every source-destination pair. In this paper, we propose an in-transit buffer (ITB) mechanism to improve the network performance. We apply the ITB mechanism to NOWs with up*/down* source routing, like the Myrinet, analyzing its behavior on networks with both regular and irregular topologies. The proposed scheme can be implemented on Myrinet networks by simply modifying the MCP, without changing the network hardware. We evaluate by simulation several networks with different traffic patterns using timing parameters taken from the Myrinet network. The results show that the current routing schemes used in Myrinet networks can be strongly improved by applying the ITB mechanism. In general, our proposed scheme is able to double the network throughput on medium and large NOWs. Finally, we present a first implementation of the ITB mechanism on a Myrinet network",
	doi = "10.1109/TPDS.2002.1019859",
	issn = "1045-9219",
	journal = "Parallel and Distributed Systems, IEEE Transactions on",
	keywords = "Myrinet Control Program;Myrinet network performance;in-transit buffer mechanism;incremental expansion capability;irregular topologies;minimal routing;network interfaces;network throughput;network traffic patterns;performance evaluation;regular topologies;",
	month = "jul",
	number = 7,
	pages = "693 -709",
	title = "{B}oosting the performance of {M}yrinet networks",
	url = "http://dx.doi.org/10.1109/TPDS.2002.1019859",
	volume = 13,
	year = 2002
}

Elvira Baydal, Pedro Lopez and Jose Duato. Increasing the adaptivity of routing algorithms for k-ary n-cubes. In Parallel, Distributed and Network-based Processing, 2002. Proceedings. 10th Euromicro Workshop on. 2002, 455 -462. URL, DOI BibTeX

@conference{994333,
	author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
	abstract = "In this paper, we show that routing algorithms may exploit not only the flexibility obtained by crossing network dimensions in any order but also that obtained in the same network dimension, thanks to the availability of bidirectional channels. We analyze the behavior of adaptive routing algorithms both for deadlock avoidance and recovery, exploiting this increased routing flexibility, and compare them with previous proposals in order to evaluate the contribution of the additional routing freedom on network performance. Simulation results show that this simple improvement in the routing algorithm allows one to achieve throughput improvements of up to 45% in networks with low radix, for a uniform distribution of message destinations",
	booktitle = "Parallel, Distributed and Network-based Processing, 2002. Proceedings. 10th Euromicro Workshop on",
	doi = "10.1109/EMPDP.2002.994333",
	isbn = "0-7695-1444-8",
	keywords = "adaptive routing algorithms;additional routing freedom;algorithm adaptivity;bidirectional channels;deadlock avoidance;deadlock recovery;hypercube networks;k-ary n-cubes;network dimension crossing;network performance;network radix;routing flexibility;simul",
	pages = "455 -462",
	title = "{I}ncreasing the adaptivity of routing algorithms for k-ary n-cubes",
	url = "http://dx.doi.org/10.1109/EMPDP.2002.994333",
	year = 2002
}

Elvira Baydal, Pedro Lopez and Jose Duato. Avoiding network congestion with local information. 2002, 35 - 48. URL BibTeX

@conference{20093412265277,
	author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
	abstract = "Congestion leads to a severe performance degradation in multiprocessor interconnection networks. Therefore, the use of techniques that prevent network saturation are of crucial importance. Some recent proposals use global network information, thus requiring that nodes exchange some control information, which consumes a far from negligible bandwidth. As a consequence, the behavior of these techniques in practice is not as good as expected. In this paper, we propose a mechanism that uses only local information to avoid network saturation. Each node estimates traffic locally by using the percentage of free virtual output channels that can be used to forward a message towards its destination. When this number is below a threshold value, network congestion is assumed to exist and message throttling is applied. The main contributions of the proposed mechanism are two: i) it is more selective than previous approaches, as it only prevents the injection of messages when they are destined to congested areas; and ii) it outperforms recent proposals that rely on global information. © 2002 Springer Berlin Heidelberg.",
	address = "Kansai Science City, Japan",
	issn = "0302-9743",
	journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
	key = "Interconnection networks",
	keywords = "Computer science;Telecommunication networks;",
	note = "Control information;Global informations;Global network information;Local information;Message throttling;Multiprocessor interconnections;Network congestions;Network saturation;Performance degradation;Virtual output;",
	pages = "35 - 48",
	title = "{A}voiding network congestion with local information",
	url = "http://dx.doi.org/10.1007/3-540-47847-7_6",
	volume = "2327 LNCS",
	year = 2002
}

Elvira Baydal, Pedro Lopez and Jose Duato. Increasing the adaptivity of routing algorithms for k-ary n-cubes. 2002, 455 - 62. URL BibTeX

@conference{7205121,
	author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
	abstract = "In this paper, we show that routing algorithms may exploit not only the flexibility obtained by crossing network dimensions in any order but also that obtained in the same network dimension, thanks to the availability of bidirectional channels. We analyze the behavior of adaptive routing algorithms both for deadlock avoidance and recovery, exploiting this increased routing flexibility, and compare them with previous proposals in order to evaluate the contribution of the additional routing freedom on network performance. Simulation results show that this simple improvement in the routing algorithm allows one to achieve throughput improvements of up to 45% in networks with low radix, for a uniform distribution of message destinations",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing",
	keywords = "adaptive systems;concurrency control;hypercube networks;network routing;parallel algorithms;performance evaluation;system recovery;",
	note = "adaptive routing algorithms;algorithm adaptivity;k-ary n-cubes;hypercube networks;wormhole switching;routing flexibility;network dimension crossing;bidirectional channels;deadlock avoidance;deadlock recovery;additional routing freedom;network performance;simulation;throughput;network radix;uniform message destination distribution;",
	pages = "455 - 62",
	title = "{I}ncreasing the adaptivity of routing algorithms for k-ary n-cubes",
	url = "http://dx.doi.org/10.1109/EMPDP.2002.994333",
	year = 2002
}

Maria E Gomez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Evaluation of routing algorithms for InfiniBand networks. 2002, 775 - 80. BibTeX

@conference{7568237,
	author = "Gomez, Maria E. and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
	abstract = "Storage area networks (SAN) provide the scalability required by the IT servers. The InfiniBand (IBA) interconnect is very likely to become the de facto standard for SAN as well as for NOW. The routing algorithm is a key design issue in irregular networks. Moreover, as several virtual lanes can be used and different network issues can be considered, the performance of the routing algorithms may be affected. In this paper we evaluate three existing routing algorithms (up*/down*, DFS, and smart-routing) suitable for being applied to IBA. Evaluation has been performed by simulation under different synthetic traffic patterns and I/O traces. Simulation results show that the smart-routing algorithm achieves the highest performance",
	address = "Berlin, Germany",
	journal = "Euro-Par 2002 Parallel Processing. 8th International Euro-Par Conference. Proceedings (Lecture Notes in Computer Science Vol.2400)",
	keywords = "parallel algorithms;performance evaluation;telecommunication network routing;telecommunication standards;telecommunication traffic;workstation clusters;",
	note = "routing algorithms;InfiniBand networks;storage area networks;SAN;scalability;de facto standard;IBA interconnect;NOW;irregular networks;virtual lanes;performance;up*/down* routing;DFS routing;smart routing;synthetic traffic patterns;I/O traces;simulation;IT servers;",
	pages = "775 - 80",
	title = "{E}valuation of routing algorithms for {I}nfini{B}and networks",
	year = 2002
}

Elvira Baydal, Pedro Lopez and Jose Duato. Congestion control based on transmission times. 2002, 781 - 90. BibTeX

@conference{7568238,
	author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
	abstract = "Congestion leads to a severe performance degradation in multiprocessor interconnection networks. Therefore, the use of techniques that prevent network saturation are of crucial importance to avoid high execution times. We propose a new mechanism that uses only local information to avoid network saturation in wormhole networks. In order to detect congestion, each network node computes the quotient between the real transmission time of messages and its minimum theoretical value. If this ratio is greater than a threshold, the physical channel used by the message is considered congested. Depending on the number of congested channels, the available bandwidth to inject messages is reduced. The main contributions of the new mechanism are three: i) it can detect congestion in a remote way, but without transmitting control information through the network; ii) it tries to dynamically adjust the effective injection bandwidth available at each node; and iii) it is starvation-free. Evaluation results show that the proposed mechanism avoids network performance degradation for different network loads and topologies. Indeed, the mechanism does not introduce any penalty for low and medium network loads, where no congestion control mechanism is required",
	address = "Berlin, Germany",
	journal = "Euro-Par 2002 Parallel Processing. 8th International Euro-Par Conference. Proceedings (Lecture Notes in Computer Science Vol.2400)",
	keywords = "multiprocessor interconnection networks;network routing;parallel architectures;parallel machines;performance evaluation;",
	note = "congestion control;transmission times;performance degradation;multiprocessor interconnection networks;network saturation;execution times;massively parallel computers;wormhole networks;bandwidth;starvation-free;network topologies;",
	pages = "781 - 90",
	title = "{C}ongestion control based on transmission times",
	year = 2002
}

Jose Flich, M P Malumbres, Pedro Lopez and Jose Duato. Removing the latency overhead of the ITB mechanism in COWs with source routing. 2002, 463 - 70. URL BibTeX

@conference{7205122,
	author = "Flich, Jose and M.P. Malumbres and Lopez, Pedro and Duato, Jose",
	abstract = "Clusters of workstations (COWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. The in-transit buffer (ITB) mechanism can improve network performance when applied to COWs with irregular topology and source routing. This mechanism considerably improves the performance of this kind of network when compared to current source routing algorithms; however, it introduces a latency penalty. An implementation of this mechanism was performed, showing that the latency overhead of the mechanism may be noticeable, especially for short messages and at low network loads. In this paper, we analyze in detail the latency overhead of ITBs, proposing several mechanisms to reduce, hide and remove it. Firstly, we show, by simulation, the effect of an ITB implementation that is much slower than the one implemented. Then we propose three mechanisms that try to overcome the latency penalty. All the mechanisms are simple and can be easily implemented; also, they are out of the critical path of the ITB packet-processing procedure. The results show very good behaviour of the proposed mechanisms, considerably reducing or even completely removing the latency overhead",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing",
	keywords = "buffer storage;delays;performance evaluation;telecommunication network routing;workstation clusters;",
	note = "latency overhead removal;in-transit buffer mechanism;workstation clusters;source routing;network performance;irregular network topology;short messages;network loads;simulation;latency penalty;critical path;packet processing procedure;",
	pages = "463 - 70",
	title = "{R}emoving the latency overhead of the {ITB} mechanism in {COW}s with source routing",
	url = "http://dx.doi.org/10.1109/EMPDP.2002.994334",
	year = 2002
}

Jose Flich, Pedro Lopez, J C Sancho, Antonio Robles and Jose Duato. Improving InfiniBand routing through multiple virtual networks. 2002, 49 - 63. BibTeX

@conference{7387421,
	author = "Flich, Jose and Lopez, Pedro and J.C. Sancho and Robles, Antonio and Duato, Jose",
	abstract = "InfiniBand is very likely to become the de facto standard for communication between nodes and I/O devices as well as for interprocessor communication. Often, the interconnection pattern is irregular. Up*/down* is the most popular routing scheme currently used in NOWs with irregular topologies. However, the main drawbacks of up*/down* routing are the unbalanced channel utilization and the difficulties to route most packets through minimal paths, which negatively affects network performance. Using additional virtual lanes can improve up*/down* routing performance by reducing the head-of-line blocking effect, but its use is not aimed to remove its main drawbacks. We propose a methodology that uses a reduced number of virtual lanes in an efficient way to achieve a better traffic balance and a higher number of minimal paths. This methodology is based on routing packets simultaneously through several properly selected up*/down* trees. To guarantee deadlock freedom, each up*/down* tree is built over a different virtual network. Simulation results, show that the proposed methodology increases throughput up to an average factor ranging from 1.18 to 2.18 for 8, 16, and 32-switch networks by using only two virtual lanes. For larger networks with an additional virtual lane, network throughput is tripled, on average",
	address = "Berlin, Germany",
	journal = "High Performance Computing. 4th International Symposium, ISHPC 2002. Proceedings (Lecture Notes in Computer Science Vol.2327)",
	keywords = "multiplexing;multiprocessor interconnection networks;telecommunication network routing;workstation clusters;",
	note = "InfiniBand routing;networks of workstations;multiple virtual networks;interprocessor communication;NOWs;switch-based network;point-to-point links;up*/down* routing;head-of-line blocking effect;deadlock freedom;",
	pages = "49 - 63",
	title = "{I}mproving {I}nfini{B}and routing through multiple virtual networks",
	year = 2002
}

J C Sancho, Antonio Robles, Jose Flich, Pedro Lopez and Jose Duato. Effective methodology for deadlock-free minimal routing in InfiniBand networks. In Parallel Processing, 2002. Proceedings. International Conference on. 2002, 409 - 418. DOI BibTeX

@conference{1040897,
	author = "J.C. Sancho and Robles, Antonio and Flich, Jose and Lopez, Pedro and Duato, Jose",
	abstract = "The InfiniBand Architecture (IBA) defines a switch-based network with point-to-point links whose topology is arbitrarily established by the customer. We propose a simple and effective methodology for designing deadlock-free routing strategies that are able to route packets through minimal paths in InfiniBand networks. This methodology can meet the trade-off between network performance and the number of resources dedicated to deadlock avoidance. Evaluation results show that the resulting routing strategies significantly outperform up*/down* routing. In particular, throughput improvement ranges, on average, from 1.33 for small networks to 4.05 for large networks. Also, it is shown that just two virtual lanes and three service levels are enough to achieve more than 80% of the throughput improvement achieved by the best proposed routing strategy (the one that always provides minimal paths without limiting the number of resources).",
	booktitle = "Parallel Processing, 2002. Proceedings. International Conference on",
	doi = "10.1109/ICPP.2002.1040897",
	issn = "0190-3918",
	keywords = "InfiniBand architecture; InfiniBand networks; NOWs; deadlock-free minimal routing; interconnection pattern; minimal paths; network performance; packet routing; point-to-point links; service levels; switch-based network; throughput improvement; up*/down*",
	pages = "409 - 418",
	title = "{E}ffective methodology for deadlock-free minimal routing in {I}nfini{B}and networks",
	year = 2002
}

Jose Flich, Pedro Lopez, Perez M Malumbres and Jose Duato. Boosting the performance of Myrinet networks. IEEE Transactions on Parallel and Distributed Systems 13(7):693 - 709, 2002. URL BibTeX

@article{2002367073594,
	author = "Flich, Jose and Lopez, Pedro and M. Perez Malumbres and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. These networks allow the customer to connect processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Some of these networks use source routing and wormhole switching. In particular, we are interested in Myrinet networks because it is a well-known commercial product and its behavior can be controlled by the software running in network interfaces (Myrinet Control Program, MCP). Usually, the Myrinet network uses up*/down* routing for computing the paths for every source-destination pair. In this paper, we propose the In-Transit Buffer (ITB) mechanism to improve network performance. We apply the ITB mechanism to NOWs with up*/down* source routing, like Myrinet, analyzing its behavior on both networks with regular and irregular topologies. The proposed scheme can be implemented on Myrinet networks by only modifying the MCP, without changing the network hardware. We evaluate by simulation several networks with different traffic patterns using timing parameters taken from the Myrinet network. Results show that the current routing schemes used in Myrinet networks can be strongly improved by applying the ITB mechanism. In general, our proposed scheme is able to double the network throughput on medium and large NOWs. Finally, we present a first implementation of the ITB mechanism on a Myrinet network.",
	issn = 10459219,
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Computer networks",
	keywords = "Buffer storage;Computer hardware;Computer simulation;Computer workstations;Interfaces;Parallel processing systems;Program processors;Routers;Telecommunication traffic;Topology;",
	note = "Myrinet networks;",
	number = 7,
	pages = "693 - 709",
	title = "{B}oosting the performance of {M}yrinet networks",
	url = "http://dx.doi.org/10.1109/TPDS.2002.1019859",
	volume = 13,
	year = 2002
}

J C Sancho, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Analyzing the influence of virtual lanes on the performance of infiniband networks. In Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002, Abstracts and CD-ROM. 2002, 166 -175. BibTeX

@conference{1016568,
	author = "J.C. Sancho and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
	booktitle = "Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002, Abstracts and CD-ROM",
	pages = "166 -175",
	title = "{A}nalyzing the influence of virtual lanes on the performance of infiniband networks",
	year = 2002
}

Juan Miguel Martínez, Pedro Lopez and Jose Duato. A cost-effective approach to deadlock handling in wormhole networks. Parallel and Distributed Systems, IEEE Transactions on 12(7):716 -729, July 2001. URL, DOI BibTeX

@article{940746,
	author = "Mart{\'i}nez, Juan Miguel and Lopez, Pedro and Duato, Jose",
	abstract = "Wormhole networks have traditionally used deadlock avoidance strategies. More recently, deadlock recovery strategies have begun to gain acceptance. In particular, progressive deadlock recovery techniques allocate a few dedicated resources to quickly deliver deadlocked packets. Deadlock recovery is based on the assumption that deadlocks are rare; otherwise, recovery techniques are not efficient. Measurements of deadlock occurrence frequency show that deadlocks are highly unlikely when enough routing freedom is provided. However, networks are more prone to deadlocks when the network is close to or beyond saturation, causing some network performance degradation. Similar performance degradation behavior at saturation was also observed in networks using deadlock avoidance strategies. In this paper, we take a different approach to handling deadlocks and performance degradation. We propose the use of an injection limitation mechanism that prevents performance degradation near the saturation point and, at the same time, reduces the probability of deadlock to negligible values. We also propose an improved deadlock detection mechanism that uses only local information, detects all deadlocks, and considerably reduces the probability of false deadlock detection over previous proposals. In the rare case when impending deadlock is detected, our proposal consists of using a simple recovery technique that absorbs the deadlocked message at the current node and later reinjects it for continued routing toward its destination. Performance evaluation results show that our new approach to handling deadlock is more efficient than previously proposed techniques",
	doi = "10.1109/71.940746",
	issn = "1045-9219",
	journal = "Parallel and Distributed Systems, IEEE Transactions on",
	keywords = "cost-effective approach;deadlock avoidance;deadlock handling;deadlock occurrence frequency;deadlock recovery;injection limitation mechanism;network performance degradation;performance degradation;performance evaluation;wormhole networks;concurrency contro",
	month = "jul",
	number = 7,
	pages = "716 -729",
	title = "{A} cost-effective approach to deadlock handling in wormhole networks",
	url = "http://dx.doi.org/10.1109/71.940746",
	volume = 12,
	year = 2001
}

Salvador Coll, Jose Flich, M P Malumbres, Pedro Lopez, Jose Duato and F J Mora. A first implementation of in-transit buffers on myrinet gm software. In Parallel and Distributed Processing Symposium., Proceedings 15th International. April 2001, 1640 -1647. URL, DOI BibTeX

@conference{925150,
author = "Coll, Salvador and Flich, Jose and M.P. Malumbres and Lopez, Pedro and Duato, Jose and F.J. Mora",
abstract = "Clusters of workstations (COWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. In these systems, the interconnection network connects hosts using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Myrinet is the most popular network used to build COWs. It uses source routing with the up*/down* routing algorithm. In previous papers we proposed the In-Transit Buffer (ITB) mechanism that improves network performance by allowing minimal routing, balancing network traffic, and reducing network contention. The mechanism is based on ejecting packets at some intermediate hosts and later re-injecting them into the network. Moreover, the ITB mechanism does not require additional hardware as it can be implemented on the software running at Myrinet network adapters. In this paper, we present a first implementation of the ITB mechanism on Myrinet GM software. We show the changes required in packet format and the modifications performed in the Myrinet Control Program (MCP). In addition, both the overhead introduced by the new code and the cost of extracting and re-injecting packets are measured. Results show that, even for this simple implementation, code overhead is only about 125 ns per packet and the message latency increase for messages that use the ITB mechanismis around 1.3 s per ITB. This is the first attempt to implement this mechanism, showing that a real implementation of ITBs is feasible on Myrinet COWs, and the associated overhead does not restrict the potential benefits of this mechanism.",
booktitle = "Parallel and Distributed Processing Symposium., Proceedings 15th International",
doi = "10.1109/IPDPS.2001.925150",
isbn = "0-7695-0990-8",
issn = "1530-2075",
month = "apr",
pages = "1640 -1647",
title = "{A} first implementation of in-transit buffers on myrinet gm software",
url = "http://dx.doi.org/10.1109/IPDPS.2001.925150",
year = 2001
}

Jose Flich, Pedro Lopez, M P Malumbres, Jose Duato and T Rokicki. Improving network performance by reducing network contention in source-based COWS with a low path-computation overhead. In Parallel and Distributed Processing Symposium., Proceedings 15th International. April 2001, 8 pp.. DOI BibTeX

@conference{925016,
	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose and T. Rokicki",
	abstract = "In previous papers, we have proposed the in-transit buffer mechanism (ITB) to improve network performance in COWs with irregular topology and source routing. This mechanism allows the use of minimal paths among all hosts, breaking cyclic dependences between channels by storing and later re-injecting packets at some intermediate hosts. However it also has two additional features that can improve even more network performance. First, the ITB mechanism reduces network contention because some messages are ejected from the network freeing network links. Second the ITB mechanism allows the use of any path between each source-destination pair improving traffic balance. In this paper we present a new routing algorithm that takes advantage of ITB by exploiting both issues: traffic balance and network contention reduction. The evaluation results show that network throughput can be considerably improved. On average, network throughput increases with respect to up*/down* by factors of 2.51 and 3.77 in 32 and 64-switch networks, respectively",
	booktitle = "Parallel and Distributed Processing Symposium., Proceedings 15th International",
	doi = "10.1109/IPDPS.2001.925016",
	keywords = "in-transit buffer mechanism;network contention;network performance;network throughput;source routing;source-based COWS;traffic balance;performance evaluation;workstation clusters;",
	month = "apr",
	pages = "8 pp.",
	title = "{I}mproving network performance by reducing network contention in source-based {COWS} with a low path-computation overhead",
	year = 2001
}

Elvira Baydal, Pedro Lopez and Jose Duato. A congestion control mechanism for wormhole networks. In Parallel and Distributed Processing, 2001. Proceedings. Ninth Euromicro Workshop on. 2001, 19 -26. URL, DOI BibTeX

@conference{904965,
author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
abstract = "Deadlock avoidance and recovery techniques suffer from severe performance degradation when the network is close to or beyond saturation. Many parallel applications produce bursty traffic that may saturate the network during some intervals, and increase execution time. Therefore, the use of techniques that prevent network saturation are of crucial importance in both deadlock avoidance and recovery strategies. Several mechanisms have been proposed in the literature to reach this goal. However some of them do not work well under all network load conditions. Others introduce some penalty when the network is not fully saturated, or complicate network and/or node implementation. In this paper we propose a new mechanism to avoid network saturation that overcomes these drawbacks. In this mechanism, each node estimates network traffic locally by using the percentage of free virtual output channels that can be used for forwarding a message towards its destination. When this number surpasses a threshold value, network congestion is assumed to exist and message injection is forbidden",
booktitle = "Parallel and Distributed Processing, 2001. Proceedings. Ninth Euromicro Workshop on",
doi = "10.1109/EMPDP.2001.904965",
keywords = "bursty traffic;congestion control mechanism;deadlock avoidance;deadlock recovery;free virtual output channels;message injection;network congestion;network load conditions;network saturation;network traffic;performance degradation;threshold value;wormhole",
pages = "19 -26",
title = "{A} congestion control mechanism for wormhole networks",
url = "http://dx.doi.org/10.1109/EMPDP.2001.904965",
year = 2001
}

Juan Miguel Martínez, Pedro Lopez and Jose Duato. A cost-effective approach to deadlock handling in wormhole networks. IEEE Transactions on Parallel and Distributed Systems 12(7):716 - 729, 2001. URL, DOI BibTeX

@article{2001376648866,
	author = "Mart{\'i}nez, Juan Miguel and Lopez, Pedro and Duato, Jose",
	abstract = "Wormhole networks have traditionally used deadlock avoidance strategies. More recently, deadlock recovery strategies have begun to gain acceptance. In particular, progressive deadlock recovery techniques allocate a few dedicated resources to quickly deliver deadlocked packets. Deadlock recovery is based on the assumption that deadlocks are rare; otherwise, recovery techniques are not efficient. Measurements of deadlock occurrence frequency show that deadlocks are highly unlikely when enough routing freedom is provided [36], [32]. However, networks are more prone to deadlocks when the network is close to or beyond saturation, causing some network performance degradation. Similar performance degradation behavior at saturation was also observed in networks using deadlock avoidance strategies [13]. In this paper, we take a different approach to handling deadlocks and performance degradation. We propose the use of an injection limitation mechanism that prevents performance degradation near the saturation point and, at the same time, reduces the probability of deadlock to negligible values. We also propose an improved deadlock detection mechanism that uses only local information, detects all deadlocks, and considerably reduces the probability of false deadlock detection over previous proposals. In the rare case when impending deadlock is detected, our proposal consists of using a simple recovery technique that absorbs the deadlocked message at the current node and later reinjects it for continued routing toward its destination. Performance evaluation results show that our new approach to handling deadlock is more efficient than previously proposed techniques.",
	doi = "10.1109/71.940746",
	issn = "1045-9219",
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	key = "Interconnection networks",
	keywords = "Communication channels;Computer system recovery;Multiprocessing programs;Packet networks;",
	note = "Wormhole networks;",
	number = 7,
	pages = "716 - 729",
	title = "{A} cost-effective approach to deadlock handling in wormhole networks",
	url = "http://dx.doi.org/10.1109/71.940746",
	volume = 12,
	year = 2001
}

Elvira Baydal, Pedro Lopez and Jose Duato. A congestion control mechanism for wormhole networks. 2001, 19 - 26. URL BibTeX

@conference{6867163,
	author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
	abstract = "Deadlock avoidance and recovery techniques suffer from severe performance degradation when the network is close to or beyond saturation. Many parallel applications produce bursty traffic that may saturate the network during some intervals, and increase execution time. Therefore, the use of techniques that prevent network saturation are of crucial importance in both deadlock avoidance and recovery strategies. Several mechanisms have been proposed in the literature to reach this goal. However some of them do not work well under all network load conditions. Others introduce some penalty when the network is not fully saturated, or complicate network and/or node implementation. In this paper we propose a new mechanism to avoid network saturation that overcomes these drawbacks. In this mechanism, each node estimates network traffic locally by using the percentage of free virtual output channels that can be used for forwarding a message towards its destination. When this number surpasses a threshold value, network congestion is assumed to exist and message injection is forbidden",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings Ninth Euromicro Workshop on Parallel and Distributed Processing",
	keywords = "multiprocessor interconnection networks;network routing;performance evaluation;system recovery;telecommunication congestion control;",
	note = "congestion control mechanism;wormhole networks;deadlock avoidance;deadlock recovery;performance degradation;bursty traffic;network saturation;network load conditions;network traffic;free virtual output channels;threshold value;network congestion;message injection;",
	pages = "19 - 26",
	title = "{A} congestion control mechanism for wormhole networks",
	url = "http://dx.doi.org/10.1109/EMPDP.2001.904965",
	year = 2001
}

Pedro Lopez, Jose Flich and Jose Duato. Deadlock-free routing in InfiniBandTM through destination renaming. In Parallel Processing, International Conference on, 2001.. 2001, 427 - 434. DOI BibTeX

@conference{952089,
	author = "Lopez, Pedro and Flich, Jose and Duato, Jose",
	abstract = "The InfiniBand Architecture (IBA) defines a switch-based network with point-to-point links that supports any topology defined by the user including irregular ones, in order to provide flexibility and incremental expansion capability. Routing in IBA is distributed, based on forwarding tables, and only considers the packet destination ID for routing within subnets in order to drastically reduce forwarding table size. Unfortunately, the forwarding tables for most of the previously proposed routing algorithms for irregular topologies consider both the destination ID and the input channel. Therefore, these popular routing algorithms for irregular topologies may not be usable in InfiniBand networks because they do nor conform to the IBA specifications. In this paper we propose an easy-to-implement strategy to adapt the forwarding tables already computed following any routing algorithm that considers the destination ID and the input channel into the required IBA forwarding table format. The resulting routing algorithm is deadlock-free on IBA. Indeed, the originally computed paths are not modified at all. Hence, the proposed strategy does not degrade performance with respect to the original routing scheme.",
	booktitle = "Parallel Processing, International Conference on, 2001.",
	doi = "10.1109/ICPP.2001.952089",
	issn = "",
	keywords = "InfiniBand Architecture; deadlock-free; destination renaming; packet destination; routing algorithms; switch-based network; multiprocessor interconnection networks; network routing;",
	month = "3-7",
	pages = "427 - 434",
	title = "{D}eadlock-free routing in {I}nfini{B}and{TM} through destination renaming",
	year = 2001
}

Jose Flich, Pedro Lopez, M P Malumbres and Jose Duato. Improving routing performance in Myrinet networks. In Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International. 2000, 27 -32. URL, DOI BibTeX

@conference{845961,
	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. Typically, these networks connect processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. In some of these networks, packets are delivered using source routing. Due to the irregular topology, the routing scheme is often non-minimal. In this paper we analyze the routing scheme used in Myrinet networks in order to improve its performance. We propose new routing algorithms that balance the utilization of the available routes and always use minimal paths. We show through simulation that the current routing schemes used in Myrinet networks can be improved by modifying only the routing software without increasing the software overhead significantly. The overall throughput can be doubled without modifying the network hardware",
	booktitle = "Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International",
	doi = "10.1109/IPDPS.2000.845961",
	keywords = "Myrinet networks;NOWs;networks of workstations;routing performance;routing scheme;network routing;workstation clusters;",
	pages = "27 -32",
	title = "{I}mproving routing performance in {M}yrinet networks",
	url = "http://dx.doi.org/10.1109/IPDPS.2000.845961",
	year = 2000
}

Elvira Baydal, Pedro Lopez and Jose Duato. A simple and efficient mechanism to prevent saturation in wormhole networks. In Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International. 2000, 617 -622. URL, DOI BibTeX

@conference{846043,
author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
abstract = "Both deadlock avoidance and recovery techniques suffer from severe performance degradation when the network is close to or beyond saturation. This performance degradation appears because messages block in the network faster than they are drained by the escape paths in the deadlock avoidance strategies or the deadlock recovery mechanism. Many parallel applications produce bursty traffic that may saturate the network during some intervals, significantly increasing execution time. Therefore, the use of techniques that prevent network saturation are of crucial importance. Although several mechanisms have been proposed in the literature to reach this goal, some of them introduce some penalty when the network is not fully saturated, require complex hardware to be implemented or do not behave well under all network load conditions. In this paper we propose a new mechanism to avoid network saturation that overcomes these drawbacks",
booktitle = "Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International",
doi = "10.1109/IPDPS.2000.846043",
keywords = "deadlock avoidance;deadlock recovery;network saturation;performance degradation;wormhole networks;computer networks;concurrency control;multiprocessor interconnection networks;",
pages = "617 -622",
title = "{A} simple and efficient mechanism to prevent saturation in wormhole networks",
url = "http://dx.doi.org/10.1109/IPDPS.2000.846043",
year = 2000
}

Jose Flich, Pedro Lopez, M P Malumbres and Jose Duato. Improving the performance of regular networks with source routing. In Parallel Processing, 2000. Proceedings. 2000 International Conference on. 2000, 353 -361. URL, DOI BibTeX

@conference{876151,
author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose",
abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. In these machines, the network connects processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Also, when performance is the primary concern, these network products are being used to build large commodity clusters with regular topologies. In previous papers, we have proposed the in-transit buffer mechanism to improve network performance, applying it to NOWs with irregular topology and source routing. This mechanism allows the use of minimal paths among all hosts, breaking cyclic dependencies between channels by storing and later re-injecting packers at some intermediate hosts. In this paper we apply the in-transit buffer mechanism to regular networks with source routing in order to improve their performance. Also, two path selection policies are evaluated. The first one will always choose the same minimal path from source to destination, whereas the second one will choose from different alternative minimal paths in a round-robin fashion. The evaluation results show that the overall network throughput can be doubled for large networks",
booktitle = "Parallel Processing, 2000. Proceedings. 2000 International Conference on",
doi = "10.1109/ICPP.2000.876151",
keywords = "NOWs;networks of workstations;parallel computers;path selection policies;regular networks;round-robin;source routing;buffer storage;network routing;performance evaluation;workstation clusters;",
pages = "353 -361",
title = "{I}mproving the performance of regular networks with source routing",
url = "http://dx.doi.org/10.1109/ICPP.2000.876151",
year = 2000
}

Elvira Baydal, Pedro Lopez and Jose Duato. Simple and efficient mechanism to prevent saturation in wormhole networks. Proceedings of the International Parallel Processing Symposium, IPPS, pages 617 - 622, 2000. BibTeX

@article{2000265175264,
	author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
	abstract = "Both deadlock avoidance and recovery techniques suffer from severe performance degradation when the network is close to or beyond saturation. This performance degradation appears because messages block in the network faster than they are drained by the escape paths in the deadlock avoidance strategies or the deadlock recovery mechanism. Many parallel applications produce bursty traffic that may saturate the network during some intervals, significantly increasing execution time. Therefore, the use of techniques that prevent network saturation are of crucial importance. Although several mechanisms have been proposed in the literature to reach this goal, some of them introduce some penalty when the network is not fully saturated, require complex hardware to be implemented or do not behave well under all network load conditions. In this paper, we propose a new mechanism to avoid network saturation that overcomes these drawbacks.",
	address = "United States",
	issn = "1063-7133",
	journal = "Proceedings of the International Parallel Processing Symposium, IPPS",
	key = "Parallel processing systems",
	keywords = "Computer system recovery;Congestion control;Fault tolerant computer systems;Response time;Telecommunication traffic;",
	note = "Deadlock recovery methods;Wormhole networks;",
	pages = "617 - 622",
	title = "{S}imple and efficient mechanism to prevent saturation in wormhole networks",
	year = 2000
}

Juan Carlos Martinez, Federico Silla, Pedro Lopez and Jose Duato. On the influence of the selection function on the performance of networks of workstations. 2000, 292 - 9. BibTeX

@conference{6977556,
	author = "Martinez, Juan Carlos and Silla, Federico and Lopez, Pedro and Duato, Jose",
	abstract = "Previous research has pointed out the influence of adaptive routing on the performance improvement of interconnection networks for clusters of workstations. One of the design issues of adaptive routing algorithms is the selection function, which selects the output channel among all the available choices. We analyze in detail several selection functions in order to evaluate their influence on network performance. Simulation results show that network throughput may be increased up to 10%. When the network is close to saturation, improvements in latency up to 40% may be achieved",
	address = "Berlin, Germany",
	journal = "High Performance Computing. Third International Symposium, ISHPC 2000. Proceedings (Lecture Notes in Computer Science Vol.1940)",
	keywords = "delays;multiprocessor interconnection networks;network routing;network topology;performance evaluation;workstation clusters;",
	note = "selection function;networks of workstations;interconnection networks;workstation clusters;adaptive routing algorithms;performance evaluation;network throughput;latency;",
	pages = "292 - 9",
	title = "{O}n the influence of the selection function on the performance of networks of workstations",
	year = 2000
}

Elvira Baydal, Pedro Lopez and Jose Duato. A simple and efficient mechanism to prevent saturation in wormhole networks. 2000, 617 - 22. URL BibTeX

@conference{6590366,
	author = "Baydal, Elvira and Lopez, Pedro and Duato, Jose",
	abstract = "Both deadlock avoidance and recovery techniques suffer from severe performance degradation when the network is close to or beyond saturation. This performance degradation appears because messages block in the network faster than they are drained by the escape paths in the deadlock avoidance strategies or the deadlock recovery mechanism. Many parallel applications produce bursty traffic that may saturate the network during some intervals, significantly increasing execution time. Therefore, the use of techniques that prevent network saturation are of crucial importance. Although several mechanisms have been proposed in the literature to reach this goal, some of them introduce some penalty when the network is not fully saturated, require complex hardware to be implemented or do not behave well under all network load conditions. In this paper we propose a new mechanism to avoid network saturation that overcomes these drawbacks",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000",
	keywords = "computer networks;concurrency control;multiprocessor interconnection networks;",
	note = "wormhole networks;deadlock avoidance;performance degradation;deadlock recovery;network saturation;",
	pages = "617 - 22",
	title = "{A} simple and efficient mechanism to prevent saturation in wormhole networks",
	url = "http://dx.doi.org/10.1109/IPDPS.2000.846043",
	year = 2000
}

Jose Flich, M P Malumbres, Pedro Lopez and Jose Duato. Performance evaluation of a new routing strategy for irregular networks with source routing. 2000, 34 - 43. URL BibTeX

@conference{7144248,
author = "Flich, Jose and M.P. Malumbres and Lopez, Pedro and Duato, Jose",
abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. Typically, these networks connect processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. In some of these networks, messages are delivered using the up*/down* routing algorithm. However, the up*/down* routing scheme is often non-minimal. Also, some of these networks use source routing. With this technique, the entire path to destination is generated at the source host before the message is sent. In this paper we develop a new mechanism in order to improve the performance of irregular networks with source routing, increasing overall throughput. With this mechanism, messages always use minimal paths. To avoid possible deadlocks, when necessary, routes between a pair of hosts are divided into sub-routes, and a special kind of virtual cut-through is performed at some intermediate hosts. We evaluate the new mechanism by simulation using parameters taken from the Myrinet network. We show that the current routing schemes used in Myrinet can be improved by modifying only the routing software without increasing its overhead significantly and, most importantly, without modifying the network hardware. The benefits of using the new routing scheme are noticeable for networks with 16 or more switches, and increase with network size. For 32 and 64-switch networks, throughput is increased on average by a factor ranging from 1.3 to 3.3",
address = "New York, NY, USA",
journal = "Conference Proceedings of the 2000 International Conference on Supercomputing",
keywords = "multiprocessor interconnection networks;network routing;performance evaluation;",
note = "performance evaluation;routing strategy;irregular networks;source routing;networks of workstations;deadlocks;virtual cut-through;Myrinet network;routing software;wormhole switching;minimal routing;",
pages = "34 - 43",
title = "{P}erformance evaluation of a new routing strategy for irregular networks with source routing",
url = "http://dx.doi.org/10.1145/335231.335235",
year = 2000
}

Jose Flich, Pedro Lopez, M P Malumbres and Jose Duato. Improving the performance of regular networks with source routing. 2000, 353 - 61. URL BibTeX

@conference{6742420,
	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. In these machines, the network connects processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Also, when performance is the primary concern, these network products are being used to build large commodity clusters with regular topologies. In previous papers, we have proposed the in-transit buffer mechanism to improve network performance, applying it to NOWs with irregular topology and source routing. This mechanism allows the use of minimal paths among all hosts, breaking cyclic dependencies between channels by storing and later re-injecting packers at some intermediate hosts. In this paper we apply the in-transit buffer mechanism to regular networks with source routing in order to improve their performance. Also, two path selection policies are evaluated. The first one will always choose the same minimal path from source to destination, whereas the second one will choose from different alternative minimal paths in a round-robin fashion. The evaluation results show that the overall network throughput can be doubled for large networks",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 2000 International Conference on Parallel Processing",
	keywords = "buffer storage;network routing;performance evaluation;workstation clusters;",
	note = "regular networks;source routing;networks of workstations;NOWs;parallel computers;path selection policies;round-robin;",
	pages = "353 - 61",
	title = "{I}mproving the performance of regular networks with source routing",
	url = "http://dx.doi.org/10.1109/ICPP.2000.876151",
	year = 2000
}

Jose Flich, M P Malumbres, Pedro Lopez and Jose Duato. Improving routing performance in Myrinet networks. 2000, 27 - 32. URL BibTeX

@conference{6590291,
	author = "Flich, Jose and M.P. Malumbres and Lopez, Pedro and Duato, Jose",
	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. Typically, these networks connect processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. In some of these networks, packets are delivered using source routing. Due to the irregular topology, the routing scheme is often non-minimal. In this paper we analyze the routing scheme used in Myrinet networks in order to improve its performance. We propose new routing algorithms that balance the utilization of the available routes and always use minimal paths. We show through simulation that the current routing schemes used in Myrinet networks can be improved by modifying only the routing software without increasing the software overhead significantly. The overall throughput can be doubled without modifying the network hardware",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000",
	keywords = "network routing;workstation clusters;",
	note = "routing performance;Myrinet networks;NOWs;networks of workstations;routing scheme;",
	pages = "27 - 32",
	title = "{I}mproving routing performance in {M}yrinet networks",
	url = "http://dx.doi.org/10.1109/IPDPS.2000.845961",
	year = 2000
}

Jose Flich, Pedro Lopez, M P Malumbres, Jose Duato and T Rokicki. Combining in-transit buffers with optimized routing schemes to boost the performance of networks with source routing. 2000, 300 - 9. BibTeX

@conference{6977557,
	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose and T. Rokicki",
	abstract = "In previous papers we proposed the ITB mechanism to improve the performance of up^*/down^* routing in irregular networks with source routing. With this mechanism, both minimal routing and a better use of network links are guaranteed, resulting on an overall network performance improvement. In this paper, we show that the ITB mechanism can be used with any source routing scheme in the NOW environment. In particular, we apply ITB to DFS and Smart routing algorithms, which provide better routes than up^*/down^* routing. Results show that ITB strongly improves DFS (by 63%, for 64-switch networks) and Smart throughput (23%, for 32-switch networks)",
	address = "Berlin, Germany",
	journal = "High Performance Computing. Third International Symposium, ISHPC 2000. Proceedings (Lecture Notes in Computer Science Vol.1940)",
	keywords = "buffer storage;network routing;performance evaluation;workstation clusters;",
	note = "in-transit buffers;optimized routing schemes;network performance;source routing;ITB mechanism;NOW;Smart routing algorithm;DFS routing algorithm;",
	pages = "300 - 9",
	title = "{C}ombining in-transit buffers with optimized routing schemes to boost the performance of networks with source routing",
	year = 2000
}

Pedro Lopez, Rosa Alcover, Jose Duato and L Zunica. Optimizing network throughput: optimal versus robust design. In Parallel and Distributed Processing, 1999. PDP '99. Proceedings of the Seventh Euromicro Workshop on. February 1999, 45 -52. URL, DOI BibTeX

@conference{746644,
author = "Lopez, Pedro and Alcover, Rosa and Duato, Jose and L. Zunica",
abstract = "Interconnection network performance is usually measured in terms of its latency (time required to deliver a message) and throughput (maximum traffic accepted by the network). At first glance, minimizing average message latency is the main designer goal, because average network traffic is usually far from saturation. However, applications can also generate very high peak traffic. In order to deal with such situations, it is important that network throughput is also high. On the other hand, interconnection network performance depends on several parameters. Some of them can be chosen by the designer: routing algorithm, switching technique, topology and node design parameters. However, there are other parameters that cannot be selected by the designer. Among these, there are parameters that depend on the application, such as message size, message destination distribution and message traffic, as well as parameters defined by the customer, such as network size. Network designer can select the design parameters that maximize average (optimal design) or the design parameters that achieve a good performance under all the feasible combinations of the parameters that cannot be selected by him (robust design). Notice that both alternatives do not always lead to the same parameter configuration. Previously we chose the design parameters of a k-ary n-cube network considering optimize latency. In this case, optimal and robust design lead to the same choice. In this paper, we obtain these design parameters considering optimized network throughput. Unfortunately, there is a discrepancy between optimal and robust design criteria, being the former the best choice",
booktitle = "Parallel and Distributed Processing, 1999. PDP '99. Proceedings of the Seventh Euromicro Workshop on",
doi = "10.1109/EMPDP.1999.746644",
isbn = "0-7695-0059-5",
issn = "1066-6192",
keywords = "average message latency;average network traffic;interconnection network performance;latency;message destination distribution;network throughput optimisation;node design parameters;optimal design;parameter configuration;robust design;routing algorithm;swit",
month = "feb",
pages = "45 -52",
title = "{O}ptimizing network throughput: optimal versus robust design",
url = "http://dx.doi.org/10.1109/EMPDP.1999.746644",
year = 1999
}

Pedro Lopez, Rosa Alcover, Jose Duato and L Zunica. Optimizing network throughput: optimal versus robust design. 1999, 45 - 52. URL BibTeX

@conference{6169182,
author = "Lopez, Pedro and Alcover, Rosa and Duato, Jose and L. Zunica",
abstract = "Interconnection network performance is usually measured in terms of its latency (time required to deliver a message) and throughput (maximum traffic accepted by the network). At first glance, minimizing average message latency is the main designer goal, because average network traffic is usually far from saturation. However, applications can also generate very high peak traffic. In order to deal with such situations, it is important that network throughput is also high. On the other hand, interconnection network performance depends on several parameters. Some of them can be chosen by the designer: routing algorithm, switching technique, topology and node design parameters. However, there are other parameters that cannot be selected by the designer. Among these, there are parameters that depend on the application, such as message size, message destination distribution and message traffic, as well as parameters defined by the customer, such as network size. Network designer can select the design parameters that maximize average (optimal design) or the design parameters that achieve a good performance under all the feasible combinations of the parameters that cannot be selected by him (robust design). Notice that both alternatives do not always lead to the same parameter configuration. Previously we chose the design parameters of a k-ary n-cube network considering optimize latency. In this case, optimal and robust design lead to the same choice. In this paper, we obtain these design parameters considering optimized network throughput. Unfortunately, there is a discrepancy between optimal and robust design criteria, being the former the best choice",
address = "Los Alamitos, CA, USA",
journal = "Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed Processing. PDP'99",
keywords = "multiprocessor interconnection networks;performance evaluation;telecommunication network routing;",
note = "network throughput optimisation;robust design;optimal design;interconnection network performance;latency;average message latency;average network traffic;routing algorithm;switching technique;node design parameters;message destination distribution;parameter configuration;",
pages = "45 - 52",
title = "{O}ptimizing network throughput: optimal versus robust design",
url = "http://dx.doi.org/10.1109/EMPDP.1999.746644",
year = 1999
}

Jose Flich, M P Malumbres, Pedro Lopez and Jose Duato. Performance evaluation of networks of workstations with hardware shared memory model using execution-driven simulation. In Parallel Processing, 1999. Proceedings. 1999 International Conference on. 1999, 146 -153. DOI BibTeX

@conference{797399,
author = "Flich, Jose and M.P. Malumbres and Lopez, Pedro and Duato, Jose",
abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. Typically, these networks connect processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Similar to the evolution of parallel computers, NOWs are also evolving from distributed memory to shared memory programming model. However, physical distances between processors are longer in NOWs than in tightly-coupled distributed shared-memory multiprocessors (DSMs), leading to higher message latency and lower network bandwidth. Therefore, the network may be a bottleneck when executing some parallel applications in a NOW supporting a shared-memory programming paradigm. In this paper we analyze whether the interconnection network is able to efficiently handle the traffic generated in a NOW with the shared memory model. In particular, we are interested in analyzing the influence of the routing mechanism in the performance of the system. We evaluate the behavior of a NOW with irregular topology by means of an execution-driven simulator using SPLASH-2 applications as the input load. The results show that the routing algorithm can considerably reduce the total execution time of applications. In particular routing adaptivity can reduce the total execution time by 58% in some applications. These results confirm the behavior observed in previous works using synthetic traffic loads",
booktitle = "Parallel Processing, 1999. Proceedings. 1999 International Conference on",
doi = "10.1109/ICPP.1999.797399",
keywords = "SPLASH-2;distributed shared-memory multiprocessors;execution-driven simulation;execution-driven simulator;hardware shared memory model;incremental expansion capability;interconnection network;irregular topologies;message latency;networks of workstations;p",
pages = "146 -153",
title = "{P}erformance evaluation of networks of workstations with hardware shared memory model using execution-driven simulation",
year = 1999
}

J M Martinez, Pedro Lopez and Jose Duato. Impact of buffer size on the efficiency of deadlock detection. 1999, 315 - 18. URL BibTeX

@conference{6169109,
author = "J.M. Martinez and Lopez, Pedro and Duato, Jose",
abstract = "Deadlock detection is one of the most important design issues in recovery strategies for routing in interconnection networks. In a previous paper, we presented an efficient deadlock detection mechanism. This mechanism requires that when a message header blocks it must be quickly notified to all the channels reserved by that message. To achieve this goal, the detection mechanism uses the information provided by flow control. Some recent commercial multiprocessors use deep buffers, since they may increase network throughput and efficiently allow transmission over long wires. However, deep buffers may increase the elapsed time between header blocking at a router and the propagation of flow control signals, thus negatively affecting the behavior of our deadlock detection mechanism. On the other hand, deeper buffers reduce deadlock frequency. As a consequence, buffer size has opposing effects on deadlock detection. In this paper, we analyze by simulation the influence of these effects on the efficiency of our deadlock detection mechanism, showing that overall performance improves with buffer size",
address = "Los Alamitos, CA, USA",
journal = "Proceedings Fifth International Symposium on High-Performance Computer Architecture",
keywords = "concurrency control;multiprocessor interconnection networks;",
note = "buffer size;deadlock detection;recovery strategies;interconnection networks routing;multiprocessors;deep buffers;simulation;",
pages = "315 - 18",
title = "{I}mpact of buffer size on the efficiency of deadlock detection",
url = "http://dx.doi.org/10.1109/HPCA.1999.744385",
year = 1999
}

Pedro Lopez, Juan Miguel Martínez and Jose Duato. DRIL: dynamically reduced message injection limitation mechanism for wormhole networks. In Parallel Processing, 1998. Proceedings. 1998 International Conference on. August 1998, 535 -542. URL, DOI BibTeX

@conference{708527,
	author = "Lopez, Pedro and Mart{\'i}nez, Juan Miguel and Duato, Jose",
	abstract = "Deadlock avoidance and recovery techniques are alternatives to deal with the interconnection network deadlock problem. Both techniques allow fully adaptive routing on some set of resources while providing dedicated resources to escape from deadlock. They mainly differ in the way they supply escape paths and when those paths are used. As the escape paths only provide limited bandwidth to escape from deadlocks, both techniques suffer from severe performance degradation when the network is close to saturation. On the other hand, deadlock recovery is based on the assumption that deadlocks are rare. Several studies show that deadlock are more prone when the network is close to or beyond saturation. In this paper we propose a new mechanism that prevents network saturation by dynamically adjusting message injection limitation into the network. As a consequence, this mechanism will avoid the performance degradation problem that typically occurs in both deadlock avoidance and recovery techniques, making fully adaptive feasible. Also, it will guarantee that the frequency of deadlock is really negligible, allowing the use of simple low-cost recovery strategies",
	booktitle = "Parallel Processing, 1998. Proceedings. 1998 International Conference on",
	doi = "10.1109/ICPP.1998.708527",
	isbn = "0-8186-8650-2",
	issn = "0190-3918",
	keywords = "DRIL;deadlock avoidance;interconnection network deadlock;message injection limitation;network saturation;performance degradation;recovery techniques;wormhole networks;concurrency control;multiprocessor interconnection networks;performance evaluation;syste",
	month = "aug",
	pages = "535 -542",
	title = "{DRIL}: dynamically reduced message injection limitation mechanism for wormhole networks",
	url = "http://dx.doi.org/10.1109/ICPP.1998.708527",
	year = 1998
}

Pedro Lopez, Juan Miguel Martínez and Jose Duato. Very efficient distributed deadlock detection mechanism for wormhole networks. 1998, 57 - 66. BibTeX

@conference{1998534159795,
	author = "Lopez, Pedro and Mart{\'i}nez, Juan Miguel and Duato, Jose",
	abstract = "Networks using wormhole switching have traditionally relied upon deadlock avoidance strategies for the design of routing algorithms. More recently, deadlock recovery strategies have begun to gain acceptance. Progressive deadlock recovery techniques are very attractive because they allocate a few dedicated resources to quickly deliver deadlocked messages, instead of killing them. However, the distributed deadlock detection techniques proposed up to now detect many false deadlocks, especially when the network is heavily loaded and messages have different lengths. As a consequence, messages detected as deadlocked may saturate the bandwidth offered by recovery resources, thus degrading performance considerably. In this paper we propose an improved distributed deadlock detection mechanism that uses only local information, detects all the deadlocks, considerably reduces the probability of false deadlock detection and is not strongly affected by variations in message length and message destination distribution.",
	address = "Las Vegas, NV, USA",
	journal = "IEEE High-Performance Computer Architecture Symposium Proceedings",
	key = "Computer system recovery",
	keywords = "Algorithms;Bandwidth;Computer networks;Distributed computer systems;Error detection;",
	note = "Distributed deadlock detection mechanisms;Wormhole networks;",
	pages = "57 - 66",
	title = "{V}ery efficient distributed deadlock detection mechanism for wormhole networks",
	year = 1998
}

Pedro Lopez, J M Martinez and Jose Duato. A very efficient distributed deadlock detection mechanism for wormhole networks. 1998, 57 - 66. URL BibTeX

@conference{5842955,
	author = "Lopez, Pedro and J.M. Martinez and Duato, Jose",
	abstract = "Networks using wormhole switching have traditionally relied upon deadlock avoidance strategies for the design of routing algorithms. More recently, deadlock recovery strategies have begun to gain acceptance. Progressive deadlock recovery techniques are very attractive because they allocate a few dedicated resources to quickly deliver deadlocked messages, instead of killing them. However, the distributed deadlock detection techniques proposed up to now detect many false deadlocks, especially when the network is heavily loaded and messages have different lengths. As a consequence, messages detected as deadlocked may saturate the bandwidth offered by recovery resources, thus degrading performance considerably. In this paper we propose an improved distributed deadlock detection mechanism that uses only local information, detects all the deadlocks, considerably reduces the probability of false deadlock detection and is not strongly affected by variations in message length and message destination distribution",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture (Cat. No.98TB100224)",
	keywords = "multiprocessor interconnection networks;performance evaluation;system recovery;",
	note = "distributed deadlock detection mechanism;wormhole networks;wormhole switching;deadlock avoidance strategies;routing algorithms;deadlock recovery strategies;deadlock recovery techniques;performance degradation;local information;false deadlock detection;message length;message destination distribution;",
	pages = "57 - 66",
	title = "{A} very efficient distributed deadlock detection mechanism for wormhole networks",
	url = "http://dx.doi.org/10.1109/HPCA.1998.650546",
	year = 1998
}

Pedro Lopez, J M Martinez, Jose Duato and F Petrini. On the reduction of deadlock frequency by limiting message injection in wormhole networks. 1998, 295 - 307. BibTeX

@conference{5992388,
	author = "Lopez, Pedro and J.M. Martinez and Duato, Jose and F. Petrini",
	abstract = "Recently, deadlock recovery strategies have begun to gain acceptance in networks using wormhole switching. In particular, progressive deadlock recovery techniques are very attractive because they allocate a few dedicated resources to quickly deliver deadlocked packets, instead of killing them. Deadlock recovery is based on the assumption that deadlocks are really rare. Otherwise, recovery techniques are not efficient. We propose the use of a message injection limitation mechanism that reduces the probability of deadlock to negligible values, even when fully adaptive routing is used. The main new feature is that it can be used with different message destination distributions. The proposed mechanism can be combined with any deadlock detection mechanism. In particular, we use the deadlock detection mechanism proposed in Martinez (1997). In addition, the proposed injection limitation mechanism considerably reduces performance degradation when the network reaches the saturation point",
	address = "Berlin, Germany",
	journal = "Parallel Computer Routing and Communication. Second International Workshop, PCRCW'97. Proceedings",
	keywords = "multiprocessor interconnection networks;network routing;packet switching;performance evaluation;probability;resource allocation;system recovery;",
	note = "deadlock frequency;wormhole switching;progressive deadlock recovery;resource allocation;deadlocked packets;message injection limitation;probability;fully adaptive routing;message destination distributions;network performance;",
	pages = "295 - 307",
	title = "{O}n the reduction of deadlock frequency by limiting message injection in wormhole networks",
	year = 1998
}

Jose Flich, Pedro Lopez, M P Malumbres and Jose Duato. Edinet: an execution driven interconnection network simulator for DSM systems. 1998, 336 - 9. BibTeX

@conference{6161583,
	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose",
	abstract = "Evaluation studies on interconnection networks for distributed memory multiprocessors usually assume synthetic or trace-driven workloads. However, when the final design choices must be done a more precise evaluation study should be performed. In this paper, we describe a new execution-driven simulation tool to evaluate interconnection networks for distributed memory multiprocessors using real application workloads. As an example, we have developed a NCC-NUMA memory model and obtained some simulation results from the SPLASH-2 suite, using different network routing algorithms",
	address = "Berlin, Germany",
	journal = "Computer Performance Evaluation. Modelling Techniques and Tools. 10th International Conference, Tools'98. Proceedings",
	keywords = "discrete event simulation;distributed shared memory systems;multiprocessor interconnection networks;performance evaluation;",
	note = "Edinet;execution driven interconnection network simulator;distributed memory multiprocessors;trace-driven workloads;execution-driven simulation tool;NCC-NUMA memory model;simulation results;SPLASH-2 suite;network routing algorithms;",
	pages = "336 - 9",
	title = "{E}dinet: an execution driven interconnection network simulator for {DSM} systems",
	year = 1998
}

Pedro Lopez, J M Martinez and Jose Duato. DRIL: dynamically reduced message injection limitation mechanism for wormhole networks. 1998, 535 - 42. URL BibTeX

@conference{6034749,
author = "Lopez, Pedro and J.M. Martinez and Duato, Jose",
abstract = "Deadlock avoidance and recovery techniques are alternatives to deal with the interconnection network deadlock problem. Both techniques allow fully adaptive routing on some set of resources while providing dedicated resources to escape from deadlock. They mainly differ in the way they supply escape paths and when those paths are used. As the escape paths only provide limited bandwidth to escape from deadlocks, both techniques suffer from severe performance degradation when the network is close to saturation. On the other hand, deadlock recovery is based on the assumption that deadlocks are rare. Several studies show that deadlock are more prone when the network is close to or beyond saturation. In this paper we propose a new mechanism that prevents network saturation by dynamically adjusting message injection limitation into the network. As a consequence, this mechanism will avoid the performance degradation problem that typically occurs in both deadlock avoidance and recovery techniques, making fully adaptive feasible. Also, it will guarantee that the frequency of deadlock is really negligible, allowing the use of simple low-cost recovery strategies",
address = "Los Alamitos, CA, USA",
journal = "Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205)",
keywords = "concurrency control;multiprocessor interconnection networks;performance evaluation;system recovery;",
note = "DRIL;wormhole networks;interconnection network deadlock;network saturation;message injection limitation;performance degradation;deadlock avoidance;recovery techniques;",
pages = "535 - 42",
title = "{DRIL}: dynamically reduced message injection limitation mechanism for wormhole networks",
url = "http://dx.doi.org/10.1109/ICPP.1998.708527",
year = 1998
}

Pedro Lopez, Rosa Alcover, Jose Duato and L Zunica. Cost-effective methodology for the evaluation of interconnection networks. Journal of Systems Architecture 44(9-10):815 - 830, 1998. URL BibTeX

@article{1998384306573,
	author = "Lopez, Pedro and Alcover, Rosa and Duato, Jose and L. Zunica",
	abstract = "Interconnection network performance depends on several parameters, including network design parameters, network size, message traffic and message length. Simulation is the methodology usually followed in evaluation studies, because the model can more faithfully represent hardware implementation, taking into account more details. Nevertheless, the number of parameter combinations is often very high, and simulations also take long to complete. Therefore, evaluation studies must choose a subset of the parameters and restrict the variability of each of them. In this paper, we propose a methodology for evaluating interconnection networks. It is based on experimental design used in statistical studies. Using this methodology, we can study network behavior considering many parameters, running only a subset of the simulations required to study all the combinations. In addition, the methodology permits to quantify the effect of interactions among the parameters. We apply this methodology to adjust node design parameters such as number of virtual channels, input buffer size, and output buffer size for a 8-ary 3-cube with adaptive (both partially and fully) wormhole routing. We show that running only one third of the simulations required to study all the combinations, the most significant effects can be estimated without a noticeable loss in precision.",
	address = "Amsterdam, Netherlands",
	issn = 13837621,
	journal = "Journal of Systems Architecture",
	key = "Interconnection networks",
	keywords = "Buffer storage;Communication channels (information theory);Computer simulation;Cost effectiveness;Data communication systems;Statistical methods;Telecommunication traffic;",
	note = "Adaptive routing;Virtual channels;Wormhole routing;",
	number = "9-10",
	pages = "815 - 830",
	title = "{C}ost-effective methodology for the evaluation of interconnection networks",
	url = "http://dx.doi.org/10.1016/S1383-7621(97)00019-2",
	volume = 44,
	year = 1998
}

Juan Miguel Martínez, Pedro Lopez, Jose Duato and T M Pinkston. Software-based deadlock recovery technique for true fully adaptive routing in wormhole networks. In Parallel Processing, 1997., Proceedings of the 1997 International Conference on. August 1997, 182 -189. URL, DOI BibTeX

@conference{622586,
	author = "Mart{\'i}nez, Juan Miguel and Lopez, Pedro and Duato, Jose and T.M. Pinkston",
	abstract = "In this paper, we take a different approach to handle deadlocks and performance degradation. We propose the use of an injection limitation mechanism that prevents performance degradation near the saturation point and reduces the probability of deadlock to negligible values even when fully adaptive routing is used. We also propose an improved deadlock detection mechanism that only uses local information, detects all the deadlocks, and considerably reduces the probability of false deadlock detection over previous proposals. In the rare case when impending deadlock is detected, our proposed recovery technique absorbs the deadlocked message at the current node and later re-injects it for continued routing towards its destination. Performance evaluation results show that our new approach to deadlock handling is more efficient than previously proposed techniques",
	booktitle = "Parallel Processing, 1997., Proceedings of the 1997 International Conference on",
	doi = "10.1109/ICPP.1997.622586",
	keywords = "deadlock detection mechanism;deadlocked message;fully adaptive routing;injection limitation mechanism;performance degradation;performance evaluation;software-based deadlock recovery technique;true fully adaptive routing;wormhole networks;concurrency contr",
	month = "aug",
	pages = "182 -189",
	title = "{S}oftware-based deadlock recovery technique for true fully adaptive routing in wormhole networks",
	url = "http://dx.doi.org/10.1109/ICPP.1997.622586",
	year = 1997
}

Jose Duato, Pedro Lopez and S Yalamanchili. Deadlock- and livelock-free routing protocols for wave switching. In Parallel Processing Symposium, 1997. Proceedings., 11th International. April 1997, 570 -577. URL, DOI BibTeX

@conference{580958,
author = "Duato, Jose and Lopez, Pedro and S. Yalamanchili",
abstract = "Wave switching is a hybrid switching technique for high performance routers. It combines wormhole switching and circuit switching in the same router architecture. Wave switching achieves very high performance by exploiting communication locality. When two nodes are going to communicate frequently, a physical circuit is established between them. By combining circuit switching, pre-established physical circuits and wave pipelining across channels and switches, it is possible to increase network bandwidth considerably, also reducing latency for communications that use pre-established physical circuits. In this paper we propose two protocols for routers implementing wave switching. The first protocol handles the network as a cache of circuits, automatically establishing a circuit when two nodes are going to communicate. Subsequent communications use the previously established circuit. When a new circuit requests channels belonging to another circuit, a replacement algorithm selects the circuit to be torn down. The second protocol relies on the programmer and/or the compiler to decide when a circuit should be established or torn down for a set of messages. Also, we show that the proposed protocols are always able to deliver messages, and are deadlock- and livelock-free",
booktitle = "Parallel Processing Symposium, 1997. Proceedings., 11th International",
doi = "10.1109/IPPS.1997.580958",
keywords = "circuit switching;deadlock-free;high performance routers;livelock-free;protocol;routing protocols;wave switching;wormhole switching;circuit switching;concurrency control;multiprocessor interconnection networks;network routing;protocols;",
month = "apr",
pages = "570 -577",
title = "{D}eadlock- and livelock-free routing protocols for wave switching",
url = "http://dx.doi.org/10.1109/IPPS.1997.580958",
year = 1997
}

Fabrizio Petrini, Jose Duato, Pedro Lopez and Juan Miguel Martínez. LIFE: A Limited Injection, Fully adaptivE, recovery-based routing algorithm. 1997, 316 - 321. BibTeX

@conference{1998104020143,
	author = "Fabrizio Petrini and Duato, Jose and Lopez, Pedro and Mart{\'i}nez, Juan Miguel",
	abstract = "Networks using wormhole switching have traditionally relied upon deadlock avoidance strategies for the design of deadlock-free algorithms. The past few years have seen a rise in popularity of deadlock recovery strategies, that are based on the property that deadlocks are quite rare in practice and happen only at or beyond the network saturation point. In fact, recovery-based routing algorithms have a higher potential performance over the deadlock avoidance-based ones which allow less routing freedom. In this paper we present a recovery-based fully adaptive routing algorithm, LIFE, which is based on an innovative injection policy that reduces the probability of deadlocks to negligible values, both with uniform and non-uniform traffic patterns. The experimental results, conducted on a 8-ary 3-cube with 512 nodes, show that it is possible to implement true fully adaptive routing using only two virtual channels. Also, LIFE outperforms state-of-the-art avoidance- and recovery-based algorithms of the same cost, both in terms of throughput and message latency under uniform traffic and provides stable throughput under non-uniform traffic patterns.",
	address = "Bangalore, India",
	journal = "Proceedings of the International Conference on High Performance Computing, HiPC",
	key = "Computer system recovery",
	keywords = "Adaptive algorithms;Communication channels;Computer networks;Congestion control;Switching circuits;Telecommunication traffic;",
	note = "Deadlock free algorithms;Non uniform traffic patterns;",
	pages = "316 - 321",
	title = "{LIFE}: {A} {L}imited {I}njection, {F}ully adaptiv{E}, recovery-based routing algorithm",
	year = 1997
}

J M Martinez, Pedro Lopez, Jose Duato and T M Pinkston. Software-based deadlock recovery technique for true fully adaptive routing in wormhole networks. 1997, 182 - 9. URL BibTeX

@conference{5698560,
	author = "J.M. Martinez and Lopez, Pedro and Duato, Jose and T.M. Pinkston",
	abstract = "In this paper, we take a different approach to handle deadlocks and performance degradation. We propose the use of an injection limitation mechanism that prevents performance degradation near the saturation point and reduces the probability of deadlock to negligible values even when fully adaptive routing is used. We also propose an improved deadlock detection mechanism that only uses local information, detects all the deadlocks, and considerably reduces the probability of false deadlock detection over previous proposals. In the rare case when impending deadlock is detected, our proposed recovery technique absorbs the deadlocked message at the current node and later re-injects it for continued routing towards its destination. Performance evaluation results show that our new approach to deadlock handling is more efficient than previously proposed techniques",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)",
	keywords = "concurrency control;hypercube networks;network routing;software performance evaluation;system recovery;",
	note = "software-based deadlock recovery technique;true fully adaptive routing;wormhole networks;performance degradation;injection limitation mechanism;fully adaptive routing;deadlock detection mechanism;deadlocked message;performance evaluation;",
	pages = "182 - 9",
	title = "{S}oftware-based deadlock recovery technique for true fully adaptive routing in wormhole networks",
	url = "http://dx.doi.org/10.1109/ICPP.1997.622586",
	year = 1997
}

Jose Duato, Pedro Lopez and S Yalamanchili. Deadlock- and livelock-free routing protocols for wave switching. 1997, 570 - 7. URL BibTeX

@conference{5559828,
author = "Duato, Jose and Lopez, Pedro and S. Yalamanchili",
abstract = "Wave switching is a hybrid switching technique for high performance routers. It combines wormhole switching and circuit switching in the same router architecture. Wave switching achieves very high performance by exploiting communication locality. When two nodes are going to communicate frequently, a physical circuit is established between them. By combining circuit switching, pre-established physical circuits and wave pipelining across channels and switches, it is possible to increase network bandwidth considerably, also reducing latency for communications that use pre-established physical circuits. In this paper we propose two protocols for routers implementing wave switching. The first protocol handles the network as a cache of circuits, automatically establishing a circuit when two nodes are going to communicate. Subsequent communications use the previously established circuit. When a new circuit requests channels belonging to another circuit, a replacement algorithm selects the circuit to be torn down. The second protocol relies on the programmer and/or the compiler to decide when a circuit should be established or torn down for a set of messages. Also, we show that the proposed protocols are always able to deliver messages, and are deadlock- and livelock-free",
address = "Los Alamitos, CA, USA",
journal = "Proceedings. 11th International Parallel Processing Symposium (Cat. No.97TB100107)",
keywords = "circuit switching;concurrency control;multiprocessor interconnection networks;network routing;protocols;",
note = "wave switching;routing protocols;high performance routers;wormhole switching;circuit switching;protocol;livelock-free;deadlock-free;",
pages = "570 - 7",
title = "{D}eadlock- and livelock-free routing protocols for wave switching",
url = "http://dx.doi.org/10.1109/IPPS.1997.580958",
year = 1997
}

Rosa Alcover, Pedro Lopez, Jose Duato and L Zunica. A methodology for optimal interconnection network design. 1997, 81 - 4. BibTeX

@conference{5863027,
	author = "Alcover, Rosa and Lopez, Pedro and Duato, Jose and L. Zunica",
	abstract = "Interconnection network performance depends on several parameters. Some of them can be chosen by the designer: routing algorithm, switching technique, topology and node design parameters. However, there are other parameters that cannot be selected by the designer. Among these, there are parameters that depend on the application, such as message size, message destination distribution and message traffic, as well as parameters defined by the customer, such as network size. The optimization criteria that the network designer should follow is not only maximizing performance, but also selecting the design parameters that achieve a good performance under all the feasible combinations of the parameters that cannot be selected by the designer. We propose a methodology for optimal network design based on robust experimental design techniques used in statistics. As an application, we choose the most important design parameters of a k-ary n-cube network based on that methodology",
	address = "Raleigh, NC, USA",
	journal = "Proceedings of the ISCA 10th International Conference on Parallel and Distributed Computing Systems",
	keywords = "design of experiments;message passing;multiprocessor interconnection networks;optimisation;parallel architectures;performance evaluation;",
	note = "optimal interconnection network design;interconnection network performance;routing algorithm;switching technique;topology;node design parameters;message size;message destination distribution;message traffic;network size;optimization criteria;experimental design techniques;statistics;k-ary n-cube network;",
	pages = "81 - 4",
	title = "{A} methodology for optimal interconnection network design",
	year = 1997
}

Rosa Alcover, Pedro Lopez, Jose Duato and L Zunica. Interconnection network design: a statistical analysis of interactions between factors. In Parallel and Distributed Processing, 1996. PDP '96. Proceedings of the Fourth Euromicro Workshop on. January 1996, 211 -218. URL, DOI BibTeX

@conference{500589,
author = "Alcover, Rosa and Lopez, Pedro and Duato, Jose and L. Zunica",
abstract = "Interconnection network performance depends on several parameters, including network design parameters, network size, message traffic and message length. Simulation is the methodology usually followed in evaluation studies, because the model can more faithfully represent hardware implementation, taking into account more details. Nevertheless, the number of parameter combinations is often high, and simulations also take long to complete. Therefore, evaluation studies must choose a subset of the parameters and restrict the variability of each of them. In a previous paper (IEEE Computer Soc. TCCA Newsletter, pp. 32-37, Aug. 1995), we have proposed a methodology for evaluating interconnection networks. It is based on experimental design used in statistical studies. Using this methodology, we can study network behavior considering many parameters, running only a subset of the simulations required to study all the combination. In addition, the methodology permits us to quantify the effect of interactions among the parameters. In this paper, we make use of the second advantage of this methodology, analysing the effect of node design parameters and their interactions for an 8-ary 3-cube with adaptive wormhole routing",
booktitle = "Parallel and Distributed Processing, 1996. PDP '96. Proceedings of the Fourth Euromicro Workshop on",
doi = "10.1109/EMPDP.1996.500589",
keywords = "8-ary 3-cube;adaptive wormhole routing;evaluation studies;interconnection network design;interconnection network performance;message length;message traffic;network behavior;network design parameters;network size;node design parameters;parameter combinatio",
month = "jan",
pages = "211 -218",
title = "{I}nterconnection network design: a statistical analysis of interactions between factors",
url = "http://dx.doi.org/10.1109/EMPDP.1996.500589",
year = 1996
}

Rosa Alcover, Pedro Lopez, Jose Duato and L Zunica. Interconnection network design: a statistical analysis of interactions between factors. 1996, 211 - 18. URL BibTeX

@conference{5242395,
	author = "Alcover, Rosa and Lopez, Pedro and Duato, Jose and L. Zunica",
	abstract = "Interconnection network performance depends on several parameters, including network design parameters, network size, message traffic and message length. Simulation is the methodology usually followed in evaluation studies, because the model can more faithfully represent hardware implementation, taking into account more details. Nevertheless, the number of parameter combinations is often high, and simulations also take long to complete. Therefore, evaluation studies must choose a subset of the parameters and restrict the variability of each of them. In a previous paper (IEEE Computer Soc. TCCA Newsletter, pp. 32-37, Aug. 1995), we have proposed a methodology for evaluating interconnection networks. It is based on experimental design used in statistical studies. Using this methodology, we can study network behavior considering many parameters, running only a subset of the simulations required to study all the combination. In addition, the methodology permits us to quantify the effect of interactions among the parameters. In this paper, we make use of the second advantage of this methodology, analysing the effect of node design parameters and their interactions for an 8-ary 3-cube with adaptive wormhole routing",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings of the Fourth Euromicro Workshop on Parallel and Distributed Processing - PDP '96",
	keywords = "design of experiments;multiprocessor interconnection networks;network routing;network synthesis;network topology;performance evaluation;statistical analysis;",
	note = "interconnection network design;statistical analysis;interconnection network performance;network design parameters;network size;message traffic;message length;simulations;parameter combinations;evaluation studies;parameter variability;network behavior;parameter interactions;node design parameters;8-ary 3-cube;adaptive wormhole routing;",
	pages = "211 - 18",
	title = "{I}nterconnection network design: a statistical analysis of interactions between factors",
	url = "http://dx.doi.org/10.1109/EMPDP.1996.500589",
	year = 1996
}

Jose Duato, Pedro Lopez, Federico Silla and S Yalamanchili. A high performance router architecture for interconnection networks. 1996, 61 - 8. URL BibTeX

@conference{5376067,
	author = "Duato, Jose and Lopez, Pedro and Silla, Federico and S. Yalamanchili",
	abstract = "We propose a new router architecture that supports wormhole switching and circuit switching concurrently. This architecture has been designed to take advantage of temporal communication locality. This can be done by establishing a circuit between nodes that are going to communicate frequently. Messages using those circuits face no contention. By combining circuit switching, pre-established physical circuits and wave pipelining across channels and switches, it is possible to increase network bandwidth considerably, also reducing latency for communications that use pre-established physical circuits. This router architecture also allows to reduce the overhead of the software messaging layer in multicomputers by offering a better hardware support. Preliminary performance evaluation results show a drastic reduction in latency and increment in throughput when messages are long enough, even if circuits are established for a single transmission and locality is not exploited",
	address = "Los Alamitos, CA, USA",
	journal = "Proceedings of the 1996 International Conference on Parallel Processing. Vol.1 Architecture",
	keywords = "message passing;multiprocessor interconnection networks;parallel architectures;performance evaluation;",
	note = "high performance router architecture;interconnection networks;wormhole switching;circuit switching;temporal communication locality;router architecture;software messaging layer;performance evaluation;",
	pages = "61 - 8",
	title = "{A} high performance router architecture for interconnection networks",
	url = "http://dx.doi.org/10.1109/ICPP.1996.537144",
	volume = "vol.1",
	year = 1996
}

Jose Duato and Pedro Lopez. Highly adaptive wormhole routing algorithms for n-dimensional torus. 1995, 87 - 104. BibTeX

@conference{5513276,
author = "Duato, Jose and Lopez, Pedro",
abstract = "Deadlock avoidance is a key issue in wormhole networks. A first approach consists of removing the cyclic dependencies between channels. Many deterministic and adaptive routing algorithms have been proposed based on that approach. The absence of cyclic dependencies is a necessary and sufficient condition for deadlock-free deterministic routing. However, it can be relaxed for adaptive routing. A more powerful approach was proposed by us. It only requires the absence of cyclic dependencies on a connected channel subset. The remaining channels can be used in almost any way. In this paper, we show that there exists a more relaxed condition for deadlock-free adaptive routing. This condition is the key for the design of more powerful adaptive routing algorithms. We apply this condition to the design of adaptive routing algorithms for n-dimensional torus. In particular, we propose a partially adaptive routing algorithm which doubles the throughput achieved by the deterministic algorithm without increasing the hardware complexity significantly",
address = "New York, NY, USA",
journal = "Interconnection Networks and Mapping and Scheduling Parallel Computations. DIMACS Workshop",
keywords = "deterministic algorithms;multiprocessor interconnection networks;telecommunication network routing;",
note = "wormhole networks;n-dimensional torus;wormhole routing;deadlock avoidance;cyclic dependencies;deterministic routing;deterministic algorithm;",
pages = "87 - 104",
title = "{H}ighly adaptive wormhole routing algorithms for n-dimensional torus",
year = 1995
}

Pedro Lopez and Jose Duato. Deadlock-free fully-adaptive minimal routing algorithms: limitations and solutions. Computers and Artificial Intelligence 14(2):105 - 25, 1995. BibTeX

@article{5024414,
	author = "Lopez, Pedro and Duato, Jose",
	abstract = "In previous papers, a theory for the design of deadlock-free adaptive routing algorithms as well as a design methodology have been proposed. In this paper, an adaptive routing algorithm, obtained from the application of this theory to the 3D-torus, is evaluated under different load conditions and compared with other algorithms. The results show that this algorithm is very fast, also increasing the network throughput considerably. Nevertheless, this adaptive algorithm has cycles in its channel dependency graph. Consequently, when the network is heavily loaded messages may temporarily block cyclically, drastically reducing the performance of the algorithm. Two mechanisms are proposed to avoid this problem",
	address = "Slovakia",
	issn = "0232-0274",
	journal = "Computers and Artificial Intelligence",
	keywords = "concurrency control;distributed algorithms;distributed memory systems;distributed processing;message passing;processor scheduling;",
	note = "deadlock-free fully-adaptive minimal routing algorithm;distributed memory computer;interconnection network;multiprocessor design;theory;3D-torus;three dimensional torus;network throughput;channel dependency graph;message passing;temporary block;",
	number = 2,
	pages = "105 - 25",
	title = "{D}eadlock-free fully-adaptive minimal routing algorithms: limitations and solutions",
	volume = 14,
	year = 1995
}

Jose Duato and Pedro Lopez. Performance evaluation of adaptive routing algorithms for k-ary n-cubes. Number 853, pages 45 - 45, 1994. BibTeX

@inbook{1994122484814,
	author = "Duato, Jose and Lopez, Pedro",
	address = "Seattle, WA, United states",
	issn = 03029743,
	journal = "Lecture Notes in Computer Science",
	number = 853,
	pages = "45 - 45",
	title = "{P}erformance evaluation of adaptive routing algorithms for k-ary n-cubes",
	year = 1994
}

Jose Duato and Pedro Lopez. Performance evaluation of adaptive routing algorithms for k-ary n-cubes. 1994, 45 - 59. BibTeX

@conference{4897362,
	author = "Duato, Jose and Lopez, Pedro",
	abstract = "Deadlock avoidance is a key issue in wormhole networks. A first approach consists in removing the cyclic dependencies between channels. Although the absence of cyclic dependencies is a necessary and sufficient condition for deadlock-free deterministic routing, it is only a sufficient condition for deadlock-free adaptive routing. A more powerful approach only requires the absence of cyclic dependencies on a connected channel subset. Moreover, we proposed a necessary and sufficient condition for deadlock-free adaptive routing previously (1994). In this paper, we design adaptive routing algorithms for k-ary n-cubes. In particular, we propose partially adaptive and fully adaptive routing algorithms which considerably increase the throughput achieved by the deterministic routing algorithm. Also, we evaluate the performance of the new routing algorithms under both, uniform and non-uniform distribution of message destinations",
	address = "Berlin, Germany",
	journal = "Parallel Computer Routing and Communication. First International Workshop, PCRCW '94. Proceedings",
	keywords = "concurrency control;multiprocessor interconnection networks;performance evaluation;telecommunication network routing;",
	note = "performance evaluation;adaptive routing algorithms;k-ary n-cubes;deadlock avoidance;wormhole networks;cyclic dependencies;necessary and sufficient condition;connected channel subset;deterministic routing algorithm;routing algorithms;",
	pages = "45 - 59",
	title = "{P}erformance evaluation of adaptive routing algorithms for k-ary n-cubes",
	year = 1994
}

Pedro Lopez and Jose Duato. Deadlock-free adaptive routing algorithms for the 3D-torus: limitations and solutions. 1993, 684 - 7. BibTeX

@conference{4585304,
	author = "Lopez, Pedro and Duato, Jose",
	abstract = "A deadlock-free adaptive routing algorithm, obtained from the application of the theory proposed by J. Duato (1991) to the 3D-torus, is evaluated under different load conditions and compared with other algorithms. The results show that this algorithm is very fast, also increasing the network throughput considerably. Nevertheless, this adaptive algorithm has cycles in its channel dependency graph. As a consequence, when the network is heavily loaded messages may temporarily block cyclically, drastically reducing the performance of the algorithm. Two mechanisms are proposed to avoid this problem",
	address = "Berlin, Germany",
	journal = "PARLE '93 Parallel Architectures and Languages Europe. 5th International PARLE Conference Proceedings",
	keywords = "multiprocessor interconnection networks;performance evaluation;",
	note = "deadlock-free adaptive routing algorithms;3D-torus;network throughput;channel dependency graph;",
	pages = "684 - 7",
	title = "{D}eadlock-free adaptive routing algorithms for the 3{D}-torus: limitations and solutions",
	year = 1993
}

Thesis

Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors. Julio Sahuquillo, Pedro López (Processor Architecture)

Low-Memory Techniques for Routing and Fault-Tolerance on the Fat-Tree Topology. Maria E. Gómez, Pedro López (Routing Algorithms)

Improvement of interconnection networks for clusters: direct-indirect hybrid topology and HoL-blocking reduction routing. Pedro López, Maria E. Gómez (High Performance Clusters)