Salvador Petit

Contact

Position:: Associate Professor

Address:: Valencia
Email:: This email address is being protected from spambots. You need JavaScript enabled to view it.
Phone:: +34963877007x85709
Website:: http://www.disca.upv.es/spetit

Image & Curriculum Vitae

Image & Curriculum Vitae

Salvador Petit received his Ph.D. degree in computer engineering from the Universidad Politécnica de Valencia. Currently, he is an Associate Professor in the Computer Engineering Department at the UPV. His research topics include multithreaded and multicore processors, as well as memory hierarchy design and real-time systems. He is a member of the IEEE Computer Society.

Publications

Carlos Navarro, Josué Feliu, Salvador Petit, Maria E Gomez and Julio Sahuquillo. Bandwidth-Aware Dynamic Prefetch Configuration for IBM POWER8. IEEE Transactions on Parallel and Distributed Systems PP (99), 2020. BibTeX

@article{10.1109/TPDS.2020.2982392,
	author = "Navarro, Carlos and Feliu, Josu{\'e} and Petit, Salvador and Gomez, Maria E. and Sahuquillo, Julio",
	abstract = "Advanced hardware prefetch engines are being integrated in current high-performance processors. Prefetching can boost the performance of most applications, however, the induced bandwidth consumption can lead the system to a high contention for main memory bandwidth, which is a scarce resource in current multicores. In such a case, the system performance can be severely damaged. This work characterizes the applications’ behavior in an IBM POWER8 machine, which presents many prefetch settings,varying the bandwidth contention degree. The study reveals that the best prefetch setting for each application depends on the main memory bandwidth availability, that is, it depends on the co-running applications. Based on this study, we propose Bandwidth-AwarePrefetch Configuration (BAPC) a scalable adaptive prefetching algorithm that improves the performance of multi-program workloads. BAPC increases the performance of the applications in a 8%, 11%, and 12% for workload mixes composed of 6, 8, and 10 applications over the IBM POWER8 default configuration. In addition to performance, BAPC reduces bandwidth consumption in 39%, 42%, and 45%, respectively.",
	journal = "IEEE Transactions on Parallel and Distributed Systems PP",
	number = 99,
	title = "{B}andwidth-{A}ware {D}ynamic {P}refetch {C}onfiguration for {IBM} {POWER}8",
	year = 2020
}

Jose Puche, Salvador Petit, Maria E Gomez and Julio Sahuquillo. An efficient cache flat storage organization for multithreaded workloads for low power processors. Future Generation Computer Systems, 2019. BibTeX

@article{10.1016/j.future.2019.11.024,
	author = "Puche, Jose and Petit, Salvador and Gomez, Maria E. and Sahuquillo, Julio",
	journal = "Future Generation Computer Systems",
	title = "{A}n efficient cache flat storage organization for multithreaded workloads for low power processors",
	year = 2019
}

Francisco Candel, Alejandro Valero, Salvador Petit and Julio Sahuquillo. Efficient Management of Cache Accesses to Boost GPGPU Memory Subsystem Performance. IEEE Transactions on Computers 10(68):1442-1454, 2019. BibTeX

@article{10.1109/TC.2019.2907591,
	author = "Candel, Francisco and Valero, Alejandro and Petit, Salvador and Sahuquillo, Julio",
	abstract = "To support the massive amount of memory accesses that GPGPU applications generate, GPU memory hierarchies are becoming more and more complex, and the Last Level Cache (LLC) size considerably increases each GPU generation. This paper shows that counter-intuitively, enlarging the LLC brings marginal performance gains in most applications. In other words, increasing the LLC size does not scale neither in performance nor energy consumption. We examine how LLC misses are managed in typical GPUs, and we find that in most cases the way LLC misses are managed are precisely the main performance limiter. This paper proposes a novel approach that addresses this shortcoming by leveraging a tiny additional Fetch and Replacement Cache-like structure (FRC) that stores control and coherence information of the incoming blocks until they are fetched from main memory. Then, the fetched blocks are swapped with the victim blocks (i.e., selected to be replaced) in the LLC, and the eviction of such victim blocks is performed from the FRC. This approach improves performance due to three main reasons: i) the lifetime of blocks being replaced is enlarged, ii) the main memory path is unclogged on long bursts of LLC misses, and iii) the average LLC miss latency is reduced. The proposal improves the LLC hit ratio, memory-level parallelism, and reduces the miss latency compared to much larger conventional caches. Moreover, this is achieved with reduced energy consumption and with much less area requirements. Experimental results show that the proposed FRC cache scales in performance with the number of GPU compute units and the LLC size, since, depending on the FRC size, performance improves ranging from 30% to 67% for a modern baseline GPU card, and from 32% to 118% for a larger GPU. In addition, energy consumption is reduced on average from 49% to 57% for the larger GPU. These benefits come with a small area increase (by 7.3%) over the LLC baseline.",
	journal = "IEEE Transactions on Computers",
	number = 68,
	pages = "1442-1454",
	title = "{E}fficient {M}anagement of {C}ache {A}ccesses to {B}oost {GPGPU} {M}emory {S}ubsystem {P}erformance",
	volume = 10,
	year = 2019
}

Josué Feliu, Salvador Petit and Julio Sahuquillo. Thread Isolation to Improve Symbiotic Scheduling on SMT Multicore Processors. IEEE Transactions on Parallel and Distributed Systems PP (99), 2019. BibTeX

@article{ 10.1109/TPDS.2019.2934955,
	author = "Feliu, Josu{\'e} and Petit, Salvador and Sahuquillo, Julio",
	abstract = "Resource sharing is a critical issue in simultaneous multithreading (SMT) processors as threads running simultaneously on an SMT core compete for shared resources. Symbiotic job scheduling, which co-schedules applications with complementary resource demands, is an effective solution to maximize hardware utilization and improve overall system performance. However, symbiotic job scheduling typically distributes threads evenly among cores, i.e., all cores get assigned the same number of threads, which we find to lead to sub-optimal performance. In this paper, we show that asymmetric schedules (i.e., schedules that assign a different number of threads to each SMT core) can significantly improve performance compared to symmetric schedules. To leverage this finding, we propose thread isolation, a technique that turns symmetric schedules into asymmetric ones yielding higher overall system performance. Thread isolation identifies SMT-adverse applications and schedules them in isolation on a dedicated core to mitigate their sharp performance degradation under SMT. Our experimental results on an IBM POWER8 processor show that thread isolation improves system throughput by up to 5.5% compared to a state-of-the-art symmetric symbiotic job scheduler.",
	journal = "IEEE Transactions on Parallel and Distributed Systems PP",
	number = 99,
	title = "{T}hread {I}solation to {I}mprove {S}ymbiotic {S}cheduling on {SMT} {M}ulticore {P}rocessors",
	year = 2019
}

Jose Puche, Salvador Petit, Maria E Gomez and Julio Sahuquillo. FOS: a low-power cache organization for multicores. The Journal of Supercomputing 3s(75):1-32, 2019. BibTeX

@article{10.1007/s11227-019-02858-x,
	author = "Puche, Jose and Petit, Salvador and Gomez, Maria E. and Sahuquillo, Julio",
	abstract = "The cache hierarchy of current multicore processors typically consists of one or two levels of private caches per core and a large shared last-level cache. This approach incurs area and energy wasting due to oversizing the private cache space, data replication through the inclusive cache levels, as well as the use of highly set-associative caches. In this paper, we claim that although this is the commonly adopted approach, it presents important design issues that can be addressed by a more energy efficient organization. This work proposes Flat On-chip Storage (FOS), a novel cache organization that, aimed at addressing energy and area on low-power processors, resolves the mentioned issues. For this purpose, FOS combines L2 and L3 cache levels into a single one, organized as a flat space, and composed of a pool of private small cache slices. These slices are initially powered off to save energy, and they are powered on and assigned to cores provided that the system performance is expected to improve. To provide fast and uniform access from the private L1 caches to the FOS’s cache slices, multiple architectural challenges are overcome, which entails the design of a custom optical network-on-chip. Experimental results show that FOS achieves significant energy savings on both static and dynamic energy over conventional cache organizations with the same storage capacity. FOS static energy savings are as much as 60% over an electrically connected shared cache; these savings grow up to 75% compared to optically connected baselines. Moreover, despite deactivating part of the cache space, FOS achieves similar performance values as those achieved by conventional approaches.",
	journal = "The Journal of Supercomputing",
	number = 75,
	pages = "1-32",
	title = "{FOS}: a low-power cache organization for multicores",
	volume = "3s",
	year = 2019
}

Francisco Candel, Alejandro Valero, Salvador Petit and Julio Sahuquillo. An Aging-Aware GPU Register File Design Based on Data Redundancy. IEEE Transactions on Computers 1(68):4-20, 2019. BibTeX

@article{10.1109/TC.2018.2849376,
	author = "Candel, Francisco and Valero, Alejandro and Petit, Salvador and Sahuquillo, Julio",
	abstract = "Nowadays, GPUs sit at the forefront of high-performance computing thanks to their massive computational capabilities. Internally, thousands of functional units, architected to be fed by large register files, fuel such a performance. At deep nanometer technologies, the SRAM memory cells that implement GPU register files are very sensitive to the Negative Bias Temperature Instability (NBTI) effect. NBTI ages cell transistors by degrading their threshold voltage $V_{th}$ over the lifetime of the GPU. This degradation, which manifests when a cell keeps the same logic value for a relatively long period of time, compromises the cell read stability and increases the transistor switching delay, which can lead to wrong read values and eventually exceed the processor cycle time, respectively, so resulting in faulty operation. This work proposes architectural mechanisms leveraging the redundancy of the data stored in GPU register files to attack NBTI aging. The proposed mechanisms are based on data compression, power gating, and register address rotation techniques. All these mechanisms working together balance the distribution of logic values stored in the cells along the execution time, reducing both the overall $V_{th}$ degradation and the increase in the transistor switching delays. Experimental results show that a conventional GPU register file suffers the worst case for NBTI, since a significant fraction of the cells maintain the same logic value during the entire application execution (i.e., a 100 percent ‘0’ and ‘1’ duty cycle distributions). On average, the proposal reduces these distributions by 58 and 68 percent, respectively, which translates into $V_{th}$ degradation savings by 54 and 62 percent, respectively.",
	journal = "IEEE Transactions on Computers",
	number = 68,
	pages = "4-20",
	title = "{A}n {A}ging-{A}ware {GPU} {R}egister {F}ile {D}esign {B}ased on {D}ata {R}edundancy",
	volume = 1,
	year = 2019
}

Lucía Pons, Vicent Selfa, Salvador Petit and Julio Sahuquillo. Improving System Turnaround Time with Intel CAT by Identifying LLC Critical Applications. Euro-Par 2018: Parallel Processing, pages 603-615, 2018. BibTeX

@article{10.1007/978-3-319-96983-1_43,
	author = "Pons, Luc{\'i}a and Selfa, Vicent and Petit, Salvador and Sahuquillo, Julio",
	journal = "Euro-Par 2018: Parallel Processing",
	pages = "603-615",
	title = "{I}mproving {S}ystem {T}urnaround {T}ime with {I}ntel {CAT} by {I}dentifying {LLC} {C}ritical {A}pplications",
	year = 2018
}

Jose Duro, Salvador Petit and Julio Sahuquillo. Modeling and analysis of the performance of exascale photonic networks. Concurrency and Computation Practice and Experience (31), 2018. BibTeX

@article{10.1002/cpe.4773,
	author = "Duro, Jose and Petit, Salvador and Sahuquillo, Julio",
	abstract = "Photonics technology has become a promising and viable alternative for both on‐chip and off‐chip interconnection networks of future Exascale systems. Nevertheless, this technology is not mature enough yet in this context, so research efforts focusing on photonic networks are still required to achieve realistic suitable network implementations. In this regard, system‐level photonic network simulators can help guide designers to assess the multiple design choices. Most current research is done on electrical network simulators, whose components work widely different from photonics components. In this work, we summarize and compare the working behavior of both technologies which includes the use of optical routers, wavelength‐division multiplexing and circuit switching among others. After implementing them into a well‐known simulation framework, an extensive simulation study has been carried out using realistic photonic network configurations with synthetic and realistic traffic. Experimental results show that, compared to electrical networks, optical networks can reduce the execution time of the studied real workloads in almost one order of magnitude. Our study also reveals that the photonic configuration highly impacts on the network performance, being the bandwidth per channel and the message length the most important parameters.",
	journal = "Concurrency and Computation Practice and Experience",
	number = 31,
	title = "{M}odeling and analysis of the performance of exascale photonic networks",
	year = 2018
}

Francisco Candel, Salvador Petit, Alejandro Valero and Julio Sahuquillo. Improving GPU Cache Hierarchy Performance with a Fetch and Replacement Cache. The 24th International European Conference on Parallel and Distributed Computing, 2018. BibTeX

@article{GPU,
	author = "Candel, Francisco and Petit, Salvador and Valero, Alejandro and Sahuquillo, Julio",
	abstract = "In the last few years, GPGPU computing has become one of the most popular computing paradigms in high-performance computers due to its excellent performance to power ratio. The memory requirements of GPGPU applications widely differ from the requirements of CPU counterparts. The amount of memory accesses is several orders of magnitude higher in GPU applications than in CPU applications, and they present disparate access patterns. Because of this fact, large and highly associative Last-Level Caches (LLCs) bring much lower performance gains in GPUs than in CPUs. This paper presents a novel approach to manage LLC misses that efficiently improves LLC hit ratio, memory-level parallelism, and miss latencies in GPU systems. The proposed approach leverages a small additional Fetch and Replacement Cache (FRC) that stores control and coherence information of incoming blocks until they are fetched from main memory. Then, fetched blocks are swapped with victim blocks to be replaced in the LLC. After that, the eviction of victim blocks is performed from the FRC. This management approach improves performance due to three main reasons: i) the lifetime of blocks being replaced is increased, ii) the main memory path is unclogged on long bursts of LLC misses, and iii) the average L2 miss delaying latency is reduced. Experimental results show that our proposal increases the performance (OPC) over 25% in most of the studied applications, reaching improvements up to 400% in some applications.",
	journal = "The 24th International European Conference on Parallel and Distributed Computing",
	title = "{I}mproving {GPU} {C}ache {H}ierarchy {P}erformance with a {F}etch and {R}eplacement {C}ache",
	year = 2018
}

Clara Furió, Josué Feliu, Julio Sahuquillo, Salvador Petit and Jose Duro. A Workload Generator for Evaluating SMT Real-Time Systems. 2018 International Conference on High Performance Computing & Simulation (HPCS), 2018. BibTeX

@article{10.1109/HPCS.2018.00067,
	author = "Furi{\'o}, Clara and Feliu, Josu{\'e} and Sahuquillo, Julio and Petit, Salvador and Duro, Jose",
	journal = "2018 International Conference on High Performance Computing {\&} Simulation (HPCS)",
	title = "{A} {W}orkload {G}enerator for {E}valuating {SMT} {R}eal-{T}ime {S}ystems",
	year = 2018
}

Jose Duro, Salvador Petit, Julio Sahuquillo and Maria E Gomez. Workload Characterization for Exascale Computing Networks. 2018 International Conference on High Performance Computing & Simulation (HPCS), 2018. BibTeX

@article{10.1109/HPCS.2018.00069,
	author = "Duro, Jose and Petit, Salvador and Sahuquillo, Julio and Gomez, Maria E.",
	journal = "2018 International Conference on High Performance Computing {\&} Simulation (HPCS)",
	title = "{W}orkload {C}haracterization for {E}xascale {C}omputing {N}etworks",
	year = 2018
}

Vicent Selfa, Julio Sahuquillo, Maria E Gomez and Crispín Gomez. Efficient selective multicore prefetching under limited memory bandwidth. Journal of Parallel and Distributed Computing (120), 2018. BibTeX

@article{10.1016/j.jpdc.2018.05.002,
	author = "Selfa, Vicent and Sahuquillo, Julio and Gomez, Maria E. and Gomez, Crisp{\'i}n",
	abstract = "Current multicore systems implement multiple hardware prefetchers to tolerate long main memory latencies. However, memory bandwidth is a scarce shared resource which becomes critical with the increasing core count. To deal with this fact, recent works have focused on adaptive prefetchers, which control the prefetcher aggressiveness to regulate the main memory bandwidth consumption. Nevertheless, in limited bandwidth machines or under memory-hungry workloads, keeping active the prefetcher can damage the system performance and increase energy consumption. This paper introduces selective prefetching, where individual prefetchers are activated or deactivated to improve both main memory energy and performance, and proposes ADP, a prefetcher that deactivates local prefetchers in some cores when they present low performance and co-runners need additional bandwidth. Based on heuristics, an individual prefetcher is reactivated when performance enhancements are foreseen. Compared to a state-of-the-art adaptive prefetcher, ADP provides both performance and energy enhancements in limited memory bandwidth.",
	journal = "Journal of Parallel and Distributed Computing",
	number = 120,
	title = "{E}fficient selective multicore prefetching under limited memory bandwidth",
	year = 2018
}

Francisco Candel, Julio Sahuquillo, Salvador Petit and Alejandro Valero. Improving GPU Cache Hierarchy Performance with a Fetch and Replacement Cache: 24th International Conference on Parallel and Distributed Computing. 24th International Conference on Parallel and Distributed Computing, 2018. BibTeX

@article{10.1007/978-3-319-96983-1_17,
	author = "Candel, Francisco and Sahuquillo, Julio and Petit, Salvador and Valero, Alejandro",
	journal = "24th International Conference on Parallel and Distributed Computing",
	title = "{I}mproving {GPU} {C}ache {H}ierarchy {P}erformance with a {F}etch and {R}eplacement {C}ache: 24th {I}nternational {C}onference on {P}arallel and {D}istributed {C}omputing",
	year = 2018
}

Vicent Selfa, Julio Sahuquillo, Salvador Petit and Maria E Gomez. Application Clustering Policies to Address System Fairness with Intel’s Cache Allocation Technolo. 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2017. BibTeX

@article{10.1109/PACT.2017.19,
	author = "Selfa, Vicent and Sahuquillo, Julio and Petit, Salvador and Gomez, Maria E.",
	journal = "2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)",
	title = "{A}pplication {C}lustering {P}olicies to {A}ddress {S}ystem {F}airness with {I}ntel’s {C}ache {A}llocation {T}echnolo",
	year = 2017
}

Francisco Candel, Alejandro Valero, Salvador Petit and Julio Sahuquillo. Exploiting Data Compression to Mitigate Aging in GPU Register Files. 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2017. BibTeX

@article{10.1109/SBAC-PAD.2017.15,
	author = "Candel, Francisco and Valero, Alejandro and Petit, Salvador and Sahuquillo, Julio",
	abstract = "Nowadays, GPUs sit at the forefront of highperformance computing thanks to their massive computational capabilities. Internally, thousands of functional units, architected to be fed by large register files, fuel such a performance. At nanometer technologies, the SRAM cells that implement register files suffer the Negative Bias Temperature Instability (NBTI) effect, which degrades the transistor threshold voltage Vth and, in turn, can make cells faulty unreliable when they hold the same logic value for long periods of time. Fortunately, the GPU single-thread multiple-data execution model writes data in recognizable patterns. This work proposes mechanisms to detect those patterns, and to compress and shuffle the data, so that compressed register file entries can be safely powered off, mitigating NBTI aging. Experimental results show that a conventional GPU register file experiences the worst case for NBTI, since maintains cells with a single logic value during the entire application execution (i.e., a 100% ‘0’ and ‘1’ duty cycle distributions). On average, the proposal reduces these distributions by 61% and 72%, respectively, which translates into Vth degradation savings by 57% and 64%, respectively.",
	journal = "2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)",
	title = "{E}xploiting {D}ata {C}ompression to {M}itigate {A}ging in {GPU} {R}egister {F}iles",
	year = 2017
}

Jose Duro, Salvador Petit, Julio Sahuquillo and Maria E Gomez. Modeling a Photonic Network for Exascale Computing. 2017 International Conference on High Performance Computing & Simulation (HPCS), 2017. BibTeX

@article{10.1109/HPCS.2017.82,
	author = "Duro, Jose and Petit, Salvador and Sahuquillo, Julio and Gomez, Maria E.",
	journal = "2017 International Conference on High Performance Computing {\&} Simulation (HPCS)",
	title = "{M}odeling a {P}hotonic {N}etwork for {E}xascale {C}omputing",
	year = 2017
}

Vicent Selfa, Julio Sahuquillo, Maria E Gomez and Salvador Petit. A Hardware Approach to Fairly Balance the Inter-Thread Interference in Shared Caches. IEEE Transactions on Parallel and Distributed Systems PP (99), 2017. BibTeX

@article{10.1109/TPDS.2017.2713778,
	author = "Selfa, Vicent and Sahuquillo, Julio and Gomez, Maria E. and Petit, Salvador",
	abstract = "Shared caches have become the common design choice in the vast majority of modern multi-core and many-core processors, since cache sharing improves throughput for a given silicon area. Sharing the cache, however, has a downside: the requests from multiple applications compete among them for cache resources, so the execution time of each application increases over isolated execution. The degree in which the performance of each application is affected by the interference becomes unpredictable yielding the system to unfairness situations. This paper proposes Fair-Progress Cache Partitioning (FPCP), a low-overhead hardware-based cache partitioning approach that addresses system fairness. FPCP reduces the interference by allocating to each application a cache partition and adjusting the partition sizes at runtime. To adjust partitions, our approach estimates during multicore execution the time each application would have taken in isolation, which is challenging. The proposed approach has two main differences over existing approaches. First, FPCP distributes cache ways incrementally, which makes the proposal less prone to estimation errors. Second, the proposed algorithm is much less costly than the state-of-the-art ASM-Cache approach. Experimental results show that, compared to ASM-Cache, FPCP reduces unfairness by 48% in four-application workloads and by 28% in eight-application workloads, without harming the performance",
	journal = "IEEE Transactions on Parallel and Distributed Systems PP",
	number = 99,
	title = "{A} {H}ardware {A}pproach to {F}airly {B}alance the {I}nter-{T}hread {I}nterference in {S}hared {C}aches",
	year = 2017
}

Josué Feliu, Salvador Petit and Julio Sahuquillo. Improving IBM POWER8 Performance through Symbiotic Job Scheduling. IEEE Transactions on Parallel and Distributed Systems PP (99), 2017. BibTeX

@article{10.1109/TPDS.2017.2691708,
	author = "Feliu, Josu{\'e} and Petit, Salvador and Sahuquillo, Julio",
	abstract = "Symbiotic job scheduling, i.e., scheduling applications that co-run well together on a core, can have a considerable impact on the performance of processors with simultaneous multithreading (SMT) cores. SMT cores share most of their microarchitectural components among the co-running applications, which causes performance interference between them. Therefore, scheduling applications with complementary resource requirements on the same core can greatly improve the throughput of the system. This paper enhances symbiotic job scheduling for the IBM POWER8 processor. We leverage the existing cycle accounting mechanism to build an interference model that predicts symbiosis between applications. The proposed models achieve higher accuracy than previous models by predicting job symbiosis from throttled CPI stacks, i.e., CPI stacks of the applications when running in the same SMT mode to consider the statically partitioned resources, but without interference from other applications. The symbiotic scheduler uses these interference models to decide, at run-time, which applications should run on the same core or on separate cores. We prototype the symbiotic scheduler as a user-level scheduler in the Linux operating system and evaluate it on an IBM POWER8 server running multiprogram workloads. The symbiotic job scheduler significantly improves performance compared to both an agnostic random scheduler and the default Linux scheduler. Across all evaluated workloads in SMT4 mode, throughput improves by 12:4% and 5:1% on average over the random and Linux schedulers, respectively.",
	journal = "IEEE Transactions on Parallel and Distributed Systems PP",
	number = 99,
	title = "{I}mproving {IBM} {POWER}8 {P}erformance through {S}ymbiotic {J}ob {S}cheduling",
	year = 2017
}

Josué Feliu, Salvador Petit, Julio Sahuquillo and Jose Duato. Perf&Fair: A Progress-Aware Scheduler to Enhance Performance and Fairness in SMT Multicores. EEE Transactions on Computers PP (99), 2016. BibTeX

@article{10.1109/TC.2016.2620977,
	author = "Feliu, Josu{\'e} and Petit, Salvador and Sahuquillo, Julio and Duato, Jose",
	abstract = "Nowadays, high performance multicore processors implement multithreading capabilities. The processes running concurrently on these processors are continuously competing for the shared resources, not only among cores, but also within the core. While resource sharing increases the resource utilization, the interference among processes accessing the shared resources can strongly affect the performance of individual processes and its predictability. In this scenario, process scheduling plays a key role to deal with performance and fairness.",
	journal = "EEE Transactions on Computers PP",
	number = 99,
	title = "{P}erf{\&}{F}air: {A} {P}rogress-{A}ware {S}cheduler to {E}nhance {P}erformance and {F}airness in {SMT} {M}ulticores",
	year = 2016
}

Joan Josep Valls, Alberto Ros, Maria E Gomez and Julio Sahuquillo. A Directory Cache with Dynamic Private-Shared Partitioning. 2016 IEEE 23rd International Conference on High Performance Computing (HiPC), 2016. BibTeX

@article{10.1109/HiPC.2016.051,
	author = "Valls, Joan Josep and Ros, Alberto and Gomez, Maria E. and Sahuquillo, Julio",
	journal = "2016 IEEE 23rd International Conference on High Performance Computing (HiPC)",
	title = "{A} {D}irectory {C}ache with {D}ynamic {P}rivate-{S}hared {P}artitioning",
	year = 2016
}

Alejandro Valero, Salvador Petit and Julio Sahuquillo. Enhancing the L1 Data Cache Design to Mitigate HCI. IEEE Computer Architecture Letters 2(15):93-96, 2016. BibTeX

@article{10.1109/LCA.2015.2460736,
	author = "Valero, Alejandro and Petit, Salvador and Sahuquillo, Julio",
	abstract = "Over the lifetime of a microprocessor, the Hot Carrier Injection (HCI) phenomenon degrades the threshold voltage, which causes slower transistor switching and eventually results in timing violations and faulty operation. This effect appears when the memory cell contents flip from logic ‘0’ to ‘1’ and vice versa. In caches, the majority of cell flips are concentrated into only a few of the total memory cells that make up each data word. In addition, other researchers have noted that zero is the most commonly-stored data value in a cache, and have taken advantage of this behavior to propose data compression and power reduction techniques. Contrary to these works, we use this information to extend the lifetime of the caches by introducing two microarchitectural techniques that spread and reduce the number of flips across the first-level (L1) data cache cells. Experimental results show that, compared to the conventional approach, the proposed mechanisms reduce the highest cell flip peak up to 65.8%, whereas the threshold voltage degradation savings range from 32.0% to 79.9% depending on the application.",
	journal = " IEEE Computer Architecture Letters",
	number = 15,
	pages = "93-96",
	title = "{E}nhancing the {L}1 {D}ata {C}ache {D}esign to {M}itigate {HCI}",
	volume = 2,
	year = 2016
}

Francisco Candel, Salvador Petit, Julio Sahuquillo and Jose Duato. Impact of Memory-Level Parallelism on the Performance of GPU Coherence Protocols. 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), 2016. BibTeX

@article{10.1109/PDP.2016.67,
	author = "Candel, Francisco and Petit, Salvador and Sahuquillo, Julio and Duato, Jose",
	journal = "2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)",
	title = "{I}mpact of {M}emory-{L}evel {P}arallelism on the {P}erformance of {GPU} {C}oherence {P}rotocols",
	year = 2016
}

Joan Josep Valls, Alberto Ros, Maria E Gomez and Julio Sahuquillo. The Tag Filter Architecture: An energy-efficient cache and directory design. Journal of Parallel and Distributed Computing (100), 2016. BibTeX

@article{10.1016/j.jpdc.2016.04.016,
author = "Valls, Joan Josep and Ros, Alberto and Gomez, Maria E. and Sahuquillo, Julio",
abstract = "Power consumption in current high-performance chip multiprocessors (CMPs) has become a major design concern that aggravates with the current trend of increasing the core count. A significant fraction of the total power budget is consumed by on-chip caches which are usually deployed with a high associativity degree (even L1 caches are being implemented with eight ways) to enhance the system performance. On a cache access, each way in the corresponding set is accessed in parallel, which is costly in terms of energy. On the other hand, coherence protocols also must implement efficient directory caches that scale in terms of power consumption. Most of the state-of-the-art techniques that reduce the energy consumption of directories are at the cost of performance, which may become unacceptable for high-performance CMPs. In this paper, we propose an energy-efficient architectural design that can be effectively applied to any kind of cache memory. The proposed approach, called the Tag Filter (TF) Architecture, filters the ways accessed in the target cache set, and just a few ways are searched in the tag and data arrays. This allows the approach to reduce the dynamic energy consumption of caches without hurting their access time. For this purpose, the proposed architecture holds the X least significant bits of each tag in a small auxiliary X-bit-wide array. These bits are used to filter the ways where the least significant bits of the tag do not match with the bits in the X-bit array. Experimental results show that, on average, the TF Architecture reduces the dynamic power consumption across the studied applications up to 74.9%, 85.9%, and 84.5% when applied to L1 caches, L2 caches, and directory caches, respectively.",
journal = "Journal of Parallel and Distributed Computing",
number = 100,
title = "{T}he {T}ag {F}ilter {A}rchitecture: {A}n energy-efficient cache and directory design",
year = 2016
}

Julio Sahuquillo, Vicent Selfa, Crispín Gomez and Maria E Gomez. A Simple Activation/Deactivation Prefetching Scheme for Chip Multiprocessors. 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), 2016. BibTeX

@article{10.1109/PDP.2016.47,
	author = "Sahuquillo, Julio and Selfa, Vicent and Gomez, Crisp{\'i}n and Gomez, Maria E.",
	journal = "2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)",
	title = "{A} {S}imple {A}ctivation/{D}eactivation {P}refetching {S}cheme for {C}hip {M}ultiprocessors",
	year = 2016
}

Julio Sahuquillo, Josué Feliu and Salvador Petit. Symbiotic job scheduling on the IBM POWER8. 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016. BibTeX

@article{10.1109/HPCA.2016.7446103,
	author = "Sahuquillo, Julio and Feliu, Josu{\'e} and Petit, Salvador",
	journal = "2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)",
	title = "{S}ymbiotic job scheduling on the {IBM} {POWER}8",
	year = 2016
}

Julio Sahuquillo, Houcine Hassan Mohamed, Salvador Petit, Jose Duato and José Luis March. A dynamic execution time estimation model to save energy in heterogeneous multicores running periodic tasks. Future Generation Computer Systems (56), 2015. BibTeX

@article{10.1016/j.future.2015.06.011,
	author = "Sahuquillo, Julio and Mohamed, Houcine Hassan and Petit, Salvador and Duato, Jose and March, Jos{\'e} Luis",
	abstract = "Nowadays, real-time embedded applications have to cope with an increasing demand of functionalities, which require increasing processing capabilities. With this aim real-time systems are being implemented on top of high-performance multicore processors that run multithreaded periodic workloads by allocating threads to individual cores. In addition, to improve both performance and energy savings, the industry is introducing new multicore designs such as ARM’s big.LITTLE that include heterogeneous cores in the same package. A key issue to improve energy savings in multicore embedded real-time systems and reduce the number of deadline misses is to accurately estimate the execution time of the tasks considering the supported processor frequencies. Two main aspects make this estimation difficult. First, the running threads compete among them for shared resources. Second, almost all current microprocessors implement Dynamic Voltage and Frequency Scaling (DVFS) regulators to dynamically adjust the voltage/frequency at run-time according to the workload behavior. Existing execution time estimation models rely on off-line analysis or on the assumption that the task execution time scales linearly with the processor frequency, which can bring important deviations since the memory system uses a different power supply. In contrast, this paper proposes the Processor–Memory (Proc–Mem) model, which dynamically predicts the distinct task execution times depending on the implemented processor frequencies. A power-aware EDF (Earliest Deadline First)-based scheduler using the Proc–Mem approach has been evaluated and compared against the same scheduler using a typical Constant Memory Access Time model, namely CMAT. Results on a heterogeneous multicore processor show that the average deviation of Proc–Mem is only by 5.55% with respect to the actual measured execution time, while the average deviation of the CMAT model is 36.42%. These results turn in important energy savings, by 18% on average and up to 31% in some mixes, in comparison to CMAT for a similar number of deadline misses.",
	journal = "Future Generation Computer Systems",
	number = 56,
	title = "{A} dynamic execution time estimation model to save energy in heterogeneous multicores running periodic tasks",
	year = 2015
}

Francisco Candel, Salvador Petit, Julio Sahuquillo and Jose Duato. Accurately modeling the GPU memory subsystem. 2015 International Conference on High Performance Computing & Simulation (HPCS), 2015. BibTeX

@article{10.1109/HPCSim.2015.7237038,
	author = "Candel, Francisco and Petit, Salvador and Sahuquillo, Julio and Duato, Jose",
	journal = "2015 International Conference on High Performance Computing {\&} Simulation (HPCS)",
	title = "{A}ccurately modeling the {GPU} memory subsystem",
	year = 2015
}

Joan Josep Valls, Julio Sahuquillo, Alberto Ros and Maria E Gomez. PS directory: A scalable multilevel directory cache for CMPs. The Journal of Supercomputing 8(71):2847-2876, 2015. BibTeX

@article{10.1007/s11227-014-1332-5 ,
	author = "Valls, Joan Josep and Sahuquillo, Julio and Ros, Alberto and Gomez, Maria E.",
	abstract = "As the number of cores increases in current and future chip-multiprocessor (CMP) generations, coherence protocols must rely on novel hardware structures to scale in terms of performance, power, and area. Systems that use directory information for coherence purposes are currently the most scalable alternative. This paper studies the important differences between the directory behavior of private and shared blocks, which claim for a separate management of both types of blocks at the directory. We propose the PS directory, a two-level directory cache that keeps the reduced number of frequently accessed shared entries in a small and fast first-level cache, namely Shared cache, and uses a larger and slower second-level Private cache to track the large amount of private blocks. Entries in the Private cache do not implement the sharer vector, which allows important silicon area savings. Speed and area reasons suggest the use of eDRAM technology, much denser but slower than SRAM technology, for the Private cache, which in turn brings energy savings. Experimental results for a 16-core CMP show that, compared to a conventional directory, the PS directory improves performance by 14 % while reducing silicon area and energy consumption by 34 and 27 %, respectively. Also, compared to the state-of-the-art Multi-Grain Directory, the PS directory apart from increasing performance, it reduces power by 18.7 %, and provides more scalability in terms of area.",
	journal = "The Journal of Supercomputing",
	number = 71,
	pages = "2847-2876",
	title = "{PS} directory: {A} scalable multilevel directory cache for {CMP}s",
	volume = 8,
	year = 2015
}

Josué Feliu, Julio Sahuquillo, Salvador Petit and Jose Duato. Addressing Fairness in SMT Multicores with a Progress-Aware Schedule. IEEE InternationalParallel and Distributed Processing Symposium (IPDPS), 2015. BibTeX

@article{10.1109/IPDPS.2015.48,
	author = "Feliu, Josu{\'e} and Sahuquillo, Julio and Petit, Salvador and Duato, Jose",
	journal = "IEEE InternationalParallel and Distributed Processing Symposium (IPDPS)",
	title = "{A}ddressing {F}airness in {SMT} {M}ulticores with a {P}rogress-{A}ware {S}chedule",
	year = 2015
}

Alejandro Valero, Julio Sahuquillo, Salvador Petit and Jose Duato. Design of Hybrid Second-Level Caches. IEEE Transactions on Computers 7(64):1884-1897, 2015. BibTeX

@article{10.1109/TC.2014.2346185,
	author = "Valero, Alejandro and Sahuquillo, Julio and Petit, Salvador and Duato, Jose",
	abstract = "In recent years, embedded dynamic random-access memory (eDRAM) technology has been implemented in last-level caches due to its low leakage energy consumption and high density. However, the fact that eDRAM presents slower access time than static RAM (SRAM) technology has prevented its inclusion in higher levels of the cache hierarchy. This paper proposes to mingle SRAM and eDRAM banks within the data array of second-level (L2) caches. The main goal is to achieve the best trade-off among performance, energy, and area. To this end, two main directions have been followed. First, this paper explores the optimal percentage of banks for each technology. Second, the cache controller is redesigned to deal with performance and energy. Performance is addressed by keeping the most likely accessed blocks in fast SRAM banks. In addition, energy savings are further enhanced by avoiding unnecessary destructive reads of eDRAM blocks. Experimental results show that, compared to a conventional SRAM L2 cache, a hybrid approach requiring similar or even lower area speedups the performance on average by 5.9 percent, while the total energy savings are by 32 percent. For a 45 nm technology node, the energy-delay-area product confirms that a hybrid cache is a better design than the conventional SRAM cache regardless of the number of eDRAM banks, and also better than a conventional eDRAM cache when the number of SRAM banks is an eighth of the total number of cache banks.",
	journal = "IEEE Transactions on Computers",
	number = 64,
	pages = "1884-1897",
	title = "{D}esign of {H}ybrid {S}econd-{L}evel {C}aches",
	volume = 7,
	year = 2015
}

Vicent Selfa, Julio Sahuquillo, Crispín Gomez and Maria E Gomez. Methodologies and Performance Metrics to Evaluate Multiprogram Workloads. 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), 2015. BibTeX

@article{10.1109/PDP.2015.74,
	author = "Selfa, Vicent and Sahuquillo, Julio and Gomez, Crisp{\'i}n and Gomez, Maria E.",
	journal = "23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)",
	title = "{M}ethodologies and {P}erformance {M}etrics to {E}valuate {M}ultiprogram {W}orkloads",
	year = 2015
}

Joan Josep Valls, Julio Sahuquillo, Alberto Ros and Maria E Gomez. The Tag Filter Cache: An Energy-Efficient Approach. 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), 2015. BibTeX

@article{10.1109/PDP.2015.58 ,
	author = "Valls, Joan Josep and Sahuquillo, Julio and Ros, Alberto and Gomez, Maria E.",
	abstract = "Power consumption in current high-performance chip multiprocessors (CMPs) has become a major design concern.The current trend of increasing the core count aggravates this problem. On-chip caches consume a signiﬁcant fraction of the total power budget. Most of the proposed techniques to reduce the energy consumption of these memory structures are at the cost of performance, which may become unacceptable for high-performance CMPs. On-chip caches in multi-core systems are usually deployed with a high associativity degree in order to enhance performance. Even ﬁrst-level caches are currently implemented with eight ways. The concurrent access to all the ways in the cache set is costly in terms of energy.",
	journal = "23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)",
	title = "{T}he {T}ag {F}ilter {C}ache: {A}n {E}nergy-{E}fficient {A}pproach",
	year = 2015
}

Alejandro Valero, Salvador Petit, Julio Sahuquillo and Jose Duato. Article A reuse-based refresh policy for energy-aware eDRAM caches. 1(39):37-48, 2015. BibTeX

@article{10.1016/j.micpro.2014.12.001,
author = "Valero, Alejandro and Petit, Salvador and Sahuquillo, Julio and Duato, Jose",
abstract = "DRAM technology requires refresh operations to be performed in order to avoid data loss due to capacitance leakage. Refresh operations consume a significant amount of dynamic energy, which increases with the storage capacity. To reduce this amount of energy, prior work has focused on reducing refreshes in off-chip memories. However, this problem also appears in on-chip eDRAM memories implemented in current low-level caches. The refresh energy can dominate the dynamic consumption when a high percentage of the chip area is devoted to eDRAM cache structures. Replacement algorithms for high-associativity low-level caches select the victim block avoiding blocks more likely to be reused soon. This paper combines the state-of-the-art MRUT replacement algorithm with a novel refresh policy. Refresh operations are performed based on information produced by the replacement algorithm. The proposed refresh policy is implemented on top of an energy-aware eDRAM cache architecture, which implements bank-prediction and swap operations to save energy. Experimental results show that, compared to a conventional eDRAM design, the proposed energy-aware cache can achieve by 72% refresh energy savings. Considering the entire on-chip memory hierarchy consumption, the overall energy savings are 30%. These benefits come with minimal impact on performance (by 1.2%) and area overhead (by 0.4%).",
number = 39,
pages = "37-48",
title = "{A}rticle {A} reuse-based refresh policy for energy-aware e{DRAM} caches",
volume = 1,
year = 2015
}

Josué Feliu, Salvador Petit, Julio Sahuquillo and Jose Duato. Bandwidth-Aware On-Line Scheduling in SMT Multicores. IEEE Transactions on Computers 1(65), 2015. BibTeX

@article{10.1109/TC.2015.2428694,
	author = "Feliu, Josu{\'e} and Petit, Salvador and Sahuquillo, Julio and Duato, Jose",
	abstract = "The memory hierarchy plays a critical role on the performance of current chip multiprocessors. Main memory is shared by all the running processes, which can cause important bandwidth contention. In addition, when the processor implements SMT cores, the L1 bandwidth becomes shared among the threads running on each core. In such a case, bandwidth-aware schedulers emerge as an interesting approach to mitigate the contention. This work investigates the performance degradation that the processes suffer due to memory bandwidth constraints. Experiments show that main memory and L1 bandwidth contention negatively impact the process performance; in both cases, performance degradation can grow up to 40% for some of applications. To deal with contention, we devise a scheduling algorithm that consists of two policies guided by the bandwidth consumption gathered at runtime. The process selection policy balances the number of memory requests over the execution time to address main memory bandwidth contention. The process allocation policy tackles L1 bandwidth contention by balancing the L1 accesses among the L1 caches. The proposal is evaluated on a Xeon E5645 platform using a wide set of multiprogrammed workloads, achieving performance benefits up to 6.7% with respect to the Linux scheduler.",
	journal = "IEEE Transactions on Computers",
	number = 65,
	title = "{B}andwidth-{A}ware {O}n-{L}ine {S}cheduling in {SMT} {M}ulticores",
	volume = 1,
	year = 2015
}

Joan Josep Valls, Alberto Ros, Maria E Gomez and Julio Sahuquillo. PS-Cache: an energy-efficient cache design for chip multiprocessors. The Journal of Supercomputing 1(71):67-86, 2015. BibTeX

@article{10.1007/s11227-014-1288-5 ,
	author = "Valls, Joan Josep and Ros, Alberto and Gomez, Maria E. and Sahuquillo, Julio",
	abstract = "Power consumption has become a major design concern in current high- performance chip multiprocessors, and this problem exacerbates with the number of core counts. A significant fraction of the total power budget is often consumed by on-chip caches, thus important research has focused on reducing energy consumption in these structures. To enhance performance, on-chip caches are being deployed with a high associativity degree. Consequently, accessing concurrently all the ways in the cache set is costly in terms of energy. This paper presents the PS-Cache architecture, an energy-efficient cache design that reduces the number of accessed ways without hurting the performance. The PS-Cache takes advantage of the private-shared knowl- edge of the referenced block to reduce energy by accessing only those ways holding the kind of block looked up. Experimental results show that, on average, the PS-Cache architecture can reduce the dynamic energy consumption of L1 and L2 caches by 22 and 40%, respectively.",
	journal = "The Journal of Supercomputing",
	number = 71,
	pages = "67-86",
	title = "{PS}-{C}ache: an energy-efficient cache design for chip multiprocessors",
	volume = 1,
	year = 2015
}

Josué Feliu, Salvador Petit, Julio Sahuquillo and Jose Duato. Cache-hierarchy Contention Aware Scheduling in CMPs. IEEE Transactions on Parallel and Distributed Systems 25(3):581 - 590, March 2014. DOI BibTeX

@article{DBLP:journals/tpds/josue2013,
	author = "Feliu, Josu{\'e} and Petit, Salvador and Sahuquillo, Julio and Duato, Jose",
	doi = "10.1109/TPDS.2013.61",
	journal = "IEEE Transactions on Parallel and Distributed Systems",
	month = "March",
	number = 3,
	pages = "581 - 590",
	title = "{C}ache-hierarchy {C}ontention {A}ware {S}cheduling in {CMP}s",
	volume = 25,
	year = 2014
}

José Luis March, Salvador Petit, Julio Sahuquillo and Houcine Hassan Mohamed. Dynamic WCET Estimation for Real-Time Multicore Embedded Systems Supporting DVFS. 2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014. BibTeX

@article{10.1109/HPCC.2014.11,
	author = "March, Jos{\'e} Luis and Petit, Salvador and Sahuquillo, Julio and Mohamed, Houcine Hassan",
	journal = "2014 IEEE International Conference on High Performance Computing and Communications (HPCC)",
	title = "{D}ynamic {WCET} {E}stimation for {R}eal-{T}ime {M}ulticore {E}mbedded {S}ystems {S}upporting {DVFS}",
	year = 2014
}

Salvador Petit, Rafael Ubal, Julio Sahuquillo and Pedro Lopez. Efficient Register Renaming and Recovery for High-Performance Processors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 7(22):1506-1514, 2014. BibTeX

@article{10.1109/TVLSI.2013.2270001,
	author = "Petit, Salvador and Ubal, Rafael and Sahuquillo, Julio and Lopez, Pedro",
	abstract = "Modern superscalar processors implement register renaming using either random access memory (RAM) or content-addressable memories (CAM) tables. The design of these structures should address both access time and misprediction recovery penalty. Although direct-mapped RAMs provide faster access times, CAMs are more appropriate to avoid recovery penalties. The presence of associative ports in CAMs, however, prevents them from scaling with the number of physical registers and pipeline width, negatively impacting performance, area, and energy consumption at the rename stage. In this paper, we present a new hybrid RAM–CAM register renaming scheme, which combines the best of both approaches. In a steady state, a RAM provides fast and energy-efficient access to register mappings. On misspeculation, a low-complexity CAM enables immediate recovery. Experimental results show that in a four-way state-of-the-art superscalar processor, the new approach provides almost the same performance as an ideal CAM-based renaming scheme, while dissipating only between 17% and 26% of the original energy and, in some cases, consuming less energy than purely RAM-based renaming schemes. Overall, the silicon area required to implement the hybrid RAM–CAM scheme does not exceed the area required by conventional renaming mechanisms.",
	journal = "IEEE Transactions on Very Large Scale Integration (VLSI) Systems",
	number = 22,
	pages = "1506-1514",
	title = "{E}fficient {R}egister {R}enaming and {R}ecovery for {H}igh-{P}erformance {P}rocessors",
	volume = 7,
	year = 2014
}

Josué Feliu, Julio Sahuquillo, Salvador Petit and Jose Duato. Addressing bandwidth contention in SMT multicores through scheduling. In International Conference on Supercomputing, ICS'14. 2014, 167. BibTeX

@conference{DBLP:conf/ics/FeliuSPD14,
	author = "Feliu, Josu{\'e} and Sahuquillo, Julio and Petit, Salvador and Duato, Jose",
	booktitle = "International Conference on Supercomputing, ICS'14",
	crossref = "DBLP:conf/ics/2014",
	pages = 167,
	title = "{A}ddressing bandwidth contention in {SMT} multicores through scheduling",
	year = 2014
}

Josué Feliu, Julio Sahuquillo, Salvador Petit and Jose Duato. Planificación Considerando Degradación de Prestaciones por Contención. In XXIV Jornadas de Paralelismo, JP 2013, Madrid, Sep 17-20. 2013, 62-67. BibTeX

@conference{JP/Feliu/13,
	author = "Feliu, Josu{\'e} and Sahuquillo, Julio and Petit, Salvador and Duato, Jose",
	booktitle = "XXIV Jornadas de Paralelismo, JP 2013, Madrid, Sep 17-20",
	isbn = "978-84-695-8330-2",
	pages = "62-67",
	title = "{P}lanificaci{\'o}n {C}onsiderando {D}egradaci{\'o}n de {P}restaciones por {C}ontenci{\'o}n",
	year = 2013
}

Josué Feliu, Julio Sahuquillo, Salvador Petit and Jose Duato. L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors. In 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT'13, Edinburgh, United Kingdom, Sep 7-11. 2013, 123-132. BibTeX

@conference{PACT/Feliu/13,
	author = "Feliu, Josu{\'e} and Sahuquillo, Julio and Petit, Salvador and Duato, Jose",
	booktitle = "22nd International Conference on Parallel Architectures and Compilation Techniques, PACT'13, Edinburgh, United Kingdom, Sep 7-11",
	isbn = "978-1-4799-1021-2",
	pages = "123-132",
	title = "{L}1-{B}andwidth {A}ware {T}hread {A}llocation in {M}ulticore {SMT} {P}rocessors",
	year = 2013
}

Josué Feliu, Julio Sahuquillo, Salvador Petit and Jose Duato. Using huge pages and performance counters to determine the LLC architecture. In International Conference on Computational Science, ICCS'13, Barcelona, Jun 5-7. 2013, 2557-2560. BibTeX

@conference{josue_iccs_2013,
	author = "Feliu, Josu{\'e} and Sahuquillo, Julio and Petit, Salvador and Duato, Jose",
	booktitle = "International Conference on Computational Science, ICCS'13, Barcelona, Jun 5-7",
	pages = "2557-2560",
	title = "{U}sing huge pages and performance counters to determine the {LLC} architecture",
	year = 2013
}

Josué Feliu, Julio Sahuquillo, Salvador Petit and Jose Duato. Planificació considerando el ancho de banda de la jerarquía de cache. In XIII Jornadas de Paralelismo, JP 2012, Elche, Sep 19-21. 2012, 472-477. BibTeX

@conference{JP/Feliu/12,
	author = "Feliu, Josu{\'e} and Sahuquillo, Julio and Petit, Salvador and Duato, Jose",
	booktitle = "XIII Jornadas de Paralelismo, JP 2012, Elche, Sep 19-21",
	isbn = "978-84-695-4473-0",
	pages = "472-477",
	title = "{P}lanificaci{\'o} considerando el ancho de banda de la jerarqu{\'i}a de cache",
	year = 2012
}

Josué Feliu, Julio Sahuquillo, Salvador Petit and Jose Duato. Understanding Cache Hierarchy Contention in CMPs to Improve Job Scheduling. In 26th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2012, Shanghai, China, May 21-25. 2012, 508-519. BibTeX

@conference{DBLP:conf/ipps/FeliuSPD12,
	author = "Feliu, Josu{\'e} and Sahuquillo, Julio and Petit, Salvador and Duato, Jose",
	booktitle = "26th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2012, Shanghai, China, May 21-25",
	isbn = "978-1-4673-0975-2",
	pages = "508-519",
	title = "{U}nderstanding {C}ache {H}ierarchy {C}ontention in {CMP}s to {I}mprove {J}ob {S}cheduling",
	year = 2012
}

Monica Serrano, Julio Sahuquillo, Salvador Petit, Houcine Hassan and Jose Duato. A cost-effective heuristic to schedule local and remote memory in cluster computers. Journal of Supercomputing, pages 1 - 19, 2011. URL BibTeX

@article{IP51286180,
	author = "Serrano, Monica and Sahuquillo, Julio and Petit, Salvador and Houcine Hassan and Duato, Jose",
	abstract = "Cluster computers represent a cost-effective alternative solution to supercomputers. In these systems, it is common to constrain the memory address space of a given processor to the local motherboard. Constraining the system in this way is much cheaper than using a full-fledged shared memory implementation among motherboards. However, memory usage among motherboards can be unfairly balanced. On the other hand, remote memory access (RMA) hardware provides fast interconnects among the motherboards of a cluster. RMA devices can be used to access remote RAM memory from a local motherboard. This work focuses on this capability in order to achieve a better global use of the total RAM memory in the system. More precisely, the address space of local applications is extended to remote motherboards and is used to access remote RAM memory. This paper presents an ideal memory scheduling algorithm and proposes a cost-effective heuristic to allocate local and remote memory among local applications. Compared to the devised ideal algorithm, the heuristic obtains the same (or closely resembling) results while largely reducing the computational cost. In addition, we analyze the impact on the performance of stand alone applications varying the memory distribution among regions (local, local to board, and remote). Then, this study is extended to any number of concurrent applications. Experimental results show that a QoS parameter is needed in order to avoid unacceptable performance degradation. {\&}copy; 2011 Springer Science+Business Media, LLC.",
	issn = 09208542,
	journal = "Journal of Supercomputing",
	key = "Multitasking",
	keywords = "Cost effectiveness;Costs;Printed circuits;Random access storage;Scheduling algorithms;Supercomputers;",
	note = "Address space;Cluster computer;Computational costs;Global use;Memory address space;Memory usage;Performance degradation;QoS parameters;Remote memory;Remote memory access;Shared memories;Standalone applications;Work Focus;",
	pages = "1 - 19",
	title = "{A} cost-effective heuristic to schedule local and remote memory in cluster computers",
	url = "http://dx.doi.org/10.1007/s11227-011-0566-8",
	year = 2011
}

M Serrano, Julio Sahuquillo, Houcine Hassan Mohamed, Salvador Petit and Jose Duato. A Scheduling Heuristic to Handle Local and Remote Memory in Cluster Computers. In High Performance Computing and Communications (HPCC), 2010 12th IEEE International Conference on. 2010, 35 -42. URL, DOI BibTeX

@conference{5581321,
	author = "M. Serrano and Sahuquillo, Julio and Mohamed, Houcine Hassan and Petit, Salvador and Duato, Jose",
	abstract = "In cluster computers, RAM memory is spread among the motherboards hosting the running applications. In these systems, it is common to constrain the memory address space of a given processor to the local motherboard. Constraining the system in this way is much cheaper than using a full-fledged shared memory implementation among motherboards. However, in this case, memory usage might widely differ among motherboards depending on the memory requirements of the applications running on each motherboard. In this context, if an application requires a huge quantity of RAM memory, the only feasible solution is to increase the amount of available memory in its local motherboard, even if the remaining ones are underused. Nevertheless, beyond a certain memory size, this memory budget increase becomes prohibitive. In this paper, we assume that the Remote Memory Access hardware used in a Hyper Transport based system allows applications to allocate the required memory from remote motherboards. We also analyze how the distribution of memory accesses among different memory locations (local or remote) impact on performance. Finally, an heuristic is devised to schedule local and remote memory among applications according to their requirements, and considering quality of service constraints.",
	booktitle = "High Performance Computing and Communications (HPCC), 2010 12th IEEE International Conference on",
	doi = "10.1109/HPCC.2010.75",
	isbn = "978-1-4244-8335-8",
	keywords = "hyper transport based system;local memory handling;random access memory;remote memory access hardware;remote memory handling;remote motherboards;scheduling heuristic;random-access storage;scheduling;storage management;",
	month = "sept.",
	pages = "35 -42",
	title = "{A} {S}cheduling {H}euristic to {H}andle {L}ocal and {R}emote {M}emory in {C}luster {C}omputers",
	url = "http://dx.doi.org/10.1109/HPCC.2010.75",
	year = 2010
}

Diana B Rayo, Julio Sahuquillo, Houcine Hassan Mohamed, Salvador Petit and Jose Duato. Balancing Task Resource Requirements in Embedded Multithreaded Multicore Processors to Reduce Power Consumption. In Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2010). February 2010, 200 - 4. URL, DOI BibTeX

@conference{11260697,
	author = "Rayo, Diana B. and Sahuquillo, Julio and Mohamed, Houcine Hassan and Petit, Salvador and Duato, Jose",
	abstract = "Power consumption is a major design issue in modern microprocessors. Hence, power reduction techniques, like dynamic voltage scaling (DVS), are being widely implemented. Unfortunately, they impact on the task execution time so difficulting schedulability of hard real-time applications. To deal with this problem, this paper proposes a power-aware scheduler for coarse-grain embedded multicore processors implementing global DVS. To this end, this work presents two heuristics, namely Balanced Memory and Balanced CPU, which distribute the task set among cores focusing on resource utilization. Results show that with respect to a system not implementing DVS, two or five DVS levels achieve energy savings by about 35% or 51%, respectively.",
	booktitle = "Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2010)",
	doi = "10.1109/PDP.2010.64",
	isbn = "978-1-4244-5672-7",
	issn = "1066-6192",
	journal = "Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2010)",
	keywords = "microprocessor chips;multi-threading;power consumption;scheduling;",
	month = "Feb",
	note = "task resource requirements;embedded multithreaded multicore processors;power consumption reduction;power reduction techniques;dynamic voltage scaling;global DVS;",
	pages = "200 - 4",
	title = "{B}alancing {T}ask {R}esource {R}equirements in {E}mbedded {M}ultithreaded {M}ulticore {P}rocessors to {R}educe {P}ower {C}onsumption",
	url = "http://dx.doi.org/10.1109/PDP.2010.64",
	year = 2010
}

Salvador Petit, Rafael Ubal, Julio Sahuquillo and Pedro Lopez. A power-aware hybrid RAM-CAM renaming mechanism for fast recovery. In Computer Design, 2009. ICCD 2009. IEEE International Conference on. 2009, 150 -157. URL, DOI BibTeX

@conference{5413160,
	author = "Petit, Salvador and Ubal, Rafael and Sahuquillo, Julio and Lopez, Pedro",
	abstract = "Modern superscalar processors implement register renaming by using either RAM or CAM tables. The design of these structures should address their access time and misprediction recovery penalty. While direct-mapped RAMs provide faster access times, CAMs are more appropriate to avoid recovery penalties. Although they are more complex and slower, CAMs usually match the processor cycle in current designs. However, they do not scale with the number of physical registers and the pipeline width. In this paper we present a new hybrid RAM-CAM register renaming scheme, which combines the best of both approaches. In a steady state, a RAM provides the current mappings quickly; on mispeculation, a low-complexity CAM enables immediate recovery and further register renaming. Compared to an ideal CAM in a 4-way state-of-the-art superscalar microprocessor, and for almost the same performance (1% slowdown) and area (95% of the ideal CAM size), the proposed scheme consumes about 90% less dynamic energy.",
	booktitle = "Computer Design, 2009. ICCD 2009. IEEE International Conference on",
	doi = "10.1109/ICCD.2009.5413160",
	issn = "1063-6404",
	keywords = "direct-mapped RAM;misprediction recovery penalty;physical registers;pipeline width;power-aware hybrid RAM-CAM renaming mechanism;processor cycle;register renaming;superscalar processors;microprocessor chips;power aware computing;random-access storage;",
	month = "oct.",
	pages = "150 -157",
	title = "{A} power-aware hybrid {RAM}-{CAM} renaming mechanism for fast recovery",
	url = "http://dx.doi.org/10.1109/ICCD.2009.5413160",
	year = 2009
}

Salvador Petit, Rafael Ubal, Julio Sahuquillo, Pedro Lopez and Jose Duato. An Efficient Low-Complexity Alternative to the ROB for Out-of-Order Retirement of Instructions. In Antonio Nunez; Pedro P Carballo (ed.). Digital System Design, Architectures, Methods and Tools, 2009. DSD '09. 12th Euromicro Conference on. 2009, 635 -642. URL, DOI BibTeX

@conference{5350186,
	author = "Petit, Salvador and Ubal, Rafael and Sahuquillo, Julio and Lopez, Pedro and Duato, Jose",
	abstract = "Current superscalar processors use a reorder buffer (ROB) to support speculation, precise exceptions, and register reclamation. Instructions are retired from this structure in program order, which may lead to significant performance degradation if a long latency operation blocks the ROB head. In this paper, a checkpoint-free out-of-order commit architecture is proposed, which replaces the ROB with a small structure called validation buffer (VB) from which instructions are retired as soon as their speculative state is resolved. An aggressive register reclamation mechanism targeted to this microarchitecture is also devised. Experimental results show that the VB microarchitecture is much more efficient than a ROB-based microprocessor. For example, a 32-entry VB provides similar performance to a 256-entry ROB, while reducing the utilization of other major processor structures.",
	booktitle = "Digital System Design, Architectures, Methods and Tools, 2009. DSD '09. 12th Euromicro Conference on",
	doi = "10.1109/DSD.2009.237",
	editor = "Antonio Nunez; Pedro P. Carballo",
	isbn = "978-0-7695-3782-5",
	keywords = "ROB-based microprocessor;checkpoint-free out-of-order commit architecture;out-of-order instruction retirement;register reclamation;register reclamation mechanism;superscalar reorder buffer processors;validation buffer;buffer circuits;microprocessor chips;",
	month = "aug.",
	pages = "635 -642",
	title = "{A}n {E}fficient {L}ow-{C}omplexity {A}lternative to the {ROB} for {O}ut-of-{O}rder {R}etirement of {I}nstructions",
	url = "http://dx.doi.org/10.1109/DSD.2009.237",
	year = 2009
}

Rafael Ubal, Julio Sahuquillo, Salvador Petit and Pedro Lopez. Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors. In Computer Architecture and High Performance Computing, 2007. SBAC-PAD 2007. 19th International Symposium on. 2007, 62 -68. URL, DOI BibTeX

@conference{4384043,
	author = "Ubal, Rafael and Sahuquillo, Julio and Petit, Salvador and Lopez, Pedro",
	abstract = "Current microprocessors are based in complex designs, integrating different components on a single chip, such as hardware threads, processor cores, memory hierarchy or interconnection networks. The permanent need of evaluating new designs on each of these components motivates the development of tools which simulate the system working as a whole. In this paper, we present the Multi2Sim simulation framework, which models the major components of incoming systems, and is intended to cover the limitations of existing simulators. A set of simulation examples is also included for illustrative purposes.",
	booktitle = "Computer Architecture and High Performance Computing, 2007. SBAC-PAD 2007. 19th International Symposium on",
	doi = "10.1109/SBAC-PAD.2007.17",
	issn = "1550-6533",
	keywords = "Multi2Sim;hardware threads;interconnection networks;memory hierarchy;microprocessors;multicore-multithreaded processors;processor cores;multi-threading;multiprocessor interconnection networks;",
	month = "oct.",
	pages = "62 -68",
	title = "{M}ulti2{S}im: {A} {S}imulation {F}ramework to {E}valuate {M}ulticore-{M}ultithreaded {P}rocessors",
	url = "http://dx.doi.org/10.1109/SBAC-PAD.2007.17",
	year = 2007
}

Rafael Ubal, Julio Sahuquillo, Salvador Petit, H Hassan and Pedro Lopez. Leakage Current Reduction in Data Caches on Embedded Systems. In Intelligent Pervasive Computing, 2007. IPC. The 2007 International Conference on. 2007, 45 -50. URL, DOI BibTeX

@conference{4438392,
	author = "Ubal, Rafael and Sahuquillo, Julio and Petit, Salvador and H. Hassan and Lopez, Pedro",
	abstract = "Nowadays, embedded systems can be found in a wide range of pervasive devices (e.g., smart phones, PDAs, or video/digital cameras). These devices contain large cache memories, whose power consumption can reach about 50% of the total spent energy, from which leakage energy is the predominant fraction in current technologies. This paper proposes a technique to reduce leakage energy consumption in data caches on embedded systems, which is based on the fact that most stored bits take a logical value of zero. The proposed technique has been evaluated on a model of a contemporary high-end embedded microprocessor, namely the ARM Cortex A8 processor, executing a set of standard embedded benchmarks. Experimental results show that leakage energy savings reach about 40% with no IPC loss.",
	booktitle = "Intelligent Pervasive Computing, 2007. IPC. The 2007 International Conference on",
	doi = "10.1109/IPC.2007.95",
	keywords = "ARM Cortex A8 processor;cache memories;data caches;high-end embedded microprocessor;leakage energy consumption reduction;pervasive devices;cache storage;microprocessor chips;power consumption;ubiquitous computing;",
	month = "oct.",
	pages = "45 -50",
	title = "{L}eakage {C}urrent {R}eduction in {D}ata {C}aches on {E}mbedded {S}ystems",
	url = "http://dx.doi.org/10.1109/IPC.2007.95",
	year = 2007
}

Rafael Ubal, Julio Sahuquillo, Salvador Petit, Pedro Lopez and Jose Duato. VB-MT: Design Issues and Performance of the Validation Buffer Microarchitecture for Multithreaded Processors. In Parallel Architecture and Compilation Techniques, 2007. PACT 2007. 16th International Conference on. 2007, 429 -429. URL, DOI BibTeX

@conference{4336257,
	author = "Ubal, Rafael and Sahuquillo, Julio and Petit, Salvador and Lopez, Pedro and Duato, Jose",
	abstract = "The validation buffer (VB) Microarchitecture retires instructions out of order, by substituting the classical ROB by the VB structure. The VB removes the negative effect of long latency instructions located at the ROB head, which prevent other instructions from retiring and cause frequent pipeline stalls due to lack of space in the ROB. This work analyzes different multithreading models (coarse grain, fine grain and simultaneous multithreading) and a set of different instruction fetch policies.",
	booktitle = "Parallel Architecture and Compilation Techniques, 2007. PACT 2007. 16th International Conference on",
	doi = "10.1109/PACT.2007.4336257",
	issn = "1089-795X",
	keywords = "ROB head;VB structure;instruction fetch policies;multithreaded processors;validation buffer microarchitecture;buffer storage;multi-threading;parallel architectures;storage allocation;",
	month = "sept.",
	pages = "429 -429",
	title = "{VB}-{MT}: {D}esign {I}ssues and {P}erformance of the {V}alidation {B}uffer {M}icroarchitecture for {M}ultithreaded {P}rocessors",
	url = "http://dx.doi.org/10.1109/PACT.2007.4336257",
	year = 2007
}

Julio Sahuquillo, N Tomas, Salvador Petit and A Pont. Spim-Cache: A Pedagogical Tool for Teaching Cache Memories Through Code-Based Exercises. Education, IEEE Transactions on 50(3):244 -250, 2007. URL, DOI BibTeX

@article{4287124,
	author = "Sahuquillo, Julio and N. Tomas and Petit, Salvador and A. Pont",
	abstract = "Cache memories represent a core topic in all computer organization and architecture courses offered at universities around the world. As a consequence, educational proposals and textbooks address important efforts to this topic. A valuable pedagogical help when studying cache memories is to perform exercises based on simple algorithms, which allow the identification of cache accesses, for instance, a program accessing the elements of an array. These exercises, referred to as code-based exercises, have a good acceptance among instructors of computer organization courses. Nevertheless, no tool (e.g., simulator) has been developed to be used in undergraduate courses working with this kind of exercises; therefore, students perform such exercises by means of the classic paper and pencil methodology. To fill this gap, this paper proposes a new pedagogical tool, namely Spim-cache. A laboratory example is also presented for illustrative purposes.",
	doi = "10.1109/TE.2007.900021",
	issn = "0018-9359",
	journal = "Education, IEEE Transactions on",
	keywords = "Spim-cache;cache memories;code-based exercises;computer architecture courses;computer organization courses;pedagogical tool;undergraduate courses;cache storage;computer aided instruction;computer science education;educational courses;",
	month = "aug.",
	number = 3,
	pages = "244 -250",
	title = "{S}pim-{C}ache: {A} {P}edagogical {T}ool for {T}eaching {C}ache {M}emories {T}hrough {C}ode-{B}ased {E}xercises",
	url = "http://dx.doi.org/10.1109/TE.2007.900021",
	volume = 50,
	year = 2007
}

B Ossa, J A Gil, Julio Sahuquillo and A Pont. Improving Web Prefetching by Making Predictions at Prefetch. In Next Generation Internet Networks, 3rd EuroNGI Conference on. May 2007, 21 -27. URL, DOI BibTeX

@conference{4231816,
	author = "de la Ossa, B. and J.A. Gil and Sahuquillo, Julio and A. Pont",
	abstract = "Most of the research attempts to improve Web prefetching techniques have focused on the prediction algorithm with the objective of increasing its precision or, in the best case, to reduce the user's perceived latency. In contrast, to improve prefetching performance, this work concentrates in the prefetching engine and proposes the Prediction at Prefetch (P@P) technique. This paper explains how a prefetching technique can be extended to include our P@P proposal on real world conditions without changes in the web architecture or HTTP protocol. To show how this proposal can improve prefetching performance an extensive performance evaluation study has been done and the results show that P@P can considerably reduce the user's perceived latency with no additional cost over the basic prefetch mechanism.",
	booktitle = "Next Generation Internet Networks, 3rd EuroNGI Conference on",
	doi = "10.1109/NGI.2007.371193",
	isbn = "1-4244-0857-1",
	keywords = "HTTP protocol;Web browser;Web prefetching techniques;Web server;prediction algorithm;prediction at prefetch technique;user perceived latency;Internet;information retrieval;",
	month = "may",
	pages = "21 -27",
	title = "{I}mproving {W}eb {P}refetching by {M}aking {P}redictions at {P}refetch",
	url = "http://dx.doi.org/10.1109/NGI.2007.371193",
	year = 2007
}

B Ossa, J A Gil, Julio Sahuquillo and A Pont. Delfos: the Oracle to Predict NextWeb User's Accesses. In Advanced Information Networking and Applications, 2007. AINA '07. 21st International Conference on. May 2007, 679 -686. URL, DOI BibTeX

@conference{4220957,
	author = "de la Ossa, B. and J.A. Gil and Sahuquillo, Julio and A. Pont",
	abstract = "Despite the wide and intensive research efforts focused on Web prediction and prefetching techniques aimed to reduce user's perceived latency, few attempts to implement and use them in real environments have been done, mainly due to their complexity and supposed limitations that low user available bandwidths imposed few years ago. Nevertheless, current user bandwidths open a new scenario for prefetching that becomes again an interesting option to improve web performance. This paper presents Delfos, a framework to perform web predictions and prefetching on a real environment that tries to cover the existing gap between research and praxis. Delfos is integrated in the web architecture without modifying the standard HTTP 1.1 protocol, and acts inserting predictions in the web server side, while prefetchs are carried out by the client. In addition, it can be also used as a flexible framework to evaluate and compare existing prefetching techniques and algorithms and to assist in the design of new ones because it provides detailed statistics reports.",
	booktitle = "Advanced Information Networking and Applications, 2007. AINA '07. 21st International Conference on",
	doi = "10.1109/AINA.2007.50",
	isbn = "0-7695-2846-5",
	keywords = "Delfos;Web architecture;Web prediction;Web prefetching;Web server;Web user access;oracle;Web services;authorisation;software architecture;storage management;transport protocols;",
	month = "may",
	pages = "679 -686",
	title = "{D}elfos: the {O}racle to {P}redict {N}ext{W}eb {U}ser's {A}ccesses",
	url = "http://dx.doi.org/10.1109/AINA.2007.50",
	year = 2007
}

J Domenech, Julio Sahuquillo, J A Gil and A Pont. The Impact of the Web Prefetching Architecture on the Limits of Reducing User's Perceived Latency. In Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on. 2006, 740 -744. URL, DOI BibTeX

@conference{4061463,
	author = "J. Domenech and Sahuquillo, Julio and J.A. Gil and A. Pont",
	abstract = "Web prefetching is a technique that has been researched for years to reduce the latency perceived by users. For this purpose, several Web prefetching architectures have been used, but no comparative study has been performed to identify the best architecture dealing with prefetching. This paper analyzes the impact of the Web prefetching architecture focusing on the limits of reducing the user's perceived latency. To this end, the factors that constrain the predictive power of each architecture are analyzed and these theoretical limits are quantified. Experimental results show that the best element of the Web architecture to locate a single prediction engine is the proxy, whose implementation could reduce the perceived latency up to 67%. Schemes for collaborative predictors located at diverse elements of the Web architecture are also analyzed. These predictors could dramatically reduce the perceived latency, reaching a potential limit of about 97% for a mixed proxy-server collaborative prediction engine",
	booktitle = "Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on",
	doi = "10.1109/WI.2006.166",
	isbn = "0-7695-2747-7",
	keywords = "Web prefetching architecture;mixed proxy-server collaborative prediction engine;user perceived latency;Internet;groupware;online front-ends;search engines;",
	month = "dec.",
	pages = "740 -744",
	title = "{T}he {I}mpact of the {W}eb {P}refetching {A}rchitecture on the {L}imits of {R}educing {U}ser's {P}erceived {L}atency",
	url = "http://dx.doi.org/10.1109/WI.2006.166",
	year = 2006
}

J Domenech, J A Gil, Julio Sahuquillo and A Pont. DDG: An Efficient Prefetching Algorithm for Current Web Generation. In Hot Topics in Web Systems and Technologies, 2006. HOTWEB '06. 1st IEEE Workshop on. 2006, 1 -12. URL, DOI BibTeX

@conference{4178377,
	author = "J. Domenech and J.A. Gil and Sahuquillo, Julio and A. Pont",
	abstract = "Web prefetching is one of the techniques proposed to reduce user's perceived latencies in the World Wide Web. The spatial locality shown by user's accesses makes it possible to predict future accesses based on the previous ones. A prefetching engine uses these predictions to prefetch the Web objects before the user demands them. The existing prediction algorithms achieved an acceptable performance when they were proposed but the high increase in the amount of embedded objects per page has reduced their effectiveness in the current Web. In this paper we show that most of the predictions made by the existing algorithms are useless to reduce the user's perceived latency because these algorithms do not take into account how current Web pages are structured, i.e., an HTML object with several embedded objects. Thus, they predict the accesses to the embedded objects in an HTML after reading the HTML itself. For this reason, the prediction advance is not enough to prefetch the objects and therefore there is no latency reduction. As a result of a wide analysis of the behaviour of the most commonly used algorithms, in this paper we present the DDG algorithm that distinguishes between container objects (HTML) and embedded objects to create a new prediction model according to the structure of the current Web. Results show that, for the same amount of extra requests to the server, DDG always outperforms the existing algorithms by reducing the perceived latency between 15% and 150% more without increasing the computing complexity",
	booktitle = "Hot Topics in Web Systems and Technologies, 2006. HOTWEB '06. 1st IEEE Workshop on",
	doi = "10.1109/HOTWEB.2006.355260",
	isbn = "1-4244-0596-3",
	keywords = "HTML object;Web object prefetching;Web pages;World Wide Web;container objects;embedded objects;latency reduction;spatial locality;user access;Internet;hypermedia markup languages;information retrieval;storage management;",
	month = "nov.",
	pages = "1 -12",
	title = "{DDG}: {A}n {E}fficient {P}refetching {A}lgorithm for {C}urrent {W}eb {G}eneration",
	url = "http://dx.doi.org/10.1109/HOTWEB.2006.355260",
	year = 2006
}

Rafael Ubal, José Cano Reyes, Salvador Petit and Julio Sahuquillo. RACFP: a training tool to work with floating-point representation, algorithms, and circuits in undergraduate courses. Education, IEEE Transactions on 49(3):321 -331, 2006. URL, DOI BibTeX

@article{1668276,
	author = "Ubal, Rafael and Cano Reyes, Jos{\'e} and Petit, Salvador and Sahuquillo, Julio",
	abstract = "The design of pedagogical tools to train students is an interesting challenge for academic instructors in any educational area. Some approaches have appeared focusing on computer arithmetic, both integer and floating point. Floating-point arithmetic involves much more complexity; nevertheless, little time is usually devoted to this topic in computer engineering undergraduate courses. In this paper, RACFP is proposed as a pedagogical tool to work with floating-point in undergraduate courses. The tool has been designed with three abstraction levels according to the following learning outcomes: representation, arithmetic operation algorithms, and manufactured hardware circuits. The abstraction levels work independently, allowing for the use of RACFP in other courses, such as discrete mathematics or numerical methods, in which floating representation and related issues are also learning topics. RACFP design pursues two main goals: to minimize the complexity of the learning process and to encourage students when working with floating point. The first goal is achieved as a result of the multilevel design of the tool, while the second goal is achieved as RACFP shows how manufactured hardware implements generic algorithms",
	doi = "10.1109/TE.2006.879240",
	issn = "0018-9359",
	journal = "Education, IEEE Transactions on",
	keywords = "RACFP;computer engineering;floating-point algorithms;floating-point circuits;floating-point representation;pedagogical tool;training tools;undergraduate courses;computer science education;educational aids;educational courses;floating point arithmetic;trai",
	month = "aug.",
	number = 3,
	pages = "321 -331",
	title = "{RACFP}: a training tool to work with floating-point representation, algorithms, and circuits in undergraduate courses",
	url = "http://dx.doi.org/10.1109/TE.2006.879240",
	volume = 49,
	year = 2006
}

J Domenech, Julio Sahuquillo, A Pont and J A Gil. Design Keys to Adapt Web Prefetching Algorithms to Environment Conditions. In Communication System Software and Middleware, 2006. Comsware 2006. First International Conference on. 2006, 1 -7. URL, DOI BibTeX

@conference{1665179,
	author = "J. Domenech and Sahuquillo, Julio and A. Pont and J.A. Gil",
	abstract = "This paper focuses on the design process of Web prefetching algorithms. The main goal of prefetching techniques in web is to reduce user perceived latency. Since these techniques present a high number of non-desired collateral effects that can negatively affect the system performance, the design process of new algorithms must be carefully performed. In a previous work we proposed some performance metrics to evaluate Web prefetching and introduced the byte recall index. In this work we present a statistical analysis which identifies how the environment conditions impact on the most significant indexes (recall and byte recall) used to evaluate prefetch algorithms. Our experimental results show that, depending on the user available bandwidth and the server processing time of each request, the recall is more correlated to the user's perceived latency than the byte recall and vice versa, so that we specify and suggest guidelines to adapt an algorithm to different environment conditions",
	booktitle = "Communication System Software and Middleware, 2006. Comsware 2006. First International Conference on",
	doi = "10.1109/COMSWA.2006.1665179",
	isbn = "0-7803-9575-1",
	keywords = "Web prefetching algorithm;byte recall index;nondesired collateral effect;server processing time;statistical analysis;Internet;statistical analysis;storage management;",
	month = "0-0",
	pages = "1 -7",
	title = "{D}esign {K}eys to {A}dapt {W}eb {P}refetching {A}lgorithms to {E}nvironment {C}onditions",
	url = "http://dx.doi.org/10.1109/COMSWA.2006.1665179",
	year = 2006
}

L G Cardenas, J A Gil, Julio Sahuquillo and A Pont. Emulating Web cache replacement algorithms versus a real system. In Computers and Communications, 2005. ISCC 2005. Proceedings. 10th IEEE Symposium on. June 2005, 891 - 897. URL, DOI BibTeX

@conference{1493829,
	author = "L.G. Cardenas and J.A. Gil and Sahuquillo, Julio and A. Pont",
	abstract = "This paper presents a powerful framework to simulate Web proxy cache systems. Our tool provides a comfortable environment to simulate and explore cache management techniques. It also includes an extension to design and simulate new structures considering several inter-connected caches which are very convenient for our current research projects. Besides a statistics module is enlarged to obtain supplementary performance measures. We compared the results obtained from our framework against a commercial proxy cache system by using several replacement algorithms and input traces. Experimental results show that proxy cache hit ratio deviations fall very close to the real system, since them never exceeds 3.5%. Although the simulation time varies depending on the input trace size and the modeled management technique, in all experiments run time has been by about several hundred times faster than the time the real system takes.",
	booktitle = "Computers and Communications, 2005. ISCC 2005. Proceedings. 10th IEEE Symposium on",
	doi = "10.1109/ISCC.2005.63",
	isbn = "0-7695-2373-0",
	keywords = "Web cache replacement algorithms; Web proxy cache systems; cache management techniques; statistics module; Internet; cache storage;",
	month = "june",
	pages = "891 - 897",
	title = "{E}mulating {W}eb cache replacement algorithms versus a real system",
	url = "http://dx.doi.org/10.1109/ISCC.2005.63",
	year = 2005
}

I J Nino, B Ossa, J A Gil, Julio Sahuquillo and A Pont. CARENA: a tool to capture and replay Web navigation sessions. In End-to-End Monitoring Techniques and Services, 2005. Workshop on. May 2005, 127 - 141. URL, DOI BibTeX

@conference{1564474,
author = "I.J. Nino and de la Ossa, B. and J.A. Gil and Sahuquillo, Julio and A. Pont",
abstract = "Web user behavior has widely changed over the last years. To perform precise and up-to-date Web user behavior characterization is important to carry out representative Web performance studies. In this sense, it is valuable to capture detailed information about the user's experience, which permits to perform a fine grain characterization. Two main types of tools are distinguishable: complex commercial software tools like workload generators and academic tools. The latter mainly concentrate on the development of windows applications which gather Web events (e.g., browser events) or tools modifying a part of the web browser rode. In this paper, we present CARENA, a client-side browser-embedded tool to capture and replay user navigation sessions. Like some commercial software packages our tool captures information about the user session, which can be used later to replay or mimic the gathered user navigation. Nevertheless, unlike these software packages, our tool emulates the original user think times since these times are important to obtain precise and reliable performance results. Among the main features of CARENA are: multiplatform, open source, lightweight, standards based, easily installable and usable, programmed in JavaScript and XUL.",
booktitle = "End-to-End Monitoring Techniques and Services, 2005. Workshop on",
doi = "10.1109/E2EMON.2005.1564474",
isbn = "0-7803-9249-3",
keywords = "Web navigation sessions; client-side browser-embedded tool; user navigation sessions; Internet; online front-ends; software tools;",
month = "may",
pages = "127 - 141",
title = "{CARENA}: a tool to capture and replay {W}eb navigation sessions",
url = "http://dx.doi.org/10.1109/E2EMON.2005.1564474",
year = 2005
}

L G Cardenas, J A Gil, J Domenech, Julio Sahuquillo and A Pont. Performance comparison of a Web cache simulation framework. In Advanced Information Networking and Applications, 2005. AINA 2005. 19th International Conference on 2. March 2005, 281 - 284 vol.2. URL, DOI BibTeX

@conference{1423694,
	author = "L.G. Cardenas and J.A. Gil and J. Domenech and Sahuquillo, Julio and A. Pont",
	abstract = "Performance comparison studies are primarily carried out through real systems or simulation environments. Simulation is the most commonly used method to explore new proposals due to both its flexibility and the relatively reduced time taken to obtain performance results. This paper presents a powerful framework to simulate Web proxy cache systems. Our tool provides a comfortable environment to simulate and explore cache management techniques. In order to validate our framework and show how accurate it executes, a performance comparison has been done. We analyzed the details of a commercial proxy cache system and compare its results with those obtained from our simulator using the most commonly replacement algorithm (LRU). For this purpose, the proposed environment was adapted to match the performance of the real proxy cache. Experimental results show that proxy cache hit ratio deviations fall very close to the real system, since then, never exceeds 3.42%.",
	booktitle = "Advanced Information Networking and Applications, 2005. AINA 2005. 19th International Conference on",
	doi = "10.1109/AINA.2005.275",
	isbn = "0-7695-2249-1",
	keywords = "LRU; Web cache simulation; Web proxy cache system; cache management; performance comparison; real system; replacement algorithm; simulation environment; simulation techniques; Internet; cache storage; digital simulation;",
	month = "march",
	pages = "281 - 284 vol.2",
	title = "{P}erformance comparison of a {W}eb cache simulation framework",
	url = "http://dx.doi.org/10.1109/AINA.2005.275",
	volume = 2,
	year = 2005
}

Salvador Petit, Julio Sahuquillo and A Pont. A comparison study of the HLRC-DU protocol versus a HLRC hardware assisted protocol. In Parallel, Distributed and Network-Based Processing, 2005. PDP 2005. 13th Euromicro Conference on. 2005, 197 - 204. URL, DOI BibTeX

@conference{1386059,
	author = "Petit, Salvador and Sahuquillo, Julio and A. Pont",
	abstract = "SVM systems are a cheaper and flexible way to implement the shared memory programming paradigm. Their huge flexibility is due to their software implementation; however, this is also the main responsible of their performance drawbacks with respect to hardware systems. In this paper we compare a pure software HLRC protocol called the HLRC-DU, versus an improved version of the HLRC protocol that uses hardware support to reduce asynchronous communication. Performances of both protocols are compared over a baseline HLRC protocol. Results show that, by on the half of the benchmarks, our protocol performs better than the hardware approach, even more, in some cases our protocol reaches a speedup higher than 22% with respect to the baseline protocol.",
	booktitle = "Parallel, Distributed and Network-Based Processing, 2005. PDP 2005. 13th Euromicro Conference on",
	doi = "10.1109/EMPDP.2005.2",
	isbn = "0-7695-2280-7",
	issn = "1066-6192",
	keywords = "HLRC hardware assisted protocol; SVM systems; asynchronous communication; memory consistency protocols; shared memory programming paradigm; shared virtual memory systems; software HLRC-DU protocol; distributed programming; memory protocols; microprogramm",
	month = "feb.",
	pages = "197 - 204",
	title = "{A} comparison study of the {HLRC}-{DU} protocol versus a {HLRC} hardware assisted protocol",
	url = "http://dx.doi.org/10.1109/EMPDP.2005.2",
	year = 2005
}

Salvador Petit, Julio Sahuquillo, A Pont and D Kaeli. Characterizing the dynamic behavior of workload execution in SVM systems. In Computer Architecture and High Performance Computing, 2004. SBAC-PAD 2004. 16th Symposium on. 2004, 230 - 237. URL, DOI BibTeX

@conference{1364758,
	author = "Petit, Salvador and Sahuquillo, Julio and A. Pont and D. Kaeli",
	abstract = "The overhead associated with software management of shared virtual memory (SVM) systems can seriously impact overall system performance. One way to remedy this situation is to design more efficient SVM consistency protocols. In this paper we study a number of parallel workload characteristics that can negatively impact the performance of SVM systems. We attempt to quantify the sources of performance loss in some parallel workloads. Our goal is to better understand these characteristics, enabling us to develop SVM protocols that can adjust to dynamics in workload behavior. This paper has three main contributions: i) we measure the contention for synchronization resources, showing how applications exhibit distinct phases during their execution, ii) we quantify the relationship between page size and fragmentation/false sharing while varying the sharing unit size, and iii) we study the synergies between the contention for synchronization resources and fragmentation/false sharing, providing hints for developing improved protocols.",
	booktitle = "Computer Architecture and High Performance Computing, 2004. SBAC-PAD 2004. 16th Symposium on",
	doi = "10.1109/SBAC-PAD.2004.12",
	isbn = "0-7695-2240-8",
	keywords = "SVM consistency protocols; parallel workload characteristics; shared virtual memory system; software management; synchronization resources; workload execution; performance evaluation; protocols; resource allocation; shared memory systems; synchronisation",
	month = "oct.",
	pages = "230 - 237",
	title = "{C}haracterizing the dynamic behavior of workload execution in {SVM} systems",
	url = "http://dx.doi.org/10.1109/SBAC-PAD.2004.12",
	year = 2004
}

L G Cardenas, Julio Sahuquillo, A Pont and J A Gil. The multikey Web cache simulator: a platform for designing proxy cache management techniques. In Parallel, Distributed and Network-Based Processing, 2004. Proceedings. 12th Euromicro Conference on. 2004, 390 - 397. URL, DOI BibTeX

@conference{1271471,
	author = "L.G. Cardenas and Sahuquillo, Julio and A. Pont and J.A. Gil",
	abstract = "Proxy caches have become an important mechanism to reduce latencies. Efficient management techniques for proxy caches which exploits Web-objects inherent characteristics are an essential key to reach good performance. One important segment of the replacement algorithms being applied today are the multikey algorithms that use several key or object characteristics to decide which object or objects must be replaced. This feature is not considered in most of the current simulators. In this paper we propose a proxy-cache platform to check the performance of Web object based on multikey management techniques and algorithms. The proposed platform is coded in a modular way, which allows the implementation of new algorithms or policies proposals in an easy and robust manner. In addition to the classical performance metrics like the hit ratio and the byte hit ratio, the proposed framework also offers the response time perceived by users.",
	booktitle = "Parallel, Distributed and Network-Based Processing, 2004. Proceedings. 12th Euromicro Conference on",
	doi = "10.1109/EMPDP.2004.1271471",
	isbn = "0-7695-2083-9",
	keywords = "Web-objects; byte hit ratio; multikey Web cache simulator; multikey algorithms; multikey management techniques; proxy cache replacement algorithms; proxy caches management techniques; Internet; cache storage; digital simulation; performance evaluation;",
	month = "feb.",
	pages = "390 - 397",
	title = "{T}he multikey {W}eb cache simulator: a platform for designing proxy cache management techniques",
	url = "http://dx.doi.org/10.1109/EMPDP.2004.1271471",
	year = 2004
}

J Domenech, A Pont, Julio Sahuquillo and J A Gil. An experimental framework for testing Web prefetching techniques. In Euromicro Conference, 2004. Proceedings. 30th. 2004, 214 - 221. URL, DOI BibTeX

@conference{1333374,
author = "J. Domenech and A. Pont and Sahuquillo, Julio and J.A. Gil",
abstract = "The popularity of Web objects, and by extension the popularity of the Web sites, besides the appearance of clear footprints in user's accesses that show a considerable spatial locality, make possible to predict future accesses based on the current ones. This fact permits to implement also prefetching techniques in Web architecture in order to reduce the latency perceived by the users. Although the open literature presents some approaches in this sense, the huge variety of prefetching algorithms, and the different scenarios and conditions where they are applied make very difficult to compare performance and to obtain correct conclusions that permit researchers to improve their proposals or even detect in which conditions one solution is more convenient than others. This is the main reason why we propose A new and free available environment in order to implement and study prefetching techniques efficiently. Our framework is a hybrid implementation that combines both real and simulated parts in order to provide flexibility and accuracy. It reproduces in detail the behavior of Web users, proxy severs and original servers. The simulator also includes a module to provide performance results, such as precision (prefetching accuracy), recall, response time, and byte transference.",
booktitle = "Euromicro Conference, 2004. Proceedings. 30th",
doi = "10.1109/EURMIC.2004.1333374",
isbn = "0-7695-2199-1",
keywords = "Internet latency; Web architecture performance; Web prefetching techniques; Web sites; user access; Internet; computer network reliability; storage management;",
month = "31 aug.-3 sept.",
pages = "214 - 221",
publisher = "IEEE Computer Society",
title = "{A}n experimental framework for testing {W}eb prefetching techniques",
url = "http://dx.doi.org/10.1109/EURMIC.2004.1333374",
year = 2004
}

Salvador Petit, Julio Sahuquillo and A Pont. Characterizing parallel workloads to reduce multiple writer overhead in shared virtual memory systems. In Parallel, Distributed and Network-based Processing, 2002. Proceedings. 10th Euromicro Workshop on. 2002, 261 -268. URL, DOI BibTeX

@conference{994285,
author = "Petit, Salvador and Sahuquillo, Julio and A. Pont",
abstract = "Shared virtual memory (SVM) systems, because of their software implementation, enable shared-memory programming at a low design and maintenance cost. Nevertheless, as hardware implementations become faster, their performance is still far from that achieved by distributed shared memory (DSM) systems. Nowadays, SVM systems use relaxed memory consistency models and multiple writer protocols as techniques to reduce latencies and false sharing, respectively. However, these techniques induce additional overhead that decreases system performance. We performed a study of workload behavior aimed at improving the design of SVM protocols. The work focused on the identification of the type of shared data patterns that can appear in the accesses to protected sections using semaphores. Most coherence actions in SVM systems are performed as a consequence of the write operations executed in critical sections, so we pay special attention to the write operations performed when multiple writers are allowed. As these write operations may present spatial locality, we also study the write patterns on shared pages with similar behaviour. Different software filters are applied in the instrumented parallel workloads selected to capture and classify the most common sharing patterns. This enables the recognition of those patterns in which coherence overhead can be reduced by modifying the coherence actions performed by the protocol. Despite the fact that the performance evaluation of new coherence solutions is not our main goal, the ideas presented to improve the behaviour of SVM systems can be implemented at a reasonable hardware/software cost",
booktitle = "Parallel, Distributed and Network-based Processing, 2002. Proceedings. 10th Euromicro Workshop on",
doi = "10.1109/EMPDP.2002.994285",
isbn = "0-7695-1444-8",
keywords = "coherence actions;coherence overhead;critical sections;design cost;false sharing reduction;hardware cost;hardware implementations;instrumented parallel workloads;latency reduction;maintenance cost;memory consistency protocols;multiple writer protocols;par",
pages = "261 -268",
title = "{C}haracterizing parallel workloads to reduce multiple writer overhead in shared virtual memory systems",
url = "http://dx.doi.org/10.1109/EMPDP.2002.994285",
year = 2002
}

F Buendia, P Diaz, Julio Sahuquillo, J V Benlloch, J A Gil and M Agusti. XEDU, a framework for developing XML-based didactic resources. In Euromicro Conference, 2001. Proceedings. 27th. 2001, 427 -434. URL, DOI BibTeX

@conference{952484,
	author = "F. Buendia and P. Diaz and Sahuquillo, Julio and J.V. Benlloch and J.A. Gil and M. Agusti",
	abstract = "Recent educational software applications use Web technologies like XML to improve teaching methods in distance learning environments. Though XML has already been used to implement a high number of didactic resources, specification methodologies to develop these resources are rarely applied. As a consequence, the reuse and maintenance of those resources becomes a difficult task. This paper emphasises the use of hypermedia models to deal with this problem. Hypermedia models have long considered to have a great potential to represent educational applications. The current work proposes the XEDU framework that works over the Labyrinth hypermedia model, to manage and organise didactic resources. The proposed framework provides a set of abstract didactic structures and the interface to associate them either to XML-based contents and other complex didactic resources",
	booktitle = "Euromicro Conference, 2001. Proceedings. 27th",
	doi = "10.1109/EURMIC.2001.952484",
	isbn = "0-7695-1236-4",
	keywords = "Labyrinth hypermedia model;Web technologies;XEDU;XML-based contents;XML-based didactic resources;abstract didactic structures;complex didactic resources;distance learning environments;educational software;framework;hypermedia models;specification methodol",
	pages = "427 -434",
	title = "{XEDU}, a framework for developing {XML}-based didactic resources",
	url = "http://dx.doi.org/10.1109/EURMIC.2001.952484",
	year = 2001
}

Salvador Petit, Julio Sahuquillo and A Pont. About the sensitivity of the HLRC-DU protocol on diff and page sizes. In Performance Analysis of Systems and Software, 2001. ISPASS. 2001 IEEE International Symposium on. 2001, 45 -48. URL, DOI BibTeX

@conference{990675,
	author = "Petit, Salvador and Sahuquillo, Julio and A. Pont",
	booktitle = "Performance Analysis of Systems and Software, 2001. ISPASS. 2001 IEEE International Symposium on",
	doi = "10.1109/ISPASS.2001.990675",
	pages = "45 -48",
	publisher = "IEEE Computer Society Press",
	title = "{A}bout the sensitivity of the {HLRC}-{DU} protocol on diff and page sizes",
	url = "http://dx.doi.org/10.1109/ISPASS.2001.990675",
	year = 2001
}

Julio Sahuquillo and A Pont. Splitting the data cache: a survey. Concurrency, IEEE 8(3):30 -35, 2000. URL, DOI BibTeX

@article{865890,
	author = "Sahuquillo, Julio and A. Pont",
	abstract = "Recent cache-memory research has focused on approaches that split the first-level data cache into two independent subcaches. The authors introduce a methodology for helping cache designers devise splitting schemes and survey a representative set of the published cache schemes",
	doi = "10.1109/4434.865890",
	issn = "1092-3063",
	journal = "Concurrency, IEEE",
	keywords = "cache design;cache memory;cache splitting scheme design;first-level data cache splitting;independent subcaches;survey;cache storage;reviews;",
	month = "jul-sep",
	number = 3,
	pages = "30 -35",
	title = "{S}plitting the data cache: a survey",
	url = "http://dx.doi.org/10.1109/4434.865890",
	volume = 8,
	year = 2000
}

J -C Cano, A Pont, Julio Sahuquillo and J A Gil. The differences between distributed shared memory caching and proxy caching. Concurrency, IEEE 8(3):45 -47, 2000. URL, DOI BibTeX

@article{865892,
	author = "J. -C. Cano and A. Pont and Sahuquillo, Julio and J.A. Gil",
	abstract = "The authors discuss the similarities in caching between the extensively studied distributed shared memory systems and the emerging proxy systems. They believe that several of the techniques used in distributed shared memory systems can be adapted and applied to proxy systems",
	doi = "10.1109/4434.865892",
	issn = "1092-3063",
	journal = "Concurrency, IEEE",
	keywords = "caching;distributed shared memory systems;proxy systems;cache storage;distributed shared memory systems;",
	month = "jul-sep",
	number = 3,
	pages = "45 -47",
	title = "{T}he differences between distributed shared memory caching and proxy caching",
	url = "http://dx.doi.org/10.1109/4434.865892",
	volume = 8,
	year = 2000
}

J -C Cano, Teresa Nachiondo, Julio Sahuquillo, A Pont and J A Gil. WWW client/server traffic characterization: a proxy server point of view. In System Sciences, 2000. Proceedings of the 33rd Annual Hawaii International Conference on. 2000, 10 pp.. URL, DOI BibTeX

@conference{926874,
	author = "J. -C. Cano and Nachiondo, Teresa and Sahuquillo, Julio and A. Pont and J.A. Gil",
	abstract = "When performance studies about proxy cache server systems are made, one of the most common difficulties is to identify and to obtain representative workloads. Traces have been used as traditional workload. Gathering traces imply a large amount of time. If a self-similar traffic generator could be used, this problem would be solved, therefore evaluation studies become faster and more flexible. This work contains two parts; first, we perform a study of the self-similar property of several characteristics of the arrival collected traces, such as response size pattern, elapsed request time pattern and so on. Secondly, we model a source and develop a self-similar traffic arrival pattern generator.",
	booktitle = "System Sciences, 2000. Proceedings of the 33rd Annual Hawaii International Conference on",
	doi = "10.1109/HICSS.2000.926874",
	isbn = "0-7695-0493-0",
	keywords = "World Wide Web; client server traffic; elapsed request time pattern; performance studies; proxy cache server systems; response size pattern; self-similar traffic generator; traces; workloads; Internet; client-server systems; information resources; teleco",
	month = "jan.",
	pages = "10 pp.",
	title = "{WWW} client/server traffic characterization: a proxy server point of view",
	url = "http://dx.doi.org/10.1109/HICSS.2000.926874",
	year = 2000
}

Julio Sahuquillo and A Pont. Designing competitive coherence protocols taking advantage of reuse information. In Euromicro Conference, 2000. Proceedings of the 26th 1. 2000, 378 -385. URL, DOI BibTeX

@conference{874656,
	author = "Sahuquillo, Julio and A. Pont",
	abstract = "The filter data cache scheme introduces two independent Ll data caches with different organizations placed in parallel. In this scheme, each cache block has a small counter attached for storing information needed for management-called reuse information. The Filter Data Cache micro-architecture offers lower miss rates and better speedups than conventional organizations; as well as saving die area. The reuse information included is directly responsible for improving the overall cache hit-ratio and reducing bus utilization, and this makes it relevant for multiprocessor systems. In this paper, we show how the reuse information of the Filter Data Cache scheme can also be used to design competitive coherence protocols tailored to that scheme. These offer better performance results than traditional write-invalidate and write-update policies",
	booktitle = "Euromicro Conference, 2000. Proceedings of the 26th",
	doi = "10.1109/EURMIC.2000.874656",
	isbn = "0-7695-0780-8",
	keywords = "cache hit-ratio;competitive coherence protocols;filter data cache scheme;microarchitecture;multiprocessor systems;reuse information;write-invalidate policies;write-update policies;multiprocessing systems;performance evaluation;protocols;software reusabili",
	pages = "378 -385",
	publisher = "IEEE Computer Society",
	title = "{D}esigning competitive coherence protocols taking advantage of reuse information",
	url = "http://dx.doi.org/10.1109/EURMIC.2000.874656",
	volume = 1,
	year = 2000
}

Julio Sahuquillo, Teresa Nachiondo, J -C Cano, J A Gil and A Pont. Self-similarity in SPLASH-2 workloads on shared memory multiprocessors systems. In Parallel and Distributed Processing, 2000. Proceedings. 8th Euromicro Workshop on. 2000, 293 -300. URL, DOI BibTeX

@conference{823423,
	author = "Sahuquillo, Julio and Nachiondo, Teresa and J. -C. Cano and J.A. Gil and A. Pont",
	abstract = "The workload used for evaluating and obtaining performance results in shared memory multiprocessors are widely heterogeneous. Traces have been used over several decades and as computers systems grew in power, semantic benchmarks, like SPLASH2, became the most common workloads. Unfortunately, few benchmarks are available. Recently, self-similar studies have been performed in several computer domains. In this paper, we study the self-similar properties of several SPLASH2 benchmarks. Each benchmark has been studied independently, and all exhibit a clearly self-similar behaviour. The results enable the construction of a self-similar memory reference generator that makes a wide variety of parallel workload traces in a a flexible manner; as well as quickly",
	booktitle = "Parallel and Distributed Processing, 2000. Proceedings. 8th Euromicro Workshop on",
	doi = "10.1109/EMPDP.2000.823423",
	isbn = "0-7695-0500-7",
	keywords = "SPLASH-2 workloads;parallel workload traces;self-similar memory reference generator;self-similarity;semantic benchmarks;shared memory multiprocessors systems;parallel processing;shared memory systems;software performance evaluation;system monitoring;",
	pages = "293 -300",
	title = "{S}elf-similarity in {SPLASH}-2 workloads on shared memory multiprocessors systems",
	url = "http://dx.doi.org/10.1109/EMPDP.2000.823423",
	year = 2000
}

Julio Sahuquillo and A Pont. The split data cache in multiprocessor systems: an initial hit ratio analysis. In Parallel and Distributed Processing, 1999. PDP '99. Proceedings of the Seventh Euromicro Workshop on. February 1999, 27 -34. URL, DOI BibTeX

@conference{746641,
	author = "Sahuquillo, Julio and A. Pont",
	abstract = "As current first level (L1) data caches are poorly and inefficiently managed, new approaches to achieve better performance in uniprocessor systems have been proposed. The L1 data cache management system is basically the same as it was three decades ago. New organizations have recently been proposed, where two multi-lateral caches are included in the first level in accordance with the data locality where they are stored. The processor simultaneously sends the same memory request to both caches located in L1. These caches work independently and have different organizations. The main objective is to minimize the average data access time. These new organizations will normally increase the hit ratio. Additionally, the chip area occupied by these caches-including the necessary management hardware-is smaller than in a conventional organization. As the proposed cache size is smaller, it can work faster and improve access time at this level. Several authors have studied different approaches around this idea in uniprocessors. In this work we have made extensions for shared memory multiprocessors and studied the advantages",
	booktitle = "Parallel and Distributed Processing, 1999. PDP '99. Proceedings of the Seventh Euromicro Workshop on",
	doi = "10.1109/EMPDP.1999.746641",
	isbn = "0-7695-0059-5",
	issn = "1066-6192",
	keywords = "L1 data cache management;data caches;hit ratio analysis;multiprocessor systems;performance;shared memory multiprocessors;split data cache;cache storage;performance evaluation;shared memory systems;",
	month = "feb",
	pages = "27 -34",
	publisher = "IEEE Computer Society",
	title = "{T}he split data cache in multiprocessor systems: an initial hit ratio analysis",
	url = "http://dx.doi.org/10.1109/EMPDP.1999.746641",
	year = 1999
}

Julio Sahuquillo and A Pont. The filter cache: a run-time cache management approach. In EUROMICRO Conference, 1999. Proceedings. 25th 1. 1999, 424 -431. URL, DOI BibTeX

@conference{794504,
	author = "Sahuquillo, Julio and A. Pont",
	abstract = "This work presents a new hardware cache management approach for improving the cache hit ratio and reducing the bus traffic. Increasing the L1 cache hit ratio is a crucial aspect of obtaining good performance with the current processors. The proposed approach also increases the overall (L1 plus L2) cache hit ratio, especially in multiprocessor systems, where the bus latencies are low. This work focuses in multiprocessor systems where a forth kind of miss (the coherence miss) and the bus utilization problem appear; however, the model can also be applied to uniprocessor systems. Our organization increases the overall cache hit ratio and thus reduces the bus utilization. The proposed model introduces two independent L1 caches with different organizations placed in parallel. Each cache block has attached to it a small counter for storing the reuse related information. The proposed microarchitecture not only reduces the bus traffic and speeds up better than the conventional organization, but it also saves die area. The performance (versus conventional cache organizations) increases as the number of processors increases",
	booktitle = "EUROMICRO Conference, 1999. Proceedings. 25th",
	doi = "10.1109/EURMIC.1999.794504",
	isbn = "0-7695-0321-7",
	keywords = "bus traffic;cache hit ratio;data cache management;data locality;filter cache;hardware cache management;memory architectures;multi-lateral cache;multiprocessor systems;performance;run-time cache management;cache storage;memory architecture;performance eval",
	pages = "424 -431",
	title = "{T}he filter cache: a run-time cache management approach",
	url = "http://dx.doi.org/10.1109/EURMIC.1999.794504",
	volume = 1,
	year = 1999
}

Julio Sahuquillo and A Pont. Impact of reducing miss write latencies in multiprocessors with two level cache. In Euromicro Conference, 1998. Proceedings. 24th 1. August 1998, 333 -336. URL, DOI BibTeX

@conference{711822,
	author = "Sahuquillo, Julio and A. Pont",
	abstract = "In this paper a multiprocessor system with a two-level cache hierarchy is modeled and extensions of two write invalidate snoopy protocols are implemented in the L2 cache controller for coherence maintenance. The paper focuses on the use of different techniques for reducing miss penalty and a comparative performance study is done for each possibility. To solve efficiently a miss read, the early restart technique is implemented in the second level of cache hierarchy and the critical word first technique is used in the first level cache controller. To obtain better performance in the case of a write miss the write allocate technique is implemented at the L2 cache controller. Two models, with different L1 cache controllers are considered in our study, one of them using the non-write allocate technique and the other using the write allocate. We show that the write allocate and non-write allocate techniques are independent over the processors number. The major conclusion of this work is that the non-write allocate technique is not only less complex for implementation but also better in performance if the L1 write miss rate represents a high percentage of L1 miss rate",
	booktitle = "Euromicro Conference, 1998. Proceedings. 24th",
	doi = "10.1109/EURMIC.1998.711822",
	keywords = "L2 cache controller;coherence maintenance;critical word first technique;early restart technique;miss write latencies;multiprocessor system;nonwrite allocate technique;performance study;two-level cache hierarchy;write allocate technique;write invalidate sn",
	month = "aug",
	pages = "333 -336",
	title = "{I}mpact of reducing miss write latencies in multiprocessors with two level cache",
	url = "http://dx.doi.org/10.1109/EURMIC.1998.711822",
	volume = 1,
	year = 1998
}

Thesis

Hybrid caches: design and data management. Julio Sahuquillo, Salvador Petit (Processor Architecture)

Dynamic Power-Aware Techniques for Real-Time Multicore Embedded Systems. Salvador Petit, Julio Sahuquillo (Processor Architecture)

Contention-Aware Scheduling for SMT Multicore Processors. Julio Sahuquillo, Salvador Petit (Processor Architecture)

Cache Architectures Based on Heterogeneous Technologies to deal with Manufacturing Errors. Julio Sahuquillo, Salvador Petit (Processor Architecture)

Efficient L2 Cache Management to Boost GPGPU Performance. Salvador Petit, Julio Sahuquillo (Computer Architecture)

Efficient Home-Based protocols for reducing asynchronous communication in shared virtual memory systems. Julio Sahuquillo (Computer Architecture)