Por que é dito que 'quanto maior o pipeline, maior a taxa de clock do processador' [duplicado]

0

Recentemente, passei pela tecnologia hyper-threading do Pentium 4. O número de estágios do pipeline é alto em P4 e é dito que aumentará a velocidade do clock. Como isso é possível?

Meu material de referência

    
por Nithin Jose 30.06.2015 / 07:16

2 respostas

3

Para responder a essa pergunta, precisamos entender algumas coisas sobre a eletrônica digital básica.

Vamos começar dando uma olhada em um pipeline típico.

Como podemos ver que cada estágio do pipeline é seguido por um registro de armazenamento (os retângulos verdes) que mantém a saída de cada estágio. Agora cada estágio de um pipeline consiste em um circuito combinacional. Circuitos combinacionais são basicamente a combinação de várias portas lógicas como NAND, NOR, etc. Cada uma dessas portas lógicas tem alguma latência, ou seja, quando você fornece alguma entrada, leva algum tempo para produzir a saída (alguns nanossegundos para picossegundos). Assim, a latência máxima de cada estágio depende da maior sequência dessas portas lógicas presentes naquele estágio.

Agora, para que uma etapa produza uma saída válida ao fornecer alguma entrada, precisamos fornecer a ela um período de clock suficiente. Portanto, para um estágio com um circuito combinacional complexo e grande, a latência é alta e, portanto, exigirá um longo período de clock e, portanto, um clock mais lento. Da mesma forma, para um estágio com circuito combinacional curto e simples, a latência é baixa e, portanto, exigirá um período de relógio mais curto e, portanto, maior taxa de clock.

Esta é a razão por trás, quanto maior o pipeline, maior a taxa de clock do processador. Como em um pipeline mais longo, os estágios são divididos em subestados mais e, portanto, menores, o que torna cada estágio no pipeline mais simples e um circuito combinacional mais curto, reduzindo assim a latência de cada estágio. Isso, por sua vez, abre espaço para uma taxa de relógio maior.

    
por 30.06.2015 / 11:39
0

seu breve resumo para escrever sobre isso, Para entender por que CPUs diferentes em diferentes velocidades de clock funcionam de maneiras diferentes, deixe-me resumir como a CPU processa as instruções.

Em tech Website

A CPU processes instructions in an assembly-line manner, with different instructions existing in different stages of completion as they move down the line. For instance, each instruction on the original Pentium passes through the following, five-stage pipeline:

Prefetch/Fetch: Instructions are fetched from the instruction cache and aligned for decoding. Decode1: Instructions are decoded into the Pentium's internal instruction format. Branch prediction also takes place at this stage. Decode2: Same as above. Also, address computations take place at this stage. Execute: The integer hardware executes the instruction. Write-back: The results of the computation are written back to the register file. An instruction enters the pipeline at stage 1, and leaves it at stage 5. Since the instruction stream that flows into the CPU's front-end is an ordered sequence of instructions that are to be executed one after the other, it makes sense to feed them into the pipeline one after the other. When the pipeline is full, there is an instruction at each stage.

Each pipeline stage takes one clock cycle to complete, so the smaller the clock cycle, the more instructions per second the CPU can push through its pipeline. This is why, in general, a faster clockspeed means more instructions per second and therefore higher performance.

Most modern processors, however, divide their pipelines up into many more, smaller stages than the Pentium. The later iterations of the Pentium 4 had some 21 stages in their pipelines. This 21-stage pipeline accomplished the same basic steps (with some important additions for instruction reordering) as the Pentium pipeline above, but it sliced each stage into many small stages. Because each pipeline stage was smaller and took less time, the Pentium 4's clock cycles were much shorter and its clockspeed much higher.

In a nutshell, the Pentium 4 took many more clock cycles to do the same amount of work as the original Pentium, so its clockspeed was much higher for the equivalent amount of work. This is one core reason why there's little point in comparing clockspeeds across different processor architectures and families—the amount of work done per clock cycle is different for each architecture, so the relationship between clockspeed and performance (measured in instructions per second) is different.

Um exemplo do mundo real a partir do tópico :

Lets take A VERY simple processor. It is just a programmable calculator - instructions available are add a, b, c and subtract a, b, c. (a, b, c are numbers in memory. no way to load these numbers from constants ). One way to do it would be to do the following all in one clock cycle:

  1. read the instruction and figure out what we're going to do
  2. read memory location a
  3. read memory location b
  4. perform the add or subtract
  5. write the result to location c

With this setup, the IPC is exactly 1, because one instruction takes one (VERY long) clock cycle. Now, let's improve this design. We're going to have 5 clock cycles per instruction, and each doing one of the 5 things above. So, on cycle 1, we decide what to do, on cycle 2, we read a, on cycle 3, we read b, and so on. Note that the IPC will be 1/5th. The thing you have to remember is, ideally each of those steps takes 1/5th of the time, so the end result is the SAME performance.

A more advanced implementation is a pipelined processor - multicycle like the one described, but we do more than one thing at a time: 1. read instruction i 2. read a (for instruction i), and read instruction ii 3. read b (for instruction i), a (for instruction ii), and instruction iii 4. do the op for instruction i, read b for instruction ii, read a for instruction iii, and read instruction iv 5. write c for instruction i, operate for ii, read b for iii, read a for iv, and read the instruction v 6. store c for ii, operate for iii, read b for iv, read a for v, and read vi

(note that this requires the ability to do 3 or 4 memory accesses in a cycle, which I didn't have in the other 2, but for the sake of understanding the concepts this can be ignored)

A picture would really help, but I don't have one offhand. To see how this performs, note that a given instruction takes 5 cycles from start to finish, but at any time, multiple instructions are being processed. Also, every single cycle, one instruction is completed (well, from the 5th cycle forward). So, the IPC is 1, even though each individual instruction takes a bunch of cycles, and the actual performance of the machine is 5 times the performance of the original, since the clock is 5 times faster.

Now, a modern processor is MUCH more advanced than this - there are multiple pipelines working on multiple instructions, instructions are executed out of order, etc., so you can't just do a simple analysis like this to see how an Athlon will perform vs. a P4. In general, a longer pipeline lets you do less in each stage, so you can clock the design faster. The P4's 20 stage pipeline lets it run at up to 3ghz currently, whereas the shorter pipeline of the Athlon results in more work per clock, and therefore a slower max clock speed

Se você estava procurando por informações de hardware, tenha um motivo acima de aqui

    
por 30.06.2015 / 08:39