janusng 寫:請給小弟這老古董找一個 P4 是 emulator 的 reference。
ulysses 大大知道自己說什麼嗎?

janusng 兄,討論技術問題,請維持君子之爭,不要流於意氣之爭讓其他人看笑話。你用這種語氣說話,不管你是不是對的,在下都無法接受。本篇回應重點不在評論你的看法是對是錯,純粹敘述在下對 P4 Structure 的認知。
在下是學 VLSI 出身的,習慣從底層的 Architecture、Register/Cache Schema、Pipeline 等與 Instruction Executing Path 相關的電路動作來分析。在下的認知是:ISA 本身就是一個 Emulator。
http://arstechnica.com/cpu/2q00/x86futu ... ure-1.html
從某些角度來說,ISA 扮演的角色就和 Rosetta 差不多。
http://arstechnica.com/cpu/2q00/x86futu ... ure-2.html
...Once the design and specification of the instruction set, or the set of instructions available to a programmer for writing programs, was separated from the nitty gritty details of a particular machine's design it meant that that
programs written for a particular ISA could now run on any machine that implemented that ISA...
最早使用 ISA 架構的,還不是 CISC 陣營,而是『RISC-Like』的 IBM System/360
http://portal.acm.org/citation.cfm?id=32232.32233
...Interestingly enough, the design set
almost qualifies as a RISC (reduced instruction set computer), in that many of the common instructions execute in one cycle, at least on the faster processors in the family...
有一個 ISA 層檔在前面,底下的硬體結構就可以大幅度的演進。這也是 Pentium 系列之所以可以採用 Pipeline 設計的緣故。否則以傳統的 80X86 系列 『純 CISC』架構來說,每一個指令都會同時牽動所有相關電路,而不是分成不同的 Stage 來執行。
換句話說,不管你的指令集到底有多複雜,整個 CPU 的 Executing Path 電路可以大福簡化。
http://www.xbitlabs.com/articles/cpu/di ... eview.html
Intel 的 P4 家族,在『Willamette』這塊晶片中實作了這項技術,Intel 同時並用了稱為『Trace cache』的技術來克服 Interpreter 的缺點。嚴格來說,P4 的技術不只是『Emulation』,而是『Translation』。另一個類似的技術,就是 Java 的 Hotspot JIT Engine。
http://arstechnica.com/cpu/2q00/x86futu ... ure-5.html
...Another technology that works similarly to the that is Willamette's trace cache. Modern x86 processors like the Athlon or the PIII
translate x86 instructions into an internal instruction format, which they execute natively. The problem is that they don't cache those translated instructions anywhere, so every time they execute them they've got to translate them again. Details are scarce, but as far as we know Willamette's trace cache stores translated blocks of code and reuses them, making it conform roughly to the previous diagram...
ftp://download.intel.com/technology/itj ... /art_2.pdf
...IA-32 instructions are cumbersome to decode. The instructions have a variable number of bytes and have many different options.
The instruction decoding logic needs to sort this all out and convert these complex instructions into simple uops that the machine knows how to execute....In this location the Trace Cache is able to store the already decoded IA32 instructions or uops. Storing already decoded instructions removes the IA-32 decoding from the main execution loop....
至於 P4 是否走入死胡同,AMD 與 Intel 孰優孰劣,在下不做任何評論。
P4 之所以效能不彰,一部份是因為過多的 Pipeline Stage 導致整體 Execution Path 過長,需要更高的時脈速度才能達到同等級的成效(Good For Megahertz Myth Believers?),另一部份則是不良的電路設計。從『程式』的角度來看,程式執行的『效率』除了 Executing Time 以外,『Throughput』也是重點。因此程式碼的最佳化程度在 P4 上會特別顯著。
http://answers.google.com/answers/threadview?id=388350
http://www.xbitlabs.com/articles/cpu/di ... -1_19.html
另外一個會影響 P4 執行效能的則是 Hyper Threading,簡單的說,就是讓作業系統把一顆 CPU 當成兩顆來用。當然,這要 BIOS 與作業系統有支援才行。以這一點來說,Windows 與 Linux 的 Multi-threading / SMP 效能都比不上 BSD(沒有理論依據,純粹個人使用的感覺)。
http://www.xbitlabs.com/articles/cpu/di ... -1_23.html
http://www.itdoor.net/pages/13,10104,1,1040896326.html
說到底,ISA 是為了克服包袱的產物。但是從 AMD-64 與 Itanium 不難看出,ISA 本身就成了一個包袱。