Software Architecture for Fault-Tolerant Multicore Computing with Hybridized Non-Volatile Memories


Grant Data
Project Title
Software Architecture for Fault-Tolerant Multicore Computing with Hybridized Non-Volatile Memories
Principal Investigator
Professor Wang, Cho Li   (Principal investigator)
Duration
36
Start Date
2015-09-01
Completion Date
2018-08-31
Amount
871036
Conference Title
Presentation Title
Keywords
Fault tolerance, STT-MRAM, Many-core, Operating systems, Non-volatile memory
Discipline
Software
Panel
Engineering (E)
Sponsor
RGC General Research Fund (GRF)
HKU Project Code
17210615
Grant Type
General Research Fund (GRF)
Funding Year
2015/2016
Status
On-going
Objectives
1) Propose a new fault-tolerant multicore architecture fabricated from the next-generation non-volatile memory (e.g., STT-RAM) used as last level cache (LLC), on-chip programmable scratchpad memory, and off-chip memory; 2) Revamp the Linux operating system to exploit such special hybridized memory hierarchy with both caches and programmable scratchpads SRAM and STT-MRAM. A novel persistent process model and a persistent page table design are proposed to help provide native fault tolerance for program execution against transient and crash failures; 3) Explore new data affinity techniques tailored to big data computing along the abovementioned memory hierarchy. In particular, we propose anti-caching of datasets whose access via the conventional cacheable datapath could cause serious cache pollution. The anti-caching mechanism exploits the on-chip programmable memory; 4) Failure-atomicity using non-volatile memory: we will design and implement kernel modules to ensure global consistent state of data stored across non-volatile MRAM and volatile memory components (DRAM, caches, store buffers, etc.) using variant schemes of read-copy-update (RCU) synchronization [and transactional semantics, for comparison] in the OS; 5) Propose a new fault-tolerant programming model, built atop a new object abstraction, call "FT-Object", to facilitate fast recovery, supplemented with a library containing common data structures like trees, lists and maps, built from FT-Object for effortless marriage of fault tolerance and user-friendly programming.