Employment

Industrial Experience

NVIDIA
Director, Computational Imaging Group, Tegra Product Team 2013-Present
Leading both the architecture and RTL teams for NVIDIA's computational imaging pipeline (camera), which is responsible for the processing and enhancements of raw sensor images in the Tegra camera pipeline. Developing a hardware pipeline (Image Signal Processor) to convert raw Bayer sensor data to high-quality images and video. Focusing on developing new innovative imaging features and supporting SOC system architectures and designs. Working to improve the 3As (auto-focus, auto-white balance, auto exposure), noise reduction and image quality.

Senior Research Manager, NVIDIA Research 2011-2013
Responsible for a cross-functional research team designing GPU-based SOC architectures for future systems ranging from high-performance computing to graphics to mobile devices. Focusing on a broad application space - graphics, bioinformatics, web servers, computer vision, machine learning, databases, simulations, etc. Built a cross-functional, matrixed team to focus on all aspects of the SOC system design - applications, architecture, micro-architecture, implementation, circuit technology, compilers, system software, and more. Successfully designed and executed a product transition plan in which both NV Research ideas and staff were transferred to the product groups.

Senior Architecture Manager, GPU System Architecture 2008-2011
Responsible for the architectural development of a significant portion of NVIDIA's Kepler generation of GPU System-On-Chip (SOC) system architecture focusing on the memory system. This work included the on-chip interconnect (network-on-chip), memory controller, memory caches, virtual memory, memory access protocols, system interface, and multi-chip interconnects. Explored CPU/GPU interactions and supporting memory structures including hardware-based coherent cache architectures. Responsible for the architecture, micro-architecture, performance modeling and validation, functional modeling and validation of the GPU memory system. Responsibilities included the current GPU development, next generation GPU development and the roadmap development for future GPU memory systems. Managing multiple GPUs in development each with multiple teams, which were spread across multiple sites. Close interactions with hardware and physical design teams.

Architecture Unit Lead, GPU Architecture 2004-2008
Architecture unit lead responsible for the development of the memory system including the on-chip interconnect (NoC), memory protocols, memory cache and virtual memory system of NVIDIA's Fermi generation of GPU SOCs. Responsibilities include the development of the architecture, micro-architecture, performance modeling and validation, and functional modeling and validation. Drove multiple performance analysis efforts that impacted full-chip performance. Also responsible for the unit management - plans, schedule, staffing, assignments, etc.

Architect, GPU Architecture 2003 - 2004
Responsible for the development of the virtual memory architecture for NVIDIA's GPU in support of Microsoft's Longhorn Device Driver Model (LDDM). Contributed to many other aspects of architecture work on NVIDA GPUs. Contributed to and later drove NVIDIA's involvement in Microsoft's Virtualized Graphics effort.

Newisys, Inc. Austin, TX
Chief Architect, Silicon Development 2000 - 2003
Responsible for the overall architectural and micro-architectural development of Newisys' CC-NUMA cache coherence controller and low-latency packet switch (0.13 um ASIC technology) for scalable multiprocessor systems based on AMD's Opteron x86-64 processors (Horus). Architected the overall system design, coherence controller design and coherence protocol. Co-developed the design's micro-architecture. Responsible for the extended HyperTransport protocol development, which included coherence directory and remote data cache functionality, and subsequent behavioral modeling. Drove the development of a cycle-accurate performance model and subsequent analysis. Worked closely with the BIOS and service processor development teams.

Drove the development of an advanced prototyping environment for analysis and validation of the multiprocessor system. Built the prototype on top of existing 2-processor Opteron systems. Effort included cross-functional technical leadership, system design, board design, software development, logic design, logic synthesis and place & route. Utilized high-end Xilinx FPGAs to prototype controller design.
Developed and drove the Newisys intellectual property development for the scalable multiprocessor systems. Built a strong patent portfolio (40+ patents). Developed blocking IP strategy and reviewed strategy with two external IP law firms.
Technical Manager, Architecture Group/Chip Design Group 2000 - 2003
Built and managed two groups within Newisys: initially the chip design group and finally the architecture group. Chip design group responsibilities included project plan development, interviewing and hiring, culture development, micro-architecture and initial RTL development. Built an advanced architecture group once the design group was stable. Architecture group responsibilities include performance modeling and analysis; prototype development, validation and behavior analysis; cache coherence protocol development and analysis; and future product development.

Intel/Texas Development Center, Desktop Products Group Austin, TX
Architect/Technical Manager, CPU System Cluster 1999 - 2000
Technical manager responsible for building and leading an engineering team that was responsible for the system components of a high performance, IA-32 processor with integrated memory controller and micro-architectural support for multiple, heterogeneous on-die cores. Provided technical leadership for the team's efforts, which included multiprocessor cache coherence protocol architecture and development; protocol engine micro-architecture and implementation; memory controller micro-architecture and implementation; protocol formal verification; and system level performance modeling and analysis.

IBM Research/Austin Research Laboratory Austin, TX
Architect/Research Staff Member 1996 - 1999
Provided technical leadership for a small research team that successfully implemented and demonstrated IBM's first Intel-based CC-NUMA hardware prototype. The team architected, designed and implemented the system using a combination of off-the-shelf components and programmable logic. Functionality included a patented hardware performance monitor to understand system performance. Worked closely with the CC-NUMA software team to understand system performance and drive performance monitoring and enhancements into Windows NT and SCO UnixWare. See "Experience with building a commodity Intel-based ccNUMA system" and "Windows NT in a CC-NUMA System."

Architected, designed and implemented the cache coherence mechanism for IBM's first functional PowerPC-based CC-NUMA hardware prototype. Work included the development of the cache coherence protocol and the micro-architecture & implementation of the coherence directory and pending request mechanism. Co-architected the overall CC-NUMA adapter. Demonstrated hardware functionality of a three-node CC-NUMA system implemented using high-speed programmable logic (FPGA). Developed several innovative and patented protocol features to overcome deficiencies in the PowerPC bus architecture.

HP Laboratories Palo Alto, CA
Architect/Member Technical Staff 1995-1996
Investigated cache coherence structures for HP's future generation shared-memory multiprocessor systems. Focused on cache coherence protocols, directory structures and the interaction between the cache coherence protocol, processor cache hierarchy, operating system and application software. Investigated other memory latency tolerating and reducing techniques to give HP an advantage over its competition.

HaL Computer Systems Campbell, CA
Architect/Verification Engineer 1994 - 1995
Developed a verification strategy, which was based on high-level modeling (HLM), for a CC-NUMA system. The strategy included formal verification (FV) of the cache coherence protocol and a verification tool that was able to compare the results of cycle and non-cycle accurate models. Implemented portions of the HLM using verilog and developed early FV models of the protocol. Worked with Prof. Dill of Stanford to improve the FV tool and methodology - funding one graduate student.

MIT Lincoln Laboratory Boston, MA
Micro-Architect & Logic Design Engineer 1988 - 1990
Architected a radar adaptive nulling hardware prototype designed around a systolic array, which was constructed from an array of custom CORDIC data processors. Effort included the design of a high-speed dual banked memory system and data path control logic for the systolic array. Additionally, the effort included system design, board design, and discrete and programmable logic. System included a micro-controller that required extensive programming. Significant software was also developed on the host systems to feed data to the systolic array, analyze output data, and present results. Successfully demonstrated the system in both an IBM PC and Sun workstation environment.

Artisoft, Inc Tucson, AZ
Logic Design Engineer 1985 - 1987
While attending U of Arizona, worked as a part time engineer and developed several products for the IBM PC including a hardware access control card, a laptop to desktop networking system software and portions of a local area network card and software. Involved in all aspects of product development including product conception, logic design, implementation, board design, debug, verification, manufacturing and marketing.

Academic Experience

University of Texas at Austin Austin, TX
Ph.D. Committee, EE Department 2001-2003
Participating in the orals committee for a Ph.D. student in the Electrical Engineering department. Student's work is focused on high-end processor design with an emphasis on power-wise design.
Stanford University Stanford, CA

Consulting Assistant Professor, EE Department 1995 - 1999
Consulting Professor working with Professor Michael Flynn. Developed and taught a graduate level course on shared-memory multiprocessors. Obtained an industrial grant to fund research in fault-tolerant multiprocessors. Actively participated in the research and advised a graduate student funded by this grant. Graduated one Ph.D. student.
Stanford University Stanford, CA

Research Assistant, Ph.D. Degree Program, EE Department 1990 - 1994
Designed update-based cache coherence protocols for scalable shared-memory multiprocessors. Designed protocols for both distributed and centralized directory structures. Developed a set of architectural models for shared-memory multiprocessors and several shared-memory applications. Analyzed the performance of the update-based protocols with respect to common invalidate-based protocols through full system simulations. Identified protocol limitations and evaluated possible protocol enhancements to overcome these limitations. Formally verified the update-based protocols using the Murphi modeling checking tool from Stanford.