High Performance Computing Engineer, Howard University, Washington, DC


Howard University -
N/A
Washington, DC, US
N/A

High Performance Computing Engineer

Job description

The Talent Acquisition department hires qualified candidates to fill positions which contribute to the overall strategic success of Howard University. Hiring staff "for fit" makes significant contributions to Howard University's overall mission.

BASIC FUNCTION:

The purpose of this position is to serve as the High-Performance Computing Engineer for the Research Institute for Tactical Autonomy (RITA), University Affiliated Research Center (UARC).

SUPERVISORY ACCOUNTABILITY:

Typically responsible for performing some non-supervisory duties in addition to supervisory responsibilities.

NATURE AND SCOPE:

Serve as the High-Performance Computing Engineer for the Research Institute for Tactical Autonomy (RITA), internal contacts include executives, administrators, faculty, staff and students of the departments and the University at large with special emphasis on the HR, ETS, Finance, Office of Research and The Office of the CFO. External contacts include representatives from federal government, other colleges and universities, professional associations, consultants, vendors, and the general public.

PRINCIPAL ACCOUNTABILITIES:
  • Responsible for understanding the concepts, procedures, and guidelines to solve highly complex problems in the maintenance and hardware/software network infrastructure.
  • Experience performing system set-up, experiments, and diagnostics to evaluate printed circuit board exchanges, and troubleshoot and make component repairs based on test results.
  • Performs configurations and operates multipurpose, multi-tasking computer systems.
  • Supports day-to-day operations for the Computing team by monitoring computing resource performance, managing configurations, and addressing security administration. Applies revisions to system firmware and software. Engages and collaborates with vendors to assist with support activities as required.
  • Performs training and support for technical staff in the use of new software and hardware, either developed or acquired.
  • Creates, deletes, maintains and manages HPC researcher accounts and logins for RITA staff, performs system back-ups, and maintains system configuration files.
  • Installs, configures, modifies, tunes and maintains various research software applications for access on HPC clusters.
  • Performs researcher support and documentation for software applications, programs and enterprise services.
  • Designs, installs, configures, and performs document management for cluster infrastructure, including operating systems, job schedulers, resource managers, provisioning managers, configuration managers, network devices, and other components.
  • Investigates, debugs, and addresses researcher inquiries and requests efficiently through a customer issue ticketing system. Communicates complex technical concepts in simple, straightforward language.
  • Explores emerging technologies and technical developments to address expanding analytical requirements. Identifies new services and develops implementation plans. Stays current with best practices in the HPC field. Maintains collaborative relationships with peer HPC research organizations.
  • Performs other related duties as assigned or requested.


CORE COMPETENCIES:
  • Familiarity with low-latency/high-bandwidth, interconnected infrastructure (including InfiniBand, 10/100GigE, and others).
  • Expertise with HPC system software cluster management tools, job schedulers, and other HPC tools including Slurm, Ansible, and more.
  • Proficiency with fundamental programming skills (Tensorflow, PyTorch, ML/AI Tools, Python, C/C++ or similar languages). Expertise with administration, monitoring, and maintaining secure Linux/Unix operating systems (CentOS).
  • Knowledge of HPC storage (FC, SAS) principles, file systems (NFS, Lustre, BeegFS, ZFS, etc.), and compute node storage, Network Attached Storage.
  • Proficiency with web interfacing of ML/AI tools such as Tensorflow, PyTorch
  • Ability to drive technical leadership and management of complex, large-scale computing system projects.
  • Proficiency with multi-vendor management, security and network/Internet protocols.
  • Demonstrated expertise in design configuration and planning, with excellent organization skills, and the ability to identify and resolve problems and manage performance.
  • Excellent written and oral communication skills, with experience presenting technical topics to nontechnical audiences.
  • Ability to establish processes for maintaining system performance and managing best-in-class standards.
  • Knowledge of computer applications and experience with accompanying user-friendly software, e.g., Workday, word processing, spreadsheet, data base, outlook, presentation, etc.
  • Excellent leadership, training and developmental skills.
  • Skill in oral and written (English) communications with the ability to explain complicated, fiscal and budgetary processes to lay persons, and the ability to make public presentations.
  • Strong organizational skills to establish priorities meet deadlines and perform in a responsible, professional manner.
  • Ability to maintain harmonious working relationship with staff, students, faculty and University officials and the general public.
  • Skill in leadership with ability to delegate tasks and assignments appropriately.
  • Ability to manage cross-functional teams, delegate tasks, and promote and direct staff development.
  • Ability to conduct research, compile, and prepare comprehensive complex financial and budget reports.
  • Ability to keep abreast of and adhere to new policies initiated by changes in federal, District of Columbia or University regulations and to communicate this information to others.
  • Strong decision-making skills.


MINIMUM REQUIREMENTS:
  • Bachelor's degree (foreign equivalent or higher) in a relevant field, such as computer science, computer information systems, etc.
  • Four or more years of experience in one of the following fields: information technology, HPC system administration, network engineering, or large-scale HPC file systems.
  • Relevant experience must include Linux and scripting (for example, Python) and ML/AI programming such as Tensotflow, PyTorch, etc. In addition, it may also include maintaining hardware or software over their lifecycle (i.e., requirements analysis, implementation, testing, integration, deployment/installation, and maintenance), or computer/network security, or high-performance computing.
  • Experience with cloud computing and container technologies.


Special Note:

Resume/CV and cover letter should be included with the online application.

Due to U.S. Export Control laws and regulations, the candidate hired will need to be a U.S. citizen, lawful permanent resident, or other "protected individual" (as defined by 8 U.S.C. Sec. 1324b(a)(3).

Full-time 2024-07-09
N/A
N/A
USD

Privacy Policy  Contact US
Copyright © 2023 Employ America All rights reserved.