OBJECTIVE: Natural language processing (NLP) can generate diagnoses codes from imaging reports. Meanwhile, the International Classification of Diseases (ICD-10) codes are the United States' standard for billing/coding, which enable tracking disease burden and outcomes. This cross-sectional study aimed to test feasibility of an NLP algorithm's performance and comparison to radiologists' and physicians' manual coding. METHODS: Three neuroradiologists and one non-radiologist physician reviewers manually coded a randomly-selected pool of 200 craniospinal CT and MRI reports from a pool of >10,000. The NLP algorithm (Radnosis, VEEV, Inc., Minneapolis, MN) subdivided each report's Impression into "phrases", with multiple ICD-10 matches for each phrase. Only viewing the Impression, the physician reviewers selected the single best ICD-10 code for each phrase. Codes selected by the physicians and algorithm were compared for agreement. RESULTS: The algorithm extracted the reports' Impressions into 645 phrases, each having ranked ICD-10 matches. Regarding the reviewers' selected codes, pairwise agreement was unreliable (Krippendorff α = 0.39-0.63). Using unanimous reviewer agreement as "ground truth", the algorithm's sensitivity/specificity/F2 for top 5 codes was 0.88/0.80/0.83, and for the single best code was 0.67/0.82/0.67. The engine tabulated "pertinent negatives" as negative codes for stated findings (e.g. "no intracranial hemorrhage"). The engine's matching was more specific for shorter than full-length ICD-10 codes (p = 0.00582x10(-3)). CONCLUSIONS: Manual coding by physician reviewers has significant variability and is time-consuming, while the NLP algorithm's top 5 diagnosis codes are relatively accurate. This preliminary work demonstrates the feasibility and potential for generating codes with reliability and consistency. Future works may include correlating diagnosis codes with clinical encounter codes to evaluate imaging's impact on, and relevance to care.