After the 2020 U.S. presidential election, handwriting analysis researcher Mara Merlino got a strange phone call. The person on the line seemed to seek her endorsement of a highly dubious report claiming signature discrepancies on ballots. Merlino, a psychology professor at Kentucky State University, asked some pointed questions. Learning, among other things, that the person involved in the signature verification was not a trained professional, she concluded that “everything about that report was bad and wrong.”

New research lends support to Merlino’s intuitions. It finds that forensic handwriting comparison is an effective method but requires years of professional training to ensure the validity of conclusions made in legal and administrative cases. According to findings published on August 1 in the Proceedings of the National Academy of Sciences USA, highly trained professionals can reliably identify samples written by the same hand, making the correct call most of the time. In contrast, people with more limited training perform measurably worse.

In 2009, amid concerns that some forensic techniques might be flawed, the National Research Council issued a report that noted there is scant support for methods of comparing sample evidence from crime scenes, including hair, bite marks and handwriting. In 2016 a group of advisers to then president Barack Obama followed up with a report calling for more studies of the accuracy of comparison techniques.

The new findings show “the validity and reliability in this process,” says Merlino, who was not involved in the work. The results of this study and others all “point in the same direction, which is that document examiners who are fully trained are fairly accurate in the calls that they’re making.” The growing evidence helps address criticisms “that there were no studies that empirically supported that people in this field could do what they claimed they could do,” she adds.

David Weitz, a physics professor at Harvard University and an associate editor at PNAS, who oversaw the peer-review process for the work, says he thought it would be important as a “serious scientific study of forensic analysis,” which he’s not “always convinced is done in a really scientific way.” Scientific American contacted four authors of the study, which was funded by the FBI, but none was available for an interview before publication.

For the work, 86 forensic document examiners, most of them government employees, undertook 100 handwriting comparisons using digital images of such writing produced by 230 people. Of the 100 tasks, 44 were comparison of documents handwritten by the same person, and the remaining 56 were comparison of documents written by two individuals. Unknown to the participants, a tenth of the comparison sets were repeats of sets they had already seen—a way to test how consistent each participant was over time.

Forensic document examiners compare samples based on a long list of factors, says Linda Mitchell, a certified forensic document examiner and researcher, who was not involved in the study. Features of interest include letter spacing, how letters connect, and the drop or rise of “legs” below or above a letter, such as the tail of a small letter “g” or the upsweep of a small letter “d.” “There are tons of things,” she says.

Examiners in the FBI study expressed their conclusions in the form of five ratings: definitive that the same writer had or had not written the compared samples, probable that the same writer had or had not written them, or no conclusion.

Overall, in 3.1 percent of cases, examiners incorrectly concluded that the same writer had composed the comparison samples. Different writers who were twins tripped the examiners up more often, leading to a false-positive rate of 8.7 percent. The false-negative rate of samples that were incorrectly attributed to two different writers was even lower, at 1.1 percent.

Twins are tricky because a similar environment can drive these handwriting similarities in some families, Mitchell says. “But in the end,” she adds, “there’s always going to be something to individualize your handwriting from somebody else.”

Levels of experience played into examiner accuracy and confidence. The 63 examiners who had two years or more of training performed better and were more cautious—tending to lean more heavily on “probable” conclusions—than novices. The nine participants with the worst performance were among the 23 people who had less training. “That is what the training is all about: making sure you know what you don’t know,” Mitchell says.

Of the 86 participants, 42 erroneously concluded that the samples were from the same writer in a least one comparison. But a core group of eight participants were responsible for more than half of these errors— 61 of the 114 wrong guesses. For false-negatives, the group did better overall in raw numbers: only 17 of the 86 made this error, and one of the 17 stood out for making more than a fifth of these wrong calls.

The examiners proved to be fairly consistent when called upon to review the same documents again, reversing judgment completely in only 1 percent of cases that they reviewed twice. They were more likely to hedge from definitive to probable, or vice versa, than to flip a finding entirely. Only one person in the whole group reached the same conclusion on all of their second reviews. The participants were also similarly consistent from examiner to examiner, reversing the conclusion entirely in only 1.2 percent of cases.

The five-level scale that expresses the strength of an examiner’s opinion represents a metric in flux, Merlino says. Some groups who are developing these assessments have tried out as many as nine levels to express the strength of opinions, but a gap separates how laypeople and experts interpret these findings, with laypeople tending to want a simple “yes” or “no” rather than a qualified opinion. “From what I’ve seen, five levels is not enough, nine levels probably isn’t necessary, but seven levels might be just right,” Merlino adds. Sorting this out and clarifying how laypeople interpret this “opinion language” may be the most important things to establish in further studies, she says.