Genealogy Search Tips

Historical Newspapers, Books and Optical Character Recognition (OCR)

The vast majority of pages from historical newspapers and books that are available for online reading are images of previously, published pages that cannot be read and indexed by computers.

OCR Text and the Search Index

The text in these images must be extracted from the images and put into a form that can be read by computers before keywords can be selected and a search index created. In most all cases Optical Character Recognition (OCR) is used to convert the images into text that can be processed by computers. The text generated by Optical Character Recognition is called the OCR text.

Computer programs use the OCR text to create an index of the keywords and phrases found in the OCR text. The index can then be used to find relevant newspaper articles and pages in books in response to search queries.

When a page in a historical newspaper is displayed online the OCR text generated for that page is usually printed under the image of the historical newspaper page. In most cases there are many differences between the OCR text and the text printed in the newspaper articles and in general OCR text generated in the past is not as accurate as the OCR text generated with today's technology.

Mistakes in the OCR Text

The text in the images of most old newspaper pages is not as clear as the text in the images of most book pages and so in general the OCR text generated from most historical newspapers is not as accurate as the OCR text generated from most books.

If some of the OCR extracted text does not reflect the text that appeared in some newspaper articles or books then the index will not be accurate. For example if an article about "William Jones" was indexed, based on the OCR text, as "William James" then a search for "William Jones" would not find the article even though the actual article contains reference to "William Jones".

So any mistakes in the OCR text will create an inaccurate Search Index.

Searching the OCR Text

When searching historical newspapers and books that have been indexed with OCR text it is important to remember that the OCR text does not usually contain all of the information that is actually in the articles or on the book pages.

For example you might search for a newspaper article about an ancestor named "William Jones" and not find anything. It's possible that there is an article about him but it can't be found using his name for the search because his name is not indexed correctly.

So try some alternative search terms if you don't have any positive results with your initial search terms.

Related Case Study

Historical Newspapers

See our case study, Search Historical Newspapers for Julius Gilbert, to learn how we used historical newspapers in our search for more information about Julius Gilbert.

Related Search Tip

OCR and Historical Newspapers

If you are searching an historical newspaper that is displayed online as an image then you are searching the text generated from Optical Character Recognition (OCR) software. Usually the OCR text is printed below the image of the newspaper page.

So if you don't see your search term in the newspaper page displayed on the screen you can use your browser's "Find on Page" feature to locate your search term in the OCR text. Then you will be able to find where the article appears in the newspaper page.

Even if the window for the OCR text is very small and your search term does not appear in the window the "Find on Page" will locate your search term and slide the scroll bar down so that you can see your search term and the entire article.

You can also copy the article's OCR text and save it as text in one of your files.

For some more search ideas see Our Tips for Searching Historical Newspapers.

© 2019 All Rights Reserved.