E-Day

Searchable PDF Pilot Program

Recommended Posts

After playing with Acrobat DC at work, and talking to someone in the Creative Services group about Acrobat and PDF quality these days, I decided to start a pilot program to test how Acrobat converts out JPG scans to PDFs with searchable text.

I ran an initial test last month on an Issue of Tips & Tricks that I edited. The file size was significantly smaller, and the image quality was indistinguishable from the original on my monitors, but Acrobat created some graphically anomalies where there were background patterns or something similar. Not great. Then after editing the three Antic issues this month with their boring black-on-white text, I decided to try again. This time things worked quite well, at least for the magazine content. The ad pages are hit and miss, though they are pretty good.

So the whole point of this is to see if there is interest and if it's worth it to release PDF versions with searchable text alongside the traditional CBZ files. These PDFs would not replace the CBZ files; they would supplement them for those who are interested in having a searchable format for research purposes.

These PDFs will have funky graphical issues, which should be fine since the CBZ is there to be the true version of the issue. The files will be located here:

https://www.retromags.com/files/category/227-searchable-pdf-pilot/

Right now there are three Antic issues in there. I will add more issues from various publications to see how Acrobat handles the text from various magazines with various layouts and backgrounds. Those interested can download the files, look through them, and post any problems, issues, and opinions here. Then we'll go from there.

Share this post


Link to post
Share on other sites
45 minutes ago, E-Day said:

There is a new searchable PDF file for testing purposes: Searchable PDF test file - EGM Issue 138 January 2001

This one has far more complex layouts than Antic, so it should prove interesting to see how much text was actually converted to something searchable.

You mean you haven't even checked it yet?:lol:

 

You know, another option to offering searchable PDFs here is to just put up a sidebar in the download section with links to any of the free software programs or online programs that will create searchable PDFs, for anyone who doesn't already have Acrobat.  For that matter, we could also provide links/recommendations for free software programs for converting our cbr files into normal PDFs for any who prefer that format.  With only one person having downloaded any of the previous 3 files, interest might not be high enough to warrant offering multiple formats of all of our files here.

Btw, the Fujitsu Scansnap is capable of scanning directly into searchable PDF format, but I've never bothered experimenting with it.

Share this post


Link to post
Share on other sites

I have not found a downloadable file for several days. Are these set for a higher access level to get them? It comes up with "

Sorry, there is a problem

We could not locate the item you are trying to view.

Error code: 2D161/2   "

Share this post


Link to post
Share on other sites
1 hour ago, kitsunebi77 said:

You mean you haven't even checked it yet?:lol:

Btw, the Fujitsu Scansnap is capable of scanning directly into searchable PDF format, but I've never bothered experimenting with it.

 Nope not yet :D

 both my scanners can scan directly to searchable PDF, which I can Test out. It would just mean scanning everything twice, and the scans wouldn't be edited, which i guess it's okay.

Share this post


Link to post
Share on other sites
On 8/11/2017 at 9:05 AM, E-Day said:

I ran an initial test last month on an Issue of Tips & Tricks that I edited. The file size was significantly smaller, and the image quality was indistinguishable from the original on my monitors, but Acrobat created some graphically anomalies where there were background patterns or something similar. Not great. Then after editing the three Antic issues this month with their boring black-on-white text, I decided to try again. This time things worked quite well, at least for the magazine content. The ad pages are hit and miss, though they are pretty good.

So the whole point of this is to see if there is interest and if it's worth it to release PDF versions with searchable text alongside the traditional CBZ files. These PDFs would not replace the CBZ files; they would supplement them for those who are interested in having a searchable format for research purposes.

These PDFs will have funky graphical issues, which should be fine since the CBZ is there to be the true version of the issue. The files will be located here:

Those interested can download the files, look through them, and post any problems, issues, and opinions here. Then we'll go from there.

I'am using Acrobat Pro X1 and converting part 1 of Philly's torrent from PDF to a PDF with OCR text recognition.  I never got any graphical anomalies although, like you said, advertisements are hit and  miss as well as covers.  What I have learned about Adobe's character recognition is that Images need to be deskewed, letters need to be at least 20 pixels in height, 300 dpi is the minimum recommended scan resolution, and lines of text need to be close to horizontal.  Backgrounds color is likely going to contribute to pages not being  OCR'd.

With the high standard Retromags has for it's scans, we are a gauge for the potential of all other OCR programs out there. 

It would be so easy to run a batch on all the magazines if they were in PDF and correct me if Im wrong but most are in the CBR format and would need to be converted to PDF and then run the Acrobat OCR tool.  Microsoft One Note and Evernote are both capable of character recognition of raw picture files such as CBR.  Just extract the folder out of CBR and they work but I haven't done a whole lot of investigation into this yet.

It would be nice if somebody could just search all the magazines like google searches the web and then have a list of results.  Google drive and archive org have these features but the quality is downgraded and Retromags should be getting credit for this.

 

  • Like 1

Share this post


Link to post
Share on other sites
2 minutes ago, Data said:

It would be nice if somebody could just search all the magazines like google searches the web and then have a list of results.  Google drive and archive org have these features but the quality is downgraded and Retromags should be getting credit for this.

 

There are other things holding this back.  I have a ton of OCR'd PDFs, but doing a keyword search through all of them is SSSSSSSSSSSSLLLLLOOOOOOOOOOWWWWWWWWW to the point that I would never want to do it unless absolutely necessary.  Definitely nothing like google.

Of course, in all honesty, I've never once had a desire to search an archive of video game magazines for something specific.  There's very little knowledge hidden in old game magazines that isn't more easily found by...using google.  But I guess it would be useful if you were writing a book or something and wanted offline primary sources.

Share this post


Link to post
Share on other sites
On 9/17/2017 at 7:26 PM, kitsunebi77 said:

There are other things holding this back.  I have a ton of OCR'd PDFs, but doing a keyword search through all of them is SSSSSSSSSSSSLLLLLOOOOOOOOOOWWWWWWWWW to the point that I would never want to do it unless absolutely necessary.  Definitely nothing like google.

Of course, in all honesty, I've never once had a desire to search an archive of video game magazines for something specific.  There's very little knowledge hidden in old game magazines that isn't more easily found by...using google.  But I guess it would be useful if you were writing a book or something and wanted offline primary sources.

I took 27 magazines which are already a PDF in Philly's new torrent.  They were Amiga through Micro User.
I ran the OCR tool and embedded an index for each of these 27 files and saved them to a SSD drive.  I then searched these 27 magazines with 3 different words and used a stop watch.  I searched for the words, "coin", "the", and "computer".  Each search took 2 seconds or less.

If I assume all magazines take up to 2 seconds and apply it to Philly's torrent of 1337 different magazines then these words will take 99 seconds to find in 1337 magazines.  I'm currently embedding the 2nd half of the PDF's found in Philly's torrent.  I'm curious how accurate my estimate will be after adding more magazines to the search cue.

I disagree with being able to search the magazines.  People often ask if anyone remembers a certain instance in a magazine, they give us an approximate year and we can do a search.  I would also like to search for magazine game reviews and citations of future games the writers promote.  Sometimes hardware and games get mentioned once and then become vaporware and besides this I would just be more interested if I could control what I'm looking for.

 

 

 

Share this post


Link to post
Share on other sites

Imagine if google took a minute and a half to do a search:lol:  Now imagine that you hadn't embedded an index for every mag already on an SSD, and you can see what I meant by slow.

OCR'ing a PDF makes searches possible.  But it doesn't make it easy to do, in my opinion.  If every magazine ever printed existed in html as a bunch of webpages spread all over the web, google would still get results a lot easier and a lot faster than searching your own computer's files.

But it's cool that you're planning on OCRing your collection.  If anyone has any obscure questions on which issues mentioned the Game.com, we'll send them your way.:P

Share this post


Link to post
Share on other sites

I have an update on my results regarding this latest torrent dump of 1338 magazines recently released.

I converted all the Comic Book Reader files to PDF complete with an embedded index using Adobe Acrobat Pro.  I looked for easier methods of converting the files and found that Comic Rack is capable of converting an entire directory automatically.  Unfortunately, the DPI is reduced to 150 from the original.  This is likely because the feature is considered a premium in Acrobat Pro and if free programs like Comic Rack do it for nothing then it would be considered money lost.

 

With all magazines in one folder on a mechanical hard drive, I opened Acrobat and clicked Edit > Preferences, in left pane select “Search”.  In right pane I changed the default “Maximum number of documents returned in Results” from 500 to 2000.  I then selected “Fast Find” to decrease subsequent search times as well as changing the default “Maximum Cache Size” to 10,000 MB and close.

Now I press Ctrl + Shift + F to open Advanced Search.  I select “All PDF Documents in” and then browse to the directory where the magazines are stored.

I searched for the word “computer”. Search time was 5min and got 33,127 results.  The second attempt is 42 seconds.

I searched for the word “the”.  Search time was 1:43 and got 1.83 M results.  The second attempt was also 1:43

I learned that you can speed up searches when there is a large number of PDF’s by creating an IDX file.  This file will contain a full index for entire catalogs of PDFs.

To do this I had to enable the tools first.  Select View > Tool Sets > Manage Tool Sets > Edit.  In this new window, in the left pane under “Choose tools to add”, select “Document Processing”, Click on “Manage Embedded Index” and select  “Add to Quick Tools Toolbar”.  Also add “Full Text Index with Catalog” and save.

On the home screen, select Tools > Document Processing > Full Text Index with Catalog.  In the new window select New Index, give it a name, add the directory where the magazines are stored and select build.  This took about an hour for me.  Make sure these files are in the same directory as the magazines.

After you create the IDX file, open advanced search again, ctrl + shift + F.  Select “More Options”.  Under “Look In”, select “select index” and then the search term you are looking for.

Now my searches have been sped up tremendously.  Searching for the word “computer” took 5 seconds with 33,127 results.

The word “the” took 1 min with 1.83 M results.

The reason the word “the” took 1 min and not 5 seconds is not because it takes so long to find each instance rather, it takes time to organize 1.83 million instances in the search windows and have the ability to jump to the location.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now