pdf2puz?
  • A few online crosswords are perversely published as pdf documents, so you have to print them and waste paper and pencil. I realize that general-purpose display of pdfs is a complicated business, but I have this suspicion that the crosswords published this way have a very predictable format. If so, their interesting elements could be extracted and converted to a .puz file without great effort. With this feature in Black Ink, you could support subscriptions to the Jonesin' and Newsday puzzles.
  • That would be pretty cool, but I think I better focus on other features first. If somebody out there gets ambitious and can figure out how to build such a tool, I would be interested to include it in Black Ink.
  • Extracting the clues is easy: pdftotext (from the xpdf package) does all the heavy lifting. In my near-perfect ignorance of PDF, though, I haven't found an easy way to extract the diagram. The numbering of the clues gives a lot of information about the diagram -- I need to think about that.

    $ xpdf-3.02/xpdf/pdftotext crossword.pdf 
    $ head crossword.txt
    JONESIN'
    Across

    \"Encyclopedic Knowledge\"--what you might find on the spine.
    Down
    1 Title for Gandhi 2 Canadian craft 3 Time for a cookout 4 Center of a debate 5 Fastening device 6 Buck follower 7 Not so extraordinary 8 Eva and Zsa Zsa's sister 9 Looter's paradise 10 It's grounded in Australia 11 Roulette picks: abbr. 12 Compass dir. 15 Additive that sponsors NASCAR racers 18 Pawn 20 \"___ Calling\" (cancelled Fox show) 23 Commands 24 Song with the lyric \"she really shows you all she can\" 25 State lines? 28 Holds the title to 30 In the past 31 Invitation request 33 Jersey jersey wearers 35 Prefix for an ear doctor 36 Beloved beef 39 \"___ Maria\" 40 Some chickens 41 Prefix for appropriation 42 Colony member 45 Concept found in Hinduism 46 Lets in on the joke 48 Cremona closing 50 Reason to stop on a road trip 54 \"Me ___\" (1987 Roger Waters song) 55 \"In ___\" (Nirvana album) 57 Some vegans won't lick it 58 Sioux Falls is there: abbr. 59 Some Audi models

    by Matt Jones

    1 Mariah Carey's \"The Emancipation of ___\" 5 He created Oz 9 Auberjonois of \"Boston Legal\" 13 \"Dead man's hand\" cards 14 Word before major or minor 15 He flirts with Paula 16 Does some tailoring 17 Like broken or worn-out tools 19 Primed for parenthood, perhaps 21 Bull's taunter 22 Fond du ___, Wisconsin 23 Money for later 26 Month after avril 27 Skip-Bo relative 29 Like some justice 32 Mussorgsky's \"Pictures ___ Exhibition\" 34 Movie creature that's about two feet tall 37 Zone named for Dr. Grafenberg 38 Gradually adore 41 \"SNL\" rival 43 Drink with a lizard logo 44 Ship front 47 Momentarily 49 Prank someone's house, maybe 51 Singer DiFranco 52 Pigpen 53 Sketchy substitute for cash 56 Move quickly 58 Become noticeable, like old food in the fridge 61 Faint 64 Lines on city maps: abbr. 65 Brain output 66 Soldering tool 67 Arizona city 68 Join in space 69 Jarvis of the Denver Broncos 70 \"Yo, over here!\"
    $
  • This project is coming right along. Getting a better clue list was just a matter of choosing the right pdftotext options. pdfimages gives me the grid as a Portable Pixmap (ppm) file and I've just written a proof of concept program that analyzes the ppm and produces the grid (the black squares, anyway) as bad ASCII art.

    $ ppm2grid jonesin-000.ppm
    Pixmap: 1131 columns, 1131 rows
    Maximum line length is 1130, grid size is 15, interval is 75
    X XX
    X X
    X
    XXX
    X XX
    X XX
    X X
    XX XX
    X X
    XX X
    XX X
    XXX
    X
    X X
    XX X
    $

    Next I'll stir in the .puz code from joshisanerd.com/puz and my hacky solution will be complete.
  • Wow - nice work!
  • Aw, shucks. It's easy enough with stuff like libnetpbm lying around and when you've been programming since the Nixon administration.
  • Ephraim - I wonder if you've made enough progress with this that you would mind sharing the resulting script? I have had another inquiry about supporting PDF and it might be nice to offer a pointer to a solution - even if it's a nerdy one :)
  • I hope to get back to this soon. I was in Barcelona on business all last week and shoveling snow this week, so there's been no progress.
  • Daniel, I've e-mailed you an early version of pdf2puz. It's only known to work on one puzzle, so I expect some bug reports.
  • Given half a dozen Jonesin' puzzle pdfs, the version I sent you chokes on three, in three different ways. I'm working on it.
  • Tons of progress over the holidays. The current version handles all the 2007 Jonesin' puzzles, all the available Newsday puzzles (about three months worth), and some of the Random House sample puzzles. Decorative elements in that last group are a major PITA, yet it's fun to think about how to make the page scanning more robust.

    There are some surprising crossword PDFs on the web, such as these ethics training puzzles from the U.S. Office of Government Ethics.
  • This is really impressive, Ephraim. I haven't had a chance to look at your script yet, but it sounds like you're really opening a treasure trove of new puzzles.
  • The script's not very interesting, just 50 lines of dumb stuff. It's the C program (700 lines and growing) that does the real work.
  • I've posted the latest sources. Works great for Jonesin' and Newsday puzzles, mixed results on others.
  • Oh cool! I need to try this out. I recently tackled a similar problem, trying to convert the PDF for my gym's pool schedule, into a reasonable text format.

    Daniel
Start a New Discussion

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!