Is there a simple way to identify if a pdf is scanned?
up vote
7
down vote
favorite
I have thousand of documents and some of them are scanned. So I need a script to test all pdf files that belong to a directory. Is there a simple way to do that?
- Most pdfs are reports. Thus they have a lot of text.
They are very different, but the scanned ones as mentioned below one can find some text due to a precarious ocr process coupled to the scan.
- NotScanned
- Scanned1
- Scanned2
The proposal due to Sudodus in the comments below seems to be very interesting. Look at the difference between a scanned to a not scanned-pdf:
Scanned:
grep --color -a 'Image' AR-G1002.pdf
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 340615/Name/Obj13/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40452/Name/Obj18/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41680/Name/Obj23/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41432/Name/Obj28/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59084/Name/Obj33/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 472681/Name/Obj38/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 469340/Name/Obj43/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 371863/Name/Obj48/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 344092/Name/Obj53/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59416/Name/Obj58/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 48308/Name/Obj63/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 51564/Name/Obj68/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 63184/Name/Obj73/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40824/Name/Obj78/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 23320/Name/Obj83/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 31504/Name/Obj93/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 18996/Name/Obj98/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 292932/Name/Obj103/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 27720/Name/Obj108/Subtype/Image/Type/XObject/Width 1698>>stream
<rdf:li xml:lang="x-default">Image</rdf:li>
<rdf:li xml:lang="x-default">Image</rdf:li>
Not Scanned:
grep --color -a 'Image' AR-G1003.pdf
<</Lang(en-US)/MarkInfo<</Marked true>>/Metadata 167 0 R/Pages 2 0 R/StructTreeR<</Contents 4 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F4 11 0 R/F5 13 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/StructParents 0/Tabs/S/Type/<</Filter/FlateDecode/Length 5463>>stream
<</BaseFont/Times#20New#20Roman,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontD<</Ascent 891/AvgWidth 427/CapHeight 677/Descent -216/Flags 32/FontBBox[-558 -216 2000 677]/FontName/Times#20New#20Roman,Bold/FontWeight 700/ItalicAngle 0/Leadi<</BaseFont/Times#20New#20Roman/Encoding/WinAnsiEncoding/FirstChar 32/FontDescri<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontName/Times#20New#20Roman/FontWeight 400/ItalicAngle 0/Leading 42<</BaseFont/Arial,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 10 0<</Ascent 905/AvgWidth 479/CapHeight 728/Descent -210/Flags 32/FontBBox[-628 -210 2000 728]/FontName/Arial,Bold/FontWeight 700/ItalicAngle 0/Leading 33/MaxWidth<</BaseFont/Times#20New#20Roman,Italic/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 12 0 R/LastChar 118/Name/F4/Subtype/TrueType/Type/Font/Widths 164 0 <</Ascent 891/AvgWidth 402/CapHeight 694/Descent -216/Flags 32/FontBBox[-498 -216 1333 694]/FontName/Times#20New#20Roman,Italic/FontWeight 400/ItalicAngle -16.4<</BaseFont/Arial/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 14 0 R/La<</Ascent 905/AvgWidth 441/CapHeight 728/Descent -210/Flags 32/FontBBox[-665 -210 2000 728]/FontName/Arial/FontWeight 400/ItalicAngle 0/Leading 33/MaxWidth 2665<</Contents 16 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 7534>>streamarents 1/Tabs/S/Type/Page>>
<</Contents 18 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 6137>>streamarents 2/Tabs/S/Type/Page>>
<</Contents 20 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R/F6 21 0 R><</Filter/FlateDecode/Length 6533>>stream>>/StructParents 3/Tabs/S/Type/Page>>
<</BaseFont/Times#20New#20Roman/DescendantFonts 22 0 R/Encoding/Identity-H/Subty<</BaseFont/Times#20New#20Roman/CIDSystemInfo 24 0 R/CIDToGIDMap/Identity/DW 100<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontFile2 160 0 R/FontName/Times#20New#20Roman/FontWeight 400/Italic<</Contents 27 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</ExtGState<</GS28 28 0 R/GS29 29 0 R>>/Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F5 13 0 R/F6 21 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC<</Filter/FlateDecode/Length 5369>>streamge>>
The number of images per page are much bigger (about one per page)!
command-line pdf
|
show 6 more comments
up vote
7
down vote
favorite
I have thousand of documents and some of them are scanned. So I need a script to test all pdf files that belong to a directory. Is there a simple way to do that?
- Most pdfs are reports. Thus they have a lot of text.
They are very different, but the scanned ones as mentioned below one can find some text due to a precarious ocr process coupled to the scan.
- NotScanned
- Scanned1
- Scanned2
The proposal due to Sudodus in the comments below seems to be very interesting. Look at the difference between a scanned to a not scanned-pdf:
Scanned:
grep --color -a 'Image' AR-G1002.pdf
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 340615/Name/Obj13/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40452/Name/Obj18/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41680/Name/Obj23/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41432/Name/Obj28/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59084/Name/Obj33/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 472681/Name/Obj38/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 469340/Name/Obj43/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 371863/Name/Obj48/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 344092/Name/Obj53/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59416/Name/Obj58/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 48308/Name/Obj63/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 51564/Name/Obj68/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 63184/Name/Obj73/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40824/Name/Obj78/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 23320/Name/Obj83/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 31504/Name/Obj93/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 18996/Name/Obj98/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 292932/Name/Obj103/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 27720/Name/Obj108/Subtype/Image/Type/XObject/Width 1698>>stream
<rdf:li xml:lang="x-default">Image</rdf:li>
<rdf:li xml:lang="x-default">Image</rdf:li>
Not Scanned:
grep --color -a 'Image' AR-G1003.pdf
<</Lang(en-US)/MarkInfo<</Marked true>>/Metadata 167 0 R/Pages 2 0 R/StructTreeR<</Contents 4 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F4 11 0 R/F5 13 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/StructParents 0/Tabs/S/Type/<</Filter/FlateDecode/Length 5463>>stream
<</BaseFont/Times#20New#20Roman,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontD<</Ascent 891/AvgWidth 427/CapHeight 677/Descent -216/Flags 32/FontBBox[-558 -216 2000 677]/FontName/Times#20New#20Roman,Bold/FontWeight 700/ItalicAngle 0/Leadi<</BaseFont/Times#20New#20Roman/Encoding/WinAnsiEncoding/FirstChar 32/FontDescri<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontName/Times#20New#20Roman/FontWeight 400/ItalicAngle 0/Leading 42<</BaseFont/Arial,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 10 0<</Ascent 905/AvgWidth 479/CapHeight 728/Descent -210/Flags 32/FontBBox[-628 -210 2000 728]/FontName/Arial,Bold/FontWeight 700/ItalicAngle 0/Leading 33/MaxWidth<</BaseFont/Times#20New#20Roman,Italic/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 12 0 R/LastChar 118/Name/F4/Subtype/TrueType/Type/Font/Widths 164 0 <</Ascent 891/AvgWidth 402/CapHeight 694/Descent -216/Flags 32/FontBBox[-498 -216 1333 694]/FontName/Times#20New#20Roman,Italic/FontWeight 400/ItalicAngle -16.4<</BaseFont/Arial/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 14 0 R/La<</Ascent 905/AvgWidth 441/CapHeight 728/Descent -210/Flags 32/FontBBox[-665 -210 2000 728]/FontName/Arial/FontWeight 400/ItalicAngle 0/Leading 33/MaxWidth 2665<</Contents 16 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 7534>>streamarents 1/Tabs/S/Type/Page>>
<</Contents 18 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 6137>>streamarents 2/Tabs/S/Type/Page>>
<</Contents 20 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R/F6 21 0 R><</Filter/FlateDecode/Length 6533>>stream>>/StructParents 3/Tabs/S/Type/Page>>
<</BaseFont/Times#20New#20Roman/DescendantFonts 22 0 R/Encoding/Identity-H/Subty<</BaseFont/Times#20New#20Roman/CIDSystemInfo 24 0 R/CIDToGIDMap/Identity/DW 100<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontFile2 160 0 R/FontName/Times#20New#20Roman/FontWeight 400/Italic<</Contents 27 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</ExtGState<</GS28 28 0 R/GS29 29 0 R>>/Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F5 13 0 R/F6 21 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC<</Filter/FlateDecode/Length 5369>>streamge>>
The number of images per page are much bigger (about one per page)!
command-line pdf
7
Do you mean whether they're text or images?
– DK Bose
Nov 19 at 12:03
8
Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
– sudodus
Nov 19 at 16:04
4
@sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
– pipe
Nov 19 at 16:06
1
Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
– dessert
Nov 19 at 19:11
1
If apdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string/Image/
, which can be found with the command linegrep --color -a 'Image' filename.pdf
. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).
– sudodus
Nov 19 at 19:21
|
show 6 more comments
up vote
7
down vote
favorite
up vote
7
down vote
favorite
I have thousand of documents and some of them are scanned. So I need a script to test all pdf files that belong to a directory. Is there a simple way to do that?
- Most pdfs are reports. Thus they have a lot of text.
They are very different, but the scanned ones as mentioned below one can find some text due to a precarious ocr process coupled to the scan.
- NotScanned
- Scanned1
- Scanned2
The proposal due to Sudodus in the comments below seems to be very interesting. Look at the difference between a scanned to a not scanned-pdf:
Scanned:
grep --color -a 'Image' AR-G1002.pdf
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 340615/Name/Obj13/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40452/Name/Obj18/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41680/Name/Obj23/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41432/Name/Obj28/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59084/Name/Obj33/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 472681/Name/Obj38/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 469340/Name/Obj43/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 371863/Name/Obj48/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 344092/Name/Obj53/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59416/Name/Obj58/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 48308/Name/Obj63/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 51564/Name/Obj68/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 63184/Name/Obj73/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40824/Name/Obj78/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 23320/Name/Obj83/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 31504/Name/Obj93/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 18996/Name/Obj98/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 292932/Name/Obj103/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 27720/Name/Obj108/Subtype/Image/Type/XObject/Width 1698>>stream
<rdf:li xml:lang="x-default">Image</rdf:li>
<rdf:li xml:lang="x-default">Image</rdf:li>
Not Scanned:
grep --color -a 'Image' AR-G1003.pdf
<</Lang(en-US)/MarkInfo<</Marked true>>/Metadata 167 0 R/Pages 2 0 R/StructTreeR<</Contents 4 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F4 11 0 R/F5 13 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/StructParents 0/Tabs/S/Type/<</Filter/FlateDecode/Length 5463>>stream
<</BaseFont/Times#20New#20Roman,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontD<</Ascent 891/AvgWidth 427/CapHeight 677/Descent -216/Flags 32/FontBBox[-558 -216 2000 677]/FontName/Times#20New#20Roman,Bold/FontWeight 700/ItalicAngle 0/Leadi<</BaseFont/Times#20New#20Roman/Encoding/WinAnsiEncoding/FirstChar 32/FontDescri<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontName/Times#20New#20Roman/FontWeight 400/ItalicAngle 0/Leading 42<</BaseFont/Arial,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 10 0<</Ascent 905/AvgWidth 479/CapHeight 728/Descent -210/Flags 32/FontBBox[-628 -210 2000 728]/FontName/Arial,Bold/FontWeight 700/ItalicAngle 0/Leading 33/MaxWidth<</BaseFont/Times#20New#20Roman,Italic/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 12 0 R/LastChar 118/Name/F4/Subtype/TrueType/Type/Font/Widths 164 0 <</Ascent 891/AvgWidth 402/CapHeight 694/Descent -216/Flags 32/FontBBox[-498 -216 1333 694]/FontName/Times#20New#20Roman,Italic/FontWeight 400/ItalicAngle -16.4<</BaseFont/Arial/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 14 0 R/La<</Ascent 905/AvgWidth 441/CapHeight 728/Descent -210/Flags 32/FontBBox[-665 -210 2000 728]/FontName/Arial/FontWeight 400/ItalicAngle 0/Leading 33/MaxWidth 2665<</Contents 16 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 7534>>streamarents 1/Tabs/S/Type/Page>>
<</Contents 18 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 6137>>streamarents 2/Tabs/S/Type/Page>>
<</Contents 20 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R/F6 21 0 R><</Filter/FlateDecode/Length 6533>>stream>>/StructParents 3/Tabs/S/Type/Page>>
<</BaseFont/Times#20New#20Roman/DescendantFonts 22 0 R/Encoding/Identity-H/Subty<</BaseFont/Times#20New#20Roman/CIDSystemInfo 24 0 R/CIDToGIDMap/Identity/DW 100<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontFile2 160 0 R/FontName/Times#20New#20Roman/FontWeight 400/Italic<</Contents 27 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</ExtGState<</GS28 28 0 R/GS29 29 0 R>>/Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F5 13 0 R/F6 21 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC<</Filter/FlateDecode/Length 5369>>streamge>>
The number of images per page are much bigger (about one per page)!
command-line pdf
I have thousand of documents and some of them are scanned. So I need a script to test all pdf files that belong to a directory. Is there a simple way to do that?
- Most pdfs are reports. Thus they have a lot of text.
They are very different, but the scanned ones as mentioned below one can find some text due to a precarious ocr process coupled to the scan.
- NotScanned
- Scanned1
- Scanned2
The proposal due to Sudodus in the comments below seems to be very interesting. Look at the difference between a scanned to a not scanned-pdf:
Scanned:
grep --color -a 'Image' AR-G1002.pdf
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 340615/Name/Obj13/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40452/Name/Obj18/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41680/Name/Obj23/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41432/Name/Obj28/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59084/Name/Obj33/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 472681/Name/Obj38/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 469340/Name/Obj43/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 371863/Name/Obj48/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 344092/Name/Obj53/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59416/Name/Obj58/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 48308/Name/Obj63/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 51564/Name/Obj68/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 63184/Name/Obj73/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40824/Name/Obj78/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 23320/Name/Obj83/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 31504/Name/Obj93/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 18996/Name/Obj98/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 292932/Name/Obj103/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 27720/Name/Obj108/Subtype/Image/Type/XObject/Width 1698>>stream
<rdf:li xml:lang="x-default">Image</rdf:li>
<rdf:li xml:lang="x-default">Image</rdf:li>
Not Scanned:
grep --color -a 'Image' AR-G1003.pdf
<</Lang(en-US)/MarkInfo<</Marked true>>/Metadata 167 0 R/Pages 2 0 R/StructTreeR<</Contents 4 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F4 11 0 R/F5 13 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/StructParents 0/Tabs/S/Type/<</Filter/FlateDecode/Length 5463>>stream
<</BaseFont/Times#20New#20Roman,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontD<</Ascent 891/AvgWidth 427/CapHeight 677/Descent -216/Flags 32/FontBBox[-558 -216 2000 677]/FontName/Times#20New#20Roman,Bold/FontWeight 700/ItalicAngle 0/Leadi<</BaseFont/Times#20New#20Roman/Encoding/WinAnsiEncoding/FirstChar 32/FontDescri<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontName/Times#20New#20Roman/FontWeight 400/ItalicAngle 0/Leading 42<</BaseFont/Arial,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 10 0<</Ascent 905/AvgWidth 479/CapHeight 728/Descent -210/Flags 32/FontBBox[-628 -210 2000 728]/FontName/Arial,Bold/FontWeight 700/ItalicAngle 0/Leading 33/MaxWidth<</BaseFont/Times#20New#20Roman,Italic/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 12 0 R/LastChar 118/Name/F4/Subtype/TrueType/Type/Font/Widths 164 0 <</Ascent 891/AvgWidth 402/CapHeight 694/Descent -216/Flags 32/FontBBox[-498 -216 1333 694]/FontName/Times#20New#20Roman,Italic/FontWeight 400/ItalicAngle -16.4<</BaseFont/Arial/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 14 0 R/La<</Ascent 905/AvgWidth 441/CapHeight 728/Descent -210/Flags 32/FontBBox[-665 -210 2000 728]/FontName/Arial/FontWeight 400/ItalicAngle 0/Leading 33/MaxWidth 2665<</Contents 16 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 7534>>streamarents 1/Tabs/S/Type/Page>>
<</Contents 18 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 6137>>streamarents 2/Tabs/S/Type/Page>>
<</Contents 20 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R/F6 21 0 R><</Filter/FlateDecode/Length 6533>>stream>>/StructParents 3/Tabs/S/Type/Page>>
<</BaseFont/Times#20New#20Roman/DescendantFonts 22 0 R/Encoding/Identity-H/Subty<</BaseFont/Times#20New#20Roman/CIDSystemInfo 24 0 R/CIDToGIDMap/Identity/DW 100<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontFile2 160 0 R/FontName/Times#20New#20Roman/FontWeight 400/Italic<</Contents 27 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</ExtGState<</GS28 28 0 R/GS29 29 0 R>>/Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F5 13 0 R/F6 21 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC<</Filter/FlateDecode/Length 5369>>streamge>>
The number of images per page are much bigger (about one per page)!
command-line pdf
command-line pdf
edited Nov 20 at 1:54
muru
134k19282482
134k19282482
asked Nov 19 at 12:00
DanielTheRocketMan
3371314
3371314
7
Do you mean whether they're text or images?
– DK Bose
Nov 19 at 12:03
8
Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
– sudodus
Nov 19 at 16:04
4
@sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
– pipe
Nov 19 at 16:06
1
Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
– dessert
Nov 19 at 19:11
1
If apdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string/Image/
, which can be found with the command linegrep --color -a 'Image' filename.pdf
. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).
– sudodus
Nov 19 at 19:21
|
show 6 more comments
7
Do you mean whether they're text or images?
– DK Bose
Nov 19 at 12:03
8
Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
– sudodus
Nov 19 at 16:04
4
@sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
– pipe
Nov 19 at 16:06
1
Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
– dessert
Nov 19 at 19:11
1
If apdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string/Image/
, which can be found with the command linegrep --color -a 'Image' filename.pdf
. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).
– sudodus
Nov 19 at 19:21
7
7
Do you mean whether they're text or images?
– DK Bose
Nov 19 at 12:03
Do you mean whether they're text or images?
– DK Bose
Nov 19 at 12:03
8
8
Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
– sudodus
Nov 19 at 16:04
Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
– sudodus
Nov 19 at 16:04
4
4
@sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
– pipe
Nov 19 at 16:06
@sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
– pipe
Nov 19 at 16:06
1
1
Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
– dessert
Nov 19 at 19:11
Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
– dessert
Nov 19 at 19:11
1
1
If a
pdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/
, which can be found with the command line grep --color -a 'Image' filename.pdf
. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).– sudodus
Nov 19 at 19:21
If a
pdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/
, which can be found with the command line grep --color -a 'Image' filename.pdf
. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).– sudodus
Nov 19 at 19:21
|
show 6 more comments
5 Answers
5
active
oldest
votes
up vote
3
down vote
accepted
Shellscript
If a
pdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string/Image/
.In the same way you can search for the string
/Text
to tell if a pdf file contains text (not scanned).
I made the shellscript pdf-text-or-image
, and it might work in most cases with your files. The shellscript looks for the text strings /Image/
and /Text
in the pdf
files.
#!/bin/bash
echo "shellscript $0"
ls --color --group-directories-first
read -p "Is it OK to use this shellscript in this directory? (y/N) " ans
if [ "$ans" != "y" ]
then
exit
fi
mkdir -p scanned
mkdir -p text
mkdir -p "s-and-t"
for file in *.pdf
do
grep -aq '/Image/' "$file"
if [ $? -eq 0 ]
then
image=true
else
image=false
fi
grep -aq '/Text' "$file"
if [ $? -eq 0 ]
then
text=true
else
text=false
fi
if $image && $text
then
mv "$file" "s-and-t"
elif $image
then
mv "$file" "scanned"
elif $text
then
mv "$file" "text"
else
echo "$file undecided"
fi
done
Make the shellscript executable,
chmod ugo+x pdf-text-or-image
Change directory to where you have the pdf
files and run the shellscript.
Identified files are moved to the following subdirectories
scanned
text
s-and-t
(for documents with both [scanned?] images and text content)
Unidentified file objects, 'UFOs', remain in the current directory.
Test
I tested the shellscript with two of your files, AR-G1002.pdf
and AR-G1003.pdf
, and with some own pdf
files (that I have created using Libre Office Impress).
$ ./pdf-text-or-image
shellscript ./pdf-text-or-image
s-and-t mkUSB-quick-start-manual-11.pdf mkUSB-quick-start-manual-nox-11.pdf
scanned mkUSB-quick-start-manual-12-0.pdf mkUSB-quick-start-manual-nox.pdf
text mkUSB-quick-start-manual-12.pdf mkUSB-quick-start-manual.pdf
AR-G1002.pdf mkUSB-quick-start-manual-74.pdf OBI-quick-start-manual.pdf
AR-G1003.pdf mkUSB-quick-start-manual-75.pdf oem.pdf
DescriptionoftheOneButtonInstaller.pdf mkUSB-quick-start-manual-8.pdf pdf-text-or-image
GrowIt.pdf mkUSB-quick-start-manual-9.pdf pdf-text-or-image0
list-files.pdf mkUSB-quick-start-manual-bas.pdf README.pdf
Is it OK to use this shellscript in this directory? (y/N) y
$ ls -1 *
pdf-text-or-image
pdf-text-or-image0
s-and-t:
DescriptionoftheOneButtonInstaller.pdf
GrowIt.pdf
mkUSB-quick-start-manual-11.pdf
mkUSB-quick-start-manual-12-0.pdf
mkUSB-quick-start-manual-12.pdf
mkUSB-quick-start-manual-8.pdf
mkUSB-quick-start-manual-9.pdf
mkUSB-quick-start-manual.pdf
OBI-quick-start-manual.pdf
README.pdf
scanned:
AR-G1002.pdf
text:
AR-G1003.pdf
list-files.pdf
mkUSB-quick-start-manual-74.pdf
mkUSB-quick-start-manual-75.pdf
mkUSB-quick-start-manual-bas.pdf
mkUSB-quick-start-manual-nox-11.pdf
mkUSB-quick-start-manual-nox.pdf
oem.pdf
Let us hope that
- there are no UFOs in your set of files
- the sorting is correct concerning text versus scanned/images
instead of redirecting to /dev/null you can just usegrep -q
– phuclv
Nov 20 at 1:07
1
@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, becausegrep -q
exits immediately with zero status if any match is found (instead of seaching through the whole files).
– sudodus
Nov 20 at 12:22
add a comment |
up vote
6
down vote
- Put all the .pdf files in one folder.
- No .txt file in that folder.
- In terminal change directory to that folder with
cd <path to dir>
- Make one more directory for non scanned files. Example:
mkdir ./x
for file in *.pdf; do
if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi
rm *.txt
done
All the pdf scanned files will remain in the folder and other files will move to another folder.
this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
Nov 19 at 14:34
8
Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
Nov 19 at 15:00
2
Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
Nov 19 at 17:26
1
@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output offile pdf-filename.pdf
will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
– Elder Geek
Nov 19 at 20:10
1
@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output offile
helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
– Elder Geek
Nov 19 at 20:17
|
show 5 more comments
up vote
1
down vote
Hobbyist offers a good solution if the document collection's scanned documents do not have text added with optical character recognition (OCR). If this is a possibility, you may want to do some scripting that reads the output of pdfinfo -meta
and checks for the tool used to create the file, or employ a Python routine that uses one of the Python libraries to examine them. Searching for text with a tool like strings
will be unreliable because PDF content can be compressed. And checking the creation tool is not failsafe, either, since PDF pages can be combined; I routinely combine PDF text documents with scanned images to keep things together.
I'm sorry that I am unable to offer specific suggestions. It's been a while since I poked at the PDF internal structure, but depending on how stringent your requirements are, you may want to know that it's kind of complicated. Good luck!
New contributor
2
I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
Nov 19 at 15:03
add a comment |
up vote
1
down vote
If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.
In general, for the files I could find on my computer and your test files, following is true:
- Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page
- Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software
- PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.
I'm using Windows at the moment, so I used node.js
for the following example:
const fs = require("mz/fs");
const pdf_parse = require("pdf-parse");
const path = require("path");
const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;
const DEBUG = process.argv.indexOf("debug") != -1;
const STRICT = process.argv.indexOf("strict") != -1;
const debug = DEBUG ? console.error : () => { };
(async () => {
const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });
for (let i = 0, l = pdfs.length; i < l; ++i) {
const pdffilename = pdfs[i];
try {
debug("nnFILE: ", pdffilename);
const buffer = await fs.readFile(pdffilename);
const data = await pdf_parse(buffer);
if (!data.info)
data.indo = {};
if (!data.metadata) {
data.metadata = {
_metadata: {}
};
}
// PDF info
debug(data.info);
// PDF metadata
debug(data.metadata);
// text length
const textLen = data.text ? data.text.length : 0;
const textPerPage = textLen / (data.numpages);
debug("Text length: ", textLen);
debug("Chars per page: ", textLen / data.numpages);
// PDF.js version
// check https://mozilla.github.io/pdf.js/getting_started/
debug(data.version);
if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {
console.log(path.resolve(".", pdffilename));
}
}
catch (e) {
if (strict && !debug) {
console.error("Failed to evaluate " + item);
}
{
debug("Failed to evaluate " + item);
debug(e.stack);
}
if (strict) {
process.exit(1);
}
}
}
})();
const IS_CREATOR_CANON = /canon/i;
const IS_CREATOR_MS_WORD = /microsoft.*?word/i;
// just defined for better clarity or return values
const IS_SCANNED = true;
const IS_NOT_SCANNED = false;
function evalScanned(pdfdata, textLen, textPerPage) {
if (textPerPage < 300 && pdfdata.numpages>1) {
// really low number, definitelly not text pdf
return IS_SCANNED;
}
// definitelly has enough text
// might be scanned but OCRed
// we return this if no
// suspition of scanning is found
let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;
if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {
// this is always scanned, canon is brand name
return IS_SCANNED;
}
return implicitAssumption;
}
To run it, you need to have Node.js installed (should be a single command) and you also need to call:
npm install mz pdf-parse
Usage:
node howYouNamedIt.js [scanned] [debug] [strict]
- scanned show PDFs thought to be scanned (otherwise shows not scanned)
- debug shows the debug info such as metadata and error stack traces
- strict kills the program on first error
This example is not considered finished solution, but with the debug
flag, you get some insight into meta information of a file:
FILE: BR-L1411-3-scanned.pdf
{ PDFFormatVersion: '1.3',
IsAcroFormPresent: false,
IsXFAPresent: false,
Creator: 'Canon ',
Producer: ' ',
CreationDate: 'D:20131212150500-03'00'',
ModDate: 'D:20140709104225-03'00'' }
Metadata {
_metadata:
{ 'xmp:createdate': '2013-12-12T15:05-03:00',
'xmp:creatortool': 'Canon',
'xmp:modifydate': '2014-07-09T10:42:25-03:00',
'xmp:metadatadate': '2014-07-09T10:42:25-03:00',
'pdf:producer': '',
'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',
'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',
'dc:format': 'application/pdf' } }
Text length: 772
Chars per page: 2
1.10.100
D:webso-odpovedipdfBR-L1411-3-scanned.pdf
The naive function that I wrote has 100% success on the documents that I could find on my computer (including your samples). I named the files based on what their status was before running the program, to make it possible to see if results are correct.
D:xxxxpdf>node detect_scanned.js scanned
D:xxxxpdfAR-G1002-scanned.pdf
D:xxxxpdfAR-G1002_scanned.pdf
D:xxxxpdfBR-L1411-3-scanned.pdf
D:xxxxpdfWHO_TRS_696-scanned.pdf
D:xxxxpdf>node detect_scanned.js
D:xxxxpdfAR-G1003-not-scanned.pdf
D:xxxxpdfASEE_-_thermoelectric_paper_-_final-not-scanned.pdf
D:xxxxpdfMULTIMODE ABSORBER-not-scanned.pdf
D:xxxxpdfReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf
You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.
Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
Nov 20 at 18:04
add a comment |
up vote
0
down vote
2 ways I can think of:
Using select text tool: if you are using a scanned PDF the texts can not be selected, rather a box will appear. You can use this fact to create the script. I know in C++ QT there is a way, not sure in Linux though.
Search for word in file: In a non-scanned PDF your search will work, however not in scanned file. You just need to find some words common to all PDFs or I would rather say search for letter 'e' in all the PDFs. It has the highest frequency distribution so chances are you will find it in all the documents which have text (Unless its gadsby)
eg
grep -rnw '/path/to/pdf/' -e 'e'
Use any of the text processing tools
1
a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
Nov 19 at 16:15
@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
Nov 19 at 17:30
1
@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
Nov 19 at 17:50
1
i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
Nov 19 at 20:07
add a comment |
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
accepted
Shellscript
If a
pdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string/Image/
.In the same way you can search for the string
/Text
to tell if a pdf file contains text (not scanned).
I made the shellscript pdf-text-or-image
, and it might work in most cases with your files. The shellscript looks for the text strings /Image/
and /Text
in the pdf
files.
#!/bin/bash
echo "shellscript $0"
ls --color --group-directories-first
read -p "Is it OK to use this shellscript in this directory? (y/N) " ans
if [ "$ans" != "y" ]
then
exit
fi
mkdir -p scanned
mkdir -p text
mkdir -p "s-and-t"
for file in *.pdf
do
grep -aq '/Image/' "$file"
if [ $? -eq 0 ]
then
image=true
else
image=false
fi
grep -aq '/Text' "$file"
if [ $? -eq 0 ]
then
text=true
else
text=false
fi
if $image && $text
then
mv "$file" "s-and-t"
elif $image
then
mv "$file" "scanned"
elif $text
then
mv "$file" "text"
else
echo "$file undecided"
fi
done
Make the shellscript executable,
chmod ugo+x pdf-text-or-image
Change directory to where you have the pdf
files and run the shellscript.
Identified files are moved to the following subdirectories
scanned
text
s-and-t
(for documents with both [scanned?] images and text content)
Unidentified file objects, 'UFOs', remain in the current directory.
Test
I tested the shellscript with two of your files, AR-G1002.pdf
and AR-G1003.pdf
, and with some own pdf
files (that I have created using Libre Office Impress).
$ ./pdf-text-or-image
shellscript ./pdf-text-or-image
s-and-t mkUSB-quick-start-manual-11.pdf mkUSB-quick-start-manual-nox-11.pdf
scanned mkUSB-quick-start-manual-12-0.pdf mkUSB-quick-start-manual-nox.pdf
text mkUSB-quick-start-manual-12.pdf mkUSB-quick-start-manual.pdf
AR-G1002.pdf mkUSB-quick-start-manual-74.pdf OBI-quick-start-manual.pdf
AR-G1003.pdf mkUSB-quick-start-manual-75.pdf oem.pdf
DescriptionoftheOneButtonInstaller.pdf mkUSB-quick-start-manual-8.pdf pdf-text-or-image
GrowIt.pdf mkUSB-quick-start-manual-9.pdf pdf-text-or-image0
list-files.pdf mkUSB-quick-start-manual-bas.pdf README.pdf
Is it OK to use this shellscript in this directory? (y/N) y
$ ls -1 *
pdf-text-or-image
pdf-text-or-image0
s-and-t:
DescriptionoftheOneButtonInstaller.pdf
GrowIt.pdf
mkUSB-quick-start-manual-11.pdf
mkUSB-quick-start-manual-12-0.pdf
mkUSB-quick-start-manual-12.pdf
mkUSB-quick-start-manual-8.pdf
mkUSB-quick-start-manual-9.pdf
mkUSB-quick-start-manual.pdf
OBI-quick-start-manual.pdf
README.pdf
scanned:
AR-G1002.pdf
text:
AR-G1003.pdf
list-files.pdf
mkUSB-quick-start-manual-74.pdf
mkUSB-quick-start-manual-75.pdf
mkUSB-quick-start-manual-bas.pdf
mkUSB-quick-start-manual-nox-11.pdf
mkUSB-quick-start-manual-nox.pdf
oem.pdf
Let us hope that
- there are no UFOs in your set of files
- the sorting is correct concerning text versus scanned/images
instead of redirecting to /dev/null you can just usegrep -q
– phuclv
Nov 20 at 1:07
1
@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, becausegrep -q
exits immediately with zero status if any match is found (instead of seaching through the whole files).
– sudodus
Nov 20 at 12:22
add a comment |
up vote
3
down vote
accepted
Shellscript
If a
pdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string/Image/
.In the same way you can search for the string
/Text
to tell if a pdf file contains text (not scanned).
I made the shellscript pdf-text-or-image
, and it might work in most cases with your files. The shellscript looks for the text strings /Image/
and /Text
in the pdf
files.
#!/bin/bash
echo "shellscript $0"
ls --color --group-directories-first
read -p "Is it OK to use this shellscript in this directory? (y/N) " ans
if [ "$ans" != "y" ]
then
exit
fi
mkdir -p scanned
mkdir -p text
mkdir -p "s-and-t"
for file in *.pdf
do
grep -aq '/Image/' "$file"
if [ $? -eq 0 ]
then
image=true
else
image=false
fi
grep -aq '/Text' "$file"
if [ $? -eq 0 ]
then
text=true
else
text=false
fi
if $image && $text
then
mv "$file" "s-and-t"
elif $image
then
mv "$file" "scanned"
elif $text
then
mv "$file" "text"
else
echo "$file undecided"
fi
done
Make the shellscript executable,
chmod ugo+x pdf-text-or-image
Change directory to where you have the pdf
files and run the shellscript.
Identified files are moved to the following subdirectories
scanned
text
s-and-t
(for documents with both [scanned?] images and text content)
Unidentified file objects, 'UFOs', remain in the current directory.
Test
I tested the shellscript with two of your files, AR-G1002.pdf
and AR-G1003.pdf
, and with some own pdf
files (that I have created using Libre Office Impress).
$ ./pdf-text-or-image
shellscript ./pdf-text-or-image
s-and-t mkUSB-quick-start-manual-11.pdf mkUSB-quick-start-manual-nox-11.pdf
scanned mkUSB-quick-start-manual-12-0.pdf mkUSB-quick-start-manual-nox.pdf
text mkUSB-quick-start-manual-12.pdf mkUSB-quick-start-manual.pdf
AR-G1002.pdf mkUSB-quick-start-manual-74.pdf OBI-quick-start-manual.pdf
AR-G1003.pdf mkUSB-quick-start-manual-75.pdf oem.pdf
DescriptionoftheOneButtonInstaller.pdf mkUSB-quick-start-manual-8.pdf pdf-text-or-image
GrowIt.pdf mkUSB-quick-start-manual-9.pdf pdf-text-or-image0
list-files.pdf mkUSB-quick-start-manual-bas.pdf README.pdf
Is it OK to use this shellscript in this directory? (y/N) y
$ ls -1 *
pdf-text-or-image
pdf-text-or-image0
s-and-t:
DescriptionoftheOneButtonInstaller.pdf
GrowIt.pdf
mkUSB-quick-start-manual-11.pdf
mkUSB-quick-start-manual-12-0.pdf
mkUSB-quick-start-manual-12.pdf
mkUSB-quick-start-manual-8.pdf
mkUSB-quick-start-manual-9.pdf
mkUSB-quick-start-manual.pdf
OBI-quick-start-manual.pdf
README.pdf
scanned:
AR-G1002.pdf
text:
AR-G1003.pdf
list-files.pdf
mkUSB-quick-start-manual-74.pdf
mkUSB-quick-start-manual-75.pdf
mkUSB-quick-start-manual-bas.pdf
mkUSB-quick-start-manual-nox-11.pdf
mkUSB-quick-start-manual-nox.pdf
oem.pdf
Let us hope that
- there are no UFOs in your set of files
- the sorting is correct concerning text versus scanned/images
instead of redirecting to /dev/null you can just usegrep -q
– phuclv
Nov 20 at 1:07
1
@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, becausegrep -q
exits immediately with zero status if any match is found (instead of seaching through the whole files).
– sudodus
Nov 20 at 12:22
add a comment |
up vote
3
down vote
accepted
up vote
3
down vote
accepted
Shellscript
If a
pdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string/Image/
.In the same way you can search for the string
/Text
to tell if a pdf file contains text (not scanned).
I made the shellscript pdf-text-or-image
, and it might work in most cases with your files. The shellscript looks for the text strings /Image/
and /Text
in the pdf
files.
#!/bin/bash
echo "shellscript $0"
ls --color --group-directories-first
read -p "Is it OK to use this shellscript in this directory? (y/N) " ans
if [ "$ans" != "y" ]
then
exit
fi
mkdir -p scanned
mkdir -p text
mkdir -p "s-and-t"
for file in *.pdf
do
grep -aq '/Image/' "$file"
if [ $? -eq 0 ]
then
image=true
else
image=false
fi
grep -aq '/Text' "$file"
if [ $? -eq 0 ]
then
text=true
else
text=false
fi
if $image && $text
then
mv "$file" "s-and-t"
elif $image
then
mv "$file" "scanned"
elif $text
then
mv "$file" "text"
else
echo "$file undecided"
fi
done
Make the shellscript executable,
chmod ugo+x pdf-text-or-image
Change directory to where you have the pdf
files and run the shellscript.
Identified files are moved to the following subdirectories
scanned
text
s-and-t
(for documents with both [scanned?] images and text content)
Unidentified file objects, 'UFOs', remain in the current directory.
Test
I tested the shellscript with two of your files, AR-G1002.pdf
and AR-G1003.pdf
, and with some own pdf
files (that I have created using Libre Office Impress).
$ ./pdf-text-or-image
shellscript ./pdf-text-or-image
s-and-t mkUSB-quick-start-manual-11.pdf mkUSB-quick-start-manual-nox-11.pdf
scanned mkUSB-quick-start-manual-12-0.pdf mkUSB-quick-start-manual-nox.pdf
text mkUSB-quick-start-manual-12.pdf mkUSB-quick-start-manual.pdf
AR-G1002.pdf mkUSB-quick-start-manual-74.pdf OBI-quick-start-manual.pdf
AR-G1003.pdf mkUSB-quick-start-manual-75.pdf oem.pdf
DescriptionoftheOneButtonInstaller.pdf mkUSB-quick-start-manual-8.pdf pdf-text-or-image
GrowIt.pdf mkUSB-quick-start-manual-9.pdf pdf-text-or-image0
list-files.pdf mkUSB-quick-start-manual-bas.pdf README.pdf
Is it OK to use this shellscript in this directory? (y/N) y
$ ls -1 *
pdf-text-or-image
pdf-text-or-image0
s-and-t:
DescriptionoftheOneButtonInstaller.pdf
GrowIt.pdf
mkUSB-quick-start-manual-11.pdf
mkUSB-quick-start-manual-12-0.pdf
mkUSB-quick-start-manual-12.pdf
mkUSB-quick-start-manual-8.pdf
mkUSB-quick-start-manual-9.pdf
mkUSB-quick-start-manual.pdf
OBI-quick-start-manual.pdf
README.pdf
scanned:
AR-G1002.pdf
text:
AR-G1003.pdf
list-files.pdf
mkUSB-quick-start-manual-74.pdf
mkUSB-quick-start-manual-75.pdf
mkUSB-quick-start-manual-bas.pdf
mkUSB-quick-start-manual-nox-11.pdf
mkUSB-quick-start-manual-nox.pdf
oem.pdf
Let us hope that
- there are no UFOs in your set of files
- the sorting is correct concerning text versus scanned/images
Shellscript
If a
pdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string/Image/
.In the same way you can search for the string
/Text
to tell if a pdf file contains text (not scanned).
I made the shellscript pdf-text-or-image
, and it might work in most cases with your files. The shellscript looks for the text strings /Image/
and /Text
in the pdf
files.
#!/bin/bash
echo "shellscript $0"
ls --color --group-directories-first
read -p "Is it OK to use this shellscript in this directory? (y/N) " ans
if [ "$ans" != "y" ]
then
exit
fi
mkdir -p scanned
mkdir -p text
mkdir -p "s-and-t"
for file in *.pdf
do
grep -aq '/Image/' "$file"
if [ $? -eq 0 ]
then
image=true
else
image=false
fi
grep -aq '/Text' "$file"
if [ $? -eq 0 ]
then
text=true
else
text=false
fi
if $image && $text
then
mv "$file" "s-and-t"
elif $image
then
mv "$file" "scanned"
elif $text
then
mv "$file" "text"
else
echo "$file undecided"
fi
done
Make the shellscript executable,
chmod ugo+x pdf-text-or-image
Change directory to where you have the pdf
files and run the shellscript.
Identified files are moved to the following subdirectories
scanned
text
s-and-t
(for documents with both [scanned?] images and text content)
Unidentified file objects, 'UFOs', remain in the current directory.
Test
I tested the shellscript with two of your files, AR-G1002.pdf
and AR-G1003.pdf
, and with some own pdf
files (that I have created using Libre Office Impress).
$ ./pdf-text-or-image
shellscript ./pdf-text-or-image
s-and-t mkUSB-quick-start-manual-11.pdf mkUSB-quick-start-manual-nox-11.pdf
scanned mkUSB-quick-start-manual-12-0.pdf mkUSB-quick-start-manual-nox.pdf
text mkUSB-quick-start-manual-12.pdf mkUSB-quick-start-manual.pdf
AR-G1002.pdf mkUSB-quick-start-manual-74.pdf OBI-quick-start-manual.pdf
AR-G1003.pdf mkUSB-quick-start-manual-75.pdf oem.pdf
DescriptionoftheOneButtonInstaller.pdf mkUSB-quick-start-manual-8.pdf pdf-text-or-image
GrowIt.pdf mkUSB-quick-start-manual-9.pdf pdf-text-or-image0
list-files.pdf mkUSB-quick-start-manual-bas.pdf README.pdf
Is it OK to use this shellscript in this directory? (y/N) y
$ ls -1 *
pdf-text-or-image
pdf-text-or-image0
s-and-t:
DescriptionoftheOneButtonInstaller.pdf
GrowIt.pdf
mkUSB-quick-start-manual-11.pdf
mkUSB-quick-start-manual-12-0.pdf
mkUSB-quick-start-manual-12.pdf
mkUSB-quick-start-manual-8.pdf
mkUSB-quick-start-manual-9.pdf
mkUSB-quick-start-manual.pdf
OBI-quick-start-manual.pdf
README.pdf
scanned:
AR-G1002.pdf
text:
AR-G1003.pdf
list-files.pdf
mkUSB-quick-start-manual-74.pdf
mkUSB-quick-start-manual-75.pdf
mkUSB-quick-start-manual-bas.pdf
mkUSB-quick-start-manual-nox-11.pdf
mkUSB-quick-start-manual-nox.pdf
oem.pdf
Let us hope that
- there are no UFOs in your set of files
- the sorting is correct concerning text versus scanned/images
edited Nov 20 at 12:16
answered Nov 19 at 20:59
sudodus
21.4k32770
21.4k32770
instead of redirecting to /dev/null you can just usegrep -q
– phuclv
Nov 20 at 1:07
1
@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, becausegrep -q
exits immediately with zero status if any match is found (instead of seaching through the whole files).
– sudodus
Nov 20 at 12:22
add a comment |
instead of redirecting to /dev/null you can just usegrep -q
– phuclv
Nov 20 at 1:07
1
@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, becausegrep -q
exits immediately with zero status if any match is found (instead of seaching through the whole files).
– sudodus
Nov 20 at 12:22
instead of redirecting to /dev/null you can just use
grep -q
– phuclv
Nov 20 at 1:07
instead of redirecting to /dev/null you can just use
grep -q
– phuclv
Nov 20 at 1:07
1
1
@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, because
grep -q
exits immediately with zero status if any match is found (instead of seaching through the whole files).– sudodus
Nov 20 at 12:22
@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, because
grep -q
exits immediately with zero status if any match is found (instead of seaching through the whole files).– sudodus
Nov 20 at 12:22
add a comment |
up vote
6
down vote
- Put all the .pdf files in one folder.
- No .txt file in that folder.
- In terminal change directory to that folder with
cd <path to dir>
- Make one more directory for non scanned files. Example:
mkdir ./x
for file in *.pdf; do
if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi
rm *.txt
done
All the pdf scanned files will remain in the folder and other files will move to another folder.
this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
Nov 19 at 14:34
8
Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
Nov 19 at 15:00
2
Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
Nov 19 at 17:26
1
@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output offile pdf-filename.pdf
will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
– Elder Geek
Nov 19 at 20:10
1
@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output offile
helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
– Elder Geek
Nov 19 at 20:17
|
show 5 more comments
up vote
6
down vote
- Put all the .pdf files in one folder.
- No .txt file in that folder.
- In terminal change directory to that folder with
cd <path to dir>
- Make one more directory for non scanned files. Example:
mkdir ./x
for file in *.pdf; do
if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi
rm *.txt
done
All the pdf scanned files will remain in the folder and other files will move to another folder.
this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
Nov 19 at 14:34
8
Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
Nov 19 at 15:00
2
Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
Nov 19 at 17:26
1
@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output offile pdf-filename.pdf
will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
– Elder Geek
Nov 19 at 20:10
1
@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output offile
helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
– Elder Geek
Nov 19 at 20:17
|
show 5 more comments
up vote
6
down vote
up vote
6
down vote
- Put all the .pdf files in one folder.
- No .txt file in that folder.
- In terminal change directory to that folder with
cd <path to dir>
- Make one more directory for non scanned files. Example:
mkdir ./x
for file in *.pdf; do
if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi
rm *.txt
done
All the pdf scanned files will remain in the folder and other files will move to another folder.
- Put all the .pdf files in one folder.
- No .txt file in that folder.
- In terminal change directory to that folder with
cd <path to dir>
- Make one more directory for non scanned files. Example:
mkdir ./x
for file in *.pdf; do
if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi
rm *.txt
done
All the pdf scanned files will remain in the folder and other files will move to another folder.
edited Nov 19 at 19:03
dessert
21.2k55896
21.2k55896
answered Nov 19 at 13:35
Hobbyist
1,024617
1,024617
this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
Nov 19 at 14:34
8
Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
Nov 19 at 15:00
2
Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
Nov 19 at 17:26
1
@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output offile pdf-filename.pdf
will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
– Elder Geek
Nov 19 at 20:10
1
@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output offile
helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
– Elder Geek
Nov 19 at 20:17
|
show 5 more comments
this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
Nov 19 at 14:34
8
Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
Nov 19 at 15:00
2
Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
Nov 19 at 17:26
1
@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output offile pdf-filename.pdf
will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
– Elder Geek
Nov 19 at 20:10
1
@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output offile
helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
– Elder Geek
Nov 19 at 20:17
this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
Nov 19 at 14:34
this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
Nov 19 at 14:34
8
8
Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
Nov 19 at 15:00
Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
Nov 19 at 15:00
2
2
Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
Nov 19 at 17:26
Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
Nov 19 at 17:26
1
1
@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output of
file pdf-filename.pdf
will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.– Elder Geek
Nov 19 at 20:10
@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output of
file pdf-filename.pdf
will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.– Elder Geek
Nov 19 at 20:10
1
1
@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output of
file
helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.– Elder Geek
Nov 19 at 20:17
@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output of
file
helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.– Elder Geek
Nov 19 at 20:17
|
show 5 more comments
up vote
1
down vote
Hobbyist offers a good solution if the document collection's scanned documents do not have text added with optical character recognition (OCR). If this is a possibility, you may want to do some scripting that reads the output of pdfinfo -meta
and checks for the tool used to create the file, or employ a Python routine that uses one of the Python libraries to examine them. Searching for text with a tool like strings
will be unreliable because PDF content can be compressed. And checking the creation tool is not failsafe, either, since PDF pages can be combined; I routinely combine PDF text documents with scanned images to keep things together.
I'm sorry that I am unable to offer specific suggestions. It's been a while since I poked at the PDF internal structure, but depending on how stringent your requirements are, you may want to know that it's kind of complicated. Good luck!
New contributor
2
I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
Nov 19 at 15:03
add a comment |
up vote
1
down vote
Hobbyist offers a good solution if the document collection's scanned documents do not have text added with optical character recognition (OCR). If this is a possibility, you may want to do some scripting that reads the output of pdfinfo -meta
and checks for the tool used to create the file, or employ a Python routine that uses one of the Python libraries to examine them. Searching for text with a tool like strings
will be unreliable because PDF content can be compressed. And checking the creation tool is not failsafe, either, since PDF pages can be combined; I routinely combine PDF text documents with scanned images to keep things together.
I'm sorry that I am unable to offer specific suggestions. It's been a while since I poked at the PDF internal structure, but depending on how stringent your requirements are, you may want to know that it's kind of complicated. Good luck!
New contributor
2
I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
Nov 19 at 15:03
add a comment |
up vote
1
down vote
up vote
1
down vote
Hobbyist offers a good solution if the document collection's scanned documents do not have text added with optical character recognition (OCR). If this is a possibility, you may want to do some scripting that reads the output of pdfinfo -meta
and checks for the tool used to create the file, or employ a Python routine that uses one of the Python libraries to examine them. Searching for text with a tool like strings
will be unreliable because PDF content can be compressed. And checking the creation tool is not failsafe, either, since PDF pages can be combined; I routinely combine PDF text documents with scanned images to keep things together.
I'm sorry that I am unable to offer specific suggestions. It's been a while since I poked at the PDF internal structure, but depending on how stringent your requirements are, you may want to know that it's kind of complicated. Good luck!
New contributor
Hobbyist offers a good solution if the document collection's scanned documents do not have text added with optical character recognition (OCR). If this is a possibility, you may want to do some scripting that reads the output of pdfinfo -meta
and checks for the tool used to create the file, or employ a Python routine that uses one of the Python libraries to examine them. Searching for text with a tool like strings
will be unreliable because PDF content can be compressed. And checking the creation tool is not failsafe, either, since PDF pages can be combined; I routinely combine PDF text documents with scanned images to keep things together.
I'm sorry that I am unable to offer specific suggestions. It's been a while since I poked at the PDF internal structure, but depending on how stringent your requirements are, you may want to know that it's kind of complicated. Good luck!
New contributor
New contributor
answered Nov 19 at 14:55
ichabod
111
111
New contributor
New contributor
2
I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
Nov 19 at 15:03
add a comment |
2
I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
Nov 19 at 15:03
2
2
I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
Nov 19 at 15:03
I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
Nov 19 at 15:03
add a comment |
up vote
1
down vote
If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.
In general, for the files I could find on my computer and your test files, following is true:
- Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page
- Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software
- PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.
I'm using Windows at the moment, so I used node.js
for the following example:
const fs = require("mz/fs");
const pdf_parse = require("pdf-parse");
const path = require("path");
const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;
const DEBUG = process.argv.indexOf("debug") != -1;
const STRICT = process.argv.indexOf("strict") != -1;
const debug = DEBUG ? console.error : () => { };
(async () => {
const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });
for (let i = 0, l = pdfs.length; i < l; ++i) {
const pdffilename = pdfs[i];
try {
debug("nnFILE: ", pdffilename);
const buffer = await fs.readFile(pdffilename);
const data = await pdf_parse(buffer);
if (!data.info)
data.indo = {};
if (!data.metadata) {
data.metadata = {
_metadata: {}
};
}
// PDF info
debug(data.info);
// PDF metadata
debug(data.metadata);
// text length
const textLen = data.text ? data.text.length : 0;
const textPerPage = textLen / (data.numpages);
debug("Text length: ", textLen);
debug("Chars per page: ", textLen / data.numpages);
// PDF.js version
// check https://mozilla.github.io/pdf.js/getting_started/
debug(data.version);
if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {
console.log(path.resolve(".", pdffilename));
}
}
catch (e) {
if (strict && !debug) {
console.error("Failed to evaluate " + item);
}
{
debug("Failed to evaluate " + item);
debug(e.stack);
}
if (strict) {
process.exit(1);
}
}
}
})();
const IS_CREATOR_CANON = /canon/i;
const IS_CREATOR_MS_WORD = /microsoft.*?word/i;
// just defined for better clarity or return values
const IS_SCANNED = true;
const IS_NOT_SCANNED = false;
function evalScanned(pdfdata, textLen, textPerPage) {
if (textPerPage < 300 && pdfdata.numpages>1) {
// really low number, definitelly not text pdf
return IS_SCANNED;
}
// definitelly has enough text
// might be scanned but OCRed
// we return this if no
// suspition of scanning is found
let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;
if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {
// this is always scanned, canon is brand name
return IS_SCANNED;
}
return implicitAssumption;
}
To run it, you need to have Node.js installed (should be a single command) and you also need to call:
npm install mz pdf-parse
Usage:
node howYouNamedIt.js [scanned] [debug] [strict]
- scanned show PDFs thought to be scanned (otherwise shows not scanned)
- debug shows the debug info such as metadata and error stack traces
- strict kills the program on first error
This example is not considered finished solution, but with the debug
flag, you get some insight into meta information of a file:
FILE: BR-L1411-3-scanned.pdf
{ PDFFormatVersion: '1.3',
IsAcroFormPresent: false,
IsXFAPresent: false,
Creator: 'Canon ',
Producer: ' ',
CreationDate: 'D:20131212150500-03'00'',
ModDate: 'D:20140709104225-03'00'' }
Metadata {
_metadata:
{ 'xmp:createdate': '2013-12-12T15:05-03:00',
'xmp:creatortool': 'Canon',
'xmp:modifydate': '2014-07-09T10:42:25-03:00',
'xmp:metadatadate': '2014-07-09T10:42:25-03:00',
'pdf:producer': '',
'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',
'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',
'dc:format': 'application/pdf' } }
Text length: 772
Chars per page: 2
1.10.100
D:webso-odpovedipdfBR-L1411-3-scanned.pdf
The naive function that I wrote has 100% success on the documents that I could find on my computer (including your samples). I named the files based on what their status was before running the program, to make it possible to see if results are correct.
D:xxxxpdf>node detect_scanned.js scanned
D:xxxxpdfAR-G1002-scanned.pdf
D:xxxxpdfAR-G1002_scanned.pdf
D:xxxxpdfBR-L1411-3-scanned.pdf
D:xxxxpdfWHO_TRS_696-scanned.pdf
D:xxxxpdf>node detect_scanned.js
D:xxxxpdfAR-G1003-not-scanned.pdf
D:xxxxpdfASEE_-_thermoelectric_paper_-_final-not-scanned.pdf
D:xxxxpdfMULTIMODE ABSORBER-not-scanned.pdf
D:xxxxpdfReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf
You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.
Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
Nov 20 at 18:04
add a comment |
up vote
1
down vote
If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.
In general, for the files I could find on my computer and your test files, following is true:
- Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page
- Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software
- PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.
I'm using Windows at the moment, so I used node.js
for the following example:
const fs = require("mz/fs");
const pdf_parse = require("pdf-parse");
const path = require("path");
const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;
const DEBUG = process.argv.indexOf("debug") != -1;
const STRICT = process.argv.indexOf("strict") != -1;
const debug = DEBUG ? console.error : () => { };
(async () => {
const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });
for (let i = 0, l = pdfs.length; i < l; ++i) {
const pdffilename = pdfs[i];
try {
debug("nnFILE: ", pdffilename);
const buffer = await fs.readFile(pdffilename);
const data = await pdf_parse(buffer);
if (!data.info)
data.indo = {};
if (!data.metadata) {
data.metadata = {
_metadata: {}
};
}
// PDF info
debug(data.info);
// PDF metadata
debug(data.metadata);
// text length
const textLen = data.text ? data.text.length : 0;
const textPerPage = textLen / (data.numpages);
debug("Text length: ", textLen);
debug("Chars per page: ", textLen / data.numpages);
// PDF.js version
// check https://mozilla.github.io/pdf.js/getting_started/
debug(data.version);
if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {
console.log(path.resolve(".", pdffilename));
}
}
catch (e) {
if (strict && !debug) {
console.error("Failed to evaluate " + item);
}
{
debug("Failed to evaluate " + item);
debug(e.stack);
}
if (strict) {
process.exit(1);
}
}
}
})();
const IS_CREATOR_CANON = /canon/i;
const IS_CREATOR_MS_WORD = /microsoft.*?word/i;
// just defined for better clarity or return values
const IS_SCANNED = true;
const IS_NOT_SCANNED = false;
function evalScanned(pdfdata, textLen, textPerPage) {
if (textPerPage < 300 && pdfdata.numpages>1) {
// really low number, definitelly not text pdf
return IS_SCANNED;
}
// definitelly has enough text
// might be scanned but OCRed
// we return this if no
// suspition of scanning is found
let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;
if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {
// this is always scanned, canon is brand name
return IS_SCANNED;
}
return implicitAssumption;
}
To run it, you need to have Node.js installed (should be a single command) and you also need to call:
npm install mz pdf-parse
Usage:
node howYouNamedIt.js [scanned] [debug] [strict]
- scanned show PDFs thought to be scanned (otherwise shows not scanned)
- debug shows the debug info such as metadata and error stack traces
- strict kills the program on first error
This example is not considered finished solution, but with the debug
flag, you get some insight into meta information of a file:
FILE: BR-L1411-3-scanned.pdf
{ PDFFormatVersion: '1.3',
IsAcroFormPresent: false,
IsXFAPresent: false,
Creator: 'Canon ',
Producer: ' ',
CreationDate: 'D:20131212150500-03'00'',
ModDate: 'D:20140709104225-03'00'' }
Metadata {
_metadata:
{ 'xmp:createdate': '2013-12-12T15:05-03:00',
'xmp:creatortool': 'Canon',
'xmp:modifydate': '2014-07-09T10:42:25-03:00',
'xmp:metadatadate': '2014-07-09T10:42:25-03:00',
'pdf:producer': '',
'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',
'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',
'dc:format': 'application/pdf' } }
Text length: 772
Chars per page: 2
1.10.100
D:webso-odpovedipdfBR-L1411-3-scanned.pdf
The naive function that I wrote has 100% success on the documents that I could find on my computer (including your samples). I named the files based on what their status was before running the program, to make it possible to see if results are correct.
D:xxxxpdf>node detect_scanned.js scanned
D:xxxxpdfAR-G1002-scanned.pdf
D:xxxxpdfAR-G1002_scanned.pdf
D:xxxxpdfBR-L1411-3-scanned.pdf
D:xxxxpdfWHO_TRS_696-scanned.pdf
D:xxxxpdf>node detect_scanned.js
D:xxxxpdfAR-G1003-not-scanned.pdf
D:xxxxpdfASEE_-_thermoelectric_paper_-_final-not-scanned.pdf
D:xxxxpdfMULTIMODE ABSORBER-not-scanned.pdf
D:xxxxpdfReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf
You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.
Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
Nov 20 at 18:04
add a comment |
up vote
1
down vote
up vote
1
down vote
If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.
In general, for the files I could find on my computer and your test files, following is true:
- Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page
- Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software
- PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.
I'm using Windows at the moment, so I used node.js
for the following example:
const fs = require("mz/fs");
const pdf_parse = require("pdf-parse");
const path = require("path");
const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;
const DEBUG = process.argv.indexOf("debug") != -1;
const STRICT = process.argv.indexOf("strict") != -1;
const debug = DEBUG ? console.error : () => { };
(async () => {
const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });
for (let i = 0, l = pdfs.length; i < l; ++i) {
const pdffilename = pdfs[i];
try {
debug("nnFILE: ", pdffilename);
const buffer = await fs.readFile(pdffilename);
const data = await pdf_parse(buffer);
if (!data.info)
data.indo = {};
if (!data.metadata) {
data.metadata = {
_metadata: {}
};
}
// PDF info
debug(data.info);
// PDF metadata
debug(data.metadata);
// text length
const textLen = data.text ? data.text.length : 0;
const textPerPage = textLen / (data.numpages);
debug("Text length: ", textLen);
debug("Chars per page: ", textLen / data.numpages);
// PDF.js version
// check https://mozilla.github.io/pdf.js/getting_started/
debug(data.version);
if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {
console.log(path.resolve(".", pdffilename));
}
}
catch (e) {
if (strict && !debug) {
console.error("Failed to evaluate " + item);
}
{
debug("Failed to evaluate " + item);
debug(e.stack);
}
if (strict) {
process.exit(1);
}
}
}
})();
const IS_CREATOR_CANON = /canon/i;
const IS_CREATOR_MS_WORD = /microsoft.*?word/i;
// just defined for better clarity or return values
const IS_SCANNED = true;
const IS_NOT_SCANNED = false;
function evalScanned(pdfdata, textLen, textPerPage) {
if (textPerPage < 300 && pdfdata.numpages>1) {
// really low number, definitelly not text pdf
return IS_SCANNED;
}
// definitelly has enough text
// might be scanned but OCRed
// we return this if no
// suspition of scanning is found
let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;
if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {
// this is always scanned, canon is brand name
return IS_SCANNED;
}
return implicitAssumption;
}
To run it, you need to have Node.js installed (should be a single command) and you also need to call:
npm install mz pdf-parse
Usage:
node howYouNamedIt.js [scanned] [debug] [strict]
- scanned show PDFs thought to be scanned (otherwise shows not scanned)
- debug shows the debug info such as metadata and error stack traces
- strict kills the program on first error
This example is not considered finished solution, but with the debug
flag, you get some insight into meta information of a file:
FILE: BR-L1411-3-scanned.pdf
{ PDFFormatVersion: '1.3',
IsAcroFormPresent: false,
IsXFAPresent: false,
Creator: 'Canon ',
Producer: ' ',
CreationDate: 'D:20131212150500-03'00'',
ModDate: 'D:20140709104225-03'00'' }
Metadata {
_metadata:
{ 'xmp:createdate': '2013-12-12T15:05-03:00',
'xmp:creatortool': 'Canon',
'xmp:modifydate': '2014-07-09T10:42:25-03:00',
'xmp:metadatadate': '2014-07-09T10:42:25-03:00',
'pdf:producer': '',
'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',
'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',
'dc:format': 'application/pdf' } }
Text length: 772
Chars per page: 2
1.10.100
D:webso-odpovedipdfBR-L1411-3-scanned.pdf
The naive function that I wrote has 100% success on the documents that I could find on my computer (including your samples). I named the files based on what their status was before running the program, to make it possible to see if results are correct.
D:xxxxpdf>node detect_scanned.js scanned
D:xxxxpdfAR-G1002-scanned.pdf
D:xxxxpdfAR-G1002_scanned.pdf
D:xxxxpdfBR-L1411-3-scanned.pdf
D:xxxxpdfWHO_TRS_696-scanned.pdf
D:xxxxpdf>node detect_scanned.js
D:xxxxpdfAR-G1003-not-scanned.pdf
D:xxxxpdfASEE_-_thermoelectric_paper_-_final-not-scanned.pdf
D:xxxxpdfMULTIMODE ABSORBER-not-scanned.pdf
D:xxxxpdfReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf
You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.
If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.
In general, for the files I could find on my computer and your test files, following is true:
- Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page
- Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software
- PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.
I'm using Windows at the moment, so I used node.js
for the following example:
const fs = require("mz/fs");
const pdf_parse = require("pdf-parse");
const path = require("path");
const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;
const DEBUG = process.argv.indexOf("debug") != -1;
const STRICT = process.argv.indexOf("strict") != -1;
const debug = DEBUG ? console.error : () => { };
(async () => {
const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });
for (let i = 0, l = pdfs.length; i < l; ++i) {
const pdffilename = pdfs[i];
try {
debug("nnFILE: ", pdffilename);
const buffer = await fs.readFile(pdffilename);
const data = await pdf_parse(buffer);
if (!data.info)
data.indo = {};
if (!data.metadata) {
data.metadata = {
_metadata: {}
};
}
// PDF info
debug(data.info);
// PDF metadata
debug(data.metadata);
// text length
const textLen = data.text ? data.text.length : 0;
const textPerPage = textLen / (data.numpages);
debug("Text length: ", textLen);
debug("Chars per page: ", textLen / data.numpages);
// PDF.js version
// check https://mozilla.github.io/pdf.js/getting_started/
debug(data.version);
if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {
console.log(path.resolve(".", pdffilename));
}
}
catch (e) {
if (strict && !debug) {
console.error("Failed to evaluate " + item);
}
{
debug("Failed to evaluate " + item);
debug(e.stack);
}
if (strict) {
process.exit(1);
}
}
}
})();
const IS_CREATOR_CANON = /canon/i;
const IS_CREATOR_MS_WORD = /microsoft.*?word/i;
// just defined for better clarity or return values
const IS_SCANNED = true;
const IS_NOT_SCANNED = false;
function evalScanned(pdfdata, textLen, textPerPage) {
if (textPerPage < 300 && pdfdata.numpages>1) {
// really low number, definitelly not text pdf
return IS_SCANNED;
}
// definitelly has enough text
// might be scanned but OCRed
// we return this if no
// suspition of scanning is found
let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;
if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {
// this is always scanned, canon is brand name
return IS_SCANNED;
}
return implicitAssumption;
}
To run it, you need to have Node.js installed (should be a single command) and you also need to call:
npm install mz pdf-parse
Usage:
node howYouNamedIt.js [scanned] [debug] [strict]
- scanned show PDFs thought to be scanned (otherwise shows not scanned)
- debug shows the debug info such as metadata and error stack traces
- strict kills the program on first error
This example is not considered finished solution, but with the debug
flag, you get some insight into meta information of a file:
FILE: BR-L1411-3-scanned.pdf
{ PDFFormatVersion: '1.3',
IsAcroFormPresent: false,
IsXFAPresent: false,
Creator: 'Canon ',
Producer: ' ',
CreationDate: 'D:20131212150500-03'00'',
ModDate: 'D:20140709104225-03'00'' }
Metadata {
_metadata:
{ 'xmp:createdate': '2013-12-12T15:05-03:00',
'xmp:creatortool': 'Canon',
'xmp:modifydate': '2014-07-09T10:42:25-03:00',
'xmp:metadatadate': '2014-07-09T10:42:25-03:00',
'pdf:producer': '',
'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',
'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',
'dc:format': 'application/pdf' } }
Text length: 772
Chars per page: 2
1.10.100
D:webso-odpovedipdfBR-L1411-3-scanned.pdf
The naive function that I wrote has 100% success on the documents that I could find on my computer (including your samples). I named the files based on what their status was before running the program, to make it possible to see if results are correct.
D:xxxxpdf>node detect_scanned.js scanned
D:xxxxpdfAR-G1002-scanned.pdf
D:xxxxpdfAR-G1002_scanned.pdf
D:xxxxpdfBR-L1411-3-scanned.pdf
D:xxxxpdfWHO_TRS_696-scanned.pdf
D:xxxxpdf>node detect_scanned.js
D:xxxxpdfAR-G1003-not-scanned.pdf
D:xxxxpdfASEE_-_thermoelectric_paper_-_final-not-scanned.pdf
D:xxxxpdfMULTIMODE ABSORBER-not-scanned.pdf
D:xxxxpdfReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf
You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.
answered Nov 19 at 21:50
Tomáš Zato
169113
169113
Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
Nov 20 at 18:04
add a comment |
Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
Nov 20 at 18:04
Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
Nov 20 at 18:04
Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
Nov 20 at 18:04
add a comment |
up vote
0
down vote
2 ways I can think of:
Using select text tool: if you are using a scanned PDF the texts can not be selected, rather a box will appear. You can use this fact to create the script. I know in C++ QT there is a way, not sure in Linux though.
Search for word in file: In a non-scanned PDF your search will work, however not in scanned file. You just need to find some words common to all PDFs or I would rather say search for letter 'e' in all the PDFs. It has the highest frequency distribution so chances are you will find it in all the documents which have text (Unless its gadsby)
eg
grep -rnw '/path/to/pdf/' -e 'e'
Use any of the text processing tools
1
a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
Nov 19 at 16:15
@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
Nov 19 at 17:30
1
@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
Nov 19 at 17:50
1
i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
Nov 19 at 20:07
add a comment |
up vote
0
down vote
2 ways I can think of:
Using select text tool: if you are using a scanned PDF the texts can not be selected, rather a box will appear. You can use this fact to create the script. I know in C++ QT there is a way, not sure in Linux though.
Search for word in file: In a non-scanned PDF your search will work, however not in scanned file. You just need to find some words common to all PDFs or I would rather say search for letter 'e' in all the PDFs. It has the highest frequency distribution so chances are you will find it in all the documents which have text (Unless its gadsby)
eg
grep -rnw '/path/to/pdf/' -e 'e'
Use any of the text processing tools
1
a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
Nov 19 at 16:15
@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
Nov 19 at 17:30
1
@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
Nov 19 at 17:50
1
i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
Nov 19 at 20:07
add a comment |
up vote
0
down vote
up vote
0
down vote
2 ways I can think of:
Using select text tool: if you are using a scanned PDF the texts can not be selected, rather a box will appear. You can use this fact to create the script. I know in C++ QT there is a way, not sure in Linux though.
Search for word in file: In a non-scanned PDF your search will work, however not in scanned file. You just need to find some words common to all PDFs or I would rather say search for letter 'e' in all the PDFs. It has the highest frequency distribution so chances are you will find it in all the documents which have text (Unless its gadsby)
eg
grep -rnw '/path/to/pdf/' -e 'e'
Use any of the text processing tools
2 ways I can think of:
Using select text tool: if you are using a scanned PDF the texts can not be selected, rather a box will appear. You can use this fact to create the script. I know in C++ QT there is a way, not sure in Linux though.
Search for word in file: In a non-scanned PDF your search will work, however not in scanned file. You just need to find some words common to all PDFs or I would rather say search for letter 'e' in all the PDFs. It has the highest frequency distribution so chances are you will find it in all the documents which have text (Unless its gadsby)
eg
grep -rnw '/path/to/pdf/' -e 'e'
Use any of the text processing tools
edited Nov 19 at 20:23
phuclv
318224
318224
answered Nov 19 at 12:32
swapedoc
416
416
1
a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
Nov 19 at 16:15
@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
Nov 19 at 17:30
1
@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
Nov 19 at 17:50
1
i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
Nov 19 at 20:07
add a comment |
1
a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
Nov 19 at 16:15
@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
Nov 19 at 17:30
1
@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
Nov 19 at 17:50
1
i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
Nov 19 at 20:07
1
1
a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
Nov 19 at 16:15
a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
Nov 19 at 16:15
@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
Nov 19 at 17:30
@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
Nov 19 at 17:30
1
1
@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
Nov 19 at 17:50
@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
Nov 19 at 17:50
1
1
i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
Nov 19 at 20:07
i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
Nov 19 at 20:07
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1094198%2fis-there-a-simple-way-to-identify-if-a-pdf-is-scanned%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
7
Do you mean whether they're text or images?
– DK Bose
Nov 19 at 12:03
8
Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
– sudodus
Nov 19 at 16:04
4
@sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
– pipe
Nov 19 at 16:06
1
Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
– dessert
Nov 19 at 19:11
1
If a
pdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string/Image/
, which can be found with the command linegrep --color -a 'Image' filename.pdf
. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).– sudodus
Nov 19 at 19:21