It has been a deadly felt need for a code to extract the text and images from a PDF file. Though there has been online tools that convert PDF files into text and other formats, But there had been no programming solution to extract the content from a PDF file, until a useful class added to PHP named “class.pdf2text.php”.
Extract Text and Images from a PDF file Using PHP
With this class, one can not only get and use the content of a PDF file in a web application, but also this class gives user the facility to determine the presence of a specific text string inside the PDF file.
Here is the class file code “class.pdf2text.php”. Include this file in each example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 |
<?php /* SYNTAX: include('class.pdf2text.php'); $a = new PDF2Text(); $a->setFilename('test.pdf'); $a->decodePDF(); echo $a->output(); ALTERNATIVES: Other excellent options to search within a PDF: - Apache PDFbox (http://pdfbox.apache.org/). An open source Java solution - pdflib TET (http://www.pdflib.com/products/tet/) - Online converter: http://snowtide.com/PDFTextStream */ class PDF2Text { // Some settings var $multibyte = 4; // Use setUnicode(TRUE|FALSE) var $convertquotes = ENT_QUOTES; // ENT_COMPAT (double-quotes), ENT_QUOTES (Both), ENT_NOQUOTES (None) var $showprogress = true; // TRUE if you have problems with time-out // Variables var $filename = ''; var $decodedtext = ''; function setFilename($filename) { // Reset $this->decodedtext = ''; $this->filename = $filename; } function output($echo = false) { if($echo) echo $this->decodedtext; else return $this->decodedtext; } function setUnicode($input) { // 4 for unicode. But 2 should work in most cases just fine if($input == true) $this->multibyte = 4; else $this->multibyte = 2; } function decodePDF() { // Read the data from pdf file $infile = @file_get_contents($this->filename, FILE_BINARY); if (empty($infile)) return ""; // Get all text data. $transformations = array(); $texts = array(); // Get the list of all objects. preg_match_all("#obj[\n|\r](.*)endobj[\n|\r]#ismU", $infile . "endobj\r", $objects); $objects = @$objects[1]; // Select objects with streams. for ($i = 0; $i < count($objects); $i++) { $currentObject = $objects[$i]; // Prevent time-out @set_time_limit (); if($this->showprogress) { // echo ". "; flush(); ob_flush(); } // Check if an object includes data stream. if (preg_match("#stream[\n|\r](.*)endstream[\n|\r]#ismU", $currentObject . "endstream\r", $stream )) { $stream = ltrim($stream[1]); // Check object parameters and look for text data. $options = $this->getObjectOptions($currentObject); if (!(empty($options["Length1"]) && empty($options["Type"]) && empty($options["Subtype"])) ) // if ( $options["Image"] && $options["Subtype"] ) // if (!(empty($options["Length1"]) && empty($options["Subtype"])) ) continue; // Hack, length doesnt always seem to be correct unset($options["Length"]); // So, we have text data. Decode it. $data = $this->getDecodedStream($stream, $options); if (strlen($data)) { if (preg_match_all("#BT[\n|\r](.*)ET[\n|\r]#ismU", $data . "ET\r", $textContainers)) { $textContainers = @$textContainers[1]; $this->getDirtyTexts($texts, $textContainers); } else $this->getCharTransformations($transformations, $data); } } } // Analyze text blocks taking into account character transformations and return results. $this->decodedtext = $this->getTextUsingTransformations($texts, $transformations); } function decodeAsciiHex($input) { $output = ""; $isOdd = true; $isComment = false; for($i = 0, $codeHigh = -1; $i < strlen($input) && $input[$i] != '>'; $i++) { $c = $input[$i]; if($isComment) { if ($c == '\r' || $c == '\n') $isComment = false; continue; } switch($c) { case '\0': case '\t': case '\r': case '\f': case '\n': case ' ': break; case '%': $isComment = true; break; default: $code = hexdec($c); if($code === 0 && $c != '0') return ""; if($isOdd) $codeHigh = $code; else $output .= chr($codeHigh * 16 + $code); $isOdd = !$isOdd; break; } } if($input[$i] != '>') return ""; if($isOdd) $output .= chr($codeHigh * 16); return $output; } function decodeAscii85($input) { $output = ""; $isComment = false; $ords = array(); for($i = 0, $state = 0; $i < strlen($input) && $input[$i] != '~'; $i++) { $c = $input[$i]; if($isComment) { if ($c == '\r' || $c == '\n') $isComment = false; continue; } if ($c == '\0' || $c == '\t' || $c == '\r' || $c == '\f' || $c == '\n' || $c == ' ') continue; if ($c == '%') { $isComment = true; continue; } if ($c == 'z' && $state === 0) { $output .= str_repeat(chr(0), 4); continue; } if ($c < '!' || $c > 'u') return ""; $code = ord($input[$i]) & 0xff; $ords[$state++] = $code - ord('!'); if ($state == 5) { $state = 0; for ($sum = 0, $j = 0; $j < 5; $j++) $sum = $sum * 85 + $ords[$j]; for ($j = 3; $j >= 0; $j--) $output .= chr($sum >> ($j * 8)); } } if ($state === 1) return ""; elseif ($state > 1) { for ($i = 0, $sum = 0; $i < $state; $i++) $sum += ($ords[$i] + ($i == $state - 1)) * pow(85, 4 - $i); for ($i = 0; $i < $state - 1; $i++) { try { if(false == ($o = chr($sum >> ((3 - $i) * 8)))) { throw new Exception('Error'); } $output .= $o; } catch (Exception $e) { /*Dont do anything*/ } } } return $output; } function decodeFlate($data) { return @gzuncompress($data); } function getObjectOptions($object) { $options = array(); if (preg_match("#<<(.*)>>#ismU", $object, $options)) { $options = explode("/", $options[1]); @array_shift($options); $o = array(); for ($j = 0; $j < @count($options); $j++) { $options[$j] = preg_replace("#\s+#", " ", trim($options[$j])); if (strpos($options[$j], " ") !== false) { $parts = explode(" ", $options[$j]); $o[$parts[0]] = $parts[1]; } else $o[$options[$j]] = true; } $options = $o; unset($o); } return $options; } function getDecodedStream($stream, $options) { $data = ""; if (empty($options["Filter"])) $data = $stream; else { $length = !empty($options["Length"]) ? $options["Length"] : strlen($stream); $_stream = substr($stream, 0, $length); foreach ($options as $key => $value) { if ($key == "ASCIIHexDecode") $_stream = $this->decodeAsciiHex($_stream); elseif ($key == "ASCII85Decode") $_stream = $this->decodeAscii85($_stream); elseif ($key == "FlateDecode") $_stream = $this->decodeFlate($_stream); elseif ($key == "Crypt") { // TO DO } } $data = $_stream; } return $data; } function getDirtyTexts(&$texts, $textContainers) { for ($j = 0; $j < count($textContainers); $j++) { if (preg_match_all("#\[(.*)\]\s*TJ[\n|\r]#ismU", $textContainers[$j], $parts)) $texts = array_merge($texts, array(@implode('', $parts[1]))); elseif (preg_match_all("#T[d|w|m|f]\s*(\(.*\))\s*Tj[\n|\r]#ismU", $textContainers[$j], $parts)) $texts = array_merge($texts, array(@implode('', $parts[1]))); elseif (preg_match_all("#T[d|w|m|f]\s*(\[.*\])\s*Tj[\n|\r]#ismU", $textContainers[$j], $parts)) $texts = array_merge($texts, array(@implode('', $parts[1]))); } } function getCharTransformations(&$transformations, $stream) { preg_match_all("#([0-9]+)\s+beginbfchar(.*)endbfchar#ismU", $stream, $chars, PREG_SET_ORDER); preg_match_all("#([0-9]+)\s+beginbfrange(.*)endbfrange#ismU", $stream, $ranges, PREG_SET_ORDER); for ($j = 0; $j < count($chars); $j++) { $count = $chars[$j][1]; $current = explode("\n", trim($chars[$j][2])); for ($k = 0; $k < $count && $k < count($current); $k++) { if (preg_match("#<([0-9a-f]{2,4})>\s+<([0-9a-f]{4,512})>#is", trim($current[$k]), $map)) $transformations[str_pad($map[1], 4, "0")] = $map[2]; } } for ($j = 0; $j < count($ranges); $j++) { $count = $ranges[$j][1]; $current = explode("\n", trim($ranges[$j][2])); for ($k = 0; $k < $count && $k < count($current); $k++) { if (preg_match("#<([0-9a-f]{4})>\s+<([0-9a-f]{4})>\s+<([0-9a-f]{4})>#is", trim($current[$k]), $map)) { $from = hexdec($map[1]); $to = hexdec($map[2]); $_from = hexdec($map[3]); for ($m = $from, $n = 0; $m <= $to; $m++, $n++) $transformations[sprintf("%04X", $m)] = sprintf("%04X", $_from + $n); } elseif (preg_match("#<([0-9a-f]{4})>\s+<([0-9a-f]{4})>\s+\[(.*)\]#ismU", trim($current[$k]), $map)) { $from = hexdec($map[1]); $to = hexdec($map[2]); $parts = preg_split("#\s+#", trim($map[3])); for ($m = $from, $n = 0; $m <= $to && $n < count($parts); $m++, $n++) $transformations[sprintf("%04X", $m)] = sprintf("%04X", hexdec($parts[$n])); } } } } function getTextUsingTransformations($texts, $transformations) { $document = ""; for ($i = 0; $i < count($texts); $i++) { $isHex = false; $isPlain = false; $hex = ""; $plain = ""; for ($j = 0; $j < strlen($texts[$i]); $j++) { $c = $texts[$i][$j]; switch($c) { case "<": $hex = ""; $isHex = true; $isPlain = false; break; case ">": $hexs = str_split($hex, $this->multibyte); // 2 or 4 (UTF8 or ISO) for ($k = 0; $k < count($hexs); $k++) { $chex = str_pad($hexs[$k], 4, "0"); // Add tailing zero if (isset($transformations[$chex])) $chex = $transformations[$chex]; $document .= html_entity_decode("&#x".$chex.";"); } $isHex = false; break; case "(": $plain = ""; $isPlain = true; $isHex = false; break; case ")": $document .= $plain; $isPlain = false; break; case "\\": $c2 = $texts[$i][$j + 1]; if (in_array($c2, array("\\", "(", ")"))) $plain .= $c2; elseif ($c2 == "n") $plain .= '\n'; elseif ($c2 == "r") $plain .= '\r'; elseif ($c2 == "t") $plain .= '\t'; elseif ($c2 == "b") $plain .= '\b'; elseif ($c2 == "f") $plain .= '\f'; elseif ($c2 >= '0' && $c2 <= '9') { $oct = preg_replace("#[^0-9]#", "", substr($texts[$i], $j + 1, 3)); $j += strlen($oct) - 1; $plain .= html_entity_decode("&#".octdec($oct).";", $this->convertquotes); } $j++; break; default: if ($isHex) $hex .= $c; elseif ($isPlain) $plain .= $c; break; } } $document .= "\n"; } return $document; } } ?> |
For ease, I am giving here three example using this class, as follows:
pdf_to_text.php
To Extract the Content from PDF a file
In this example extract the text from a PDF file. Create a ‘pdf_to_text.php’ file and include ‘class.pdf2text.php’ to execute library functions.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
<?php include ( 'class.pdf2text.php' ) ; function output ( $message ) { if ( php_sapi_name ( ) == 'cli' ) echo ( $message ) ; else echo ( nl2br ( $message ) ) ; } $file = 'sample' ; // name of pdf file without .pdf extenstion, sample.pdf $pdf = new PdfToText ( "$file.pdf" ) ; output ( "Extracted file contents :\n" ) ; output ( $pdf -> Text ) ; ?> |
pdf_to_image.php
To Extract the Embedded Image(if any) Available in The Source PDF file.
This example belong to extract an image from any source PDF file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
<?php // This example saves all images found in the 'sample.pdf' file, after having put the string // "Hello world" in blue color, using the largest stock font include ( 'class.pdf2text.php' ) ; function output ( $message ) { if ( php_sapi_name ( ) == 'cli' ) echo ( $message ) ; else echo ( nl2br ( $message ) ) ; } $file = 'sample' ; // name of pdf file without .pdf extenstion, sample.pdf $pdf = new PdfToText ( "$file.pdf", PdfToText::PDFOPT_DECODE_IMAGE_DATA ) ; $image_count = count ( $pdf -> Images ) ; if ( $image_count ) { for ( $i = 0 ; $i < $image_count ; $i ++ ) { // Get next image and generate a filename for it (there will be a file named "sample.x.jpg" // for each image found in file "sample.pdf") $img = $pdf -> Images [$i] ; // This is an object of type PdfImage $imgindex = sprintf ( "%02d", $i + 1 ) ; $output_image = "$file.$imgindex.jpg" ; // Allocate a color entry for "white". Note that the ImageResource property of every PdfImage object // is a real image resource that can be specified to any of the image*() Php functions $textcolor = imagecolorallocate ( $img -> ImageResource, 0, 0, 255 ) ; // Put the string "Hello world" on top of the image. imagestring ( $img -> ImageResource, 5, 0, 0, "Hello world #$imgindex", $textcolor ) ; // Save the image (the default is IMG_JPG, but you can specify another IMG_* image type by specifying it // as the second parameter) $img -> SaveAs ( $output_image ) ; output ( "Generated image file \"$output_image\"" ) ; } } else echo "No image was found in sample file \"$file.pdf\"" ; |
You May Also Like:
How to Make Soap Client Call in PHP
Create and Save XML file Using MySQL Data in PHP
Configure Intercom API in PHp and Create New User in It
GitHub Tutorial: Commit, Push And Go
How to Integrate Instagram API into Website
How to Create Rss Feed Script Using PHP and MySQL
search_pdf.php :
To Search a String in PDF file
If you are looking to search any specific text or string from a PDF file than below is example for it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
<?php include ( 'class.pdf2text.php' ) ; $file = 'sample' ; // name of pdf file without .pdf extenstion, sample.pdf $pdf = new PdfToText ( "file.pdf" ) ; $search = 'john'; // keyword to search goes here $result = $pdf -> text_strpos ( $search, $start = 0 ) ; // $start is the start offset in the pdf text contents echo "<pre>"; print_r($result); // the result will give you the position of searched string and thus indicating its presence in PDF file ?> |
Many web applications need to use PDF document data for further usage. Using these PHP class you can easily get text and images from the PDF file and use it. If find this post useful, please share with others. Thanks
Hi, I need to scrape/copy all the data from a section of a PDF page that looks like this: http://prntscr.com/czngo4
I want to take all the variable values in that table and put them into a mysql database. Is this something you could give me a quote to do?
Hi,
Yes, i will give my service for your task. You need only red highlighted part data or complete pdf text data ?
Please send your complete requirement on [email protected].
Thanks