PHP Code to Extract Text and Images from a PDF file

It has been a deadly felt need for a code to extract the text and images from a PDF file. Though there has been online tools that convert PDF files into text and other formats, But there had been no programming solution to extract the content from a PDF file, until a useful class added to PHP named “class.pdf2text.php”.

Extract Text and Images from a PDF file Using PHP

With this class, one can not only get and use the content of a PDF file in a web application, but also this class gives user the facility to determine the presence of a specific text string inside the PDF file.

Here is the class file code “class.pdf2text.php”. Include this file in each example.

<?php
/*
SYNTAX:
include('class.pdf2text.php');
$a = new PDF2Text();
$a->setFilename('test.pdf');
$a->decodePDF();
echo $a->output();

ALTERNATIVES:
Other excellent options to search within a PDF:
- Apache PDFbox (http://pdfbox.apache.org/). An open source Java solution
- pdflib TET (http://www.pdflib.com/products/tet/)
- Online converter: http://snowtide.com/PDFTextStream
*/

class PDF2Text {
	// Some settings
	var $multibyte = 4; // Use setUnicode(TRUE|FALSE)
	var $convertquotes = ENT_QUOTES; // ENT_COMPAT (double-quotes), ENT_QUOTES (Both), ENT_NOQUOTES (None)
	var $showprogress = true; // TRUE if you have problems with time-out

	// Variables
	var $filename = '';
	var $decodedtext = '';

	function setFilename($filename) {
		// Reset
		$this->decodedtext = '';
		$this->filename = $filename;
	}

	function output($echo = false) {
		if($echo) echo $this->decodedtext;
		else return $this->decodedtext;
	}

	function setUnicode($input) {
		// 4 for unicode. But 2 should work in most cases just fine
		if($input == true) $this->multibyte = 4;
		else $this->multibyte = 2;
	}

	function decodePDF() {
		// Read the data from pdf file
		$infile = @file_get_contents($this->filename, FILE_BINARY);
		if (empty($infile))
			return "";

		// Get all text data.
		$transformations = array();
		$texts = array();

		// Get the list of all objects.
		preg_match_all("#obj[\n|\r](.*)endobj[\n|\r]#ismU", $infile . "endobj\r", $objects);
		$objects = @$objects[1];

		// Select objects with streams.
		for ($i = 0; $i < count($objects); $i++) {
			$currentObject = $objects[$i];

			// Prevent time-out
			@set_time_limit ();
			if($this->showprogress) {
//				echo ". ";
				flush(); ob_flush();
			}

			// Check if an object includes data stream.
			if (preg_match("#stream[\n|\r](.*)endstream[\n|\r]#ismU", $currentObject . "endstream\r", $stream )) {
				$stream = ltrim($stream[1]);
				// Check object parameters and look for text data.
				$options = $this->getObjectOptions($currentObject);

				if (!(empty($options["Length1"]) && empty($options["Type"]) && empty($options["Subtype"])) )
//				if ( $options["Image"] && $options["Subtype"] )
//				if (!(empty($options["Length1"]) &&  empty($options["Subtype"])) )
					continue;

				// Hack, length doesnt always seem to be correct
				unset($options["Length"]);

				// So, we have text data. Decode it.
				$data = $this->getDecodedStream($stream, $options);

				if (strlen($data)) {
	                if (preg_match_all("#BT[\n|\r](.*)ET[\n|\r]#ismU", $data . "ET\r", $textContainers)) {
						$textContainers = @$textContainers[1];
						$this->getDirtyTexts($texts, $textContainers);
					} else
						$this->getCharTransformations($transformations, $data);
				}
			}
		}

		// Analyze text blocks taking into account character transformations and return results.
		$this->decodedtext = $this->getTextUsingTransformations($texts, $transformations);
	}


	function decodeAsciiHex($input) {
		$output = "";

		$isOdd = true;
		$isComment = false;

		for($i = 0, $codeHigh = -1; $i < strlen($input) && $input[$i] != '>'; $i++) {
			$c = $input[$i];

			if($isComment) {
				if ($c == '\r' || $c == '\n')
					$isComment = false;
				continue;
			}

			switch($c) {
				case '\0': case '\t': case '\r': case '\f': case '\n': case ' ': break;
				case '%':
					$isComment = true;
				break;

				default:
					$code = hexdec($c);
					if($code === 0 && $c != '0')
						return "";

					if($isOdd)
						$codeHigh = $code;
					else
						$output .= chr($codeHigh * 16 + $code);

					$isOdd = !$isOdd;
				break;
			}
		}

		if($input[$i] != '>')
			return "";

		if($isOdd)
			$output .= chr($codeHigh * 16);

		return $output;
	}

	function decodeAscii85($input) {
		$output = "";

		$isComment = false;
		$ords = array();

		for($i = 0, $state = 0; $i < strlen($input) && $input[$i] != '~'; $i++) {
			$c = $input[$i];

			if($isComment) {
				if ($c == '\r' || $c == '\n')
					$isComment = false;
				continue;
			}

			if ($c == '\0' || $c == '\t' || $c == '\r' || $c == '\f' || $c == '\n' || $c == ' ')
				continue;
			if ($c == '%') {
				$isComment = true;
				continue;
			}
			if ($c == 'z' && $state === 0) {
				$output .= str_repeat(chr(0), 4);
				continue;
			}
			if ($c < '!' || $c > 'u')
				return "";

			$code = ord($input[$i]) & 0xff;
			$ords[$state++] = $code - ord('!');

			if ($state == 5) {
				$state = 0;
				for ($sum = 0, $j = 0; $j < 5; $j++)
					$sum = $sum * 85 + $ords[$j];
				for ($j = 3; $j >= 0; $j--)
					$output .= chr($sum >> ($j * 8));
			}
		}
		if ($state === 1)
			return "";
		elseif ($state > 1) {
			for ($i = 0, $sum = 0; $i < $state; $i++)
				$sum += ($ords[$i] + ($i == $state - 1)) * pow(85, 4 - $i);
			for ($i = 0; $i < $state - 1; $i++) {
				try {
					if(false == ($o = chr($sum >> ((3 - $i) * 8)))) {
						throw new Exception('Error');
					}
					$output .= $o;
				} catch (Exception $e) { /*Dont do anything*/ }
			}
		}

		return $output;
	}

	function decodeFlate($data) {
		return @gzuncompress($data);
	}

	function getObjectOptions($object) {
		$options = array();

		if (preg_match("#<<(.*)>>#ismU", $object, $options)) {
			$options = explode("/", $options[1]);
			@array_shift($options);

			$o = array();
			for ($j = 0; $j < @count($options); $j++) {
				$options[$j] = preg_replace("#\s+#", " ", trim($options[$j]));
				if (strpos($options[$j], " ") !== false) {
					$parts = explode(" ", $options[$j]);
					$o[$parts[0]] = $parts[1];
				} else
					$o[$options[$j]] = true;
			}
			$options = $o;
			unset($o);
		}

		return $options;
	}

	function getDecodedStream($stream, $options) {
		$data = "";
		if (empty($options["Filter"]))
			$data = $stream;
		else {
			$length = !empty($options["Length"]) ? $options["Length"] : strlen($stream);
			$_stream = substr($stream, 0, $length);

			foreach ($options as $key => $value) {
				if ($key == "ASCIIHexDecode")
					$_stream = $this->decodeAsciiHex($_stream);
				elseif ($key == "ASCII85Decode")
					$_stream = $this->decodeAscii85($_stream);
				elseif ($key == "FlateDecode")
					$_stream = $this->decodeFlate($_stream);
				elseif ($key == "Crypt") { // TO DO
				}
			}
			$data = $_stream;
		}
		return $data;
	}

	function getDirtyTexts(&$texts, $textContainers) {
		for ($j = 0; $j < count($textContainers); $j++) {
			if (preg_match_all("#\[(.*)\]\s*TJ[\n|\r]#ismU", $textContainers[$j], $parts))
				$texts = array_merge($texts, array(@implode('', $parts[1])));
			elseif (preg_match_all("#T[d|w|m|f]\s*(\(.*\))\s*Tj[\n|\r]#ismU", $textContainers[$j], $parts))
				$texts = array_merge($texts, array(@implode('', $parts[1])));
			elseif (preg_match_all("#T[d|w|m|f]\s*(\[.*\])\s*Tj[\n|\r]#ismU", $textContainers[$j], $parts))
				$texts = array_merge($texts, array(@implode('', $parts[1])));
		}

	}

	function getCharTransformations(&$transformations, $stream) {
		preg_match_all("#([0-9]+)\s+beginbfchar(.*)endbfchar#ismU", $stream, $chars, PREG_SET_ORDER);
		preg_match_all("#([0-9]+)\s+beginbfrange(.*)endbfrange#ismU", $stream, $ranges, PREG_SET_ORDER);

		for ($j = 0; $j < count($chars); $j++) {
			$count = $chars[$j][1];
			$current = explode("\n", trim($chars[$j][2]));
			for ($k = 0; $k < $count && $k < count($current); $k++) {
				if (preg_match("#<([0-9a-f]{2,4})>\s+<([0-9a-f]{4,512})>#is", trim($current[$k]), $map))
					$transformations[str_pad($map[1], 4, "0")] = $map[2];
			}
		}
		for ($j = 0; $j < count($ranges); $j++) {
			$count = $ranges[$j][1];
			$current = explode("\n", trim($ranges[$j][2]));
			for ($k = 0; $k < $count && $k < count($current); $k++) {
				if (preg_match("#<([0-9a-f]{4})>\s+<([0-9a-f]{4})>\s+<([0-9a-f]{4})>#is", trim($current[$k]), $map)) {
					$from = hexdec($map[1]);
					$to = hexdec($map[2]);
					$_from = hexdec($map[3]);

					for ($m = $from, $n = 0; $m <= $to; $m++, $n++)
						$transformations[sprintf("%04X", $m)] = sprintf("%04X", $_from + $n);
				} elseif (preg_match("#<([0-9a-f]{4})>\s+<([0-9a-f]{4})>\s+\[(.*)\]#ismU", trim($current[$k]), $map)) {
					$from = hexdec($map[1]);
					$to = hexdec($map[2]);
					$parts = preg_split("#\s+#", trim($map[3]));

					for ($m = $from, $n = 0; $m <= $to && $n < count($parts); $m++, $n++)
						$transformations[sprintf("%04X", $m)] = sprintf("%04X", hexdec($parts[$n]));
				}
			}
		}
	}
	function getTextUsingTransformations($texts, $transformations) {
		$document = "";
		for ($i = 0; $i < count($texts); $i++) {
			$isHex = false;
			$isPlain = false;

			$hex = "";
			$plain = "";
			for ($j = 0; $j < strlen($texts[$i]); $j++) {
				$c = $texts[$i][$j];
				switch($c) {
					case "<":
						$hex = "";
						$isHex = true;
                        $isPlain = false;
					break;
					case ">":
						$hexs = str_split($hex, $this->multibyte); // 2 or 4 (UTF8 or ISO)
						for ($k = 0; $k < count($hexs); $k++) {

							$chex = str_pad($hexs[$k], 4, "0"); // Add tailing zero
							if (isset($transformations[$chex]))
								$chex = $transformations[$chex];
							$document .= html_entity_decode("&#x".$chex.";");
						}
						$isHex = false;
					break;
					case "(":
						$plain = "";
						$isPlain = true;
                        $isHex = false;
					break;
					case ")":
						$document .= $plain;
						$isPlain = false;
					break;
					case "\\":
						$c2 = $texts[$i][$j + 1];
						if (in_array($c2, array("\\", "(", ")"))) $plain .= $c2;
						elseif ($c2 == "n") $plain .= '\n';
						elseif ($c2 == "r") $plain .= '\r';
						elseif ($c2 == "t") $plain .= '\t';
						elseif ($c2 == "b") $plain .= '\b';
						elseif ($c2 == "f") $plain .= '\f';
						elseif ($c2 >= '0' && $c2 <= '9') {
							$oct = preg_replace("#[^0-9]#", "", substr($texts[$i], $j + 1, 3));
							$j += strlen($oct) - 1;
							$plain .= html_entity_decode("&#".octdec($oct).";", $this->convertquotes);
						}
						$j++;
					break;

					default:
						if ($isHex)
							$hex .= $c;
						elseif ($isPlain)
							$plain .= $c;
					break;
				}
			}
			$document .= "\n";
		}

		return $document;
	}
}
?>

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

<?php

SYNTAX:

include('class.pdf2text.php');

$a = new PDF2Text();

$a->setFilename('test.pdf');

$a->decodePDF();

echo $a->output();

ALTERNATIVES:

Other excellent options to search within a PDF:

- Apache PDFbox (http://pdfbox.apache.org/). An open source Java solution

- pdflib TET (http://www.pdflib.com/products/tet/)

- Online converter: http://snowtide.com/PDFTextStream

class PDF2Text {

// Some settings

var $multibyte = 4; // Use setUnicode(TRUE|FALSE)

var $convertquotes = ENT_QUOTES; // ENT_COMPAT (double-quotes), ENT_QUOTES (Both), ENT_NOQUOTES (None)

var $showprogress = true; // TRUE if you have problems with time-out

// Variables

var $filename = '';

var $decodedtext = '';

function setFilename($filename) {

// Reset

$this->decodedtext = '';

$this->filename = $filename;

}

function output($echo = false) {

if($echo) echo $this->decodedtext;

else return $this->decodedtext;

}

function setUnicode($input) {

// 4 for unicode. But 2 should work in most cases just fine

if($input == true) $this->multibyte = 4;

else $this->multibyte = 2;

}

function decodePDF() {

// Read the data from pdf file

$infile = @file_get_contents($this->filename, FILE_BINARY);

if (empty($infile))

return "";

// Get all text data.

$transformations = array();

$texts = array();

// Get the list of all objects.

preg_match_all("#obj[\n|\r](.*)endobj[\n|\r]#ismU", $infile . "endobj\r", $objects);

$objects = @$objects[1];

// Select objects with streams.

for ($i = 0; $i < count($objects); $i++) {

$currentObject = $objects[$i];

// Prevent time-out

@set_time_limit ();

if($this->showprogress) {

// echo ". ";

flush(); ob_flush();

}

// Check if an object includes data stream.

if (preg_match("#stream[\n|\r](.*)endstream[\n|\r]#ismU", $currentObject . "endstream\r", $stream )) {

$stream = ltrim($stream[1]);

// Check object parameters and look for text data.

$options = $this->getObjectOptions($currentObject);

if (!(empty($options["Length1"]) && empty($options["Type"]) && empty($options["Subtype"])) )

// if ( $options["Image"] && $options["Subtype"] )

// if (!(empty($options["Length1"]) && empty($options["Subtype"])) )

continue;

// Hack, length doesnt always seem to be correct

unset($options["Length"]);

// So, we have text data. Decode it.

$data = $this->getDecodedStream($stream, $options);

if (strlen($data)) {

if (preg_match_all("#BT[\n|\r](.*)ET[\n|\r]#ismU", $data . "ET\r", $textContainers)) {

$textContainers = @$textContainers[1];

$this->getDirtyTexts($texts, $textContainers);

} else

$this->getCharTransformations($transformations, $data);

}

// Analyze text blocks taking into account character transformations and return results.

$this->decodedtext = $this->getTextUsingTransformations($texts, $transformations);

}

function decodeAsciiHex($input) {

$output = "";

$isOdd = true;

$isComment = false;

for($i = 0, $codeHigh = -1; $i < strlen($input) && $input[$i] != '>'; $i++) {

$c = $input[$i];

if($isComment) {

if ($c == '\r' || $c == '\n')

$isComment = false;

continue;

}

switch($c) {

case '\0': case '\t': case '\r': case '\f': case '\n': case ' ': break;

case '%':

$isComment = true;

break;

default:

$code = hexdec($c);

if($code === 0 && $c != '0')

return "";

if($isOdd)

$codeHigh = $code;

else

$output .= chr($codeHigh * 16 + $code);

$isOdd = !$isOdd;

break;

}

if($input[$i] != '>')

return "";

if($isOdd)

$output .= chr($codeHigh * 16);

return $output;

}

function decodeAscii85($input) {

$output = "";

$isComment = false;

$ords = array();

for($i = 0, $state = 0; $i < strlen($input) && $input[$i] != '~'; $i++) {

$c = $input[$i];

if($isComment) {

if ($c == '\r' || $c == '\n')

$isComment = false;

continue;

}

if ($c == '\0' || $c == '\t' || $c == '\r' || $c == '\f' || $c == '\n' || $c == ' ')

continue;

if ($c == '%') {

$isComment = true;

continue;

}

if ($c == 'z' && $state === 0) {

$output .= str_repeat(chr(0), 4);

continue;

}

if ($c < '!' || $c > 'u')

return "";

$code = ord($input[$i]) & 0xff;

$ords[$state++] = $code - ord('!');

if ($state == 5) {

$state = 0;

for ($sum = 0, $j = 0; $j < 5; $j++)

$sum = $sum * 85 + $ords[$j];

for ($j = 3; $j >= 0; $j--)

$output .= chr($sum >> ($j * 8));

}

if ($state === 1)

return "";

elseif ($state > 1) {

for ($i = 0, $sum = 0; $i < $state; $i++)

$sum += ($ords[$i] + ($i == $state - 1)) * pow(85, 4 - $i);

for ($i = 0; $i < $state - 1; $i++) {

try {

if(false == ($o = chr($sum >> ((3 - $i) * 8)))) {

throw new Exception('Error');

}

$output .= $o;

} catch (Exception $e) { /*Dont do anything*/ }

}

return $output;

}

function decodeFlate($data) {

return @gzuncompress($data);

}

function getObjectOptions($object) {

$options = array();

if (preg_match("#<<(.*)>>#ismU", $object, $options)) {

$options = explode("/", $options[1]);

@array_shift($options);

$o = array();

for ($j = 0; $j < @count($options); $j++) {

$options[$j] = preg_replace("#\s+#", " ", trim($options[$j]));

if (strpos($options[$j], " ") !== false) {

$parts = explode(" ", $options[$j]);

$o[$parts[0]] = $parts[1];

} else

$o[$options[$j]] = true;

}

$options = $o;

unset($o);

}

return $options;

}

function getDecodedStream($stream, $options) {

$data = "";

if (empty($options["Filter"]))

$data = $stream;

else {

$length = !empty($options["Length"]) ? $options["Length"] : strlen($stream);

$_stream = substr($stream, 0, $length);

foreach ($options as $key => $value) {

if ($key == "ASCIIHexDecode")

$_stream = $this->decodeAsciiHex($_stream);

elseif ($key == "ASCII85Decode")

$_stream = $this->decodeAscii85($_stream);

elseif ($key == "FlateDecode")

$_stream = $this->decodeFlate($_stream);

elseif ($key == "Crypt") { // TO DO

}

$data = $_stream;

}

return $data;

}

function getDirtyTexts(&$texts, $textContainers) {

for ($j = 0; $j < count($textContainers); $j++) {

if (preg_match_all("#\[(.*)\]\s*TJ[\n|\r]#ismU", $textContainers[$j], $parts))

$texts = array_merge($texts, array(@implode('', $parts[1])));

elseif (preg_match_all("#T[d|w|m|f]\s*($.*$)\s*Tj[\n|\r]#ismU", $textContainers[$j], $parts))

$texts = array_merge($texts, array(@implode('', $parts[1])));

elseif (preg_match_all("#T[d|w|m|f]\s*(\[.*\])\s*Tj[\n|\r]#ismU", $textContainers[$j], $parts))

$texts = array_merge($texts, array(@implode('', $parts[1])));

}

function getCharTransformations(&$transformations, $stream) {

preg_match_all("#([0-9]+)\s+beginbfchar(.*)endbfchar#ismU", $stream, $chars, PREG_SET_ORDER);

preg_match_all("#([0-9]+)\s+beginbfrange(.*)endbfrange#ismU", $stream, $ranges, PREG_SET_ORDER);

for ($j = 0; $j < count($chars); $j++) {

$count = $chars[$j][1];

$current = explode("\n", trim($chars[$j][2]));

for ($k = 0; $k < $count && $k < count($current); $k++) {

if (preg_match("#<([0-9a-f]{2,4})>\s+<([0-9a-f]{4,512})>#is", trim($current[$k]), $map))

$transformations[str_pad($map[1], 4, "0")] = $map[2];

}

for ($j = 0; $j < count($ranges); $j++) {

$count = $ranges[$j][1];

$current = explode("\n", trim($ranges[$j][2]));

for ($k = 0; $k < $count && $k < count($current); $k++) {

if (preg_match("#<([0-9a-f]{4})>\s+<([0-9a-f]{4})>\s+<([0-9a-f]{4})>#is", trim($current[$k]), $map)) {

$from = hexdec($map[1]);

$to = hexdec($map[2]);

$_from = hexdec($map[3]);

for ($m = $from, $n = 0; $m <= $to; $m++, $n++)

$transformations[sprintf("%04X", $m)] = sprintf("%04X", $_from + $n);

} elseif (preg_match("#<([0-9a-f]{4})>\s+<([0-9a-f]{4})>\s+\[(.*)\]#ismU", trim($current[$k]), $map)) {

$from = hexdec($map[1]);

$to = hexdec($map[2]);

$parts = preg_split("#\s+#", trim($map[3]));

for ($m = $from, $n = 0; $m <= $to && $n < count($parts); $m++, $n++)

$transformations[sprintf("%04X", $m)] = sprintf("%04X", hexdec($parts[$n]));

}

function getTextUsingTransformations($texts, $transformations) {

$document = "";

for ($i = 0; $i < count($texts); $i++) {

$isHex = false;

$isPlain = false;

$hex = "";

$plain = "";

for ($j = 0; $j < strlen($texts[$i]); $j++) {

$c = $texts[$i][$j];

switch($c) {

case "<":

$hex = "";

$isHex = true;

$isPlain = false;

break;

case ">":

$hexs = str_split($hex, $this->multibyte); // 2 or 4 (UTF8 or ISO)

for ($k = 0; $k < count($hexs); $k++) {

$chex = str_pad($hexs[$k], 4, "0"); // Add tailing zero

if (isset($transformations[$chex]))

$chex = $transformations[$chex];

$document .= html_entity_decode("&#x".$chex.";");

}

$isHex = false;

break;

case "(":

$plain = "";

$isPlain = true;

$isHex = false;

break;

case ")":

$document .= $plain;

$isPlain = false;

break;

case "\\":

$c2 = $texts[$i][$j + 1];

if (in_array($c2, array("\\", "(", ")"))) $plain .= $c2;

elseif ($c2 == "n") $plain .= '\n';

elseif ($c2 == "r") $plain .= '\r';

elseif ($c2 == "t") $plain .= '\t';

elseif ($c2 == "b") $plain .= '\b';

elseif ($c2 == "f") $plain .= '\f';

elseif ($c2 >= '0' && $c2 <= '9') {

$oct = preg_replace("#[^0-9]#", "", substr($texts[$i], $j + 1, 3));

$j += strlen($oct) - 1;

$plain .= html_entity_decode("&#".octdec($oct).";", $this->convertquotes);

}

$j++;

break;

default:

if ($isHex)

$hex .= $c;

elseif ($isPlain)

$plain .= $c;

break;

}

$document .= "\n";

}

return $document;

}

For ease, I am giving here three example using this class, as follows:

pdf_to_text.php

To Extract the Content from PDF a file

In this example extract the text from a PDF file. Create a ‘pdf_to_text.php’ file and include ‘class.pdf2text.php’ to execute library functions.

<?php
	include ( 'class.pdf2text.php' ) ;

	function  output ( $message )
	   {
		if  ( php_sapi_name ( )  ==  'cli' )
			echo ( $message ) ;
		else
			echo ( nl2br ( $message ) ) ;
	    }

	$file	=  'sample' ; // name of pdf file without .pdf extenstion, sample.pdf
	$pdf	=  new PdfToText ( "$file.pdf" ) ;

	
	output ( "Extracted file contents :\n" ) ;
	output ( $pdf -> Text ) ;
?>

<?php

include ( 'class.pdf2text.php' ) ;

function output ( $message )

{

if ( php_sapi_name ( ) == 'cli' )

echo ( $message ) ;

else

echo ( nl2br ( $message ) ) ;

}

$file = 'sample' ; // name of pdf file without .pdf extenstion, sample.pdf

$pdf = new PdfToText ( "$file.pdf" ) ;

output ( "Extracted file contents :\n" ) ;

output ( $pdf -> Text ) ;

pdf_to_image.php

To Extract the Embedded Image(if any) Available in The Source PDF file.

This example belong to extract an image from any source PDF file.

<?php
	// This example saves all images found in the 'sample.pdf' file, after having put the string
	// "Hello world" in blue color, using the largest stock font
	include ( 'class.pdf2text.php' ) ;

	function  output ( $message )
	   {
		if  ( php_sapi_name ( )  ==  'cli' )
			echo ( $message ) ;
		else
			echo ( nl2br ( $message ) ) ;
	    }

	$file		=  'sample' ; // name of pdf file without .pdf extenstion, sample.pdf
	$pdf		=  new PdfToText ( "$file.pdf", PdfToText::PDFOPT_DECODE_IMAGE_DATA ) ;
	$image_count 	=  count ( $pdf -> Images ) ;
	
	if  ( $image_count )
	   {
		for  ( $i = 0 ; $i  <  $image_count ; $i ++ )
		   {
			// Get next image and generate a filename for it (there will be a file named "sample.x.jpg"
			// for each image found in file "sample.pdf")
			$img		=  $pdf -> Images [$i] ;			// This is an object of type PdfImage
			$imgindex 	=  sprintf ( "%02d", $i + 1 ) ;
			$output_image	=  "$file.$imgindex.jpg" ;
			
			// Allocate a color entry for "white". Note that the ImageResource property of every PdfImage object
			// is a real image resource that can be specified to any of the image*() Php functions
			$textcolor	=  imagecolorallocate ( $img -> ImageResource, 0, 0, 255 ) ;
			
			// Put the string "Hello world" on top of the image. 
			imagestring ( $img -> ImageResource, 5, 0, 0, "Hello world #$imgindex", $textcolor ) ;
			
			// Save the image (the default is IMG_JPG, but you can specify another IMG_* image type by specifying it
			// as the second parameter)
			$img -> SaveAs ( $output_image ) ;
			
			output ( "Generated image file \"$output_image\"" ) ;
		    }
	    }
	else
		echo "No image was found in sample file \"$file.pdf\"" ;

<?php

// This example saves all images found in the 'sample.pdf' file, after having put the string

// "Hello world" in blue color, using the largest stock font

include ( 'class.pdf2text.php' ) ;

function output ( $message )

{

if ( php_sapi_name ( ) == 'cli' )

echo ( $message ) ;

else

echo ( nl2br ( $message ) ) ;

}

$file = 'sample' ; // name of pdf file without .pdf extenstion, sample.pdf

$pdf = new PdfToText ( "$file.pdf", PdfToText::PDFOPT_DECODE_IMAGE_DATA ) ;

$image_count = count ( $pdf -> Images ) ;

if ( $image_count )

{

for ( $i = 0 ; $i < $image_count ; $i ++ )

{

// Get next image and generate a filename for it (there will be a file named "sample.x.jpg"

// for each image found in file "sample.pdf")

$img = $pdf -> Images [$i] ; // This is an object of type PdfImage

$imgindex = sprintf ( "%02d", $i + 1 ) ;

$output_image = "$file.$imgindex.jpg" ;

// Allocate a color entry for "white". Note that the ImageResource property of every PdfImage object

// is a real image resource that can be specified to any of the image*() Php functions

$textcolor = imagecolorallocate ( $img -> ImageResource, 0, 0, 255 ) ;

// Put the string "Hello world" on top of the image.

imagestring ( $img -> ImageResource, 5, 0, 0, "Hello world #$imgindex", $textcolor ) ;

// Save the image (the default is IMG_JPG, but you can specify another IMG_* image type by specifying it

// as the second parameter)

$img -> SaveAs ( $output_image ) ;

output ( "Generated image file \"$output_image\"" ) ;

}

else

echo "No image was found in sample file \"$file.pdf\"" ;

search_pdf.php :

To Search a String in PDF file

If you are looking to search any specific text or string from a PDF file than below is example for it.

<?php
	include ( 'class.pdf2text.php' ) ;


	$file	=  'sample' ; // name of pdf file without .pdf extenstion, sample.pdf
	$pdf	=  new PdfToText ( "file.pdf" ) ;

	
	$search = 'john'; // keyword to search goes here 
	 
	    $result		=  $pdf -> text_strpos  ( $search, $start = 0 ) ; // $start is the start offset in the pdf text contents
	    
   echo "<pre>";
print_r($result); // the result will give you the position of searched string and thus indicating its presence in PDF file
?>

<?php

include ( 'class.pdf2text.php' ) ;

$file = 'sample' ; // name of pdf file without .pdf extenstion, sample.pdf

$pdf = new PdfToText ( "file.pdf" ) ;

$search = 'john'; // keyword to search goes here

$result = $pdf -> text_strpos ( $search, $start = 0 ) ; // $start is the start offset in the pdf text contents

echo "<pre>";

print_r($result); // the result will give you the position of searched string and thus indicating its presence in PDF file

Many web applications need to use PDF document data for further usage. Using these PHP class you can easily get text and images from the PDF file and use it. If find this post useful, please share with others. Thanks

Download Script

2 thoughts on “PHP Code to Extract Text and Images from a PDF file”

Russ
October 27, 2016 at 4:59 pm

Hi, I need to scrape/copy all the data from a section of a PDF page that looks like this: http://prntscr.com/czngo4

I want to take all the variable values in that table and put them into a mysql database. Is this something you could give me a quote to do?

Harish
October 27, 2016 at 5:31 pm

Hi,

Yes, i will give my service for your task. You need only red highlighted part data or complete pdf text data ?

Please send your complete requirement on [email protected].

Thanks

PHP Code to Extract Text and Images from a PDF file

Extract Text and Images from a PDF file Using PHP

pdf_to_text.php

To Extract the Content from PDF a file

pdf_to_image.php

To Extract the Embedded Image(if any) Available in The Source PDF file.

You May Also Like:

search_pdf.php :

To Search a String in PDF file

Recommended Posts For You

About Manish Kumar

2 thoughts on “PHP Code to Extract Text and Images from a PDF file”

Leave a Comment Cancel Reply

Subscribe Us

Pages

Contact Us

Recent Feed