Write a Program to search in pdf

by admin on September 3, 2014

Lets get started by downloading the required libraries. Please stick to the version of software’s that I am using, since latest versions may require different kind of implementation.
1. Download Apache lucene 3.6.1 from here. Unzip the content and find lucene-core-3.6.1.jar.
2. Download Apache PDFBox 0.7.3 from here. Unzip it and find pdfbox-0.7.3.jar
3. Download fontbox-0.1.0.jar from here. This project will throw Class not found exception if this library is not present.
Next step is to create a Java Project in Eclipse. Right click the project in project explorer, Go to -> Configure build Path -> Add External jars -> add lucene-core-3.6.1.jar,pdfbox-0.7.3.jar and fontbox-0.1.0.jar -> Click OK.
4. Create a class and name it as SimplePDFSearch.java. This is the main class that is going to perform each action one by one. Copy paste the below code in this class. Edit the package name to the name of package in which you are creating this class.
01 package com.programmingfree.simplepdfsearch;
02
03 import org.apache.lucene.queryParser.ParseException;
04 import org.pdfbox.pdmodel.PDDocument;
05 import org.pdfbox.util.PDFTextStripper;
06
07 import java.io.File;
08 import java.io.IOException;
09
10
11  public class SimplePDFSearch {
12      // location where the index will be stored.
13      private static final String INDEX_DIR = "src/main/resources/index";
14      private static final int DEFAULT_RESULT_SIZE = 100;
15
16      public static void main(String[] args) throws IOException, ParseException {
17
18          File pdfFile = new File("src/resources/SamplePDF.pdf");
19          IndexItem pdfIndexItem = index(pdfFile);
20
21          // creating an instance of the indexer class and indexing the items
22          Indexer indexer = new Indexer(INDEX_DIR);
23          indexer.index(pdfIndexItem);
24          indexer.close();
25
26          // creating an instance of the Searcher class to the query the index
27          Searcher searcher = new Searcher(INDEX_DIR);
28          int result = searcher.findByContent("Hello", DEFAULT_RESULT_SIZE);
29          print(result);
30          searcher.close();
31      }
32      
33      //Extract text from PDF document
34      public static IndexItem index(File file) throws IOException {
35          PDDocument doc = PDDocument.load(file);
36          String content = new PDFTextStripper().getText(doc);
37          doc.close();
38          return new IndexItem((long)file.getName().hashCode(), file.getName(), content);
39      }
40
41     //Print the results
42      private static void print(int result) {
43       if(result==1)
44          System.out.println("The document contains the search keyword");
45       else
46       System.out.println("The document does not contain the search keyword");
47
48      }
49  }
5. We have to create a class to set and get the items that need to be indexed from a PDF file. Create a class and name it as IndexItem.java and copy the below code and paste in it. By doing this we are instructing the search engine to create and to retrieve the following contents of the PDF file, an Unique ID, the file name and the contents (text) of the file.
01 package com.programmingfree.simplepdfsearch;
02
03 public class IndexItem {
04  private Long id;
05     private String title;
06     private String content;
07
08     public static final String ID = "id";
09     public static final String TITLE = "title";
10     public static final String CONTENT = "content";
11
12     public IndexItem(Long id, String title, String content) {
13         this.id = id;
14         this.title = title;
15         this.content = content;
16     }
17
18     public Long getId() {
19         return id;
20     }
21
22     public String getTitle() {
23         return title;
24     }
25
26     public String getContent() {
27         return content;
28     }
29
30     @Override
31     public String toString() {
32         return "IndexItem{" +
33                 "id=" + id +
34                 ", title='" + title + '\'' +
35                 ", content='" + content + '\'' +
36                 '}';
37     }
38
39 }

6. Next step is to create a class to index the contents of the PDF documents. Create a new class and name it as Indexer.java as we have referred here. Copy and paste the below code to Indexer.java,

01 package com.programmingfree.simplepdfsearch;
02
03 import org.apache.lucene.analysis.standard.StandardAnalyzer;
04 import org.apache.lucene.document.Document;
05 import org.apache.lucene.document.Field;
06 import org.apache.lucene.index.IndexWriter;
07 import org.apache.lucene.index.IndexWriterConfig;
08 import org.apache.lucene.index.Term;
09 import org.apache.lucene.store.FSDirectory;
10 import org.apache.lucene.util.Version;
11
12 import java.io.File;
13 import java.io.IOException;
14
15 public class Indexer {
16  private IndexWriter writer;
17
18     public Indexer(String indexDir) throws IOException {
19         // create the index
20         if(writer == null) {
21         writer = new IndexWriter(FSDirectory.open(
22                 new File(indexDir)), new IndexWriterConfig(Version.LUCENE_36, newStandardAnalyzer(Version.LUCENE_36)));
23         }
24     }
25
26     /**
27       * This method will add the items into index
28       */
29     public void index(IndexItem indexItem) throws IOException {
30
31         // deleting the item, if already exists
32         writer.deleteDocuments(new Term(IndexItem.ID, indexItem.getId().toString()));
33
34         Document doc = new Document();
35
36         doc.add(new Field(IndexItem.ID, indexItem.getId().toString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
37         doc.add(new Field(IndexItem.TITLE, indexItem.getTitle(), Field.Store.YES, Field.Index.ANALYZED));
38         doc.add(new Field(IndexItem.CONTENT, indexItem.getContent(), Field.Store.YES, Field.Index.ANALYZED));
39
40         // add the document to the index
41         writer.addDocument(doc);
42     }
43
44     /**
45       * Closing the index
46       */
47     public void close() throws IOException {
48         writer.close();
49     }
50 }

7. The last step is to create a class that provides features to query the index that is created using the indexer class. Create a class and name it as Searcher.java. Copy and paste the below code in it.

01 package com.programmingfree.simplepdfsearch;
02
03 import org.apache.lucene.analysis.standard.StandardAnalyzer;
04 import org.apache.lucene.document.Document;
05 import org.apache.lucene.index.IndexReader;
06 import org.apache.lucene.queryParser.ParseException;
07 import org.apache.lucene.queryParser.QueryParser;
08 import org.apache.lucene.search.*;
09 import org.apache.lucene.store.FSDirectory;
10 import org.apache.lucene.util.Version;
11
12 import java.io.File;
13 import java.io.IOException;
14 import java.util.ArrayList;
15 import java.util.List;
16
17 public class Searcher {
18  
19     private IndexSearcher searcher;
20     private QueryParser contentQueryParser;
21
22     public Searcher(String indexDir) throws IOException {
23         // open the index directory to search
24         searcher = new IndexSearcher(IndexReader.open(FSDirectory.open(newFile(indexDir))));
25         StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
26
27         // defining the query parser to search items by content field.
28         contentQueryParser = new QueryParser(Version.LUCENE_36, IndexItem.CONTENT, analyzer);
29     }
30
31     
32     /**
33       * This method is used to find the indexed items by the content.
34       * @param queryString - the query string to search for
35       */
36     public int findByContent(String queryString, int numOfResults) throwsParseException, IOException {
37         // create query from the incoming query string.
38         Query query = contentQueryParser.parse(queryString);
39          // execute the query and get the results
40         ScoreDoc[] queryResults = searcher.search(query, numOfResults).scoreDocs;
41         
42         if(queryResults.length>0)
43          return 1;
44         else
45          return 0;
46         
47     }
48
49     public void close() throws IOException {
50         searcher.close();
51     }
52 }

 

That is all we have to do before we run this program to find whether a word is present in a PDF file or not in a more quick and efficient way. Note in the main class (SimplePDFSearch.java), I have used a field named INDEX_DIR which contains the path where the index will be stored. Every time this program is run, the old index will be cleared and new index will be created. I have used a sample PDF document that consists of the following text in it,


“Hello World by PDFBox”

I am searching for the word “Hello”, that is passed as a parameter to findByContent method of the Searcher class and the output is,

#SQASolutionShare on FacebookShare on Google+Share on LinkedInTweet about this on TwitterEmail this to someone

Previous post:

Next post: