Recently, I came up with a task where files were stored in a file system and needed to search on file contents and name.
I opted Lucene index to index file contents. Below is the basic code that adds the Lucene document.
//Create lucene Document Document document = new Document(); document.add(new StringField("path", file.toString(), Field.Store.YES)); document.add(new TextField("displayName", file.getFileName().toString(), Field.Store.YES)); //Here file contents are extracted using Apache Tika library document.add(new TextField("contents", content, Store.NO)); //Create analyzer Analyzer analyzer = new StandardAnalyzer(); //Create IndexWriter pass the analyzer IndexWriterConfig indexWriterCOnfig = new IndexWriterConfig(analyzer); indexWriterCOnfig.setOpenMode(OpenMode.CREATE_OR_APPEND); //org.apache.lucene.store.Directory instance Directory directory = FSDirectory.open( Paths.get("Here is the directory for index files keep it separate from data") ); //IndexWriter to write in index IndexWriter writer = new IndexWriter(directory, indexWriterCOnfig); writer.addDocument(document); writer.close();
There are 3 fields to save in the document path, displayName, and content.
You may notice the third parameter
Field.Store. This parameter is NO for contents.
So, here comes the question when we need to store the fields. I will explain with my scenario.
In my case files are saved on the file system and we need to search only the analyzed content within the Lucene index. So in this scenario, it doesn’t make sense to store file content again. We can get the file name and path from other fields and fetch the actual file.
So, store only when you don’t have data outside of Lucene.
You can also find other useful helping materials and tutorials in our Coding Articles & Tutorials Knowledge Base.