Azure Search Pdf Indexing

06/03/2015

Azure Search has just reached general availability, and with that, they announced a few new nice features.
One of the new features are a .net library that makes it super easy to work with Azure Search. The library can be downloaded with nuget.

Install-Package Microsoft.Azure.Search -IncludePrerelease

I have used the library to create an Azure WebJob, that can index pdf documents stored in Azure blob storage.

The first thing web job does when it is started is to check if a index named documents exists, if it doesn’t it creates the index.
Creating an index with the library can be done with the following code
[csharp]
private static void Setup()
{
var searchClient = new SearchServiceClient("sjkp", new SearchCredentials(ConfigurationManager.AppSettings["SearchApiKey"]));
var indexName = "documents";
var indexExists = searchClient.Indexes.ListNames().Any(s => s == indexName);
if (indexExists)
{
Console.WriteLine("Index {0} exists", indexName);
return;
}

var task = searchClient.Indexes.CreateAsync(new Microsoft.Azure.Search.Models.Index()
{
Name = indexName,
Fields = new Field[] {
new Field("filename", DataType.String, AnalyzerName.EnLucene) {
IsRetrievable = true,
IsSearchable = true,
},
new Field("content", DataType.String, AnalyzerName.EnLucene) {
IsSearchable = true,
},
new Field("id", DataType.String) {
IsKey = true,
},
new Field("page", DataType.Int32) {
IsSortable = true,
IsFilterable = true
}
}

});

Task.WaitAll(task);
Console.WriteLine("Result of index creation: " + task.Result.StatusCode);
}
[/csharp]
Connecting to the Azure Search is a matter of creating a new SearchServiceClient and passing in the name of your azure search, in my case sjkp, and then pass a SearchCredentials object with is basically a wrapper around the access key.

Creating the index is a matter of passing an IEnumerable of Fiels into Indexes.Create or Indexes.CreateAsync. The collection of fields, must contain exactly one field that has IsKey set to true, but I think that is the only restriction.

The ID column is used as a primary key, if you later want to refresh a document in the index.

In my pdf search example, I created a field page which I’m going to use to store the page number. That way when I index pdf files I can make a document for each page of the pdf file, so the user can get a good indication of where the search result is found. The other columns is pretty self-explanatory.

In order to index documents from an Azure Blob storage I created a WebJob BlobTrigger method. A blobtrigger method is called every time a file in the azure storage is changed or added.
[csharp]
public static void IndexPdfDocument([BlobTrigger("documents/{name}.{ext}")] Stream input, string name, string ext)
{
if (ext.ToLower() != "pdf")
{
return;
}
List<IndexAction> actions = new List<IndexAction>();

using (PdfReader reader = new PdfReader(input))
{
for (int i = 1; i <= reader.NumberOfPages; i++)
{
actions.Add(new IndexAction(IndexActionType.MergeOrUpload, new Document()
{
{"filename",name+"."+ext } ,
{"id",MakeSafeId(name+"_"+i+ "."+ext)},
{"content", PdfTextExtractor.GetTextFromPage(reader, i) },
{"page", i }
}));
}
}

var client = new SearchIndexClient("sjkp", "documents", new SearchCredentials(ConfigurationManager.AppSettings["SearchApiKey"]));

for (int i = 0; i < (int)Math.Ceiling(actions.Count / 1000.0); i++)
{
client.Documents.Index(new IndexBatch(actions.Skip(i*1000).Take(actions.Count-(i*1000))));
}
}
[/csharp]
The method first determines if the file is a pdf. If that is the case, we create a PdfReader with iTextSharp, a .net pdf library, which we use to read all the text content from every page in the pdf. For every page we create a new document in the documents index, with the content of the page. Once we have created an IndexAction og type MergeOrUpload we add it to a collection.

Once we have processed all pages, we make a new SearchIndexClient in much the same way as we created the SearchServiceClient when we created the index. The difference is that the SearchIndexClient is used for operating on the index e.g. inserting or searching, while the SearchServiceClient is for more management-oriented operations.

Currently the maximum batch size is 1000 IndexActions per rest call. Another fine detail worth noticing is that the Key column can only contain letters, numbers and underscore.

This two pieces of code is all that is needed to index pdf documents, in the most simple way. The iTextSharp library is not perfect, and sometimes it won’t be able to extract text 100 % correct, but considering how simple it’s to get started with I feel it’s a great starting point.
Feel free to grab the code from https://github.com/sjkp/SJKP.AzureSearch.PdfIndexer and expand upon it.