• Table of contents

    • [+]Preliminaries (3)
    • [+]Introduction (4)
    • [+]Latin America (13)
    • [+]Sub-Saharan Africa (9)
    • [—]Arab World (11)
    • [+]Russia (11)
    • [+]India (11)
    • [+]China (9)
    • [+]Conclusions (6)
    • [+]Appendix (1)

Arab World

Technical difficulties of digital publishing in Arabic: ePub and OCR

PDF or Flash formats do not present any technical complications for local publishers. But Ramy Habeeb, the co-founder of Kotobarabia, explains that even when a text in Arabic has been carefully laid out using InDesign and other similar tools, converting it to ePub proves extremely troublesome, for various reasons:

1) The flow of text goes from right to left, which creates layout problems: if there are indented paragraphs or quotations in the ePub file, they will not be displayed properly. This means that even when the right to left flow is shown correctly, certain formats make the characters become illegible – a problem that needs to be solved manually. So it is not possible to obtain an ePub directly from an RTF or an InDesign file: the text has to be checked line by line.

2) Another difficulty is related to fonts. Habeeb points out that the same thing happened in European publishing several decades ago, when publishers differentiated themselves from each other by their typefaces: each publisher could create their own one, and the result was a plethora of fonts. This is now no longer a problem in Europe, as there are different standardized instruments, such as Microsoft Office or InDesign. However, with texts in Arabic the difficulty remains, because whenever there is a quote from the Qur’an, for example, publishers take great pains to ensure the lines are written in an extremely beautiful font, but the ePub converter is often unable to decipher it. One solution would be to treat these fonts as images, but this gives rise to a new problem: that of inserting an image into a file.

3) Thirdly, most Arabic letters can be represented in three or four different ways. The letter will look different if it appears alone, if it is at the beginning, in the middle or at the end of a word. Often the publisher needs the letter to look as though it is the middle of a word but it appears as though it were alone – and the result is complete gibberish.

4) Lastly, classical Arabic – which is not always used but is used in high-brow texts – has Teshkiel, or accentuation. It is possible to place 5 different accents on the letters “a”, “b”, “c”, for example, which constitutes a technical challenge when it comes to converting to ePub.[1]

 

Now, if the profusion of Arabic fonts is a problem for converting to ePub, the same thing occurs when it comes to using text recognition systems (OCR). Habeeb explains this in the following way:

There are so many dots and lines and other things that an automated OCR system can mistake for a letter or convert into another letter. And to complicate matters even more, because the industry is relatively poor, the quality of the paper and the quality of the ink used isn’t always the highest. All of these factors combined make OCR an extremely difficult endeavour.

It is interesting to examine the strategy implemented by Kotobarabia to overcome these technical hurdles:

So as a result, each time we take on a book, it either goes through one of two processes: 1) One process is that we fully type it so that it’s fully searchable. We discovered that typing a book with a series of edits is cheaper than working with current OCR software that’s on the market. Then we’ll go through the whole process of creating the metadata behind it and uploading it to the site and converting it to the two formats that we are currently using commercially. 2) The thing that we do is to scan the pages, and then we’ll have people read the pages and pick out keywords, so that the books become semi-searchable. We do this for most of our books. But if we find that a book is being read over and over again, or that this title has a particular interest, then we’ll go back and retype it. It’s actually cheaper this way to do it. It’s a more sustainable business model.[2]


Notes    
  1. Personal interview, December 2010.
  2. Turner, James: “Bringing e-Books to Africa and the Middle East”, O’Reilly Radar, 19th January, 2010.

4

  1. Hedaya

     /  19/07/2011

    Very interesting. I hope we will continue to have people such as Habeeb following up with this. I’m hopeful that soon we will soon a plethora of Arabic texts available in epub format.

  2. Soon this will be updated. We once struggled to OCR an Arab book. Its just a three days work took a week to get completed. My technical guys tried their best to bring out the book.

  3. thierry quinqueton

     /  27/08/2011

    I think this question is a very important one considering cultural diversity. Universities and big libraries, inside this arabic area, but also in other parts of the world must be associated.

  4. No need to exaggerate. There is always a work around and following is one of them. For current technology, only way to get accurate text in Arabic is either using pdf or image.

    With following version you can read you book in two different font size assuming you have landscape mode on your device. If not, no problem just download your epub in landscape mode. Rest is explained in the post.

    http://avaxhome.ws/ebooks/religion/The_Quran_Arabic_Hayrat_Nesriyat_Epub_Mobi_Prc_Kuran_Hayrat_Nesriyat.html

    Also, I tried Indesign and many other ways but no use. Using html is much easier than using intermediate software.

Leave a Reply to Siva Kumar Rangasamy Cancel reply