PDF Pitfalls [Part 2] – How to Translate PDF Documents in Several Languages

In Part 2 of this post we will continue covering some aspects related to PDF conversion and discussing the state-of-the-art in accomplishing the translation task.


“— What's the big deal with PDF conversion?
Why not simply cut and paste, it's easy to do and doesn't cost anything.”


Extracting textual information from PDFs – though time-consuming – can seem relatively easy at first glance. You can copy and paste, take screenshots and even manually retype any needed information. However, it becomes nearly impossible when copying from the PDF isn't allowed or when a pasted section produces results that cannot be used. Also, it may seem easy to overcome an iceberg when you consider only the visible part, but there are more things to consider under the surface. The visual part of a PDF document – the look and feel – is only the tip of the iceberg.


“— Are all PDFs created equal?”


Every PDF has its own shape and features – no two are the same. There are several different flavors of PDF, but you can reduce all flavors into basically 2 types: Distilled PDF and Scanned PDF. You get Distilled PDF when you produce a PDF document from a text publishing tool (via Acrobat Distiller or other PDF writers). Adobe Acrobat allows other flavors of PDF to contain raster images of each of the pages of the document (with or without some text in the background to allow text searching). These PDF documents are referred to as Scanned PDF. You get these when you scan paper documents (via Acrobat Exchange or some other method).


“— Does the type of PDF created matter?”


Yes, it does. When it comes to converting PDFs into an editable format, the nature of the PDF does matter. Extracting text from a Scanned PDF is not that simple and it requires at least some tailoring to the problem at hand and good OCR software. The complications arise when, for instance, the image is noisy or text pixels cannot be well distinguished from the background. In this case, the OCR process does not work as smoothly because it depends on the quality of the provided PDF. Usually it will require a lot of clean up once they are converted.


“— What types of documents will convert easily?”


It is important to note that process optimization is a utopia when it comes to translating PDFs, but as a general rule, the simpler the layout of the source documents, the better the converted documents will be. For instance, if you are converting novels, since there is typically not much layout in the source documents, you can expect a lot of success (and hence very little cleanup) in converting these to editable format. If, on the other hand, you've got complex pages such as scanned scientific journal pages, which are likely to contain multiple columns, lots of complex tables, math, footnotes and bibliographies, you should expect have to do a fair amount of cleanup on the converted documents.


“— Is there anything happening to make PDF conversion easier in the future?”


Several tools have been designed and developed to interact with PDF documents. Beside the common Adobe products and solutions, third party developers propose many different softwares and API, either under license or as freeware. Consequently, a wide range of PDF tools are proposed in the market. Most of them allow for the extraction of textual content but their practical use is limited in the sense that the text’s reading order is not necessary preserved, especially when handling multi-column documents, or in the presence of complex layouts.


Adobe Acrobat X Pro [http://www.adobe.com/products/acrobatpro.html] does a startlingly good job of exporting PDF files into Word or Excel editable documents. It isn't perfect, and didn't select the correct fonts when exporting my test documents, but it did a far better job of preserving the original format than anything I've seen in third-party software. This export function worked best when I used Distilled PDFs—not from a scanned image. In contrast, Scanned PDFs contain only a picture of the original text, and Acrobat can only extract the text by using its built-in Optical Character Reading (OCR) software. Acrobat X has more accurate OCR than previous versions did, but it still lags far behind the best third-party OCR software like ABBYY Finereader 10 Professional Edition [http://finereader.abbyy.com/].

Our experience is that you need to experiment with various options to see which ones best fit into your needs and work best with your PDF documents. Our approach is constantly re-evaluating the various tools, methods and techniques available and incorporating the best of what's out there into what we do.

The fact is all PDF files used as a source for translation need reworking before they're translated into several languages. By making the native source documents available to your translation partner, you will avoid any rework or any unnecessary preparation of the documents before translation can start. It will allow us to perform a full analysis and it will let you stay in control of your budget and schedule without any surprises down the road. PDFs serve a purpose, but when it comes to translation, there is nothing better than the real thing: native source documents (such as FrameMaker, InDesign, Quark XPress, etc).


RULE OF THUMB:


“Native source documents are always needed (and preferred) for translation and are much more time and cost efficient to work with from the get-go.”



Missed part 1 of this post?
->Pick it up here [http://lifesciencestranslations.blogspot.com/2010/11/pdf-pitfalls-part-1-how-to-translate.html]

Translation-The Invaluable Art Form

The art of translation is essential to many industries that help save, sustain or enhance human life. The value of translating user manuals for medical equipment, for example, is clearly understood to aid health care, allowing medical professionals to monitor and improve the health of their patients worldwide. The value of translating software used by air traffic controllers, for example, is an obvious key to safe, global travel. And the value of translating nutrition facts on food labels is measured daily by appreciative consumers who see the importance of determining the quality of their sources of nourishment.


Still, the art of translation can prove invaluable in other circumstances. On January 19th, 2011, Dan Gunderson of Minnesota Public Radio delivered a broadcast from Fargo, North Dakota. “For nearly 150 years,” he says, “the voices of Dakota men imprisoned after the Dakota Conflict of 1862 went unheard. But the details of their imprisonment are starting to emerge, in letters written by those prisoners after six weeks of fighting along the Minnesota River Valley that left hundreds of Indians, settlers and soldiers dead.” Clifford Canku, an elder of the Native American Dakota tribe and professor of Dakota language at North Dakota State, has been working tirelessly to uncover the meaning behind these ancient transcripts. For over a century, the only available firsthand accounts of the Dakota Conflict of 1862 were written by soldiers of the U.S. Army, prison guards, or other members of the bureaucracy. Canku’s work is bringing the other side of the story to life. Slowly and deliberately, he is translating the 150 remaining prisoner letters, some taking nearly one week at a time, that up until now have been stored in a vault at the Minnesota Historical Society, because he feels a duty to his ancestors. These letters detail horrifying prison conditions which, though painful to read, are helping to bring a sense of closure to the prisoners’ descendants tormented by unanswered questions for so long.


While language translation makes an immense contribution to daily quality of life around the world, sometimes it takes a story of human dedication and compassion to remind us that communication goes beyond software and coding and machines; it connects us at the heart.

Translation at the service of innovation

On November 30, 2010, the European Patent Office (EPO) and Google signed a Memorandum of Understanding to improve access to patent translations in languages spoken in EPO member states, as well as in several Asian languages. The arrangement took place in the middle of a controversy about the creation of a unique pan-European patent.

The current process to register a patent is tedious, complicated and costly. Many inventors end up forfeiting protection of their inventions, with the economically critical group of small and medium-sized businesses on top of the list.

Patents need to be filed in one of three EPO official languages (English, French and German), and then have it translated into the languages of countries where they want to apply for patent protection. This process poses some problems. Inventors have difficulties to search for information about patents published in foreign languages. Furthermore, a lot of European patents are not available in all national languages, and are therefore not protected in these countries.

The past years, the governments of the European Union have been talking about implementing a single common patent in the 27 member states. The idea was to submit patents in one of the three official languages with a summary translated into the other two. However, the EU governments failed to reach an agreement, with Italy and Spain leading the opposition by demanding their languages to be represented on a level equal to the three official ones.

So now the EPO, enabled by Google, has stepped into the void. The EPO will be using Google’s statistical machine translation tool (Google Translation) to translate patents in all languages represented in the EPO as well as patents that come from Asia, United States, Canada, Australia, Russia and India that will receive protection in Europe.

In return, Google machine translation technology will be improved notably: EPO will grant total access to their patents translated manually (almost 1 and a half million of available documents, enhanced by more than 50,000 new patents every year).

Thanks to this collaboration, scientists and inventors, whose work is based on innovation, will have easier access to patents that are already registered. This partnership will also facilitate the registration process of inventors, reducing costs and improving legal security as well as making the decision process easier for EU member states that want to simplify the introduction of pan-European patents.

The agreement has divided the scientific community. On the one hand, some think this is a good measure to boost competitiveness in Europe, if the solution is used to get familiar with a document and to understand its overall meaning. Translations will not have any legal status, but will only serve informative purposes. Nevertheless, the other side of the coin can be dangerous. Machine translation is still vague and inaccurate, while patent translations demand precision and accuracy.

So, when will it be possible to talk about a unique pan-European patent?

The Translation Industry in Numbers

Little is known about the translation industry other than the fact that companies and individuals translate millions of pages every year to help the world communicate. In fact the translation industry constitutes a colossal business. It is predicted that by 2015 the translation industry worldwide will account for US$ 25 billion. Yet, more than 99 percent of what people write, say, or generate is never translated and remains in the language in which it was created.


If you wanted to reach every person on earth in their own language, it would require translating your content into 7,000 languages but if you only translated it into 80 languages you could reach 80% of the world’s population.


As far as the Internet is concerned, 10 languages account for 75 percent of the people on the web: English, Japanese, German, Spanish, French, Italian, Korean, Swedish, Chinese Simplified, and Norwegian. Translating your web content into 50 languages would provide access to almost 95% of the world’s online residents. And since 75% of consumers say they would be more likely to buy a product with information in their own language, it would make sense to translate your literature into multiple languages. Despite spending millions of dollars in translation (the average translation budget of a corporation equals 0.25% to 2.5% of its annual revenue), only 25% of companies do measure and calculate the return on their localization investment.


The translation industry is extremely fragmented despite some consolidation over the past 15 years. The vast majority of translation companies (70%) employ only between 1 and 10 people, 11% of them employ between 10 and 100 people, and the rest employ 100 or more. Only six firms worldwide employ more than 1,000 people. There are about 4,000 translation companies employing five people or more, of which 10% only are in the USA, in addition to an unaccounted for number of individual translators (freelance or single language providers).


Last but not least, if you want to start a career in translation, you should move to Switzerland or Denmark where it is estimated that translators earn the highest salaries.

Search This Forum

Loading...

Subscribe To This Forum

Enter your email address:

Delivered by FeedBurner

Comment via RSS Feed

Post via RSS Feed

Forum Archive