|
|
Overview
IdxSub2Srt is a free program to convert
existing idx/sub files to srt text format. Idx/sub files are generated
mostly from DVD rips and represent, actually, the subtitle contents of
those ripped DVDs. Idx/sub files contain the subtitles as bitmaps and so
to convert to a text format like srt some kind of OCR (Optical Character
Recognition) function is needed. This function is provided from
IdxSub2Srt in a way that I think makes the whole conversion process a
simple one and comfortable, so with no much hassles in about 10 minutes
a user is able to convert any subtitle contained in a idx/sub file to
its srt equivalent.
The OCR function used is a simple one that uses a kind
of pattern matching and the whole effort the user supplies is to make
the program learn what text (usually a single letter) represents every
pattern found in subtitle bitmaps. After the program has learnt the
whole alphabet used and every other symbol (like numbers, etc) then all
subtitles can be converted, easily, to text.
IdxSub2Srt makes the whole learning process as
comfortable and fast as possible and I think succeeds very much in this
aspect. It is able to keep an OCR database so every new idx/sub file
analyzed can be checked against this database and if already its
patterns are known it leaves the user with the task to recognize only
the missing character from his/her previous efforts.
At the moment the program can handle English subtitles
and those that match the default character set configured in your
Windows PC. For example, if my PC is set (through Control Panel/Regional
and Language Options/Advanced) to have Greek as default character set
for non-Unicode text, then the program can handle English and Greek
subtitle text conversion.
Conversion of idx/sub to srt has many advantages. For
example to recreate the idx/sub file but this time with user selection
of font, font size and position on screen. This is my case with WDTV
(Western Digital TV HD) media player which has a very good support for
idx/sub subtitle files. Most of the times the positioning information in
original idx/sub is not correct for this media player (not to mention
the quality of font and size) so I convert it to a srt and using
AVIAddXSubs (in the same zip package with IdxSub2Srt) I convert it back
to idx/sub. But this time with the appropriate positioning (on screen)
for WDTV and much better looking letters, bigger in size.
Another useful thing is to help translators to get the
original subtitles and translate them to another language.
A srt file is a more versatile format to store your
subtitles together with the related videos. They take much less space
too.
Program Description

-
Subtitle language Selection. Select the
language to extract from the loaded idx/sub. Every idx/sub file can
contain many languages.
-
Load Idx/Sub. Select the idx file to be
processed. Only the selected language from this file will be loaded.
See 1.
-
Save. From time to time save your work.
Please note that your work is saved automatically every time you
exit the program.
-
Generate Srt. Generates the recognized text
for every subtitle bitmap and saves it in the same directory as the
loaded idx/sub, using the name of the idx but with the srt
extension.
-
Previous, Next Subtitle (<<, >>). When an
idx/sub file is already loaded you can browse back and forth the
subtitles. Case 13 changes this
operation a bit. See
13.
-
Subtitle bitmap. Displays the subtitle
bitmap. The same time the selected pattern (to be
learned/recognized) appears in red. See 7,
8, 9, 15, 16.
-
Previous, Next Pattern in currently selected
subtitle (<<, >>). When an idx file is loaded then for the
currently selected subtitle there is a list with all unique patterns
contained. With <<, >> buttons you can browse these patterns and
enter in 9 the appropriate
text/letter that corresponds to it.
-
Current Pattern/Text to Display/Learn. The
current pattern for the current subtitle appears there in red. The
same pattern is in red at 6 to
help enter the correct text for it.
-
Enter Text for currently selected Pattern.
In this place (edit box) is entered the text that corresponds to the
selected pattern of the selected subtitle.
-
Use my Edited Text. The recognized text for
every subtitle appears in 14 and
is generated automatically. The user is able to overwrite this text
and enter his/her own modifications that the program will use at srt
generation.
-
Current subtitle/Total subtitles. It displays
the currently selected subtitle and the total number of subtitles.
When "Only Unknown letters" (13)
is checked it displays the current subtitle with unrecognized
patterns (always the first) and the total number of subtitles with
unrecognized patterns.
-
AVRG Normal & AVRG Italics. These two options
control how the program separates words. "AVRG Normal" is for normal
style text and "AVRG Italics" for italic style text. It appears that
a dedicated value is needed for those two text styles, with the one
for italics been lower. They work this way: When two patterns
have a distance less that AVRG number (in pixels) then are
considered as belonging to the same word. If distance is bigger than
AVRG number then a space is inserted between them. These values are
generated automatically through some statistics but the user can
tweak them to get better results, looking the result generated
immediately at 14.
-
Only Unknown letters. When is checked you can
browse only subtitles that contain unrecognized patterns (5)
and only the unrecognized patterns of the subtitle (7).
You cannot go back and you can go forth only if the selected pattern
has its text entered first. This function is very important for the
OCR learning process.
-
Generated Subtitle text. The generated text
for the current subtitle appears there. Every non recognized pattern
appears as # in the text. This text is not modifiable except
you check "Use my Edited Text" (10).
In this case the user provided text is considered for the generation
of the final srt file.
-
Italic. Marks a pattern as to be in italics.
The line of text that contains at least one such letter will be
enclosed in <i></i> tags.
-
All Italics. All patterns of the selected
subtitle are marked as italics.
-
Ignore Subtitle. The subtitle is ignored and
is not included in srt generation. This is useful to skip subtitles
that are for those having hearing problems, etc.
-
There is entered the number of a subtitle to jump.
The jump is made when the button Go (19)
is pressed.
-
Go. Jumps to the subtitle which its number is
entered at 18.
Work Flow
The first thing is to select the language to be
extracted from idx/sub. This is done through 1.
Select the idx/sub file through
2. The program will get the selected language and extract the
corresponding bitmaps. The bitmaps will be analyzed and all separate
patterns on them will be entered in a list. Next the program will check
those patterns against any existing OCR database and if one OCR file is
found to have at list 10 patterns same as of the idx/sub loaded, then
this will be used. The user has now to learn the program any new
patterns introduced. The analyzing process of an idx/sub file is done
only once. When you save your work by hand (Save button -
3) or automatically every time you exit the program, a .prj
file will be created in the same directory as the idx/sub file. This
will include all the analysis information and the OCR file used. The
next time an idx/sub file is reloaded and its corresponding prj file is
present in the same directory, then all needed analysis information will
be loaded from there.
The first time an idx/sub file is loaded and analyzed
(no prj file present) a screen appears to the user to help the program
distinguish text the best possible way on the bitmaps.

Choose the color that gives the most solid and
lean characters in the first subtitle in the idx/sub that
appears in the back at the main screen at 6.
The program suggest the best color it thinks but maybe you can give a
better selection. Generally if the suggested color gives letters solid
and lean, keep it (letter's inner/body color). Avoid colors that
represent the outline of the letters.
Please have in mind that OCR learning is not
stored in these prj files (one for every loaded idx/sub file).
Your work is saved in the OCR database. The OCR database is a directory,
named OCR, created in same directory from where the IdxSub2Srt runs. It
contains pairs of OCR*.txt/OCR*.bin files that really contain your work.
However prj files store some other information, like the text you enter
when selecting "Use my edited text" (10).
They, also, store which subtitles have to be ignored in srt generation (17).
Except those information all other analysis data can be recovered if
this file is deleted. The program will load the appropriate OCR file and
eventually a new prj file will be created. Please note that if you
delete the OCR database for any reason, all prj files have to be deleted
too.
Now the real OCR learning starts. For every subtitle you
browse through 5 there is a number
of patterns extracted through the analysis phase. Your work is to
replace the # symbol assigned automatically, which means "not
known pattern", to something else that really corresponds as text to the
selected pattern. This pattern can be many times in the same subtitle
and of course in many other subtitles. For example in the picture above
a pattern with the Greek letter o (omicron) is selected. This appears in
red at 8 and is painted red in every
part of the subtitle bitmap at 6, is
found. Letter o is found in six places in the shown subtitle bitmap.
Every time the appropriate text is entered at
9, text is generated for the subtitle
and appears at 14. Progressively all
# will be replaced with the user entered text.
To make the work faster check "Only
Unknown letters" option (13). This
helps to concentrate your efforts only to subtitles and patterns not
recognized yet. Checking this option you can browse only forward and
only if you enter text (recognize) the current pattern. At
11 you can watch, every time you go to the next not
recognized pattern, the number of subtitles that remain not completely
recognized. If you do a mistake and you wish to go back to fix the text
entered for a pattern in the current subtitle, just uncheck 13, browse
to the pattern, fix, and check again 13
to continue your work.
One aspect you must have at your attention is how the
program inserts spaces to organize the text in words. It uses the
distance between the patterns and two numbers ("AVRG Normal" & "AVRG
Italics" - see
12). One number affects Normal style
text, the other italic style text. When the distance between any two
consecutive patterns is less than its AVRG number then they are
considered as belonging in the same word. If the distance is bigger than
this number then a space is inserted between them. Those two numbers are
computed through some statistics but the user can tweak and see by
himself/herself the result (at 14)
and decide which value gives the best "word separation" results.
When all patterns are recognized then you can press
"Generate Srt" (4) to generate the
srt file. This will be created in the same directory as the loaded
idx/sub file.
What is New?
|
1.3 |
- A fix was made for better support of RTL languages
(Arabic, Farsi, Hebrew). Please note that the
program was created with Greek and Latin alphabet in mind.
Languages like Arabic and Farsi that use to connect the
letters one with the other will not be handled the best way
from IdxSub2Srt. That means more work for Arabic and Farsi
users.
|
|
1.2 |
- "AVRG Space" option is broken in two, "AVRG
Normal" and "AVRG Italics". Those two text styles need a
dedicated AVRG option with "AVRG Italics" been lower. This
way the user, tweaking these numbers, can get the best
result, in word separation, for both text styles.
|
|
1.1 |
- Added "Ignore subtitle". Marks the currently selected
subtitle to exclude from srt generation.
- Added ability to jump to any of the subtitles. You enter
the number and you press "Go".
|
For comments or questions use the form below.
The email is needed only if you wish a reply.
- Greek speaking people
can write to me in Greek.
Please avoid greeklish.
To avoid spam, please enter
(at AntiSpam) the
fourth(4) word
from the following list:
fox, dog, cat, mouse, rabbit, bird, tiger
|