|
|
Overview
IdxSub2Srt is a free program to convert existing
idx/sub files to srt text format. Idx/sub files are generated mostly
from DVD rips and represent, actually, the subtitle contents of those
ripped DVDs. Idx/sub files contain the subtitles as bitmaps and so to
convert to a text format like srt some kind of OCR (Optical Character
Recognition) function is needed. This function is provided from
IdxSub2Srt in a way that I think makes the whole conversion process a
simple one and comfortable, so with no much hassles in about 10 minutes
a user is able to convert any subtitle contained in a idx/sub file to
its srt equivalent.
The OCR function used is a simple one that uses a kind
of pattern matching and the whole effort the user supplies is to make
the program learn what text (usually a single letter) represents every
pattern found in subtitle bitmaps. After the program has learnt the whole
alphabet used and every other symbol (like numbers, etc) then all
subtitles can be converted, easily, to text.
IdxSub2Srt makes the whole learning process as
comfortable and fast as possible and I think succeeds very much in this
aspect. It is able to keep an OCR database so every new idx/sub file
analyzed can be checked against this database and if already its
patterns are known it leaves the user with the task to recognize only the missing
character from his/her previous efforts.
At the moment the program can handle English subtitles
and those that match the default character set configured in your
Windows PC. For example, if my PC is set
(through Control Panel/Regional and Language Options/Advanced) to have
Greek as default character set for non-Unicode text, then the program
can handle English and Greek subtitle text conversion.
Conversion of idx/sub to srt has many advantages. For
example to recreate the idx/sub file but this time with user selection
of font, font size and position on screen. This is my case with WDTV
(Western Digital TV HD) media player which has a very good support for
idx/sub subtitle files. Most of the times the positioning information in
original idx/sub is not correct for this media player (not to mention
the quality of font and size) so I convert it to a srt and using
AVIAddXSubs (in the same zip package with IdxSub2Srt) I convert it back
to idx/sub. But this time with the appropriate positioning (on screen)
for WDTV and much better looking letters, bigger in size.
Another useful thing is to help translators to get the
original subtitles and translate them to another language.
A srt file is a more versatile format to store your
subtitles together with the related videos. They take much less space
too.
Program Description

-
Subtitle language Selection. Select the
language to extract from the loaded idx/sub. Every idx/sub file can
contain many languages.
-
Load Idx/Sub. Select the idx file to be
processed. Only the selected language from this file will be loaded.
See 1.
-
Save. From time to time save your work.
Please note that your work is saved automatically every time you
exit the program.
-
Generate Srt. Generates the recognized text
for every subtitle bitmap and saves it in the same directory as the
loaded idx/sub, using the name of the idx but with the srt extension.
-
Previous, Next Subtitle (<<, >>). When an idx/sub
file is already loaded you can browse back and forth the subtitles.
Case 13 changes this operation a bit. See
13.
-
Subtitle bitmap. Displays the subtitle
bitmap. The same time the selected pattern (to be
learned/recognized) appears in red. See 7, 8, 9, 15, 16.
-
Previous, Next Pattern in currently selected
subtitle (<<, >>). When an idx file is loaded then for the
currently selected subtitle there is a list with all unique patterns
contained. With <<, >> buttons you can browse these patterns and
enter in 9 the appropriate text/letter that corresponds
to it.
-
Current Pattern/Text to Display/Learn. The
current pattern for the current subtitle appears there in red. The
same pattern is in red at 6 to help enter the correct text for it.
-
Enter Text for currently selected Pattern.
In this place (edit box) is entered the text that corresponds to the selected pattern of
the selected subtitle.
-
Use my Edited Text. The recognized text for
every subtitle appears in 14 and is generated automatically. The
user is able to overwrite this text and enter his/her own
modifications that the program will use at srt generation.
-
Current subtitle/Total subtitles. It displays
the currently selected subtitle and the total number of subtitles. When
"Only Unknown letters" (13) is checked it displays the current
subtitle with unrecognized patterns (always the first) and the total
number of subtitles with unrecognized patterns.
-
AVRG Normal & AVRG Italics. These two options control
how the program separates words. "AVRG Normal" is for normal style
text and "AVRG Italics" for italic style text. It appears that a
dedicated value is needed for those two text styles, with the one
for italics been lower. They work this way: When two patterns
have a distance less that AVRG number (in pixels) then are
considered as belonging to the same word. If distance is bigger than
AVRG number then a space is inserted between them. These values are
generated automatically through some statistics but the user can
tweak them to get better results, looking the result generated
immediately at 14.
-
Only Unknown letters. When is checked you can
browse only subtitles that contain unrecognized patterns (5) and
only the unrecognized patterns of the subtitle (7). You cannot go back and you can go
forth only if the selected pattern has its text entered first. This
function is very important for the OCR learning process.
-
Generated Subtitle text. The generated text
for the current subtitle appears there. Every non recognized pattern
appears as # in the text. This text is not modifiable except you
check "Use my Edited Text" (10). In this case the user provided text is
considered for the generation of the final srt file.
-
Italic. Marks a pattern as to be in italics.
The line of text that contains at least one such letter will be
enclosed in <i></i> tags.
-
All Italics. All patterns of the selected
subtitle are marked as italics.
-
Ignore Subtitle. The subtitle is ignored and
is not included in srt generation. This is useful to skip subtitles
that are for those having hearing problems, etc.
-
There is entered the number of a subtitle to jump.
The jump is made when the button Go (19)
is pressed.
-
Go. Jumps to the subtitle which its number is
entered at 18.
Work Flow
The first thing is to select the language to be
extracted from idx/sub. This is done through 1.
Select the idx/sub file through 2. The program will get
the selected language and extract the corresponding bitmaps. The bitmaps
will be analyzed and all separate patterns on them will be entered in a
list. Next the program will check those patterns against any existing
OCR database and if one OCR file is found to have at list 10 patterns
same as of the idx/sub loaded, then this will be used. The user has now
to learn the program any new patterns introduced. The analyzing process
of an idx/sub file is done only once. When you save your work by hand (Save
button - 3) or automatically every time you exit the program, a .prj file
will be created in the same directory as the idx/sub file. This will
include all the analysis information and the OCR file used. The next
time
an idx/sub file is reloaded and its corresponding prj file is present in the same
directory, then all needed analysis information will be loaded from
there.
The first time an idx/sub file is loaded and analyzed
(no prj file present) a screen appears to the user to help the program
distinguish text the best possible way on the bitmaps.

Choose the color that gives the most solid and lean characters in
the first subtitle in the idx/sub that appears in the back at the main
screen at 6. The program suggest the best color it thinks but maybe you
can give a better selection. Generally if the suggested color gives
letters solid and lean, keep it (letter's inner/body color). Avoid
colors that represent the outline of the letters.
Please have in mind that OCR learning is not stored in
these prj files (one for every loaded idx/sub file). Your work
is saved in the OCR database. The OCR database is a directory, named OCR,
created in same directory from where the IdxSub2Srt runs. It contains
pairs of
OCR*.txt/OCR*.bin files that really contain your work. However prj files
store some other information, like the text you enter when selecting
"Use my edited text" (10).
They, also, store which subtitles have to be ignored in srt generation (17).
Except those information all other analysis data can be recovered if
this file is deleted. The program will load the appropriate OCR file and
eventually a new prj file will be created. Please note that if you
delete the OCR database for any reason, all prj files have to be deleted
too.
Now the real OCR learning starts. For every subtitle you
browse through 5 there is a number of patterns extracted through the
analysis phase. Your work is to replace the # symbol assigned
automatically, which means "not known pattern", to something else that
really corresponds as text to the selected pattern. This pattern can be
many times in the same subtitle and of course in many other subtitles.
For example in the picture above a pattern with the Greek letter
o (omicron)
is selected. This appears in red at 8 and is painted red in every part
of the subtitle bitmap at 6, is found.
Letter o is found in six places in the shown subtitle bitmap.
Every time the appropriate text is entered at
9, text is
generated for the subtitle and appears at 14. Progressively all
# will be
replaced with the user entered text.
To make the work faster check "Only Unknown letters"
option (13). This helps to
concentrate your efforts only to subtitles and
patterns not recognized yet. Checking this option you can
browse only forward and only if you enter text (recognize) the current
pattern. At 11 you can watch, every time you go to the next not
recognized pattern, the number of subtitles that remain not completely
recognized. If you do a mistake and you wish to go back to fix the text
entered for a pattern in the current subtitle, just uncheck 13, browse
to the pattern, fix, and check again 13 to continue your work.
One aspect you must have at your attention is how the
program inserts spaces to organize the text in words. It uses the
distance
between the patterns and two numbers ("AVRG Normal" & "AVRG
Italics" - see
12). One number affects Normal style
text, the other italic style text. When the distance between any two
consecutive patterns is less than its AVRG number then they are
considered as belonging in the same word. If the distance is bigger than
this number then a space is inserted between them. Those two numbers are computed through
some statistics but the user can tweak and see by himself/herself the
result (at 14) and decide which value gives the best "word separation" results.
When all patterns are recognized then you can press
"Generate Srt" (4) to generate the srt
file. This will be created in the same directory as the loaded idx/sub
file.
What is New?
|
1.3 |
- A fix was made for better support of RTL languages
(Arabic, Farsi, Hebrew). Please note that the
program was created with Greek and Latin alphabet in mind.
Languages like Arabic and Farsi that use to connect the
letters one with the other will not be handled the best way
from IdxSub2Srt. That means more work for Arabic and Farsi
users.
|
|
1.2 |
- "AVRG Space" option is broken in two, "AVRG
Normal" and "AVRG Italics". Those two text styles need a
dedicated AVRG option with "AVRG Italics" been lower. This
way the user, tweaking these numbers, can get the best
result, in word separation, for both text styles.
|
|
1.1 |
- Added "Ignore subtitle". Marks the currently selected
subtitle to exclude from srt generation.
- Added ability to jump to any of the subtitles. You enter
the number and you press "Go".
|
For comments or questions use the form below.
The email is needed only if you
wish a reply.
- Greek speaking people can
write to me in Greek. Please
avoid greeklish.
To avoid spam, please enter
(at AntiSpam)
the
third(3) word
from the following list:
fox, dog, cat, mouse, rabbit, bird, tiger
|