Assessment of Achievement Programme: Report of the Sixth AAP Survey of Science (2003)

Listen

Assessment of Achievement Programme: Report of the Sixth AAP Survey of Science (2003)

2. Knowledge and understanding

2.1 The assessment process

2.1.1 The assessment tasks

The 2000 revision of the National Guidelines for Environmental Studies 6, that replaced the 'stage band' classification of science content (P1-P3, P4-P6, P7-S2) with a framework of strands and level-based attainment targets, has had a seminal influence on the assessment and reporting of pupils' Knowledge and understanding attainment within the AAP. This 2003 survey is in consequence the first national survey to report pupils' Knowledge and understanding attainment with reference to the 5-14 levels.

But this welcome change in practice did not come without challenges. For the new intention to report attainment by level, along with the decision to report attainment at more than one level at each stage, demanded an increase in the scale of the survey in this area. While just over 200 pencil and paper Knowledge and understanding tasks had been administered in the 1999 Science survey, the 2003 survey needed almost double this number. A new task development exercise on this scale was not an option, given the timescale available for survey preparation, even had all 200+ 'old' tasks been re-usable. In the event, rather few of the 200+ existing tasks proved to be re-usable, which complicated matters further. A task review, carried out during the autumn of 2002, resulted in the identification of 75 tasks as being suitable for re-use. Most of the other tasks were not found to be relevant with respect to the new Guidelines, either because their content was no longer current or because they could not be classified unambiguously into 5-14 levels. In some cases tasks were rejected because their content duplicated that of others.

Fortunately, as noted in the previous chapter, the programme was able to benefit from a related initiative, in the form of a SEED-funded project based in the Universities of Aberdeen and Strathclyde, whose general remit had been to produce a set of exemplification materials to familiarise teachers with the intentions behind the new Guidelines for science 7. Among the materials produced by the project team was a large set of pencil and paper Knowledge and understanding tasks. The tasks covered all six 5-14 levels and all three outcomes fairly evenly, and by design were already level classified. Although the tasks had not been pre-tested in schools as part of the project, and there was no time for pre-testing prior to survey use, they were reviewed for their suitability. Many of the tasks were adopted unchanged, save for presentational modifications to conform to the current AAP 'house style', while others were revised, sometimes being split to produce two or more smaller tasks. In this way, the whole set of 360 tasks needed for the survey was produced.

The 360 tasks administered in the survey comprised 60 tasks at each 5-14 level (A to F), with 20 at each level from each of the three outcomes: 'Understanding Earth and Space', 'Understanding Energy and Forces' and 'Understanding Living Things and the Processes of Life'. Around one-sixth of the tasks had been used in the same or similar form in the 1999 AAP Science survey, while the remainder, the majority, were drawn from the set of 'Aberdeen' exemplification tasks. Figures 2.1a, 2.1b, 2.1c, 2.1d, 2.1e and 2.1f reproduce six exemplar tasks (in reduced size), one at each level and two from each outcome.

While most tasks would have taken no more than two or three minutes of pupil time to answer, tasks varied in length, format and general structure, as Figures 2.1a to 2.1f show, and also in their mark allocations (see section 2.1.5).

Figure 2.1a Level A - Understanding Living Things and the Processes of Life

graphic

Figure 2.1b Level B - Understanding Earth and Space

graphic

Figure 2.1c Level C - Understanding Living Things and the Processes of Life

graphic

Figure 2.1d Level D - Understanding Energy and Forces

graphic

Figure 2.1e Level E - Understanding Earth and Space

graphic

Figure 2.1f Level F - Understanding Energy and Forces

graphic

2.1.2 The test items within tasks

As Table 2.1 shows, just 44% of the tasks were 'single-item' tasks, such as that shown in Figure 2.1b. Almost a quarter (23%) were 2-item tasks, such as the one shown in Figure 2.1f. Just under a fifth (19%) were 3-item tasks, such as the one shown in Figure 2.1d. The remainder (14%) were tasks having between four and seven items/parts each, such as that shown in Figure 2.1c. Between them, the 360 tasks comprised 759 test items.

Table 2.1 Single-item and multi-item tasks

No. items

No. tasks

% tasks

7

1

<1

6

5

1

5

12

3

4

33

9

3

69

19

2

83

23

1

157

44

Total

360

100

Several different item formats featured in the tasks, including:

  • Multiple-choice
  • A variety of multiple-choice techniques was used, including selecting one answer option from two or more given options (see Figure 2.1b and the first item, or gap, in Figure 2.1e), selecting two or more 'correct' statements from five or six, etc. Classification activities can be included here (see Figure 2.1a). In all, 30% of the 759 items were multiple-choice in format.
  • Matching
    Occasionally, pupils were asked to match attributes - see, for example, Figures 2.1c and 2.1d. These resemble multiple-choice items in some ways, the essential difference being that the inter-dependence among the matching items within any one task is high. Around a quarter of all items were matching items.
  • Sequencing
    Typically, pupils were presented with a set of pictures or statements, and required to put these into a correct sequence, for example to complete a food chain or to correctly order the stages in a life cycle. Just 6% of the items were sequencing items.
  • Short response

  • These would typically take the form "Name… …", "How much…?", "Which …?, pupils responding with a single word or number. The first item in the task in Figure 2.1f is an example. One-fifth of the items were short-response items.
  • Open ended
    These tended to appear in higher-level tasks, and generally took the form "Describe…." or "Explain …". Figures 2.1e and 2.1f (second item) are examples. Some open-ended items required quite extended responses from pupils, running to several sentences - for example, "What is the 'big bang' theory of the origin of the universe" (Level F). Just under a fifth (18%) of the items were open-ended, and about a quarter of these required extended responses.

The ratio of closed-format items (multiple choice, matching, sequencing) to open-format items (short response, open ended, extended response) changed in favour of open formats as 5-14 levels increased (see Table 2.2).

Table 2.2 Ratio of closed-format to open-format items at different 5-14 levels

Level

Total items

% closed-format

%. open-format

Approx. ratio closed-open

F

129

25

75

1:3

E

142

51

49

1:1

D

126

57

43

4:3

C

125

65

35

2:1

B

115

85

15

6:1

A

122

97

3

30:1

Total

759

62

38

3:2

This changing ratio of closed-format to open-format as levels increased is an interesting phenomenon to note, reflecting as it presumably does science educators', or at least task developers', views about what content-based assessment in science should properly look like at the different stages. But it is also very relevant to bear in mind when the attainment findings presented later in this chapter are reviewed. This is because there was a very strong association between format and success rates, with open formats producing lower attainment on average than closed formats, partly because of the impact of often substantially higher non-response rates. Pupils' responses to these types of tasks were also the most vulnerable to 'transcriber error' (see section 2.1.4).

2.1.3 Task administration

The ways that the 360 tasks were packaged for administration in the schools has been described in Chapter 1 (see also Appendix B). Here we add a little more detail.

The 60 tasks at each level were randomly distributed into 10 sets of six tasks, ensuring only that every set of six comprised two tasks from each of the three outcomes. Level-specific task sets were then paired to produce 10 Level A/B booklets for use at P3, 10 Level B/C booklets for use at P5, 10 Level C/D/E booklets for use at P7, and 10 Level D/E/F booklets for use at S2. By design, every booklet contained an equal number of tasks from each of the three outcomes, and these were kept in 'outcome blocks'. Within outcome blocks tasks were presented in order of increasing level.

Every booklet was printed in three different versions, varying the order of presentation of tasks, so that no specific tasks suffered from possible test fatigue effects by being placed towards or at the end of booklets. Any task would then have appeared an equal number of times at or near the beginning of the booklet, towards the middle of the booklet, or at or near the end of the booklet. A single numeracy task was included in every booklet (see Chapter 4 for the results of the numeracy assessment).

Thus, in each P3 booklet a pupil would be faced first with two Level A tasks followed by two Level B tasks, all four relating to one of the three outcomes, then two Level A tasks followed by two Level B tasks relating to a second of the three outcomes, then, perhaps, the numeracy task, and then, finally, two Level A tasks followed by two Level B tasks relating to the third outcome.

To assist in attainment comparisons across stages, the Level B tasks which featured in a particular test booklet at P3, here mixed with Level A tasks, were transferred into one of the test booklets at P5, to be mixed with Level C tasks, and so on.

At each stage, up to 10 pupils in each school took part in the written science assessment (in very small primary schools fewer than 10 would be available). Each sample pupil was intended to attempt two different test booklets, and booklet pairs were allocated at random to pupils before the survey took place. Provided a school could supply 10 pupils for assessment, that school was sent ten pairs of booklets for the appropriate stage, with every booklet appearing twice in the set. Thus, a maximum of two pupils in any school would attempt any particular test booklet.

The schools organised their own assessment sessions within the timescale they were given, viz. mid-May to mid-June, and they were advised to organise two separate assessment sessions for their pupils, with a break between. They had the freedom to organise the two sessions to take place on the same day or on different days within the given period. The assessment sessions were supervised by the pupils' own class teachers, or by another teacher chosen by the head teacher of the school. The supervising teacher could explain what had to be done, but was not allowed to provide answers or confirm that a pupil's answers were correct. The sessions were not necessarily timed, but it was expected that they would vary from about 30-40 minutes at P3 to 50-60 minutes at S2. It was assumed that schools would organise the core skills reading and writing assessment at the same time - see Chapter 4 for details and results. Once a school's scripts were completed they were sent to SEED for processing (see below).

In the event, almost 6000 pupils in around 600 schools participated in the written survey of Knowledge and understanding: data were analysed for 1405 P3 pupils in 155 schools, 1463 P5 pupils in 156 schools, 1483 P7 pupils in 156 schools, and 1306 S2 pupils in 130 schools.

2.1.4 Script processing

The pupils' scripts were processed centrally by a team of undergraduates, during transcription meetings held in June and July 2003. The transcribers were not required to award marks to pupils' responses. They simply noted the response options selected by the pupils, transferring these onto specially designed response transcription forms, by circling response options. Where alternative responses were offered to pupils, as in multiple choice items or 'matching' items, the options were reproduced on the transcription form. Where items were open-ended, then whenever possible a set of keyterms was identified that adequately encapsulated alternative pupil responses. Where short keyterms were not possible to identify, then letter codes were offered to transcribers, each letter code being associated with a particular type of extended written response; in these cases transcribers were supplied with accompanying explanatory notes for use during transcription.

Random checks for consistency were carried out during the response transcription exercise. In general terms the procedure was as follows. Typically, 20-30 copies of each transcribed assessment booklet were newly transcribed "blind" by a second transcriber, i.e. the second transcriber did not have sight of the original transcription. The original transcription and the second independent transcription were then compared and discrepancies noted. The results are presented in Table 2.3.

Table 2.3 Response transcription consistency

(Discrepancy rates across 10 booklets per stage*, 20-30 scripts per booklet)

Stage

Task levels

No. item-pupil codes checked

% .discrepancy

S2

D/E/F

10610

4.0

P7

C/D/E

4551

2.3

P5

B/C

7000

1.6

P3

A/B

6954

2.2

* 5 booklets only at P7

Table 2.3 shows that the discrepancy rate was lowest at the primary stages, at around 2%, and highest at S2, at 4%. Just over half the discrepancies were associated with one or two 'problem' items in each booklet, where booklets typically contained 20-30 items at P3 and P5 (12 tasks) and 30-40 items at P7 and S2 (18 tasks). As might be expected, in every case these 'problem' items were of open-ended format, requiring a degree of subjective interpretation on the part of the transcribers as they decided which keyterm or code corresponded most closely with a pupil's written response.

The completed transcription forms were keyboarded by a professional data processing company, for later machine marking and analysis.

2.1.5 Marking

With very rare exceptions, test items were allocated a single mark, and item marking was a relatively straightforward automated process. Pupils' item responses, as indicated by the response options, keyterms or letter codes circled by the transcribers, were matched against the correct answers as recorded in the system for the items concerned, and marks were allocated accordingly.

Task marking, however, proved much less straightforward. Had task marks been the simple sum of item marks, then achievable task marks would have been the same as the number of items in a task. In other words, task marks would in principle range from one to seven, depending on the task. But this wide range of achievable task marks could pose problems for interpreting attainment results. This is because attainment results were to be based on the application of cut-off scores to pupils' mark totals for tasks at a level. Clearly, tasks with the highest maximum marks would have more influence on the results than tasks with the lowest maximum marks, without necessarily any real justification for this greater importance.

To impose a degree of control on this situation, task-specific criteria were agreed that, when applied, reduced the possible mark range to between one and three marks. This was a compromise policy, intended to accommodate the nature and variety of the tasks used in the survey.

But how to achieve rational scale mappings? This was the difficult challenge for subject specialists. The ways that the six tasks shown earlier were handled will serve to illustrate how this particular challenge was met.

The 'orbit' task shown in Figure 2.1b, which comprises a single classical multiple-choice item, was relatively unproblematic. This task was allocated a single mark.

The classification task shown in Figure 2.1a was processed as a single-item task, and was also allocated a single mark, even though three out of six pictured creatures were to be identified as birds. The reasoning here was that the pupil would need to be able to identify all three bird members correctly to have shown the required knowledge and understanding of bird characteristics, i.e. the concept being tested.

The matching task shown in Figure 2.1d has three items, and therefore has a three-mark total in principle. In practice the task was dichotomously scored. Like the task in Figure 2.1a, the decision here was that a pupil would need to match all three circuit components to their correct circuit diagram symbols in order to have demonstrated the relevant knowledge and understanding. Thus, three correct matches merited one mark while fewer than three merited none.

The task in Figure 2.1c produced a different decision. This task has four items, and therefore four marks in total. The mapping decision was that a pupil matching all four functions to the correct body organs deserved three marks, one who correctly matched three of the four deserved two marks, while a pupil successfully matching one or two of the four gained one mark. This was therefore a three-mark task.

The chemical reaction task in Figure 2.1e has two items, the first a multiple choice item (select the setup which would produce the fastest reaction) and the second an open-ended response item (explain your choice). The decision here was that a pupil would need to make the correct choice of setup and give a correct explanation for the choice to deserve a mark: in other words, pupils needed to answer both items correctly to merit the single mark for the task.

The spring balance task shown in Figure 2.1f proved different again. This task also has two items, so that the task mark is in principle also two, as for the reaction task. However, there is much less dependence between the two items in this task - indeed the two items could have been presented quite independently of one another, as two separate single-item tasks. Therefore, a pupil answering both items correctly merited two marks, while one correct item merited one mark. This, then, retained the status of a two-mark task.

2.1.6 Reporting Knowledge and understanding attainment

Across their two science booklets, each pupil would have attempted 12 tasks at each of the levels included. Performances on the 12 tasks at the same level determined pupils' attainment classifications at that level. The cut-off score criteria previously identified by English specialists as appropriate for the purpose, which were used in the 2001 English Language Survey and again in the 2002 Social Subjects Survey, were applied here.

Pupils achieving 65% or more of the marks for the 12 tasks at a particular level in their two booklets are classified as 'secure' at the level concerned 8, i.e. as having attained the level. Pupils achieving 80% or more of the marks are classified as demonstrating 'considerable strengths' at this level. Pupils achieving at least 50% of the marks but not as many as 65% are classified as having demonstrated 'basic' attainment at the level concerned.

The proportions of pupils in each classification group were calculated for each booklet pair separately, i.e. for each set of tasks at the same level across a pair of booklets (the data were weighted during this process, to adjust for imbalance in sample representation - see Appendix B for details). The separate booklet results were then averaged to produce the national estimates of attainment presented in this chapter (Levels A and B at P3, Levels B and C at P5, Levels C, D and E at P7, Levels D, E and F at S2).

2.2 Overview of pupils' attainments 2.2.1 The attainment picture across the stages

2.2.1 The attainment picture across the stages

Table 2.4 provides an overview of attainment at all four stages, in terms of the proportions of pupils meeting the 65% success criterion on the tasks they attempted at particular levels, averaged over all booklet pairs. Figure 2.2 illustrates the picture.

Table 2.4 'Secure' Knowledge and understanding attainment * P3 to S2

(% pupils achieving 65% or more of the marks for 12 tasks at a level, averaged over booklet pairs: 1300-1500 pupils at each stage)

Level A

Level B

Level C

Level D

Level E

Level F

S2

20

10

<1

P7

37

7

<1

P5

75

26

P3

76

54

* Margins of error for the estimated proportions vary between 11/2 and 2 percentage points.

As Table 2.4 shows, three-quarters of the P3 pupils were deemed to be working at Level A or higher, with just over half of all the pupils working at Level B or higher. At P5, three-quarters of the pupils were working at Level B or higher, and a quarter of all the pupils were working at Level C or higher.

Just over a third of the P7 pupils were classified as working at Level C or higher, fewer than 10% of all the pupils were classified as working at Level D or higher, and a mere handful of pupils (fewer than 1%) showed sufficiently good performances to be classified as working at Level E. Just one-fifth of the S2 pupils were classified as working at Level D, 10% at Level E, and a handful only (fewer than 1%) at Level F.

Let us look now at the finer classification of pupils, which distinguishes 'basic' Knowledge and understanding (50% or more of the marks achieved), 'secure' Knowledge and understanding (65% or more of the marks achieved) and demonstration of 'considerable strengths' (80% or more of the marks achieved). Table 2.5 presents the findings, and Figure 2.3 illustrates the picture of attainment.

Figure 2.2 'Secure' Knowledge and understanding attainment P3 to S2 *

chart

* Each bar shows the percentage of pupils demonstrating attainment at the level indicated or higher: 1300-1500 pupils at each stage.

Table 2.5 Pupils' Knowledge and understanding attainment

(% pupils classified into each band*, averaged over the task sets at each level)

Stage

Pupils

Level

< Basic

Basic

Secure

Strengths

S2

1306

F

95

5

0

0

E

75

15

8

2

D

56

24

15

5

P7

1483

E

95

5

0

0

D

77

16

6

1

C

36

27

25

12

P5

1463

C

49

25

19

7

B

10

15

22

43

P3

1405

B

25

21

31

23

A

10

14

29

47

* '< basic' means fewer than 50% of marks achieved, 'basic' is between 50% and 64%, 'secure' is 65% to 79%, and 'strengths' is 80%+

Noteworthy features in the attainment data shown in Table 2.5 are the high proportions of pupils at P3 and at P5 who performed sufficiently well on their task sets (80% or more of marks achieved) to be classified as having considerable strengths at the lower of the two levels at which they were assessed - Level A at P3, Level B at P5. Almost a quarter of the P3 pupils also showed considerable strengths at Level B. This is in contrast to the situation for Levels D and E at P7 and S2, where fewer than 5% of the pupils performed so well.

When reflecting on these results, readers should remember that the proportions of open-ended format items increased with increasing task level (see Table 2.2), from 3% at Level A and 15% at Level B, through 35% at Level C, 43% at Level D and 49% at Level E, to fully 75% at Level F. The older the pupils, therefore, the less benefit they received from information support in their tasks, the more frequently they had to show evidence of their knowledge and understanding of science through the medium of writing (and from the evidence given in Chapter 4, pupils' writing skills were not well developed in general), and the more likely they were not to respond at all to the questions asked (non-response rates rose to 80-90% for some of the open-ended tasks). All these factors will have contributed to the apparently lower attainments at P7 and S2 compared with P3 and P5.

Figure 2.3 Pupils' Knowledge and understanding attainment

(% pupils classified into bands*, averaged over task sets at each level)

chart

Since each individual pupil attempted just four tasks from each outcome at any specific level (two in each booklet), it would be of very questionable value to produce level attainment figures by outcome using the usual cut-off score strategy. We look therefore at average task scores for evidence of similarity or difference.

On the evidence of average task scores, there were no significant performance differences between the three outcomes. The average percentage scores across the 120 tasks representing each outcome in the survey (all levels and stages combined) were: 44% for 'Understanding Energy and Forces', 42% for 'Understanding Earth and Space', and 43% for 'Understanding Living Things and the Processes of Life'.

2.2.2 Gender comparisons

Table 2.6 presents the level-based attainment results for boys and girls separately, averaged over the task sets at each level. While the table shows some small sample differences in one direction or the other, these are not statistically significant. The general picture is one of gender similarity. If we look at the three outcomes, however, we do find that although the sample differences are extremely small, they are in expected directions (see Table 2.7).

Table 2.6 'Secure' Knowledge and understanding attainment*: by gender

(% pupils achieving 65% or more marks for 12 tasks at a level, averaged over booklet pairs)

Level A

Level B

Level C

Level D

Level E

Level F

S2

Boys

22

10

<1

Girls

17

9

<1

B-G

5

1

0

P7

Boys

38

7

<1

Girls

36

6

<1

B-G

2

1

0

P5

Boys

75

29

Girls

75

23

B-G

0

6

P3

Boys

74

52

Girls

77

56

B-G

-3

-4

*Figures show the percentages of pupils demonstrating attainment

at the indicated level or higher: 650-750 pupils per gender at each stage.

Table 2.7 Gender and outcome: average task scores

(average percentage task score over 120 tasks in each outcome - all levels and stages combined)

Boys

Girls

Energy and Forces

45

43

Earth and Space

43

42

Living Things and the Processes of Life

43

44

2.2.3 Change over time

The continuing AAP strategy for exploring the issue of change over time is to rely on comparisons of pupils' attainments on 'common' tasks, i.e. on tasks used in identical form in two or more surveys, as the basis for comment. This same strategy was used again on this occasion, to compare pupils' attainment in 2003 with their attainment in 1999. But the strategy could only usefully be implemented at P7 and S2, given that this was the first time that P3 and P5 pupils had been assessed in an AAP Science survey. Moreover, the new need to offer attainment results with reference to the 5-14 levels has resulted in a 'common task' exercise at P7 and S2 on a very modest scale.

The level-based attainment framework for Knowledge and understanding in science was introduced in the 2000 revision of the National Guidelines for Environmental Studies, too late for implementation in the 1999 AAP Science survey. In that survey, therefore, tasks had been classified by stage band (P1-P3, P4-P6, P7-S2) rather than level. It has been noted earlier that when the tasks were reviewed in preparation for the 2003 survey rather few - just 60 - were found to have continuing content relevance and to be uniquely classifiable into appropriate 5-14 levels. In the event, just 50 tasks at Levels C, D or E were considered appropriate for re-use in unchanged form. Table 2.8 records the performances of P7 and S2 pupils on these 'common' tasks in 1999 and in 2003.

Table 2.8 Average facility values for re-used tasks at P7 and S2

Stage

Year

Level C (17 tasks)

Level D (12 tasks)

Level E (18 tasks)

S2

2003

46

43

1999

50

42

P7

2003

49

35

25

1999

53

35

24

The attainment comparisons in Table 2.8 are based on a total of 47 tasks that were administered in identical form and marked in identical ways in both surveys: 17 tasks at Level C (P7 only), 12 at Level D and 18 at Level E. For the purpose of the comparison all the tasks were dichotomously marked, and pupils needed to answer the task completely correctly to achieve the mark.

On the basis of these small and rather arbitrary sets of assessment tasks, we can say that there is no evidence of attainment change over the period. The slight differences in average task facilities at Level C for P7 and Level D for S2 are not statistically significant.

2.3 Summary

Just under 6000 pupils in around 600 schools participated in the written science assessment, that is around 1300-1500 pupils at each stage. In total, 360 Knowledge and understanding tasks were administered to these pupils, 60 per level (A to F) and 120 from each of the three outcomes. The majority of pupils attempted two different test booklets, between them containing 12 tasks from each of two or three levels.

On the basis of their assessment results on the 12 tasks at a level, pupils were classified as being 'secure' at the level (using the criterion of 65% or more of the marks achieved on tasks at the same level), or as having shown 'basic' knowledge and understanding at the level (at least 50% of marks achieved, but not as many as 65%), or as having shown 'considerable strengths' at the level (80% or more of the marks achieved).

Three-quarters of the P3 and P5 pupils were classified as being secure or showing considerable strengths at Levels A and B, respectively. Just over half the P3 pupils were similarly classified at Level B, compared with a quarter of the P5 pupils for Level C. Just over a third of the P7 pupils were classified as secure or showing considerable strengths at Level C, while a fifth of the S2 pupils were similarly classified at Level D. At most 10% of the P7 and S2 pupils were secure at the next level up, i.e. Level D for P7, Level E for S2 - the target levels for these stages. Virtually no P7 or S2 pupils produced evidence of 'secure' attainment or considerable strengths at Levels E and F, respectively.

Looking at 'basic' levels of attainment and 'considerable strengths', we see a similar picture. While almost half of the P3 and the P5 pupils showed considerable strengths at Levels A and B, respectively, few, if any, of the P7 and S2 pupils produced such high performance at the levels assessed at their stages, and indeed high proportions failed to show even 'basic' attainment at their target levels (75-80% achieved fewer than half marks).

Contributory factors to this picture of lower achievement at the higher stages are the markedly higher proportions of open-format items that featured in the tasks at Levels D, E and, particularly, F. Such formats demand an additional appeal to writing ability in order to show evidence of science knowledge and understanding, they often lead also to high non-response rates, and response assessment is vulnerable to varying degrees of marker, or transcriber, subjectivity.

There was no evidence in the survey data of important, consistent gender differences in attainment in science overall - on the contrary, the general picture is one of similarity. But there were small sample differences in expected directions for the three outcomes: marginally in favour of the boys for 'Energy and Forces' and for 'Earth and Space' and marginally in favour of the girls for 'Living Things and the Processes of Life' (none of the very small differences reached statistical significance).

On the basis of a rather arbitrary and small set of 'common' tasks, i.e. tasks used in the same form and marked in the same way in 1999 and 2003, the survey has produced no evidence of any change in P7 or S2 attainment since 1999 (P3 and P5 were assessed for the first time in 2003).

Page updated: Thursday, March 24, 2005